Aggregate functions (Similarity Estimation) , Window function syntax and usage
Combines input MinHash states into a single MinHash output state. This Minhash state can then be input to the APPROXIMATE_SIMILARITY function to estimate the similarity with other MinHash states.
This allows use cases in which MINHASH is run over horizontal rowsets of the same table, producing a MinHash state for each rowset. These states can then be combined using MINHASH_COMBINE, producing the same output state as a single run of MINHASH over the entire table.
For more information about MinHash states, see Estimating Similarity of Two or More Sets.
MINHASH_COMBINE( [ DISTINCT ] <state> )
Copy
Usage notes¶DISTINCT can be included as an argument, but has no effect.
Input MinHash state
must have MinHash arrays of equal length.
USE SCHEMA snowflake_sample_data.tpch_sf1; SELECT MINHASH_COMBINE(mh) FROM ( (SELECT MINHASH(5, c2) mh FROM orders WHERE c2 <= 10000) UNION (SELECT MINHASH(5, c2) mh FROM orders WHERE c2 > 10000 AND c2 <= 20000) UNION (SELECT MINHASH(5, C2) mh FROM orders WHERE c2 > 20000) ); +-----------------------+ | MINHASH_COMBINE(MH) | |-----------------------| | { | | "state": [ | | 628914288006793, | | 1071764954434168, | | 991489123966035, | | 2395105834644106, | | 680224867834949 | | ], | | "type": "minhash", | | "version": 1 | | } | +-----------------------+
Copy
Here is a more extensive example, showing the three related functions MINHASH
, MINHASH_COMBINE
and APPROXIMATE_SIMILARITY
. This example creates 3 tables (ta, tb, and tc), two of which (ta and tb) are similar, and two of which (ta and tc) are completely dissimilar.
Create and populate tables with values:
CREATE TABLE ta (i INTEGER); CREATE TABLE tb (i INTEGER); CREATE TABLE tc (i INTEGER); -- Insert values into the 3 tables. INSERT INTO ta (i) VALUES (1), (2), (3), (4), (5), (6), (7), (8), (9), (10); -- Almost the same as the preceding values. INSERT INTO tb (i) VALUES (1), (2), (3), (4), (5), (6), (7), (8), (9), (11); -- Different values and different number of values. INSERT INTO tc (i) VALUES (-1), (-20), (-300), (-4000);Copy
Calculate minhash info for the initial set of data:
CREATE TABLE minhash_a_1 (mh) AS SELECT MINHASH(100, i) FROM ta; CREATE TABLE minhash_b (mh) AS SELECT MINHASH(100, i) FROM tb; CREATE TABLE minhash_c (mh) AS SELECT MINHASH(100, i) FROM tc;Copy
Add more data to one of the tables:
INSERT INTO ta (i) VALUES (12);Copy
Demonstrate the
MINHASH_COMBINE
function:-- Record minhash information about only the new rows: CREATE TABLE minhash_a_2 (mh) AS SELECT MINHASH(100, i) FROM ta WHERE i > 10; -- Now combine all the minhash info for the old and new rows in table ta. CREATE TABLE minhash_a (mh) AS SELECT MINHASH_COMBINE(mh) FROM ( (SELECT mh FROM minhash_a_1) UNION ALL (SELECT mh FROM minhash_a_2) );Copy
This query shows the approximate similarity of the two similar tables (ta and tb):
SELECT APPROXIMATE_SIMILARITY (mh) FROM ( (SELECT mh FROM minhash_a) UNION ALL (SELECT mh FROM minhash_b) ); +-----------------------------+ | APPROXIMATE_SIMILARITY (MH) | |-----------------------------| | 0.75 | +-----------------------------+Copy
This query shows the approximate similarity of the two very different tables (ta and tc):
SELECT APPROXIMATE_SIMILARITY (mh) FROM ( (SELECT mh FROM minhash_a) UNION ALL (SELECT mh FROM minhash_c) ); +-----------------------------+ | APPROXIMATE_SIMILARITY (MH) | |-----------------------------| | 0 | +-----------------------------+Copy
RetroSearch is an open source project built by @garambo | Open a GitHub Issue
Search and Browse the WWW like it's 1997 | Search results from DuckDuckGo
HTML:
3.2
| Encoding:
UTF-8
| Version:
0.7.4