GoogleSQL for BigQuery supports the following text analysis functions.
Function list Name SummaryBAG_OF_WORDS
Gets the frequency of each term (token) in a tokenized document. TEXT_ANALYZE
Extracts terms (tokens) from text and converts them into a tokenized document. TF_IDF
Evaluates how relevant a term (token) is to a tokenized document in a set of tokenized documents. BAG_OF_WORDS
BAG_OF_WORDS(tokenized_document)
Definition
Gets the frequency of each term (token) in a tokenized document.
Definitions
tokenized_document
: ARRAY<STRING>
value that represents a document that has been tokenized. A tokenized document is a collection of terms (tokens), which are used for text analysis.Return type
ARRAY<STRUCT<term STRING, count INT64>>
Definitions:
term
: A unique term in the tokenized document.count
: The number of times the term was found in the tokenized document.Examples
The following query produces terms and their frequencies in two tokenized documents:
WITH
ExampleTable AS (
SELECT 1 AS id, ['I', 'like', 'pie', 'pie', 'pie', NULL] AS f UNION ALL
SELECT 2 AS id, ['yum', 'yum', 'pie', NULL] AS f
)
SELECT id, BAG_OF_WORDS(f) AS results
FROM ExampleTable
ORDER BY id;
/*----+------------------------------------------------*
| id | results |
+----+------------------------------------------------+
| 1 | [(null, 1), ('I', 1), ('like', 1), ('pie', 3)] |
| 2 | [(null, 1), ('pie', 1), ('yum', 2)] |
*----+------------------------------------------------*/
TEXT_ANALYZE
TEXT_ANALYZE(
text
[, analyzer => { 'LOG_ANALYZER' | 'NO_OP_ANALYZER' | 'PATTERN_ANALYZER' } ]
[, analyzer_options => analyzer_options_values ]
)
Description
Extracts terms (tokens) from text and converts them into a tokenized document.
Definitions
text
: STRING
value that represents the input text to tokenize.analyzer
: A named argument with a STRING
value. Determines which analyzer to use to convert text
into an array of terms (tokens). This can be:
'LOG_ANALYZER'
(default): Breaks the input into terms when delimiters are encountered and then normalizes the terms. If analyzer
isn't specified, this is used by default. For more information, see LOG_ANALYZER
text analyzer.
'NO_OP_ANALYZER'
: Extracts the text as a single term (token), but doesn't apply normalization. For more information, see NO_OP_ANALYZER
text analyzer.
'PATTERN_ANALYZER'
: Breaks the input into terms that match a regular expression. For more information, see PATTERN_ANALYZER
text analyzer.
analyzer_options
: A named argument with a JSON-formatted STRING
value. Takes a list of text analysis rules. For more information, see Text analyzer options.
Details
There is no guarantee on the order of the tokens produced by this function.
If no analyzer is specified, the LOG_ANALYZER
analyzer is used by default.
Return type
ARRAY<STRING>
Examples
The following query uses the default text analyzer, LOG_ANALYZER
, with the input text:
SELECT TEXT_ANALYZE('I like pie, you like-pie, they like 2 PIEs.') AS results
/*--------------------------------------------------------------------------*
| results |
+--------------------------------------------------------------------------+
| ['i', 'like', 'pie', 'you', 'like', 'pie', 'they', 'like', '2', 'pies' ] |
*--------------------------------------------------------------------------*/
The following query uses the NO_OP_ANALYZER
text analyzer with the input text:
SELECT TEXT_ANALYZE(
'I like pie, you like-pie, they like 2 PIEs.',
analyzer=>'NO_OP_ANALYZER'
) AS results
/*-----------------------------------------------*
| results |
+-----------------------------------------------+
| 'I like pie, you like-pie, they like 2 PIEs.' |
*-----------------------------------------------*/
The following query uses the PATTERN_ANALYZER
text analyzer with the input text:
SELECT TEXT_ANALYZE(
'I like pie, you like-pie, they like 2 PIEs.',
analyzer=>'PATTERN_ANALYZER'
) AS results
/*----------------------------------------------------------------*
| results |
+----------------------------------------------------------------+
| ['like', 'pie', 'you', 'like', 'pie', 'they', 'like', 'pies' ] |
*----------------------------------------------------------------*/
For additional examples that include analyzer options, see Text analysis.
For helpful analyzer recipes that you can use to enhance analyzer-supported queries, see Search with text analyzers.
TF_IDF
TF_IDF(tokenized_document) OVER()
TF_IDF(tokenized_document, max_distinct_tokens) OVER()
TF_IDF(tokenized_document, max_distinct_tokens, frequency_threshold) OVER()
Description
Evaluates how relevant a term is to a tokenized document in a set of tokenized documents, using the TF-IDF (term frequency-inverse document frequency) algorithm.
Definitions
tokenized_document
: ARRAY<STRING>
value that represents a document that has been tokenized. A tokenized document is a collection of terms (tokens), which are used for text analysis.max_distinct_tokens
: Optional argument. Takes a non-negative INT64
value, which represents the size of the dictionary, excluding the unknown term.
Terms are added to the dictionary until this threshold is met. So, if this value is 20
, the first 20 unique terms are added and then no additional terms are added.
If this argument isn't provided, the default value is 32000
. If this argument is specified, the maximum value is 1048576
.
frequency_threshold
: Optional argument. Takes a non-negative INT64
value that represents the minimum number of times a term must appear in a tokenized document to be included in the dictionary. So, if this value is 3
, a term must appear at least three times in the tokenized document to be added to the dictionary.
If this argument isn't provided, the default value is 5
.
Details
This function uses a TF-IDF (term frequency-inverse document frequency) algorithm to compute the relevance of terms in a set of tokenized documents. TF-IDF multiplies two metrics: how many times a term appears in a document (term frequency), and the inverse document frequency of the term across a collection of documents (inverse document frequency).
TDIF:
term frequency * inverse document frequency
term frequency:
(count of term in document) / (document size)
inverse document frequency:
log(1 + document set size / (1 + count of documents containing term))
Terms are added to a dictionary of terms if they satisfy the criteria for max_distinct_tokens
and frequency_threshold
, otherwise they are considered the unknown term. The unknown term is always the first term in the dictionary and represented as NULL
. The rest of the dictionary is ordered by term frequency rather than alphabetically.
Return type
ARRAY<STRUCT<term STRING, tf_idf DOUBLE>>
Definitions:
term
: The unique term that was added to the dictionary.tf_idf
: The TF-IDF computation for the term.Examples
The following query computes the relevance of up to 10 terms that appear at least twice in a set of tokenized documents. In this example, 10
represents max_distinct_tokens
and 2
represents frequency_threshold
:
WITH ExampleTable AS (
SELECT 1 AS id, ['I', 'like', 'pie', 'pie', 'pie', NULL] AS f UNION ALL
SELECT 2 AS id, ['yum', 'yum', 'pie', NULL] AS f UNION ALL
SELECT 3 AS id, ['I', 'yum', 'pie', NULL] AS f UNION ALL
SELECT 4 AS id, ['you', 'like', 'pie', 'too', NULL] AS f
)
SELECT id, TF_IDF(f, 10, 2) OVER() AS results
FROM ExampleTable
ORDER BY id;
/*----+-------------------------------------------------*
| id | results |
+----+-------------------------------------------------+
| 1 | [{"index":null,"value":"0.1304033435859887"}, |
| | {"index":"I","value":"0.1412163100645339"}, |
| | {"index":"like","value":"0.1412163100645339"}, |
| | {"index":"pie","value":"0.29389333245105953"}] |
+----+-------------------------------------------------+
| 2 | [{"index":null,"value":"0.1956050153789831"}, |
| | {"index":"pie","value":"0.14694666622552977"}, |
| | {"index":"yum","value":"0.4236489301936017"}] |
+----+-------------------------------------------------+
| 3 | [{"index":null,"value":"0.1956050153789831"}, |
| | {"index":"I","value":"0.21182446509680086"}, |
| | {"index":"pie","value":"0.14694666622552977"}, |
| | {"index":"yum","value":"0.21182446509680086"}] |
+----+-------------------------------------------------+
| 4 | [{"index":null,"value":"0.4694520369095594"}, |
| | {"index":"like","value":"0.1694595720774407"}, |
| | {"index":"pie","value":"0.11755733298042381"}] |
*----+-------------------------------------------------*/
The following query computes the relevance of up to three terms that appear at least once in a set of tokenized documents:
WITH ExampleTable AS (
SELECT 1 AS id, ['I', 'like', 'pie', 'pie', 'pie', NULL] AS f UNION ALL
SELECT 2 AS id, ['yum', 'yum', 'pie', NULL] AS f UNION ALL
SELECT 3 AS id, ['I', 'yum', 'pie', NULL] AS f UNION ALL
SELECT 4 AS id, ['you', 'like', 'pie', 'too', NULL] AS f
)
SELECT id, TF_IDF(f, 3, 2) OVER() AS results
FROM ExampleTable
ORDER BY id;
/*----+-------------------------------------------------*
| id | results |
+----+-------------------------------------------------+
| 1 | [{"index":null,"value":"0.12679902142647365"}, |
| | {"index":"I","value":"0.1412163100645339"}, |
| | {"index":"like","value":"0.1412163100645339"}, |
| | {"index":"pie","value":"0.29389333245105953"}] |
+----+-------------------------------------------------+
| 2 | [{"index":null,"value":"0.5705955964191315"}, |
| | {"index":"pie","value":"0.14694666622552977"}] |
+----+-------------------------------------------------+
| 3 | [{"index":null,"value":"0.380397064279421"}, |
| | {"index":"I","value":"0.21182446509680086"}, |
| | {"index":"pie","value":"0.14694666622552977"}] |
+----+-------------------------------------------------+
| 4 | [{"index":null,"value":"0.45647647713530515"}, |
| | {"index":"like","value":"0.1694595720774407"}, |
| | {"index":"pie","value":"0.11755733298042381"}] |
*----+-------------------------------------------------*/
RetroSearch is an open source project built by @garambo | Open a GitHub Issue
Search and Browse the WWW like it's 1997 | Search results from DuckDuckGo
HTML:
3.2
| Encoding:
UTF-8
| Version:
0.7.4