BigQuery supports remote models, such as Vertex AI LLMs, to perform remote inference operations on both structured and unstructured data. When using remote inference, the user needs to pay attention to quotas and limits, which can result in retryable error in a subset of rows and require reprocessing.
The BQML dataform library assists users to create BQML pipelines that are resilient to transient failures by automatic reprocessing and incrementally updating the output table.
Add the bqml package to your package.json file in your Dataform project.
{ "dependencies": { "bqml": "https://github.com/dataform-co/dataform-bqml/archive/[RELEASE_VERSION].tar.gz" } }
You can find the most up to date package version on the releases page.
The following example shows how to generate text from images using the Vertex AI multimodel.
// Import the module const bqml = require("bqml"); // Name of the multimodel that has `gemini-pro-vision` as the endpoint let model = "multi-llm"; // Name of the object table that points to a set of images let source_table = "product_image"; // Name of the table for storing the result let output_table = "product_image_description"; // Optionally declare the model and source table as dataform datasources // if it is not defined in other actions. declare({name: model}); declare({name: source_table}); // Execute the pipeline bqml.vision_generate_text( source_table, output_table, model, "Describe the image in 20 words", { flatten_json_output: true } );
This is another example showing how to generate text using a source query to form the prompt.
// Import the module const bqml = require("bqml"); // Name of the model let model = "llm"; // Name of the table that forms the source query let source_table = "product_comment"; // Name of the table for storing the result let output_table = "sentiment"; // Columns to uniquely identify a row in the source table let keys = ["comment_id"]; // Optionally declare the model and source table as dataform datasources // if it is not defined in other actions. declare({name: model}); declare({name: source_table}); // Execute the pipeline bqml.generate_text( output_table, keys, model, (ctx) => `SELECT *, "Classify the text into neutral, negative, or positive. Text: " || comment AS prompt FROM ${ctx.ref(source_table)}`, {flatten_json_output: true});
function generate_text( output_table, unique_keys, ml_model, source_query, ml_configs, options)
Performs the ML.GENERATE_TEXT function on the given source table.
See: https://cloud.google.com/bigquery/docs/reference/standard-sql/bigqueryml-syntax-generate-text
Param Type Description output_tableString
the name of the table to store the final result unique_keys String
| Array
column name(s) for identifying an unique row in the source table ml_model Resolvable
the remote model to use for the ML operation that uses one of the Vertex AI LLM endpoints source_query String
| function
either a query string or a Contextable function to produce the query on the source data for the ML operation and it must have the unique key columns selected in addition to other fields ml_configs Object
configurations for the ML operation options Object
the configuration object for the table_ml function Function vision_generate_text
function vision_generate_text( source_table, output_table, model, prompt, llm_config, options)
Performs the ML.GENERATE_TEXT function on visual content in the given source table.
Param Type Description source_tableResolvable
represents the source object table output_table String
name of the output table model Resolvable
name the remote model with the gemini-pro-vision
endpoint prompt String
the prompt text for the LLM llm_config Object
extra configurations to the LLM options Object
the configuration object for the obj_table_ml function Function generate_embedding
function generate_embedding( output_table, unique_keys, ml_model, source_query, ml_configs, options)
Performs the ML.GENERATE_EMBEDDING function on the given source table.
See: https://cloud.google.com/bigquery/docs/reference/standard-sql/bigqueryml-syntax-generate-embedding
Param Type Description output_tableString
the name of the table to store the final result unique_keys String
| Array
column name(s) for identifying an unique row in the source table ml_model Resolvable
the remote model to use for the ML operation that uses one of the textembedding-gecko*
Vertex AI LLMs as endpoint source_query String
| function
either a query string or a Contextable function to produce the query on the source data for the ML operation and it must have the unique key columns selected in addition to other fields ml_configs Object
configurations for the ML operation options Object
the configuration object for the table_ml function
function understand_text( output_table, unique_keys, ml_model, source_query, ml_configs, options)
Performs the ML.UNDERSTAND_TEXT function on the given source table.
See: https://cloud.google.com/bigquery/docs/reference/standard-sql/bigqueryml-syntax-understand-text
Param Type Description output_tableString
the name of the table to store the final result unique_keys String
| Array
column name(s) for identifying an unique row in the source table ml_model Resolvable
the remote model with a REMOTE_SERVICE_TYPE of CLOUD_AI_NATURAL_LANGUAGE_V1 source_query String
| function
either a query string or a Contextable function to produce the query on the source data for the ML operation and it must have the unique key columns selected in addition to other fields ml_configs Object
configurations for the ML operation options Object
the configuration object for the table_ml function
function translate( output_table, unique_keys, ml_model, source_query, ml_configs, options)
Performs the ML.TRANSLATE function on the given source table.
See: https://cloud.google.com/bigquery/docs/reference/standard-sql/bigqueryml-syntax-translate
Param Type Description output_tableString
the name of the table to store the final result unique_keys String
| Array
column name(s) for identifying an unique row in the source table ml_model Resolvable
the remote model with a REMOTE_SERVICE_TYPE of CLOUD_AI_TRANSLATE_V3 source_query String
| function
either a query string or a Contextable function to produce the query on the source data for the ML operation and it must have the unique key columns selected in addition to other fields ml_configs Object
configurations for the ML operation options Object
the configuration object for the table_ml function
function annotate_image( source_table, output_table, model, features, options)
Performs the ML.ANNOTATE_IMAGE function on the given source table.
See: https://cloud.google.com/bigquery/docs/reference/standard-sql/bigqueryml-syntax-annotate-image
Param Type Description source_tableResolvable
represents the source object table output_table String
name of the output table model Resolvable
the remote model with a REMOTE_SERVICE_TYPE of CLOUD_AI_VISION_V1 features Array
specifies one or more feature names of supported Vision API features options Object
the configuration object for the obj_table_ml function
function transcribe( source_table, output_table, model, recognition_config, options)
Performs the ML.TRANSCRIBE function on the given source table.
See: https://cloud.google.com/bigquery/docs/reference/standard-sql/bigqueryml-syntax-transcribe
Param Type Description source_tableResolvable
represents the source object table output_table String
name of the output table model Resolvable
the remote model with a REMOTE_SERVICE_TYPE of CLOUD_AI_SPEECH_TO_TEXT_V2 recognition_config Object
the recognition configuration to override the default configuration of the specified recognizer options Object
the configuration object for the obj_table_ml function Function process_document
function process_document( source_table, output_table, model, options)
Performs the ML.PROCESS_DOCUMENT function on the given source table.
See: https://cloud.google.com/bigquery/docs/reference/standard-sql/bigqueryml-syntax-process-document
Param Type Description source_tableResolvable
represents the source object table output_table String
name of the output table model Resolvable
the remote model with a REMOTE_SERVICE_TYPE of CLOUD_AI_DOCUMENT_V1 options Object
the configuration object for the obj_table_ml function
function table_ml( output_table, unique_keys, ml_function, ml_model, source_query, accept_filter, ml_configs = {}, { batch_size = 10000, batch_duration_secs = 22 * 60 * 60 } = {})
A generic structured table ML pipeline. It incrementally performs an ML operation on rows from the source table and merges to the output table until all rows are processed or runs longer than the specific duration.
Param Type Description output_tableString
name of the output table unique_keys String
| Array
column name(s) for identifying an unique row in the source table ml_function String
the name of the BQML function to call ml_model Resolvable
the remote model to use for the ML operation source_query String
| function
either a query string or a Contextable function to produce the query on the source data for the ML operation and it must have the unique key columns selected in addition to other fields accept_filter String
a SQL boolean expression for accepting a row to the output table after the ML operation ml_configs Object
configurations for the ML operation batch_size Number
number of rows to process in each SQL job. Rows in the object table will be processed in batches according to the batch size. Default batch size is 10000 batch_duration_secs Number
the number of seconds to pass before breaking the batching loop if it hasn't been finished before within this duration. Default value is 22 hours
function obj_table_ml( source_table, source, output_table, accept_filter, { batch_size = 500, unique_key = "uri", updated_column = "updated", batch_duration_secs = 22 * 60 * 60, } = {})
A generic object table ML pipeline. It incrementally performs an ML operation on new rows from the source table and merges to the output table until no new row is detected or runs longer than the specific duration. A row from the source table is considered as new if the unique_key
(default to "uri") of a row is absent in the output table, or if the updated_column
(default to "updated") column is newer than the largest value in the output table.
Resolvable
represents the source object table source String
| function
either a query string or a Contextable function to produce the query on the source data output_table String
the name of the table to store the final result accept_filter String
a SQL expression for finding rows that contains retryable error batch_size Number
number of rows to process in each SQL job. Rows in the object table will be processed in batches according to the batch size. Default batch size is 500 unique_key String
the primary key in the output table for incremental update. Default value is "uri". updated_column String
the column that carries the last updated timestamp of an object in the object table. Default value is "updated" batch_duration_secs Number
the number of seconds to pass before breaking the batching loop if it hasn't been finished before within this duration. Default value is 22 hours
RetroSearch is an open source project built by @garambo | Open a GitHub Issue
Search and Browse the WWW like it's 1997 | Search results from DuckDuckGo
HTML:
3.2
| Encoding:
UTF-8
| Version:
0.7.4