This tutorial teaches you how to use the TRANSFORM
clause of the CREATE MODEL
statement to perform feature engineering at the same time that you create and train a model. Using the TRANSFORM
clause, you can specify one or more preprocessing functions to transform the input data you use to train the model. The preprocessing that you apply to the model is automatically applied when you use the model with the ML.EVALUATE
and ML.PREDICT
functions.
This tutorial uses the public bigquery-public-data.ml_datasets.penguin
dataset.
This tutorial guides you through completing the following tasks:
CREATE MODEL
statement. Within the CREATE MODEL
statement, use the ML.QUANTILE_BUCKETIZE
and ML.FEATURE_CROSS
functions to preprocess data.ML.EVALUATE
function.ML.PREDICT
function.This tutorial uses billable components of Google Cloud, including:
For more information about BigQuery costs, see the BigQuery pricing page.
Before you beginIn the Google Cloud console, on the project selector page, select or create a Google Cloud project.
Note: If you don't plan to keep the resources that you create in this procedure, create a project instead of selecting an existing project. After you finish these steps, you can delete the project, removing all resources associated with the project.Verify that billing is enabled for your Google Cloud project.
In the Google Cloud console, on the project selector page, select or create a Google Cloud project.
Note: If you don't plan to keep the resources that you create in this procedure, create a project instead of selecting an existing project. After you finish these steps, you can delete the project, removing all resources associated with the project.Verify that billing is enabled for your Google Cloud project.
Enable the BigQuery API.
Create a BigQuery dataset to store your ML model.
ConsoleIn the Google Cloud console, go to the BigQuery page.
In the Explorer pane, click your project name.
Click more_vert View actions > Create dataset.
On the Create dataset page, do the following:
For Dataset ID, enter bqml_tutorial
.
For Location type, select Multi-region, and then select US (multiple regions in United States).
Leave the remaining default settings as they are, and click Create dataset.
To create a new dataset, use the bq mk
command with the --location
flag. For a full list of possible parameters, see the bq mk --dataset
command reference.
Create a dataset named bqml_tutorial
with the data location set to US
and a description of BigQuery ML tutorial dataset
:
bq --location=US mk -d \ --description "BigQuery ML tutorial dataset." \ bqml_tutorial
Instead of using the --dataset
flag, the command uses the -d
shortcut. If you omit -d
and --dataset
, the command defaults to creating a dataset.
Confirm that the dataset was created:
bq ls
Call the datasets.insert
method with a defined dataset resource.
{ "datasetReference": { "datasetId": "bqml_tutorial" } }BigQuery DataFrames
Before trying this sample, follow the BigQuery DataFrames setup instructions in the BigQuery quickstart using BigQuery DataFrames. For more information, see the BigQuery DataFrames reference documentation.
To authenticate to BigQuery, set up Application Default Credentials. For more information, see Set up ADC for a local development environment.
Create the modelCreate a linear regression model to predict penguin weight and train it on the penguins
sample table.
The OPTIONS(model_type='linear_reg', input_label_cols=['body_mass_g'])
clause indicates that you are creating a linear regression model. A linear regression model generates a continuous value from a linear combination of input features. The body_mass_g
column is the input label column. For linear regression models, the label column must be real valued (that is, the column values must be real numbers).
This query's TRANSFORM
clause uses the following columns from the SELECT
statement:
body_mass_g
: Used in training without any change.culmen_depth_mm
: Used in training without any change.flipper_length_mm
: Used in training without any change.bucketized_culmen_length
: Generated from culmen_length_mm
by bucketizing culmen_length_mm
based on quantiles using the ML.QUANTILE_BUCKETIZE()
analytic function.culmen_length_mm
: The original culmen_length_mm
value, cast to a STRING
value and used in training.species_sex
: Generated from crossing species
and sex
using the ML.FEATURE_CROSS
function.You don't need to use all of the columns from the training table in theTRANSFORM
clause.
The WHERE
clause—WHERE body_mass_g IS NOT NULL AND RAND() < 0.2
— excludes rows where the penguins weight is NULL
, and uses the RAND
function to draw a random sample of the data.
Follow these steps to create the model:
In the Google Cloud console, go to the BigQuery page.
In the query editor, paste in the following query and click Run:
CREATE OR REPLACE MODEL `bqml_tutorial.penguin_transform` TRANSFORM( body_mass_g, culmen_depth_mm, flipper_length_mm, ML.QUANTILE_BUCKETIZE(culmen_length_mm, 10) OVER () AS bucketized_culmen_length, CAST(culmen_length_mm AS string) AS culmen_length_mm, ML.FEATURE_CROSS(STRUCT(species, sex)) AS species_sex) OPTIONS ( model_type = 'linear_reg', input_label_cols = ['body_mass_g']) AS SELECT * FROM `bigquery-public-data.ml_datasets.penguins` WHERE body_mass_g IS NOT NULL AND RAND() < 0.2;
The query takes about 15 minutes to complete, after which the penguin_transform
model appears in the Explorer pane. Because the query uses a CREATE MODEL
statement to create a model, you don't see query results.
Evaluate the performance of the model by using the ML.EVALUATE
function. The ML.EVALUATE
function evaluates the predicted penguin weights returned by the model against the actual penguin weights from the training data.
This query's nested SELECT
statement and FROM
clause are the same as those in the CREATE MODEL
query. Because you used the TRANSFORM
clause when creating the model, you don't need to specify the columns and transformations again in the ML.EVALUATE
function. The function automatically retrieves them from the model.
Follow these steps to evaluate the model:
In the Google Cloud console, go to the BigQuery page.
In the query editor, paste in the following query and click Run:
SELECT * FROM ML.EVALUATE( MODEL `bqml_tutorial.penguin_transform`, ( SELECT * FROM `bigquery-public-data.ml_datasets.penguins` WHERE body_mass_g IS NOT NULL ));
The results should look similar to the following:
+---------------------+--------------------+------------------------+-----------------------+--------------------+--------------------+ | mean_absolute_error | mean_squared_error | mean_squared_log_error | median_absolute_error | r2_score | explained_variance | +---------------------+--------------------+------------------------+-----------------------+--------------------+--------------------+ | 64.21134350607677 | 13016.433317859564 | 7.140935762696211E-4 | 15.31788461553515 | 0.9813042531507734 | 0.9813186268757634 | +---------------------+--------------------+------------------------+-----------------------+--------------------+--------------------+
An important metric in the evaluation results is the R2 score. The R2 score is a statistical measure that determines if the linear regression predictions approximate the actual data. A value of 0
indicates that the model explains none of the variability of the response data around the mean. A value of 1
indicates that the model explains all the variability of the response data around the mean.
For more information about the ML.EVALUATE
function output, see Regression models.
You can also call ML.EVALUATE
without providing the input data. It will use the evaluation metrics calculated during training.
Use the model with the ML.PREDICT
function to predict the weight of male penguins.
The ML.PREDICT
function outputs the predicted value in the predicted_label_column_name
column, in this case predicted_body_mass_g
.
When you use the ML.PREDICT
function, you don't have to pass in all of the columns used in model training. Only the columns that you used in the TRANSFORM
clause are required. Similar to ML.EVALUATE
, the ML.PREDICT
function automatically retrieves the TRANSFORM
columns and transformations from the model.
Follow these steps to get predictions from the model:
In the Google Cloud console, go to the BigQuery page.
In the query editor, paste in the following query and click Run:
SELECT predicted_body_mass_g FROM ML.PREDICT( MODEL `bqml_tutorial.penguin_transform`, ( SELECT * FROM `bigquery-public-data.ml_datasets.penguins` WHERE sex = 'MALE' ));
The results should look similar to the following:
+-----------------------+ | predicted_body_mass_g | +-----------------------+ | 2810.2868541725757 | +-----------------------+ | 3813.6574220842676 | +-----------------------+ | 4098.844698262214 | +-----------------------+ | 4256.587135004173 | +-----------------------+ | 3008.393497302691 | +-----------------------+ | ... | +-----------------------+
To avoid incurring charges to your Google Cloud account for the resources used in this tutorial, either delete the project that contains the resources, or keep the project and delete the individual resources.
Deleting your project removes all datasets and all tables in the project. If you prefer to reuse the project, you can delete the dataset you created in this tutorial:
If necessary, open the BigQuery page in the Google Cloud console.
In the navigation panel, click the bqml_tutorial dataset you created.
On the right side of the window, click Delete dataset. This action deletes the dataset, the table, and all the data.
In the Delete dataset dialog box, confirm the delete command by typing the name of your dataset (bqml_tutorial
) and then click Delete.
To delete the project:
appspot.com
URL, delete selected resources inside the project instead of deleting the whole project.If you plan to explore multiple architectures, tutorials, or quickstarts, reusing projects can help you avoid exceeding project quota limits.
RetroSearch is an open source project built by @garambo | Open a GitHub Issue
Search and Browse the WWW like it's 1997 | Search results from DuckDuckGo
HTML:
3.2
| Encoding:
UTF-8
| Version:
0.7.4