A RetroSearch Logo

Home - News ( United States | United Kingdom | Italy | Germany ) - Football scores

Search Query:

Showing content from https://developers.google.com/bigquery/docs/create-machine-learning-model below:

Create a machine learning model in BigQuery ML by using SQL

Create a machine learning model in BigQuery ML by using SQL

This tutorial shows you how to create a logistic regression model by using BigQuery ML SQL queries.

BigQuery ML lets you create and train machine learning models in BigQuery by using SQL queries. This helps make machine learning more approachable by letting you use familiar tools like the BigQuery SQL editor, and also increases development speed by removing the need to move data into a separate machine learning environment.

In this tutorial, you use the sample Google Analytics sample dataset for BigQuery to create a model that predicts whether a website visitor will make a transaction. For information on the schema of the Analytics dataset, see BigQuery export schema in the Analytics Help Center.

To learn how to create models by using the Google Cloud console user interface, see work with models by using a UI. (Preview)

Objectives

This tutorial shows you how to perform the following tasks:

Costs

This tutorial uses billable components of Google Cloud, including the following:

For more information on BigQuery costs, see the BigQuery pricing page.

For more information on BigQuery ML costs, see BigQuery ML pricing.

Required roles Before you begin
  1. Sign in to your Google Cloud account. If you're new to Google Cloud, create an account to evaluate how our products perform in real-world scenarios. New customers also get $300 in free credits to run, test, and deploy workloads.
  2. In the Google Cloud console, on the project selector page, select or create a Google Cloud project.

    Note: If you don't plan to keep the resources that you create in this procedure, create a project instead of selecting an existing project. After you finish these steps, you can delete the project, removing all resources associated with the project.

    Go to project selector

  3. Verify that billing is enabled for your Google Cloud project.

  4. Make sure that you have the following role or roles on the project: BigQuery Data Editor, BigQuery Job User, Service Usage Admin

    Check for the roles
    1. In the Google Cloud console, go to the IAM page.

      Go to IAM
    2. Select the project.
    3. In the Principal column, find all rows that identify you or a group that you're included in. To learn which groups you're included in, contact your administrator.

    4. For all rows that specify or include you, check the Role column to see whether the list of roles includes the required roles.
    Grant the roles
    1. In the Google Cloud console, go to the IAM page.

      Go to IAM
    2. Select the project.
    3. Click person_add Grant access.
    4. In the New principals field, enter your user identifier. This is typically the email address for a Google Account.

    5. In the Select a role list, select a role.
    6. To grant additional roles, click add Add another role and add each additional role.
    7. Click Save.
  5. In the Google Cloud console, on the project selector page, select or create a Google Cloud project.

    Note: If you don't plan to keep the resources that you create in this procedure, create a project instead of selecting an existing project. After you finish these steps, you can delete the project, removing all resources associated with the project.

    Go to project selector

  6. Verify that billing is enabled for your Google Cloud project.

  7. Make sure that you have the following role or roles on the project: BigQuery Data Editor, BigQuery Job User, Service Usage Admin

    Check for the roles
    1. In the Google Cloud console, go to the IAM page.

      Go to IAM
    2. Select the project.
    3. In the Principal column, find all rows that identify you or a group that you're included in. To learn which groups you're included in, contact your administrator.

    4. For all rows that specify or include you, check the Role column to see whether the list of roles includes the required roles.
    Grant the roles
    1. In the Google Cloud console, go to the IAM page.

      Go to IAM
    2. Select the project.
    3. Click person_add Grant access.
    4. In the New principals field, enter your user identifier. This is typically the email address for a Google Account.

    5. In the Select a role list, select a role.
    6. To grant additional roles, click add Add another role and add each additional role.
    7. Click Save.
  8. BigQuery is automatically enabled in new projects. To activate BigQuery in a pre-existing project, go to

    Enable the BigQuery API.

    Enable the API

Create a dataset

Create a BigQuery dataset to store your ML model.

Console
  1. In the Google Cloud console, go to the BigQuery page.

    Go to the BigQuery page

  2. In the Explorer pane, click your project name.

  3. Click more_vert View actions > Create dataset.

  4. On the Create dataset page, do the following:

bq

To create a new dataset, use the bq mk command with the --location flag. For a full list of possible parameters, see the bq mk --dataset command reference.

  1. Create a dataset named bqml_tutorial with the data location set to US and a description of BigQuery ML tutorial dataset:

    bq --location=US mk -d \
     --description "BigQuery ML tutorial dataset." \
     bqml_tutorial

    Instead of using the --dataset flag, the command uses the -d shortcut. If you omit -d and --dataset, the command defaults to creating a dataset.

  2. Confirm that the dataset was created:

    bq ls
API

Call the datasets.insert method with a defined dataset resource.

{
  "datasetReference": {
     "datasetId": "bqml_tutorial"
  }
}
BigQuery DataFrames

Before trying this sample, follow the BigQuery DataFrames setup instructions in the BigQuery quickstart using BigQuery DataFrames. For more information, see the BigQuery DataFrames reference documentation.

To authenticate to BigQuery, set up Application Default Credentials. For more information, see Set up ADC for a local development environment.

Create a logistic regression model

Create a logistic regression model using the Analytics sample dataset for BigQuery.

SQL
  1. In the Google Cloud console, go to the BigQuery page.

    Go to BigQuery

  2. In the query editor, run the following statement:

    CREATE OR REPLACE MODEL `bqml_tutorial.sample_model`
    OPTIONS(model_type='logistic_reg') AS
    SELECT
    IF(totals.transactions IS NULL, 0, 1) AS label,
    IFNULL(device.operatingSystem, "") AS os,
    device.isMobile AS is_mobile,
    IFNULL(geoNetwork.country, "") AS country,
    IFNULL(totals.pageviews, 0) AS pageviews
    FROM
    `bigquery-public-data.google_analytics_sample.ga_sessions_*`
    WHERE
    _TABLE_SUFFIX BETWEEN '20160801' AND '20170630'

    The query takes several minutes to complete. After the first iteration is complete, your model (sample_model) appears in the navigation panel. Because the query uses a CREATE MODEL statement to create a model, you don't see query results.

Query details

The CREATE MODEL statement creates the model and then trains the model using the data retrieved by your query's SELECT statement.

The OPTIONS(model_type='logistic_reg') clause creates a logistic regression model. A logistic regression model splits input data into two classes, and then estimates the probability that the data is in one of the classes. What you are trying to detect, such as whether an email is spam, is represented by 1 and other values are represented by 0. The likelihood of a given value belonging to the class you are trying to detect is indicated by a value between 0 and 1. For example, if an email receives a probability estimate of 0.9, then there is a 90% probability that the email is spam.

This query's SELECT statement retrieves the following columns that are used by the model to predict the probability that a customer will complete a transaction:

The FROM clause — causes the query to train the model by using the bigquery-public-data.google_analytics_sample.ga_sessions sample tables. These tables are sharded by date, so you aggregate them by using a wildcard in the table name: google_analytics_sample.ga_sessions_*.

The WHERE clause — _TABLE_SUFFIX BETWEEN '20160801' AND '20170630' — limits the number of tables scanned by the query. The date range scanned is August 1, 2016 to June 30, 2017.

BigQuery DataFrames

Before trying this sample, follow the BigQuery DataFrames setup instructions in the BigQuery quickstart using BigQuery DataFrames. For more information, see the BigQuery DataFrames reference documentation.

To authenticate to BigQuery, set up Application Default Credentials. For more information, see Set up ADC for a local development environment.

View the model's loss statistics

Machine learning is about creating a model that can use data to make a prediction. The model is essentially a function that takes inputs and applies calculations to the inputs to produce an output — a prediction.

Machine learning algorithms work by taking several examples where the prediction is already known (such as the historical data of user purchases) and iteratively adjusting various weights in the model so that the model's predictions match the true values. It does this by minimizing how wrong the model is using a metric called loss.

The expectation is that for each iteration, the loss should be decreasing, ideally to zero. A loss of zero means the model is 100% accurate.

When training the model, BigQuery ML automatically splits the input data into training and evaluation sets, in order to avoid overfitting the model. This is necessary so that the training algorithm doesn't fit itself so closely to the training data that it can't generalize to new examples.

Use the Google Cloud console to see how the model's loss changes over the model's training iterations:

  1. In the Google Cloud console, go to the BigQuery page.

    Go to BigQuery

  2. In the Explorer pane, expand bqml_tutorial > Models and then click sample_model.

  3. Click the Training tab and look at the Loss graph. The Loss graph shows the change in the loss metric over the iterations on the training dataset. If you hold your cursor over the graph, you can see that there are lines for Training loss and Evaluation loss. Since you performed a logistic regression, the training loss value is calculated as log loss, using the training data. The evaluation loss is the log loss calculated on the evaluation data. Both loss types represent average loss values, averaged over all examples in the respective datasets for each iteration.

You can also see the results of the model training by using the ML.TRAINING_INFO function.

Evaluate the model

Evaluate the performance of the model by using the ML.EVALUATE function. The ML.EVALUATE function evaluates the predicted values generated by the model against the actual data. To calculate logistic regression specific metrics, you can use the ML.ROC_CURVE SQL function or the bigframes.ml.metrics.roc_curve BigQuery DataFrames function.

In this tutorial, you are using a binary classification model that detects transactions. The values in the label column are the two classes generated by the model: 0 (no transactions) and 1 (transaction made).

SQL
  1. In the Google Cloud console, go to the BigQuery page.

    Go to BigQuery

  2. In the query editor, run the following statement:

    SELECT
    *
    FROM
    ML.EVALUATE(MODEL `bqml_tutorial.sample_model`, (
    SELECT
    IF(totals.transactions IS NULL, 0, 1) AS label,
    IFNULL(device.operatingSystem, "") AS os,
    device.isMobile AS is_mobile,
    IFNULL(geoNetwork.country, "") AS country,
    IFNULL(totals.pageviews, 0) AS pageviews
    FROM
    `bigquery-public-data.google_analytics_sample.ga_sessions_*`
    WHERE
    _TABLE_SUFFIX BETWEEN '20170701' AND '20170801'))

    The results should look like the following:

      +--------------------+---------------------+---------------------+---------------------+---------------------+--------------------+
      |     precision      |       recall        |      accuracy       |      f1_score       |      log_loss       | roc_auc                   |
      +--------------------+---------------------+---------------------+---------------------+---------------------+--------------------+
      | 0.468503937007874  | 0.11080074487895716 | 0.98534315834767638 | 0.17921686746987953 | 0.04624221101176898    | 0.98174125874125873 |
      +--------------------+---------------------+---------------------+---------------------+---------------------+--------------------+
      

    Because you performed a logistic regression, the results include the following columns:

Query details

The initial SELECT statement retrieves the columns from your model.

The FROM clause uses the ML.EVALUATE function against your model.

The nested SELECT statement and FROM clause are the same as those in the CREATE MODEL query.

The WHERE clause — _TABLE_SUFFIX BETWEEN '20170701' AND '20170801' — limits the number of tables scanned by the query. The date range scanned is July 1, 2017 to August 1, 2017. This is the data you're using to evaluate the predictive performance of the model. It was collected in the month immediately following the time period spanned by the training data.

BigQuery DataFrames

Before trying this sample, follow the BigQuery DataFrames setup instructions in the BigQuery quickstart using BigQuery DataFrames. For more information, see the BigQuery DataFrames reference documentation.

To authenticate to BigQuery, set up Application Default Credentials. For more information, see Set up ADC for a local development environment.

Use the model to predict outcomes

Use the model to predict the number of transactions made by website visitors from each country.

SQL
  1. In the Google Cloud console, go to the BigQuery page.

    Go to BigQuery

  2. In the query editor, run the following statement:

    SELECT
    country,
    SUM(predicted_label) as total_predicted_purchases
    FROM
    ML.PREDICT(MODEL `bqml_tutorial.sample_model`, (
    SELECT
    IFNULL(device.operatingSystem, "") AS os,
    device.isMobile AS is_mobile,
    IFNULL(totals.pageviews, 0) AS pageviews,
    IFNULL(geoNetwork.country, "") AS country
    FROM
    `bigquery-public-data.google_analytics_sample.ga_sessions_*`
    WHERE
    _TABLE_SUFFIX BETWEEN '20170701' AND '20170801'))
    GROUP BY country
    ORDER BY total_predicted_purchases DESC
    LIMIT 10

    The results should look like the following:

    +----------------+---------------------------+
    |    country     | total_predicted_purchases |
    +----------------+---------------------------+
    | United States  |                       220 |
    | Taiwan         |                         8 |
    | Canada         |                         7 |
    | India          |                         2 |
    | Turkey         |                         2 |
    | Japan          |                         2 |
    | Italy          |                         1 |
    | Brazil         |                         1 |
    | Singapore      |                         1 |
    | Australia      |                         1 |
    +----------------+---------------------------+
    

Query details

The initial SELECT statement retrieves the country column and sums the predicted_label column. The predicted_label column is generated by the ML.PREDICT function. When you use the ML.PREDICT function, the output column name for the model is predicted_<label_column_name>. For linear regression models, predicted_label is the estimated value of label. For logistic regression models, predicted_label is the label that best describes the given input data value, either 0 or 1.

The ML.PREDICT function is used to predict results using your model.

The nested SELECT statement and FROM clause are the same as those in the CREATE MODEL query.

The WHERE clause — _TABLE_SUFFIX BETWEEN '20170701' AND '20170801' — limits the number of tables scanned by the query. The date range scanned is July 1, 2017 to August 1, 2017. This is the data for which you're making predictions. It was collected in the month immediately following the time period spanned by the training data.

The GROUP BY and ORDER BY clauses group the results by country and order them by the sum of the predicted purchases in descending order.

The LIMIT clause is used here to display only the top 10 results.

BigQuery DataFrames

Before trying this sample, follow the BigQuery DataFrames setup instructions in the BigQuery quickstart using BigQuery DataFrames. For more information, see the BigQuery DataFrames reference documentation.

To authenticate to BigQuery, set up Application Default Credentials. For more information, see Set up ADC for a local development environment.

Predict purchases per user

Predict the number of transactions each website visitor will make.

SQL

This query is identical to the query in the previous section except for the GROUP BY clause. Here the GROUP BY clause — GROUP BY fullVisitorId — is used to group the results by visitor ID.

  1. In the Google Cloud console, go to the BigQuery page.

    Go to BigQuery

  2. In the query editor, run the following statement:

    SELECT
    fullVisitorId,
    SUM(predicted_label) as total_predicted_purchases
    FROM
    ML.PREDICT(MODEL `bqml_tutorial.sample_model`, (
    SELECT
    IFNULL(device.operatingSystem, "") AS os,
    device.isMobile AS is_mobile,
    IFNULL(totals.pageviews, 0) AS pageviews,
    IFNULL(geoNetwork.country, "") AS country,
    fullVisitorId
    FROM
    `bigquery-public-data.google_analytics_sample.ga_sessions_*`
    WHERE
    _TABLE_SUFFIX BETWEEN '20170701' AND '20170801'))
    GROUP BY fullVisitorId
    ORDER BY total_predicted_purchases DESC
    LIMIT 10

    The results should look like the following:

      +---------------------+---------------------------+
      |    fullVisitorId    | total_predicted_purchases |
      +---------------------+---------------------------+
      | 9417857471295131045 |                         4 |
      | 112288330928895942  |                         2 |
      | 2158257269735455737 |                         2 |
      | 489038402765684003  |                         2 |
      | 057693500927581077  |                         2 |
      | 2969418676126258798 |                         2 |
      | 5073919761051630191 |                         2 |
      | 7420300501523012460 |                         2 |
      | 0456807427403774085 |                         2 |
      | 2105122376016897629 |                         2 |
      +---------------------+---------------------------+
      
BigQuery DataFrames

Before trying this sample, follow the BigQuery DataFrames setup instructions in the BigQuery quickstart using BigQuery DataFrames. For more information, see the BigQuery DataFrames reference documentation.

To authenticate to BigQuery, set up Application Default Credentials. For more information, see Set up ADC for a local development environment.

Clean up

To avoid incurring charges to your Google Cloud account for the resources used on this page, follow these steps.

You can delete the project you created, or keep the project and delete the dataset.

Delete the dataset

Deleting your project removes all datasets and all tables in the project. If you prefer to reuse the project, you can delete the dataset you created in this tutorial:

  1. In the Google Cloud console, go to the BigQuery page.

    Go to BigQuery

  2. In the Explorer pane, select the bqml_tutorial dataset you created.

  3. Click more_vert Actions > Delete.

  4. In the Delete dataset dialog, confirm the delete command by typing delete.

  5. Click Delete.

Delete the project

To delete the project:

    Caution: Deleting a project has the following effects:

    If you plan to explore multiple architectures, tutorials, or quickstarts, reusing projects can help you avoid exceeding project quota limits.

  1. In the Google Cloud console, go to the Manage resources page.

    Go to Manage resources

  2. In the project list, select the project that you want to delete, and then click Delete.
  3. In the dialog, type the project ID, and then click Shut down to delete the project.
What's next

RetroSearch is an open source project built by @garambo | Open a GitHub Issue

Search and Browse the WWW like it's 1997 | Search results from DuckDuckGo

HTML: 3.2 | Encoding: UTF-8 | Version: 0.7.4