Create a BigQuery dataset to store your ML model.
ConsoleIn the Google Cloud console, go to the BigQuery page.
In the Explorer pane, click your project name.
Click more_vert View actions > Create dataset.
On the Create dataset page, do the following:
For Dataset ID, enter bqml_tutorial
.
For Location type, select Multi-region, and then select US (multiple regions in United States).
Leave the remaining default settings as they are, and click Create dataset.
To create a new dataset, use the bq mk
command with the --location
flag. For a full list of possible parameters, see the bq mk --dataset
command reference.
Create a dataset named bqml_tutorial
with the data location set to US
and a description of BigQuery ML tutorial dataset
:
bq --location=US mk -d \ --description "BigQuery ML tutorial dataset." \ bqml_tutorial
Instead of using the --dataset
flag, the command uses the -d
shortcut. If you omit -d
and --dataset
, the command defaults to creating a dataset.
Confirm that the dataset was created:
bq ls
Call the datasets.insert
method with a defined dataset resource.
{ "datasetReference": { "datasetId": "bqml_tutorial" } }BigQuery DataFrames
Before trying this sample, follow the BigQuery DataFrames setup instructions in the BigQuery quickstart using BigQuery DataFrames. For more information, see the BigQuery DataFrames reference documentation.
To authenticate to BigQuery, set up Application Default Credentials. For more information, see Set up ADC for a local development environment.
Visualize the input dataBefore creating the model, you can optionally visualize your input time series data to get a sense of the distribution. You can do this by using Looker Studio.
SQLThe SELECT
statement of the following query uses the EXTRACT
function to extract the date information from the starttime
column. The query uses the COUNT(*)
clause to get the daily total number of Citi Bike trips.
Follow these steps to visualize the time series data:
In the Google Cloud console, go to the BigQuery page.
In the query editor, paste in the following query and click Run:
SELECT EXTRACT(DATE from starttime) AS date, COUNT(*) AS num_trips FROM `bigquery-public-data.new_york.citibike_trips` GROUP BY date;
When the query completes, click Explore data > Explore with Looker Studio. Looker Studio opens in a new tab. Complete the following steps in the new tab.
In the Looker Studio, click Insert > Time series chart.
In the Chart pane, choose the Setup tab.
In the Metric section, add the num_trips field, and remove the default Record Count metric. The resulting chart looks similar to the following:
Before trying this sample, follow the BigQuery DataFrames setup instructions in the BigQuery quickstart using BigQuery DataFrames. For more information, see the BigQuery DataFrames reference documentation.
To authenticate to BigQuery, set up Application Default Credentials. For more information, see Set up ADC for a local development environment.
Create the time series modelYou want to forecast the number of bike trips for each Citi Bike station, which requires many time series models; one for each Citi Bike station that is included in the input data. You can create multiple models to do this, but that can be a tedious and time consuming process, especially when you have a large number of time series. Instead, you can use a single query to create and fit a set of time series models in order to forecast multiple time series at once.
SQLIn the following query, the OPTIONS(model_type='ARIMA_PLUS', time_series_timestamp_col='date', ...)
clause indicates that you are creating an ARIMA-based time series model. You use the time_series_id_col
option of the CREATE MODEL
statement to specify one or more columns in the input data that you want to get forecasts for, in this case the Citi Bike station, as represented by the start_station_name
column. You use the WHERE
clause to limit the start stations to those with Central Park
in their names. The auto_arima_max_order
option of the CREATE MODEL
statement controls the search space for hyperparameter tuning in the auto.ARIMA
algorithm. The decompose_time_series
option of the CREATE MODEL
statement defaults to TRUE
, so that information about the time series data is returned when you evaluate the model in the next step.
Follow these steps to create the model:
In the Google Cloud console, go to the BigQuery page.
In the query editor, paste in the following query and click Run:
CREATE OR REPLACE MODEL `bqml_tutorial.nyc_citibike_arima_model_group` OPTIONS (model_type = 'ARIMA_PLUS', time_series_timestamp_col = 'date', time_series_data_col = 'num_trips', time_series_id_col = 'start_station_name', auto_arima_max_order = 5 ) AS SELECT start_station_name, EXTRACT(DATE from starttime) AS date, COUNT(*) AS num_trips FROM `bigquery-public-data.new_york.citibike_trips` WHERE start_station_name LIKE '%Central Park%' GROUP BY start_station_name, date;
The query takes approximately 24 seconds to complete, after which the nyc_citibike_arima_model_group
model appears in the Explorer pane. Because the query uses a CREATE MODEL
statement, you don't see query results.
This query creates twelve time series models, one for each of the twelve Citi Bike start stations in the input data. The time cost, approximately 24 seconds, is only 1.4 times more than that of creating a single time series model because of the parallelism. However, if you remove the WHERE ... LIKE ...
clause, there would be 600+ time series to forecast, and they wouldn't be forecast completely in parallel because of slot capacity limitations. In that case, the query would take approximately 15 minutes to finish. To reduce the query runtime with the compromise of a potential slight drop in model quality, you could decrease the value of the auto_arima_max_order
. This shrinks the search space of hyperparameter tuning in the auto.ARIMA
algorithm. For more information, see Large-scale time series forecasting best practices
.
In the following snippet, you are creating an ARIMA-based time series model.
Before trying this sample, follow the BigQuery DataFrames setup instructions in the BigQuery quickstart using BigQuery DataFrames. For more information, see the BigQuery DataFrames reference documentation.
To authenticate to BigQuery, set up Application Default Credentials. For more information, see Set up ADC for a local development environment.
This creates twelve time series models, one for each of the twelve Citi Bike start stations in the input data. The time cost, approximately 24 seconds, is only 1.4 times more than that of creating a single time series model because of the parallelism.
Evaluate the model SQLEvaluate the time series model by using the ML.ARIMA_EVALUATE
function. The ML.ARIMA_EVALUATE
function shows you the evaluation metrics that were generated for the model during the process of automatic hyperparameter tuning.
Follow these steps to evaluate the model:
In the Google Cloud console, go to the BigQuery page.
In the query editor, paste in the following query and click Run:
SELECT * FROM ML.ARIMA_EVALUATE(MODEL `bqml_tutorial.nyc_citibike_arima_model_group`);
The results should look like the following:
While auto.ARIMA
evaluates dozens of candidate ARIMA models for each time series, ML.ARIMA_EVALUATE
by default only outputs the information of the best model to make the output table compact. To view all the candidate models, you can set the ML.ARIMA_EVALUATE
function's show_all_candidate_model
argument to TRUE
.
Before trying this sample, follow the BigQuery DataFrames setup instructions in the BigQuery quickstart using BigQuery DataFrames. For more information, see the BigQuery DataFrames reference documentation.
To authenticate to BigQuery, set up Application Default Credentials. For more information, see Set up ADC for a local development environment.
The start_station_name
column identifies the input data column for which time series were created. This is the column that you specified with the time_series_id_col
option when creating the model.
The non_seasonal_p
, non_seasonal_d
, non_seasonal_q
, and has_drift
output columns define an ARIMA model in the training pipeline. The log_likelihood
, AIC
, and variance
output columns are relevant to the ARIMA model fitting process.The fitting process determines the best ARIMA model by using the auto.ARIMA
algorithm, one for each time series.
The auto.ARIMA
algorithm uses the KPSS test to determine the best value for non_seasonal_d
, which in this case is 1
. When non_seasonal_d
is 1
, the auto.ARIMA algorithm trains 42 different candidate ARIMA models in parallel. In this example, all 42 candidate models are valid, so the output contains 42 rows, one for each candidate ARIMA model; in cases where some of the models aren't valid, they are excluded from the output. These candidate models are returned in ascending order by AIC. The model in the first row has the lowest AIC, and is considered as the best model. This best model is saved as the final model and is used when you forecast data, evaluate the model, and inspect the model's coefficients as shown in the following steps.
The seasonal_periods
column contains information about the seasonal pattern identified in the time series data. Each time series can have different seasonal patterns. For example, from the figure, you can see that one time series has a yearly pattern, while others don't.
The has_holiday_effect
, has_spikes_and_dips
, and has_step_changes
columns are only populated when decompose_time_series=TRUE
. These columns also reflect information about the input time series data, and are not related to the ARIMA modeling. These columns also have the same values across all output rows.
Inspect the time series model's coefficients by using the ML.ARIMA_COEFFICIENTS
function.
Follow these steps to retrieve the model's coefficients:
In the Google Cloud console, go to the BigQuery page.
In the query editor, paste in the following query and click Run:
SELECT * FROM ML.ARIMA_COEFFICIENTS(MODEL `bqml_tutorial.nyc_citibike_arima_model_group`);
The query takes less than a second to complete. The results should look similar to the following:
For more information about the output columns, see ML.ARIMA_COEFFICIENTS
function.
Inspect the time series model's coefficients by using the coef_
function.
Before trying this sample, follow the BigQuery DataFrames setup instructions in the BigQuery quickstart using BigQuery DataFrames. For more information, see the BigQuery DataFrames reference documentation.
To authenticate to BigQuery, set up Application Default Credentials. For more information, see Set up ADC for a local development environment.
The start_station_name
column identifies the input data column for which time series were created. This is the column that you specified in the time_series_id_col
option when creating the model.
The ar_coefficients
output column shows the model coefficients of the autoregressive (AR) part of the ARIMA model. Similarly, the ma_coefficients
output column shows the model coefficients of the moving-average (MA) part of the ARIMA model. Both of these columns contain array values, whose lengths are equal to non_seasonal_p
and non_seasonal_q
, respectively. The intercept_or_drift
value is the constant term in the ARIMA model.
Forecast future time series values by using the ML.FORECAST
function.
In the following GoogleSQL query, the STRUCT(3 AS horizon, 0.9 AS confidence_level)
clause indicates that the query forecasts 3 future time points, and generates a prediction interval with a 90% confidence level.
Follow these steps to forecast data with the model:
In the Google Cloud console, go to the BigQuery page.
In the query editor, paste in the following query and click Run:
SELECT * FROM ML.FORECAST(MODEL `bqml_tutorial.nyc_citibike_arima_model_group`, STRUCT(3 AS horizon, 0.9 AS confidence_level))
Click Run.
The query takes less than a second to complete. The results should look like the following:
For more information about the output columns, see ML.FORECAST
function.
Forecast future time series values by using the predict
function.
Before trying this sample, follow the BigQuery DataFrames setup instructions in the BigQuery quickstart using BigQuery DataFrames. For more information, see the BigQuery DataFrames reference documentation.
To authenticate to BigQuery, set up Application Default Credentials. For more information, see Set up ADC for a local development environment.
The first column, start_station_name
, annotates the time series that each time series model is fitted against. Each start_station_name
has three rows of forecasted results, as specified by the horizon
value.
For each start_station_name
, the output rows are in chronological order by the forecast_timestamp
column value. In time series forecasting, the prediction interval, as represented by the prediction_interval_lower_bound
and prediction_interval_upper_bound
column values, is as important as the forecast_value
column value. The forecast_value
value is the middle point of the prediction interval. The prediction interval depends on the standard_error
and confidence_level
column values.
You can get explainability metrics in addition to forecast data by using the ML.EXPLAIN_FORECAST
function. The ML.EXPLAIN_FORECAST
function forecasts future time series values and also returns all the separate components of the time series. If you just want to return forecast data, use the ML.FORECAST
function instead, as shown in Use the model to forecast data.
The STRUCT(3 AS horizon, 0.9 AS confidence_level)
clause used in the ML.EXPLAIN_FORECAST
function indicates that the query forecasts 3 future time points and generates a prediction interval with 90% confidence.
Follow these steps to explain the model's results:
In the Google Cloud console, go to the BigQuery page.
In the query editor, paste in the following query and click Run:
SELECT * FROM ML.EXPLAIN_FORECAST(MODEL `bqml_tutorial.nyc_citibike_arima_model_group`, STRUCT(3 AS horizon, 0.9 AS confidence_level));
The query takes less than a second to complete. The results should look like the following:
The first thousands rows returned are all history data. You must scroll through the results to see the forecast data.
The output rows are ordered first by start_station_name
, then chronologically by the time_series_timestamp
column value. In time series forecasting, the prediction interval, as represented by the prediction_interval_lower_bound
and prediction_interval_upper_bound
column values, is as important as the forecast_value
column value. The forecast_value
value is the middle point of the prediction interval. The prediction interval depends on the standard_error
and confidence_level
column values.
For more information about the output columns, see ML.EXPLAIN_FORECAST
.
You can get explainability metrics in addition to forecast data by using the predict_explain
function. The predict_explain
function forecasts future time series values and also returns all the separate components of the time series. If you just want to return forecast data, use the predict
function instead, as shown in Use the model to forecast data.
The horizon=3, confidence_level=0.9
clause used in the predict_explain
function indicates that the query forecasts 3 future time points and generates a prediction interval with 90% confidence.
Before trying this sample, follow the BigQuery DataFrames setup instructions in the BigQuery quickstart using BigQuery DataFrames. For more information, see the BigQuery DataFrames reference documentation.
To authenticate to BigQuery, set up Application Default Credentials. For more information, see Set up ADC for a local development environment.
The output rows are ordered first by time_series_timestamp
, then chronologically by the start_station_name
column value. In time series forecasting, the prediction interval, as represented by the prediction_interval_lower_bound
and prediction_interval_upper_bound
column values, is as important as the forecast_value
column value. The forecast_value
value is the middle point of the prediction interval. The prediction interval depends on the standard_error
and confidence_level
column values.
RetroSearch is an open source project built by @garambo | Open a GitHub Issue
Search and Browse the WWW like it's 1997 | Search results from DuckDuckGo
HTML:
3.2
| Encoding:
UTF-8
| Version:
0.7.4