This document describes how to stream data into BigQuery by using the legacy tabledata.insertAll
method.
For new projects, we recommend using the BigQuery Storage Write API instead of the tabledata.insertAll
method. The Storage Write API has lower pricing and more robust features, including exactly-once delivery semantics. If you are migrating an existing project from the tabledata.insertAll
method to the Storage Write API, we recommend selecting the default stream. The tabledata.insertAll
method is still fully supported.
Ensure that you have write access to the dataset that contains your destination table. The table must exist before you begin writing data to it unless you are using template tables. For more information on template tables, see Creating tables automatically using template tables.
Check the quota policy for streaming data.
Verify that billing is enabled for your Google Cloud project.
Streaming is not available through the free tier. If you attempt to use streaming without enabling billing, you receive the following error: BigQuery: Streaming insert is not allowed in the free tier.
Grant Identity and Access Management (IAM) roles that give users the necessary permissions to perform each task in this document.
To stream data into BigQuery, you need the following IAM permissions:
bigquery.tables.updateData
(lets you insert data into the table)bigquery.tables.get
(lets you obtain table metadata)bigquery.datasets.get
(lets you obtain dataset metadata)bigquery.tables.create
(required if you use a template table to create the table automatically)Each of the following predefined IAM roles includes the permissions that you need in order to stream data into BigQuery:
roles/bigquery.dataEditor
roles/bigquery.dataOwner
roles/bigquery.admin
For more information about IAM roles and permissions in BigQuery, see Predefined roles and permissions.
Stream data into BigQuery C#Before trying this sample, follow the C# setup instructions in the BigQuery quickstart using client libraries. For more information, see the BigQuery C# API reference documentation.
To authenticate to BigQuery, set up Application Default Credentials. For more information, see Set up authentication for client libraries.
GoBefore trying this sample, follow the Go setup instructions in the BigQuery quickstart using client libraries. For more information, see the BigQuery Go API reference documentation.
To authenticate to BigQuery, set up Application Default Credentials. For more information, see Set up authentication for client libraries.
JavaBefore trying this sample, follow the Java setup instructions in the BigQuery quickstart using client libraries. For more information, see the BigQuery Java API reference documentation.
To authenticate to BigQuery, set up Application Default Credentials. For more information, see Set up authentication for client libraries.
Node.jsBefore trying this sample, follow the Node.js setup instructions in the BigQuery quickstart using client libraries. For more information, see the BigQuery Node.js API reference documentation.
To authenticate to BigQuery, set up Application Default Credentials. For more information, see Set up authentication for client libraries.
PHPBefore trying this sample, follow the PHP setup instructions in the BigQuery quickstart using client libraries. For more information, see the BigQuery PHP API reference documentation.
To authenticate to BigQuery, set up Application Default Credentials. For more information, see Set up authentication for client libraries.
PythonBefore trying this sample, follow the Python setup instructions in the BigQuery quickstart using client libraries. For more information, see the BigQuery Python API reference documentation.
To authenticate to BigQuery, set up Application Default Credentials. For more information, see Set up authentication for client libraries.
RubyBefore trying this sample, follow the Ruby setup instructions in the BigQuery quickstart using client libraries. For more information, see the BigQuery Ruby API reference documentation.
To authenticate to BigQuery, set up Application Default Credentials. For more information, see Set up authentication for client libraries.
Note: To specify aNUMERIC
or BIGNUMERIC
value in a row, you must surround the value with double quotation marks, such as "big_numeric_col":"0.123456789123"
.
You don't need to populate the insertID
field when you insert rows. The following example shows how to avoid sending an insertID
for each row when streaming.
For date and time fields, format the data in the tabledata.insertAll
method as follows:
DATE
A string in the form "YYYY-MM-DD"
DATETIME
A string in the form "YYYY-MM-DD [HH:MM:SS]"
TIME
A string in the form "HH:MM:SS"
TIMESTAMP
The number of seconds since 1970-01-01 (the Unix epoch), or a string in the form "YYYY-MM-DD HH:MM[:SS]"
Send range data
For fields with type RANGE<T>
, format the data in the tabledata.insertAll
method as a JSON object with two fields, start
and end
. Missing or NULL values for the start
and end
fields represent unbounded boundaries. These fields must have the same supported JSON format of type T
, where T
can be one of DATE
, DATETIME
, and TIMESTAMP
.
In the following example, the f_range_date
field represents a RANGE<DATE>
column in a table. A row is inserted into this column using the tabledata.insertAll
API.
{
"f_range_date": {
"start": "1970-01-02",
"end": null
}
}
Stream data availability
Data is available for real-time analysis using GoogleSQL queries immediately after BigQuery successfully acknowledges a tabledata.insertAll
request.
Recently streamed rows to an ingestion time partitioned table temporarily have a NULL value for the _PARTITIONTIME
pseudocolumn. For such rows, BigQuery assigns the final non-NULL value of the PARTITIONTIME
column in the background, typically within a few minutes. In rare cases, this can take up to 90 minutes.
Some recently streamed rows might not be available for table copy typically for a few minutes. In rare cases, this can take up to 90 minutes. To see whether data is available for table copy, check the tables.get
response for a section named streamingBuffer
. If the streamingBuffer
section is absent, your data is available for copy. You can also use the streamingBuffer.oldestEntryTime
field to identify the age of records in the streaming buffer.
When you supply insertId
for an inserted row, BigQuery uses this ID to support best effort de-duplication for up to one minute. That is, if you stream the same row with the same insertId
more than once within that time period into the same table, BigQuery might de-duplicate the multiple occurrences of that row, retaining only one of those occurrences.
The system expects that rows provided with identical insertId
s are also identical. If two rows have identical insertId
s, it is nondeterministic which row BigQuery preserves.
De-duplication is generally meant for retry scenarios in a distributed system where there's no way to determine the state of a streaming insert under certain error conditions, such as network errors between your system and BigQuery or internal errors within BigQuery. If you retry an insert, use the same insertId
for the same set of rows so that BigQuery can attempt to de-duplicate your data. For more information, see troubleshooting streaming inserts.
De-duplication offered by BigQuery is best effort, and it should not be relied upon as a mechanism to guarantee the absence of duplicates in your data. Additionally, BigQuery might degrade the quality of best effort de-duplication at any time in order to guarantee higher reliability and availability for your data.
If you have strict de-duplication requirements for your data, Google Cloud Datastore is an alternative service that supports transactions.
Disabling best effort de-duplicationYou can disable best effort de-duplication by not populating the insertId
field for each row inserted. This is the recommended way to insert data.
To disable best effort de-duplication when you use Apache Beam's BigQuery I/O connector for Java, use the ignoreInsertIds()
method.
To ensure that no duplicate rows exist after you are done streaming, use the following manual process:
insertId
as a column in your table schema and include the insertId
value in the data for each row.#standardSQL SELECT MAX(count) FROM( SELECT ID_COLUMN, count(*) as count FROM `TABLE_NAME` GROUP BY ID_COLUMN)If the result is greater than 1, duplicates exist.
#standardSQL SELECT * EXCEPT(row_number) FROM ( SELECT *, ROW_NUMBER() OVER (PARTITION BY ID_COLUMN) row_number FROM `TABLE_NAME`) WHERE row_number = 1
Notes about the duplicate removal query:
WRITE_TRUNCATE
.row_number
column with the value 1
to the end of the table schema. The query uses a SELECT * EXCEPT
statement from GoogleSQL to exclude the row_number
column from the destination table. The #standardSQL
prefix enables GoogleSQL for this query. Alternatively, you can select by specific column names to omit this column.When you stream data to a time-partitioned table, each partition has a streaming buffer. The streaming buffer is retained when you perform a load, query, or copy job that overwrites a partition by setting the writeDisposition
property to WRITE_TRUNCATE
. If you want to remove the streaming buffer, verify that the streaming buffer is empty by calling tables.get
on the partition.
When you stream to an ingestion-time partitioned table, BigQuery infers the destination partition from the current UTC time.
Newly arriving data is temporarily placed in the __UNPARTITIONED__
partition while in the streaming buffer. When there's enough unpartitioned data, BigQuery partitions the data into the correct partition. However, there is no SLA for how long it takes for data to move out of the __UNPARTITIONED__
partition. A query can exclude data in the streaming buffer from a query by filtering out the NULL
values from the __UNPARTITIONED__
partition by using one of the pseudocolumns (_PARTITIONTIME
or _PARTITIONDATE
depending on your preferred data type).
If you are streaming data into a daily partitioned table, then you can override the date inference by supplying a partition decorator as part of the insertAll
request. Include the decorator in the tableId
parameter. For example, you can stream to the partition corresponding to 2021-03-01 for table table1
using the partition decorator:
table1$20210301
When streaming using a partition decorator, you can stream to partitions within the last 31 days in the past and 16 days in the future relative to the current date, based on current UTC time. To write to partitions for dates outside these allowed bounds, use a load or query job instead, as described in Appending to and overwriting partitioned table data.
Streaming using a partition decorator is only supported for daily partitioned tables. It is not supported for hourly, monthly, or yearly partitioned tables.
For testing, you can use the bq command-line tool bq insert
CLI command. For example, the following command streams a single row to a partition for the date January 1, 2017 ($20170101
) into a partitioned table named mydataset.mytable
:
echo '{"a":1, "b":2}' | bq insert 'mydataset.mytable$20170101'
Caution: The bq insert
command is intended for testing only. Time-unit column partitioning
You can stream data into a table partitioned on a DATE
, DATETIME
, or TIMESTAMP
column that is between 10 years in the past and 1 year in the future. Data outside this range is rejected.
When the data is streamed, it is initially placed in the __UNPARTITIONED__
partition. When there's enough unpartitioned data, BigQuery automatically repartitions the data, placing it into the appropriate partition. However, there is no SLA for how long it takes for data to move out of the __UNPARTITIONED__
partition.
Template tables provide a mechanism to split a logical table into many smaller tables to create smaller sets of data (for example, by user ID). Template tables have a number of limitations described below. Instead, partitioned tables and clustered tables are the recommended ways to achieve this behavior.
To use a template table through the BigQuery API, add a templateSuffix
parameter to your insertAll
request. For the bq command-line tool, add the template_suffix
flag to your insert
command. If BigQuery detects a templateSuffix
parameter or the template_suffix
flag, it treats the targeted table as a base template. It creates a new table that shares the same schema as the targeted table and has a name that includes the specified suffix:
<targeted_table_name> + <templateSuffix>
By using a template table, you avoid the overhead of creating each table individually and specifying the schema for each table. You need only create a single template, and supply different suffixes so that BigQuery can create the new tables for you. BigQuery places the tables in the same project and dataset.
Tables created by using template tables are usually available within a few seconds. On rare occasions, they may take longer to become available.
Change the template table schemaIf you change a template table schema, all tables that are generated subsequently use the updated schema. Previously generated tables are not affected, unless the existing table still has a streaming buffer.
For existing tables that still have a streaming buffer, if you modify the template table schema in a backward compatible way, the schema of those actively streamed generated tables is also updated. However, if you modify the template table schema in a non-backward compatible way, any buffered data that uses the old schema is lost. Also, you cannot stream new data to existing generated tables that use the old, but now incompatible, schema.
After you change a template table schema, wait until the changes have propagated before you try to insert new data or query the generated tables. Requests to insert new fields should succeed within a few minutes. Attempts to query the new fields might require a longer wait of up to 90 minutes.
If you want to change a generated table's schema, do not change the schema until streaming through the template table has ceased and the generated table's streaming statistics section is absent from the tables.get()
response, which indicates that no data is buffered on the table.
Partitioned tables and clustered tables do not suffer from the preceding limitations and are the recommended mechanism.
Template table detailstemplateSuffix
(or --template_suffix
) value must contain only letters (a-z, A-Z), numbers (0-9), or underscores (_). The maximum combined length of the table name and the table suffix is 1024 characters.
Template tables are subject to streaming quota limitations. Your project can make up to 10 tables per second with template tables, similar to the tables.insert
API. This quota only applies to tables being created, not to tables being modified.
If your application needs to create more than 10 tables per second, we recommend using clustered tables. For example, you can put the high cardinality table ID into the key column of a single clustering table.
The generated table inherits its expiration time from the dataset. As with normal streaming data, generated tables cannot be copied immediately.
Deduplication only happens between uniform references to a destination table. For example, if you simultaneously stream to a generated table using both template tables and a regular insertAll
command, no deduplication occurs between rows inserted by template tables and a regular insertAll
command.
The template table and the generated tables shouldn't be views.
The following sections discuss how to troubleshoot errors that occur when you stream data into BigQuery using the legacy streaming API. For more information on how to resolve quota errors for streaming inserts, see Streaming insert quota errors.
Failure HTTP response codesIf you receive a failure HTTP response code such as a network error, there's no way to tell whether the streaming insert succeeded. If you try to re-send the request, you might end up with duplicated rows in your table. To help protect your table against duplication, set the insertId
property when sending your request. BigQuery uses the insertId
property for de-duplication.
If you receive a permission error, an invalid table name error, or an exceeded quota error, no rows are inserted and the entire request fails.
Success HTTP response codesEven if you receive a success HTTP response code, you'll need to check the insertErrors
property of the response to determine whether the row insertions were successful because it's possible that BigQuery was only partially successful at inserting the rows. You might encounter one of the following scenarios:
insertErrors
property is an empty list, all of the rows were inserted successfully.insertErrors
property are not inserted, and all other rows are inserted successfully. The errors
property contains detailed information about why each unsuccessful row failed. The index
property indicates the 0-based row index of the request that the error applies to.insertErrors
entry is returned for each row, even the rows that did not have a schema mismatch. Rows that did not have a schema mismatch have an error with the reason
property set to stopped
, and can be re-sent as-is. Rows that failed include detailed information about the schema mismatch. To learn about the supported protocol buffer types for each BigQuery data type, see Data type conversions.Because BigQuery's streaming API is designed for high insertion rates, modifications to the underlying table metadata exhibit are eventually consistent when interacting with the streaming system. Most of the time, metadata changes are propagated within minutes, but during this period API responses might reflect the inconsistent state of the table.
Some scenarios include:
notFound
response. A table created in response might not immediately be recognized by subsequent streaming inserts. Similarly, deleting or recreating a table can create a period of time where streaming inserts are effectively delivered to the old table. The streaming inserts might not be present in the new table.Streaming inserts reside temporarily in the write-optimized storage, which has different availability characteristics than managed storage. Certain operations in BigQuery do not interact with the write-optimized storage, such as table copy jobs and API methods like tabledata.list
. Recent streaming data won't be present in the destination table or output.
RetroSearch is an open source project built by @garambo | Open a GitHub Issue
Search and Browse the WWW like it's 1997 | Search results from DuckDuckGo
HTML:
3.2
| Encoding:
UTF-8
| Version:
0.7.4