A RetroSearch Logo

Home - News ( United States | United Kingdom | Italy | Germany ) - Football scores

Search Query:

Showing content from https://developers.google.com/bigquery/docs/streaming-data-into-bigquery below:

Use the legacy streaming API | BigQuery

This document describes how to stream data into BigQuery by using the legacy tabledata.insertAll method.

For new projects, we recommend using the BigQuery Storage Write API instead of the tabledata.insertAll method. The Storage Write API has lower pricing and more robust features, including exactly-once delivery semantics. If you are migrating an existing project from the tabledata.insertAll method to the Storage Write API, we recommend selecting the default stream. The tabledata.insertAll method is still fully supported.

Before you begin
  1. Ensure that you have write access to the dataset that contains your destination table. The table must exist before you begin writing data to it unless you are using template tables. For more information on template tables, see Creating tables automatically using template tables.

  2. Check the quota policy for streaming data.

  3. Verify that billing is enabled for your Google Cloud project.

  4. Streaming is not available through the free tier. If you attempt to use streaming without enabling billing, you receive the following error: BigQuery: Streaming insert is not allowed in the free tier.

  5. Grant Identity and Access Management (IAM) roles that give users the necessary permissions to perform each task in this document.

Required permissions

To stream data into BigQuery, you need the following IAM permissions:

Each of the following predefined IAM roles includes the permissions that you need in order to stream data into BigQuery:

For more information about IAM roles and permissions in BigQuery, see Predefined roles and permissions.

Stream data into BigQuery C#

Before trying this sample, follow the C# setup instructions in the BigQuery quickstart using client libraries. For more information, see the BigQuery C# API reference documentation.

To authenticate to BigQuery, set up Application Default Credentials. For more information, see Set up authentication for client libraries.

Go

Before trying this sample, follow the Go setup instructions in the BigQuery quickstart using client libraries. For more information, see the BigQuery Go API reference documentation.

To authenticate to BigQuery, set up Application Default Credentials. For more information, see Set up authentication for client libraries.

Java

Before trying this sample, follow the Java setup instructions in the BigQuery quickstart using client libraries. For more information, see the BigQuery Java API reference documentation.

To authenticate to BigQuery, set up Application Default Credentials. For more information, see Set up authentication for client libraries.

Node.js

Before trying this sample, follow the Node.js setup instructions in the BigQuery quickstart using client libraries. For more information, see the BigQuery Node.js API reference documentation.

To authenticate to BigQuery, set up Application Default Credentials. For more information, see Set up authentication for client libraries.

PHP

Before trying this sample, follow the PHP setup instructions in the BigQuery quickstart using client libraries. For more information, see the BigQuery PHP API reference documentation.

To authenticate to BigQuery, set up Application Default Credentials. For more information, see Set up authentication for client libraries.

Python

Before trying this sample, follow the Python setup instructions in the BigQuery quickstart using client libraries. For more information, see the BigQuery Python API reference documentation.

To authenticate to BigQuery, set up Application Default Credentials. For more information, see Set up authentication for client libraries.

Ruby

Before trying this sample, follow the Ruby setup instructions in the BigQuery quickstart using client libraries. For more information, see the BigQuery Ruby API reference documentation.

To authenticate to BigQuery, set up Application Default Credentials. For more information, see Set up authentication for client libraries.

Note: To specify a NUMERIC or BIGNUMERIC value in a row, you must surround the value with double quotation marks, such as "big_numeric_col":"0.123456789123".

You don't need to populate the insertID field when you insert rows. The following example shows how to avoid sending an insertID for each row when streaming.

Send date and time data

For date and time fields, format the data in the tabledata.insertAll method as follows:

Type Format DATE A string in the form "YYYY-MM-DD" DATETIME A string in the form "YYYY-MM-DD [HH:MM:SS]" TIME A string in the form "HH:MM:SS" TIMESTAMP The number of seconds since 1970-01-01 (the Unix epoch), or a string in the form "YYYY-MM-DD HH:MM[:SS]" Send range data

For fields with type RANGE<T>, format the data in the tabledata.insertAll method as a JSON object with two fields, start and end. Missing or NULL values for the start and end fields represent unbounded boundaries. These fields must have the same supported JSON format of type T, where T can be one of DATE, DATETIME, and TIMESTAMP.

In the following example, the f_range_date field represents a RANGE<DATE> column in a table. A row is inserted into this column using the tabledata.insertAll API.

{
    "f_range_date": {
        "start": "1970-01-02",
        "end": null
    }
}
Stream data availability

Data is available for real-time analysis using GoogleSQL queries immediately after BigQuery successfully acknowledges a tabledata.insertAll request.

Recently streamed rows to an ingestion time partitioned table temporarily have a NULL value for the _PARTITIONTIME pseudocolumn. For such rows, BigQuery assigns the final non-NULL value of the PARTITIONTIME column in the background, typically within a few minutes. In rare cases, this can take up to 90 minutes.

Some recently streamed rows might not be available for table copy typically for a few minutes. In rare cases, this can take up to 90 minutes. To see whether data is available for table copy, check the tables.get response for a section named streamingBuffer. If the streamingBuffer section is absent, your data is available for copy. You can also use the streamingBuffer.oldestEntryTime field to identify the age of records in the streaming buffer.

Best effort de-duplication

When you supply insertId for an inserted row, BigQuery uses this ID to support best effort de-duplication for up to one minute. That is, if you stream the same row with the same insertId more than once within that time period into the same table, BigQuery might de-duplicate the multiple occurrences of that row, retaining only one of those occurrences.

The system expects that rows provided with identical insertIds are also identical. If two rows have identical insertIds, it is nondeterministic which row BigQuery preserves.

De-duplication is generally meant for retry scenarios in a distributed system where there's no way to determine the state of a streaming insert under certain error conditions, such as network errors between your system and BigQuery or internal errors within BigQuery. If you retry an insert, use the same insertId for the same set of rows so that BigQuery can attempt to de-duplicate your data. For more information, see troubleshooting streaming inserts.

De-duplication offered by BigQuery is best effort, and it should not be relied upon as a mechanism to guarantee the absence of duplicates in your data. Additionally, BigQuery might degrade the quality of best effort de-duplication at any time in order to guarantee higher reliability and availability for your data.

If you have strict de-duplication requirements for your data, Google Cloud Datastore is an alternative service that supports transactions.

Disabling best effort de-duplication

You can disable best effort de-duplication by not populating the insertId field for each row inserted. This is the recommended way to insert data.

Apache Beam and Dataflow

To disable best effort de-duplication when you use Apache Beam's BigQuery I/O connector for Java, use the ignoreInsertIds() method.

Manually removing duplicates

To ensure that no duplicate rows exist after you are done streaming, use the following manual process:

  1. Add the insertId as a column in your table schema and include the insertId value in the data for each row.
  2. After streaming has stopped, perform the following query to check for duplicates:
    #standardSQL
    SELECT
      MAX(count) FROM(
      SELECT
        ID_COLUMN,
        count(*) as count
      FROM
        `TABLE_NAME`
      GROUP BY
        ID_COLUMN)
    If the result is greater than 1, duplicates exist.
  3. To remove duplicates, run the following query. Specify a destination table, allow large results, and disable result flattening.
    #standardSQL
    SELECT
      * EXCEPT(row_number)
    FROM (
      SELECT
        *,
        ROW_NUMBER()
              OVER (PARTITION BY ID_COLUMN) row_number
      FROM
        `TABLE_NAME`)
    WHERE
      row_number = 1

Notes about the duplicate removal query:

Stream into time-partitioned tables

When you stream data to a time-partitioned table, each partition has a streaming buffer. The streaming buffer is retained when you perform a load, query, or copy job that overwrites a partition by setting the writeDisposition property to WRITE_TRUNCATE. If you want to remove the streaming buffer, verify that the streaming buffer is empty by calling tables.get on the partition.

Ingestion-time partitioning

When you stream to an ingestion-time partitioned table, BigQuery infers the destination partition from the current UTC time.

Newly arriving data is temporarily placed in the __UNPARTITIONED__ partition while in the streaming buffer. When there's enough unpartitioned data, BigQuery partitions the data into the correct partition. However, there is no SLA for how long it takes for data to move out of the __UNPARTITIONED__ partition. A query can exclude data in the streaming buffer from a query by filtering out the NULL values from the __UNPARTITIONED__ partition by using one of the pseudocolumns (_PARTITIONTIME or _PARTITIONDATE depending on your preferred data type).

If you are streaming data into a daily partitioned table, then you can override the date inference by supplying a partition decorator as part of the insertAll request. Include the decorator in the tableId parameter. For example, you can stream to the partition corresponding to 2021-03-01 for table table1 using the partition decorator:

table1$20210301

When streaming using a partition decorator, you can stream to partitions within the last 31 days in the past and 16 days in the future relative to the current date, based on current UTC time. To write to partitions for dates outside these allowed bounds, use a load or query job instead, as described in Appending to and overwriting partitioned table data.

Streaming using a partition decorator is only supported for daily partitioned tables. It is not supported for hourly, monthly, or yearly partitioned tables.

For testing, you can use the bq command-line tool bq insert CLI command. For example, the following command streams a single row to a partition for the date January 1, 2017 ($20170101) into a partitioned table named mydataset.mytable:

echo '{"a":1, "b":2}' | bq insert 'mydataset.mytable$20170101'
Caution: The bq insert command is intended for testing only. Time-unit column partitioning

You can stream data into a table partitioned on a DATE, DATETIME, or TIMESTAMP column that is between 10 years in the past and 1 year in the future. Data outside this range is rejected.

When the data is streamed, it is initially placed in the __UNPARTITIONED__ partition. When there's enough unpartitioned data, BigQuery automatically repartitions the data, placing it into the appropriate partition. However, there is no SLA for how long it takes for data to move out of the __UNPARTITIONED__ partition.

Create tables automatically using template tables

Template tables provide a mechanism to split a logical table into many smaller tables to create smaller sets of data (for example, by user ID). Template tables have a number of limitations described below. Instead, partitioned tables and clustered tables are the recommended ways to achieve this behavior.

To use a template table through the BigQuery API, add a templateSuffix parameter to your insertAll request. For the bq command-line tool, add the template_suffix flag to your insert command. If BigQuery detects a templateSuffix parameter or the template_suffix flag, it treats the targeted table as a base template. It creates a new table that shares the same schema as the targeted table and has a name that includes the specified suffix:

<targeted_table_name> + <templateSuffix>

By using a template table, you avoid the overhead of creating each table individually and specifying the schema for each table. You need only create a single template, and supply different suffixes so that BigQuery can create the new tables for you. BigQuery places the tables in the same project and dataset.

Tables created by using template tables are usually available within a few seconds. On rare occasions, they may take longer to become available.

Change the template table schema

If you change a template table schema, all tables that are generated subsequently use the updated schema. Previously generated tables are not affected, unless the existing table still has a streaming buffer.

For existing tables that still have a streaming buffer, if you modify the template table schema in a backward compatible way, the schema of those actively streamed generated tables is also updated. However, if you modify the template table schema in a non-backward compatible way, any buffered data that uses the old schema is lost. Also, you cannot stream new data to existing generated tables that use the old, but now incompatible, schema.

After you change a template table schema, wait until the changes have propagated before you try to insert new data or query the generated tables. Requests to insert new fields should succeed within a few minutes. Attempts to query the new fields might require a longer wait of up to 90 minutes.

If you want to change a generated table's schema, do not change the schema until streaming through the template table has ceased and the generated table's streaming statistics section is absent from the tables.get() response, which indicates that no data is buffered on the table.

Partitioned tables and clustered tables do not suffer from the preceding limitations and are the recommended mechanism.

Template table details
Template suffix value
The templateSuffix (or --template_suffix) value must contain only letters (a-z, A-Z), numbers (0-9), or underscores (_). The maximum combined length of the table name and the table suffix is 1024 characters.
Quota

Template tables are subject to streaming quota limitations. Your project can make up to 10 tables per second with template tables, similar to the tables.insert API. This quota only applies to tables being created, not to tables being modified.

If your application needs to create more than 10 tables per second, we recommend using clustered tables. For example, you can put the high cardinality table ID into the key column of a single clustering table.

Time to live

The generated table inherits its expiration time from the dataset. As with normal streaming data, generated tables cannot be copied immediately.

Deduplication

Deduplication only happens between uniform references to a destination table. For example, if you simultaneously stream to a generated table using both template tables and a regular insertAll command, no deduplication occurs between rows inserted by template tables and a regular insertAll command.

Views

The template table and the generated tables shouldn't be views.

Troubleshoot streaming inserts

The following sections discuss how to troubleshoot errors that occur when you stream data into BigQuery using the legacy streaming API. For more information on how to resolve quota errors for streaming inserts, see Streaming insert quota errors.

Failure HTTP response codes

If you receive a failure HTTP response code such as a network error, there's no way to tell whether the streaming insert succeeded. If you try to re-send the request, you might end up with duplicated rows in your table. To help protect your table against duplication, set the insertId property when sending your request. BigQuery uses the insertId property for de-duplication.

If you receive a permission error, an invalid table name error, or an exceeded quota error, no rows are inserted and the entire request fails.

Success HTTP response codes

Even if you receive a success HTTP response code, you'll need to check the insertErrors property of the response to determine whether the row insertions were successful because it's possible that BigQuery was only partially successful at inserting the rows. You might encounter one of the following scenarios:

Metadata errors for streaming inserts

Because BigQuery's streaming API is designed for high insertion rates, modifications to the underlying table metadata exhibit are eventually consistent when interacting with the streaming system. Most of the time, metadata changes are propagated within minutes, but during this period API responses might reflect the inconsistent state of the table.

Some scenarios include:

Missing/Unavailable data

Streaming inserts reside temporarily in the write-optimized storage, which has different availability characteristics than managed storage. Certain operations in BigQuery do not interact with the write-optimized storage, such as table copy jobs and API methods like tabledata.list. Recent streaming data won't be present in the destination table or output.


RetroSearch is an open source project built by @garambo | Open a GitHub Issue

Search and Browse the WWW like it's 1997 | Search results from DuckDuckGo

HTML: 3.2 | Encoding: UTF-8 | Version: 0.7.4