A RetroSearch Logo

Home - News ( United States | United Kingdom | Italy | Germany ) - Football scores

Search Query:

Showing content from https://developers.google.com/bigquery/docs/omni-aws-create-external-table below:

Create Amazon S3 BigLake external tables | BigQuery

Stay organized with collections Save and categorize content based on your preferences.

Create Amazon S3 BigLake external tables

This document describes how to create an Amazon Simple Storage Service (Amazon S3) BigLake table. A BigLake table lets you use access delegation to query data in Amazon S3. Access delegation decouples access to the BigLake table from access to the underlying datastore.

For information about how data flows between BigQuery and Amazon S3, see Data flow when querying data.

Before you begin

Ensure that you have a connection to access Amazon S3 data.

Required roles

To get the permissions that you need to create an external table, ask your administrator to grant you the BigQuery Admin (roles/bigquery.admin) IAM role on your dataset. For more information about granting roles, see Manage access to projects, folders, and organizations.

This predefined role contains the permissions required to create an external table. To see the exact permissions that are required, expand the Required permissions section:

Required permissions

The following permissions are required to create an external table:

You might also be able to get these permissions with custom roles or other predefined roles.

Create a dataset

Before you create an external table, you need to create a dataset in the

supported region

. Select one of the following options:

Console
  1. Go to the BigQuery page.

    Go to BigQuery

  2. In the Explorer pane, select the project where you want to create the dataset.
  3. Expand the more_vert View actions option and click Create dataset.
  4. On the Create dataset page, specify the following details:
    1. For Dataset ID enter a unique dataset name.
    2. For Data location choose a supported region.
    3. Optional: To delete tables automatically, select the Enable table expiration checkbox and set the Default maximum table age in days. Data in Amazon S3 is not deleted when the table expires.
    4. If you want to use default collation, expand the Advanced options section and then select the Enable default collation option.
    5. Click Create dataset.
SQL

Use the CREATE SCHEMA DDL statement. The following example create a dataset in the aws-us-east-1 region:

  1. In the Google Cloud console, go to the BigQuery page.

    Go to BigQuery

  2. In the query editor, enter the following statement:

    CREATE SCHEMA mydataset
    OPTIONS (
      location = 'aws-us-east-1');
  3. Click play_circle Run.

For more information about how to run queries, see Run an interactive query.

bq

In a command-line environment, create a dataset using the bq mk command:

bq --location=LOCATION mk \
    --dataset \
PROJECT_ID:DATASET_NAME

The --project_id parameter overrides the default project.

Replace the following:

Java

Before trying this sample, follow the Java setup instructions in the BigQuery quickstart using client libraries. For more information, see the BigQuery Java API reference documentation.

To authenticate to BigQuery, set up Application Default Credentials. For more information, see Set up authentication for client libraries.

Create BigLake tables on unpartitioned data

Select one of the following options:

Console
  1. Go to the BigQuery page.

    Go to BigQuery

  2. In the Explorer pane, expand your project, and then select a dataset.

  3. In the Dataset info section, click add_box Create table.

  4. On the Create table page, in the Source section, do the following:

    1. For Create table from, select Amazon S3.
    2. For Select S3 path, enter a URI pointing to the Amazon S3 data in the format s3://BUCKET_NAME/PATH. Replace BUCKET_NAME with the name of the Amazon S3 bucket; the bucket's region should be the same as the dataset's region. Replace PATH with the path that you would like to write the exported file to; it can contain one wildcard *.
    3. For File format, select the data format in Amazon S3. Supported formats are AVRO, CSV, DELTA_LAKE, ICEBERG, JSONL, ORC, and PARQUET.
  5. In the Destination section, specify the following details:

    1. For Dataset, choose the appropriate dataset.
    2. In the Table field, enter the name of the table.
    3. Verify that Table type is set to External table.
    4. For Connection ID, choose the appropriate connection ID from the drop-down. For information about connections, see Connect to Amazon S3.
  6. In the Schema section, you can either enable schema auto-detection or manually specify a schema if you have a source file. If you don't have a source file, you must manually specify a schema.

  7. Click Create table.

SQL

To create a BigLake table, use the CREATE EXTERNAL TABLE statement with the WITH CONNECTION clause:

  1. In the Google Cloud console, go to the BigQuery page.

    Go to BigQuery

  2. In the query editor, enter the following statement:

    CREATE EXTERNAL TABLE DATASET_NAME.TABLE_NAME
      WITH CONNECTION `AWS_LOCATION.CONNECTION_NAME`
      OPTIONS (
        format = "DATA_FORMAT",
        uris = ["S3_URI"],
        max_staleness = STALENESS_INTERVAL,
        metadata_cache_mode = 'CACHE_MODE');

    Replace the following:

  3. Click play_circle Run.

For more information about how to run queries, see Run an interactive query.

Example:

CREATE EXTERNAL TABLE awsdataset.awstable
  WITH CONNECTION `aws-us-east-1.s3-read-connection`
  OPTIONS (
    format="CSV",
    uris=["s3://s3-bucket/path/file.csv"],
    max_staleness = INTERVAL 1 DAY,
    metadata_cache_mode = 'AUTOMATIC'
);
bq

Create a table definition file:

bq mkdef  \
--source_format=DATA_FORMAT \
--connection_id=AWS_LOCATION.CONNECTION_NAME \
--metadata_cache_mode=CACHE_MODE \
S3_URI > table_def

Replace the following:

Note: To override the default project, use the --project_id=PROJECT_ID parameter. Replace PROJECT_ID with the ID of your Google Cloud project.

Next, create the BigLake table:

bq mk --max_staleness=STALENESS_INTERVAL --external_table_definition=table_def DATASET_NAME.TABLE_NAME

Replace the following:

For example, the following command creates a new BigLake table, awsdataset.awstable, which can query your Amazon S3 data that's stored at the path s3://s3-bucket/path/file.csv and has a read connection in the location aws-us-east-1:

bq mkdef  \
--autodetect \
--source_format=CSV \
--connection_id=aws-us-east-1.s3-read-connection \
--metadata_cache_mode=AUTOMATIC \
s3://s3-bucket/path/file.csv > table_def

bq mk --max_staleness=INTERVAL "1" HOUR \
--external_table_definition=table_def awsdataset.awstable
API

Call the tables.insert method API method, and create an ExternalDataConfiguration in the Table resource that you pass in.

Specify the schema property or set the autodetect property to true to enable schema auto detection for supported data sources.

Specify the connectionId property to identify the connection to use for connecting to Amazon S3.

Java

Before trying this sample, follow the Java setup instructions in the BigQuery quickstart using client libraries. For more information, see the BigQuery Java API reference documentation.

To authenticate to BigQuery, set up Application Default Credentials. For more information, see Set up authentication for client libraries.

Create BigLake tables on partitioned data

You can create a BigLake table for Hive partitioned data in Amazon S3. After you create an externally partitioned table, you can't change the partition key. You need to recreate the table to change the partition key.

To create a BigLake table based on Hive partitioned data, select one of the following options:

Console
  1. Go to the BigQuery page.

    Go to BigQuery

  2. In the Explorer pane, expand your project and select a dataset.

  3. Click more_vert View actions, and then click Create table. This opens the Create table pane.

  4. In the Source section, specify the following details:

    1. For Create table from, select Amazon S3.

    2. Provide the path to the folder, using wildcards. For example, s3://mybucket/*.

      The folder must be in the same location as the dataset that contains the table you want to create, append, or overwrite.

    3. From the File format list, select the file type.

    4. Select the Source data partitioning checkbox, and then specify the following details:

      1. For Select Source URI Prefix, enter the URI prefix. For example, s3://mybucket/my_files.
      2. Optional: To require a partition filter on all queries for this table, select the Require partition filter checkbox. Requiring a partition filter can reduce cost and improve performance. For more information, see Requiring predicate filters on partition keys in queries.
      3. In the Partition inference mode section, select one of the following options:

        • Automatically infer types: set the partition schema detection mode to AUTO.
        • All columns are strings: set the partition schema detection mode to STRINGS.
        • Provide my own: set the partition schema detection mode to CUSTOM and manually enter the schema information for the partition keys. For more information, see Custom partition key schema.
  5. In the Destination section, specify the following details:

    1. For Project, select the project in which you want to create the table.
    2. For Dataset, select the dataset in which you want to create the table.
    3. For Table, enter the name of the table that you want to create.
    4. For Table type, verify that External table is selected.
    5. For Connection ID, select the connection that you created earlier.
  6. In the Schema section, you can either enable schema auto-detection or manually specify a schema if you have a source file. If you don't have a source file, you must manually specify a schema.

  7. To ignore rows with extra column values that don't match the schema, expand the Advanced options section and select Unknown values.

  8. Click Create table.

SQL

Use the CREATE EXTERNAL TABLE DDL statement:

  1. In the Google Cloud console, go to the BigQuery page.

    Go to BigQuery

  2. In the query editor, enter the following statement:

    CREATE EXTERNAL TABLE `PROJECT_ID.DATASET.EXTERNAL_TABLE_NAME`
    WITH PARTITION COLUMNS
    (
      PARTITION_COLUMN PARTITION_COLUMN_TYPE,
    )
    WITH CONNECTION `PROJECT_ID.REGION.CONNECTION_ID`
    OPTIONS (
      hive_partition_uri_prefix = "HIVE_PARTITION_URI_PREFIX",
      uris=['FILE_PATH'],
      format ="TABLE_FORMAT"
      max_staleness = STALENESS_INTERVAL,
      metadata_cache_mode = 'CACHE_MODE'
    );

    Replace the following:

  3. Click play_circle Run.

For more information about how to run queries, see Run an interactive query.

Examples

The following example creates a BigLake table over partitioned data in Amazon S3. The schema is autodetected.

CREATE EXTERNAL TABLE `my_dataset.my_table`
WITH PARTITION COLUMNS
(
  sku STRING,
)
WITH CONNECTION `us.my-connection`
OPTIONS(
  hive_partition_uri_prefix = "s3://mybucket/products",
  uris = ['s3://mybucket/products/*']
  max_staleness = INTERVAL 1 DAY,
  metadata_cache_mode = 'AUTOMATIC'
);
bq

First, use the bq mkdef command to create a table definition file:

bq mkdef \
--source_format=SOURCE_FORMAT \
--connection_id=REGION.CONNECTION_ID \
--hive_partitioning_mode=PARTITIONING_MODE \
--hive_partitioning_source_uri_prefix=URI_SHARED_PREFIX \
--require_hive_partition_filter=BOOLEAN \
--metadata_cache_mode=CACHE_MODE \
 URIS > DEFINITION_FILE

Replace the following:

If PARTITIONING_MODE is CUSTOM, include the partition key schema in the source URI prefix, using the following format:

--hive_partitioning_source_uri_prefix=URI_SHARED_PREFIX/{KEY1:TYPE1}/{KEY2:TYPE2}/...

After you create the table definition file, use the bq mk command to create the BigLake table:

bq mk --max_staleness=STALENESS_INTERVAL \
--external_table_definition=DEFINITION_FILE \
DATASET_NAME.TABLE_NAME \
SCHEMA

Replace the following:

Examples

The following example uses AUTO Hive partitioning mode for Amazon S3 data:

bq mkdef --source_format=CSV \
  --connection_id=us.my-connection \
  --hive_partitioning_mode=AUTO \
  --hive_partitioning_source_uri_prefix=s3://mybucket/myTable \
  --metadata_cache_mode=AUTOMATIC \
  s3://mybucket/* > mytable_def

bq mk --max_staleness=INTERVAL "1" HOUR \
  --external_table_definition=mytable_def \
  mydataset.mytable \
  Region:STRING,Quarter:STRING,Total_sales:INTEGER

The following example uses STRING Hive partitioning mode for Amazon S3 data:

bq mkdef --source_format=CSV \
  --connection_id=us.my-connection \
  --hive_partitioning_mode=STRING \
  --hive_partitioning_source_uri_prefix=s3://mybucket/myTable \
  --metadata_cache_mode=AUTOMATIC \
  s3://mybucket/myTable/* > mytable_def

bq mk --max_staleness=INTERVAL "1" HOUR \
  --external_table_definition=mytable_def \
  mydataset.mytable \
  Region:STRING,Quarter:STRING,Total_sales:INTEGER
API

To set Hive partitioning using the BigQuery API, include the hivePartitioningOptions object in the ExternalDataConfiguration object when you create the table definition file. To create a BigLake table, you must also specify a value for the connectionId field.

If you set the hivePartitioningOptions.mode field to CUSTOM, you must encode the partition key schema in the hivePartitioningOptions.sourceUriPrefix field as follows: s3://BUCKET/PATH_TO_TABLE/{KEY1:TYPE1}/{KEY2:TYPE2}/...

To enforce the use of a predicate filter at query time, set the hivePartitioningOptions.requirePartitionFilter field to true.

Delta Lake tables

Delta Lake is an open source table format that supports petabyte scale data tables. Delta Lake tables can be queried as both temporary and permanent tables, and is supported as a BigLake table.

Schema synchronization

Delta Lake maintains a canonical schema as part of its metadata. You can't update a schema using a JSON metadata file. To update the schema:

  1. Use the bq update command with the --autodetect_schema flag:

    bq update --autodetect_schema
    PROJECT_ID:DATASET.TABLE
    

    Replace the following:

Type conversion

BigQuery converts Delta Lake data types to the following BigQuery data types:

Delta Lake Type BigQuery Type boolean BOOL byte INT64 int INT64 long INT64 float FLOAT64 double FLOAT64 Decimal(P/S) NUMERIC or BIG_NUMERIC depending on precision date DATE time TIME timestamp (not partition column) TIMESTAMP timestamp (partition column) DATETIME string STRING binary BYTES array<Type> ARRAY<Type> struct STRUCT map<KeyType, ValueType> ARRAY<Struct<key KeyType, value ValueType>> Limitations

The following limitations apply to Delta Lake tables:

Create a Delta Lake table

The following example creates an external table by using the CREATE EXTERNAL TABLE statement with the Delta Lake format:

CREATE [OR REPLACE] EXTERNAL TABLE table_name
WITH CONNECTION connection_name
OPTIONS (
         format = 'DELTA_LAKE',
         uris = ["parent_directory"]
       );

Replace the following:

Cross-cloud transfer with Delta Lake

The following example uses the LOAD DATA statement to load data to the appropriate table:

LOAD DATA [INTO | OVERWRITE] table_name
FROM FILES (
        format = 'DELTA_LAKE',
        uris = ["parent_directory"]
)
WITH CONNECTION connection_name;

For more examples of cross-cloud data transfers, see Load data with cross cloud operations.

Query BigLake tables

For more information, see Query Amazon S3 data.

Preview

This feature is subject to the "Pre-GA Offerings Terms" in the General Service Terms section of the Service Specific Terms. Pre-GA features are available "as is" and might have limited support. For more information, see the launch stage descriptions.

You can view the resource metadata with

INFORMATION_SCHEMA

views. When you query the

JOBS_BY_*

,

JOBS_TIMELINE_BY_*

, and

RESERVATION*

views, you must

specify the query's processing location

that is collocated with the table's region. For information about BigQuery Omni locations, see

Locations

. For all other system tables, specifying the query job location is

optional

.

For information about the system tables that BigQuery Omni supports, see Limitations.

To query JOBS_* and RESERVATION* system tables, select one of the following methods to specify the processing location:

Console
  1. Go to the BigQuery page.

    Go to BigQuery

  2. If the Editor tab isn't visible, then click add_boxCompose new query.

  3. Click More > Query settings. The Query settings dialog opens.

  4. In the Query settings dialog, for Additional settings > Data location, select the BigQuery region that is collocated with the BigQuery Omni region. For example, if your BigQuery Omni region is aws-us-east-1, specify us-east4.

  5. Select the remaining fields and click Save.

bq

Use the --location flag to set the job's processing location to the BigQuery region that is collocated with the BigQuery Omni region. For example, if your BigQuery Omni region is aws-us-east-1, specify us-east4.

Example

bq query --use_legacy_sql=false --location=us-east4 \
"SELECT * FROM region-aws-us-east-1.INFORMATION_SCHEMA.JOBS limit 10;"
bq query --use_legacy_sql=false --location=asia-northeast3 \
"SELECT * FROM region-aws-ap-northeast-2.INFORMATION_SCHEMA.JOBS limit 10;"
API

If you are running jobs programmatically, set the location argument to the BigQuery region that is collocated with the BigQuery Omni region. For example, if your BigQuery Omni region is aws-us-east-1, specify us-east4.

The following example lists the metadata refresh jobs in last six hours:

SELECT
 *
FROM
 `region-REGION_NAME`.INFORMATION_SCHEMA.JOBS_BY_PROJECT
WHERE
 job_id LIKE '%metadata_cache_refresh%'
 AND creation_time > TIMESTAMP_SUB(CURRENT_TIMESTAMP(), INTERVAL 6 HOUR)
ORDER BY start_time desc
LIMIT 10;

Replace REGION_NAME with your region.

VPC Service Controls

You can use VPC Service Controls perimeters to restrict access from BigQuery Omni to an external cloud service as an extra layer of defense. For example, VPC Service Controls perimeters can limit exports from your BigQuery Omni tables to a specific Amazon S3 bucket or Blob Storage container.

To learn more about VPC Service Controls, see Overview of VPC Service Controls.

Required permission

Ensure that you have the required permissions to configure service perimeters. To view a list of IAM roles required to configure VPC Service Controls, see Access control with IAM in the VPC Service Controls documentation.

Set up VPC Service Controls using the Google Cloud console
  1. In the Google Cloud console navigation menu, click Security, and then click VPC Service Controls.

    Go to VPC Service Controls

  2. To set up VPC Service Controls for BigQuery Omni, follow the steps in the Create a service perimeter guide, and when you are in the Egress rules pane, follow these steps:

    1. In the Egress rules panel, click Add rule.

    2. In the From attributes of the API client section, select an option from the Identity list.

    3. Select To attributes of external resources.

    4. To add an external resource, click Add external resources.

    5. In the Add external resource dialog, for External resource name, enter a valid resource name. For example:

      • For Amazon Simple Storage Service (Amazon S3): s3://BUCKET_NAME

        Replace BUCKET_NAME with the name of your Amazon S3 bucket.

      • For Azure Blob Storage: azure://myaccount.blob.core.windows.net/CONTAINER_NAME

        Replace CONTAINER NAME with the name of your Blob Storage container.

      For a list of egress rule attributes, see Egress rules reference.

    6. Select the methods that you want to allow on your external resources:

      1. If you want to allow all methods, select All methods in the Methods list.
      2. If you want to allow specific methods, select Selected method, click Select methods, and then select the methods that you want to allow on your external resources.
    7. Click Create perimeter.

Set up VPC Service Controls using the gcloud CLI

To set up VPC Service Controls using the gcloud CLI, follow these steps:

  1. Set the default access policy.
  2. Create the egress policy input file.
  3. Add the egress policy.
Set the default access policy

An access policy is an organization-wide container for access levels and service perimeters. For information about setting a default access policy or getting an access policy name, see Managing an access policy.

Create the egress policy input file

An egress rule block defines the allowed access from within a perimeter to resources outside of that perimeter. For external resources, the externalResources property defines the external resource paths allowed access from within your VPC Service Controls perimeter.

Egress rules can be configured using a JSON file, or a YAML file. The following sample uses the .yaml format:

- egressTo:
    operations:
    - serviceName: bigquery.googleapis.com
      methodSelectors:
      - method: "*"
      *OR*
      - permission: "externalResource.read"
    externalResources:
      - EXTERNAL_RESOURCE_PATH
  egressFrom:
    identityType: IDENTITY_TYPE
    *OR*
    identities:
    - serviceAccount:SERVICE_ACCOUNT
Examples

The following example is a policy that allows egress operations from inside the perimeter to the s3://mybucket Amazon S3 location in AWS.

- egressTo:
    operations:
    - serviceName: bigquery.googleapis.com
      methodSelectors:
      - method: "*"
    externalResources:
      - s3://mybucket
      - s3://mybucket2
  egressFrom:
    identityType: ANY_IDENTITY

The following example allows egress operations to a Blob Storage container:

- egressTo:
    operations:
    - serviceName: bigquery.googleapis.com
      methodSelectors:
      - method: "*"
    externalResources:
      - azure://myaccount.blob.core.windows.net/mycontainer
  egressFrom:
    identityType: ANY_IDENTITY

For more information about egress policies, see the Egress rules reference.

Add the egress policy

To add the egress policy when you create a new service perimeter, use the gcloud access-context-manager perimeters create command. For example, the following command creates a new perimeter named omniPerimeter that includes the project with project number 12345, restricts the BigQuery API, and adds an egress policy defined in the egress.yaml file:

gcloud access-context-manager perimeters create omniPerimeter \
    --title="Omni Perimeter" \
    --resources=projects/12345 \
    --restricted-services=bigquery.googleapis.com \
    --egress-policies=egress.yaml

To add the egress policy to an existing service perimeter, use the gcloud access-context-manager perimeters update command. For example, the following command adds an egress policy defined in the egress.yaml file to an existing service perimeter named omniPerimeter:

gcloud access-context-manager perimeters update omniPerimeter
    --set-egress-policies=egress.yaml
Verify your perimeter

To verify the perimeter, use the gcloud access-context-manager perimeters describe command:

gcloud access-context-manager perimeters describe PERIMETER_NAME

Replace PERIMETER_NAME with the name of the perimeter.

For example, the following command describes the perimeter omniPerimeter:

gcloud access-context-manager perimeters describe omniPerimeter

For more information, see Managing service perimeters.

Allow BigQuery Omni VPC access to Amazon S3

As a BigQuery administrator, you can create an S3 bucket policy to grant BigQuery Omni access to your Amazon S3 resources. This ensures that only authorized BigQuery Omni VPCs can interact with your Amazon S3, enhancing the security of your data.

Apply an S3 bucket policy for BigQuery Omni VPC

To apply an S3 bucket policy, use the AWS CLI or Terraform:

AWS CLI

Run the following command to apply an S3 bucket policy that includes a condition using the aws:SourceVpc attribute:

  aws s3api put-bucket-policy \
    --bucket=BUCKET_NAME \
    --policy "{
      \"Version\": \"2012-10-17\",
      \"Id\": \"RestrictBucketReads\",
      \"Statement\": [
          {
              \"Sid\": \"AccessOnlyToOmniVPC\",
              \"Principal\": \"*\",
              \"Action\": [\"s3:ListBucket\", \"s3:GetObject\"],
              \"Effect\": \"Allow\",
              \"Resource\": [\"arn:aws:s3:::BUCKET_NAME\",
                             \"arn:aws:s3:::BUCKET_NAME/*\"],
              \"Condition\": {
                  \"StringEquals\": {
                    \"aws:SourceVpc\": \"VPC_ID\"
                  }
              }
          }
      ]
    }"

Replace the following:

Terraform

Add the following to your Terraform configuration file:

  resource "aws_s3_bucket" "example" {
    bucket = "BUCKET_NAME"
  }

  resource "aws_s3_bucket_policy" "example" {
    bucket = aws_s3_bucket.example.id
    policy = jsonencode({
      Version = "2012-10-17"
      Id      = "RestrictBucketReads"
      Statement = [
          {
              Sid       = "AccessOnlyToOmniVPC"
              Effect    = "Allow"
              Principal = "*"
              Action    = ["s3:GetObject", "s3:ListBucket"]
              Resource  = [
                  aws_s3_bucket.example.arn,
                  "${aws_s3_bucket.example.arn}/*"
                  ]
              Condition = {
                  StringEquals = {
                      "aws:SourceVpc": "VPC_ID"
                  }
              }
          },
      ]
    })
  }

Replace the following:

BigQuery Omni VPC Resource IDs Region VPC ID aws-ap-northeast-2 vpc-0b488548024288af2 aws-ap-southeast-2 vpc-0726e08afef3667ca aws-eu-central-1 vpc-05c7bba12ad45558f aws-eu-west-1 vpc-0e5c646979bbe73a0 aws-us-east-1 vpc-0bf63a2e71287dace aws-us-west-2 vpc-0cc24e567b9d2c1cb Limitations

For a full list of limitations that apply to BigLake tables based on Amazon S3 and Blob Storage, see Limitations.

What's next

RetroSearch is an open source project built by @garambo | Open a GitHub Issue

Search and Browse the WWW like it's 1997 | Search results from DuckDuckGo

HTML: 3.2 | Encoding: UTF-8 | Version: 0.7.4