A RetroSearch Logo

Home - News ( United States | United Kingdom | Italy | Germany ) - Football scores

Search Query:

Showing content from https://developers.google.com/bigquery/docs/querying-clustered-tables below:

Querying clustered tables | BigQuery

Stay organized with collections Save and categorize content based on your preferences.

Querying clustered tables

When you create a clustered table in BigQuery, the table data is automatically organized based on the contents of one or more columns in the table's schema. The columns you specify are used to colocate related data. When you cluster a table using multiple columns, the order of columns you specify is important. The order of the specified columns determines the sort order of the data.

To optimize performance when you run queries against clustered tables, use an expression that filters on a clustered column or on multiple clustered columns in the order the clustered columns are specified. Queries that filter on clustered columns generally perform better than queries that filter only on non-clustered columns.

BigQuery sorts the data in a clustered table based on the values in the clustering columns and organizes them into blocks.

When you submit a query that contains a filter on a clustered column, BigQuery uses the clustering information to efficiently determine whether a block contains any data relevant to the query. This allows BigQuery to only scan the relevant blocks — a process referred to as block pruning.

You can query clustered tables by:

You can only use GoogleSQL with clustered tables.

Go

Before trying this sample, follow the Go setup instructions in the BigQuery quickstart using client libraries. For more information, see the BigQuery Go API reference documentation.

To authenticate to BigQuery, set up Application Default Credentials. For more information, see Set up authentication for client libraries.

Java

Before trying this sample, follow the Java setup instructions in the BigQuery quickstart using client libraries. For more information, see the BigQuery Java API reference documentation.

To authenticate to BigQuery, set up Application Default Credentials. For more information, see Set up authentication for client libraries.

Python

Before trying this sample, follow the Python setup instructions in the BigQuery quickstart using client libraries. For more information, see the BigQuery Python API reference documentation.

To authenticate to BigQuery, set up Application Default Credentials. For more information, see Set up authentication for client libraries.

Required permissions

To run a query job, you need the bigquery.jobs.create Identity and Access Management (IAM) permission on the project that runs the query job.

Each of the following predefined IAM roles includes the permissions that you need to run a query job:

You also need the bigquery.tables.getData permission on all tables and views that your query references. In addition, when querying a view you need this permission on all underlying tables and views. However, if you are using authorized views or authorized datasets, you don't need access to the underlying source data.

Each of the following predefined IAM roles includes the permission that you need on all tables and views that the query references:

For more information about IAM roles in BigQuery, see Predefined roles and permissions.

Best practices

To get the best performance from queries against clustered tables, use the following best practices.

For context, the sample table used in the best practice examples is a clustered table that is created by using a DDL statement. The DDL statement creates a table named ClusteredSalesData. The table is clustered by the following columns: customer_id, product_id, order_id, in that sort order.

CREATE TABLE
  `mydataset.ClusteredSalesData`
PARTITION BY
  DATE(timestamp)
CLUSTER BY
  customer_id,
  product_id,
  order_id AS
SELECT
  *
FROM
  `mydataset.SalesData`
Filter clustered columns by sort order

When you specify a filter, use expressions that filter on the clustered columns in sort order. Sort order is the column order given in the CLUSTER BY clause. To get the benefits of clustering, include one or more of the clustered columns in left-to-right sort order, starting with the first column. In most cases, the first clustering column is the most effective in block pruning, then the second column, then the third. You can still use the second or third column alone in the query, but block pruning probably won't be as effective. The ordering of the column names inside the filter expression doesn't affect performance.

The following example queries the ClusteredSalesData clustered table that was created in the preceding example. The query includes a filter expression that filters on customer_id and then on product_id. This query optimizes performance by filtering the clustered columns in sort order—the column order given in the CLUSTER BY clause.

SELECT
  SUM(totalSale)
FROM
  `mydataset.ClusteredSalesData`
WHERE
  customer_id = 10000
  AND product_id LIKE 'gcp_analytics%'

The following query does not filter the clustered columns in sort order. As a result, the performance of the query is not optimal. This query filters on product_id then on order_id (skipping customer_id).

SELECT
  SUM(totalSale)
FROM
  `mydataset.ClusteredSalesData`
WHERE
  product_id LIKE 'gcp_analytics%'
  AND order_id = 20000
Don't use clustered columns in complex filter expressions

If you use a clustered column in a complex filter expression, the performance of the query is not optimized because block pruning cannot be applied.

For example, the following query won't prune blocks because a clustered column—customer_id—is used in a function in the filter expression.

SELECT
  SUM(totalSale)
FROM
  `mydataset.ClusteredSalesData`
WHERE
  CAST(customer_id AS STRING) = "10000"

To optimize query performance by pruning blocks, use simple filter expressions like the following. In this example, a simple filter is applied to the clustered column—customer_id.

SELECT
  SUM(totalSale)
FROM
  `mydataset.ClusteredSalesData`
WHERE
  customer_id = 10000
Don't compare clustered columns to other columns

If a filter expression compares a clustered column to another column (either a clustered column or a non-clustered column), the performance of the query is not optimized because block pruning cannot be applied.

The following query does not prune blocks because the filter expression compares a clustered column—customer_id to another column—order_id.

SELECT
  SUM(totalSale)
FROM
  `mydataset.ClusteredSalesData`
WHERE
  customer_id = order_id
Table security

To control access to tables in BigQuery, see Control access to resources with IAM.

What's next

Except as otherwise noted, the content of this page is licensed under the Creative Commons Attribution 4.0 License, and code samples are licensed under the Apache 2.0 License. For details, see the Google Developers Site Policies. Java is a registered trademark of Oracle and/or its affiliates.

Last updated 2025-08-07 UTC.

[[["Easy to understand","easyToUnderstand","thumb-up"],["Solved my problem","solvedMyProblem","thumb-up"],["Other","otherUp","thumb-up"]],[["Hard to understand","hardToUnderstand","thumb-down"],["Incorrect information or sample code","incorrectInformationOrSampleCode","thumb-down"],["Missing the information/samples I need","missingTheInformationSamplesINeed","thumb-down"],["Other","otherDown","thumb-down"]],["Last updated 2025-08-07 UTC."],[[["Clustered tables in BigQuery automatically organize data based on specified columns, which colocate related data for optimized performance."],["To maximize query performance, filter on clustered columns in the order they were specified during table creation; filtering on all or a subset of columns from left to right benefits from clustering."],["BigQuery utilizes block pruning to scan only relevant data blocks, based on clustering information when a query filters on clustered columns."],["Avoid using clustered columns in complex filter expressions or comparing them to other columns, as this prevents block pruning and thus negatively affects query performance."],["You can query clustered tables using various methods including the Google Cloud console, the `bq` command-line tool, API calls via the `jobs.insert` method, and client libraries, but they can currently only be done with GoogleSQL."]]],[]]


RetroSearch is an open source project built by @garambo | Open a GitHub Issue

Search and Browse the WWW like it's 1997 | Search results from DuckDuckGo

HTML: 3.2 | Encoding: UTF-8 | Version: 0.7.4