Stay organized with collections Save and categorize content based on your preferences.
Optimize query computationThis document provides the best practices for optimizing your query performance.
When you run a query, you can view the query plan in the Google Cloud console. You can also request execution details by using the INFORMATION_SCHEMA.JOBS*
views or the jobs.get
REST API method.
The query plan includes details about query stages and steps. These details can help you identify ways to improve query performance. For example, if you notice a stage that writes a lot more output than other stages, it might mean that you need to filter earlier in the query.
To learn more about the query plan and see examples of how the query plan information can help you to improve query performance, see Get query performance insights. After addressing the query performance insights, you can further optimize your query by performing the following tasks:
You can reduce data that needs to be processed by using the options described in the following sections.
AvoidSELECT *
Best practice: Control projection by querying only the columns that you need.
Projection refers to the number of columns that are read by your query. Projecting excess columns incurs additional (wasted) I/O and materialization (writing results).
SELECT *
.LIMIT
clause to a SELECT *
query does not affect the amount of data read. You are billed for reading all bytes in the entire table, and the query counts against your free tier quota. Instead, query only the columns you need. For example, use SELECT * EXCEPT
to exclude one or more columns from the results.WHERE _PARTITIONDATE="2017-01-01"
to query only the January 1, 2017 partition.Use SELECT * EXCEPT
. Querying a subset of data or using SELECT * EXCEPT
can greatly reduce the amount of data that is read by a query. In addition to the cost savings, performance is improved by reducing the amount of data I/O and the amount of materialization that is required for the query results.
SELECT * EXCEPT (col1, col2, col5) FROM mydataset.newtable
Best practice: When querying wildcard tables, you must use the most granular prefix.
Use wildcards to query multiple tables by using concise SQL statements. Wildcard tables are a union of tables that match the wildcard expression. Wildcard tables are useful if your dataset contains the following resources:
When you query a wildcard table, specify a wildcard (*
) after the common table prefix. For example, FROM
queries all tables from the 1940s.bigquery-public-data.noaa_gsod.gsod194*
More granular prefixes perform better than shorter prefixes. For example, FROM
performs better than bigquery-public-data.noaa_gsod.gsod194*
FROM
because fewer tables match the wildcard.bigquery-public-data.noaa_gsod.*
Best practice: Don't use tables sharded by date (also called date-named tables) in place of time-partitioned tables.
Partitioned tables perform better than date-named tables. When you create tables sharded by date, BigQuery must maintain a copy of the schema and metadata for each date-named table. Also, when date-named tables are used, BigQuery might be required to verify permissions for each queried table. This practice also adds to query overhead and impacts query performance.
Avoid oversharding tablesBest practice: Avoid creating too many table shards. If you are sharding tables by date, use time-partitioned tables instead.
Table sharding refers to dividing large datasets into separate tables and adding a suffix to each table name. If you are sharding tables by date, use time-partitioned tables instead.
Because of the low cost of BigQuery storage, you don't need to optimize your tables for cost as you would in a relational database system. Creating a large number of table shards has performance impacts that outweigh any cost benefits.
Sharded tables require BigQuery to maintain schema, metadata, and permissions for each shard. Because of the added overhead required to maintain information on each shard, oversharding tables can impact query performance.
The amount and source of data read by a query can impact query performance and cost.
Prune partitioned queriesBest practice: When querying a partitioned table, to filter with partitions on partitioned tables, use the following columns:
_PARTITIONTIME
For time-unit partitioned tables, filtering the data with _PARTITIONTIME
or partitioning column lets you specify a date or range of dates. For example, the following WHERE
clause uses the _PARTITIONTIME
pseudocolumn to specify partitions between January 1, 2016 and January 31, 2016:
WHERE _PARTITIONTIME
BETWEEN TIMESTAMP("20160101")
AND TIMESTAMP("20160131")
The query processes data only in the partitions that are indicated by the date range. Filtering your partitions improves query performance and reduces costs.
Reduce data before using aJOIN
Best practice: Reduce the amount of data that is processed before a JOIN
clause by performing aggregations.
Using a GROUP BY
clause with aggregate functions is computationally intensive, because these types of queries use shuffle. As these queries are computationally intensive, you must use a GROUP BY
clause only when necessary.
For queries with GROUP BY
and JOIN
, perform aggregation earlier in the query to reduce the amount of data processed. For example, the following query performs a JOIN
on two large tables without any filtering beforehand:
WITH users_posts AS ( SELECT * FROM `bigquery-public-data`.stackoverflow.comments AS c JOIN `bigquery-public-data`.stackoverflow.users AS u ON c.user_id = u.id ) SELECT user_id, ANY_VALUE(display_name) AS display_name, ANY_VALUE(reputation) AS reputation, COUNT(text) AS comments_count FROM users_posts GROUP BY user_id ORDER BY comments_count DESC LIMIT 20;
This query pre-aggregates the comment counts which reduces the amount of data read for the JOIN
:
WITH comments AS ( SELECT user_id, COUNT(text) AS comments_count FROM `bigquery-public-data`.stackoverflow.comments WHERE user_id IS NOT NULL GROUP BY user_id ORDER BY comments_count DESC LIMIT 20 ) SELECT user_id, display_name, reputation, comments_count FROM comments JOIN `bigquery-public-data`.stackoverflow.users AS u ON user_id = u.id ORDER BY comments_count DESC;Note:
WITH
clauses with common table expressions (CTEs) are used for query readability, not performance. There is no guarantee that adding a WITH
clause causes BigQuery to materialize temporary intermediate tables and reuse the temporary result for multiple references. The WITH
clause might be evaluated multiple times within a query, depending on query optimizer decisions. Use the WHERE
clause
Best practice: Use a WHERE
clause to limit the amount of data a query returns. When possible, use BOOL
, INT64
, FLOAT64
, or DATE
columns in the WHERE
clause.
Operations on BOOL
, INT64
, FLOAT64
, and DATE
columns are typically faster than operations on STRING
or BYTE
columns. When possible, use a column that uses one of these data types in the WHERE
clause to reduce the amount of data returned by the query.
Best practice: Use materialized views to precompute the results of a query for increased performance and efficiency.
Materialized views are precomputed views that periodically cache the results of a query for increased performance and efficiency. BigQuery leverages precomputed results from materialized views and whenever possible reads only changes from the base tables to compute up-to-date results. Materialized views can be queried directly or can be used by the BigQuery optimizer to process queries to the base tables.
Use BI EngineBest practice: Use BigQuery BI Engine to accelerate queries by caching the data that you use most frequently.
Consider adding a BI Engine reservation to the project where the queries are being computed. BigQuery BI Engine uses a vectorized query engine to accelerate the SELECT
query performance.
Best practice: Use search indexes for efficient row lookups when you need to find individual rows of data in large tables.
A search index is a data structure designed to enable very efficient search with the SEARCH
function but can also accelerate queries using other operators and functions, such as the equal (=
), IN
, or LIKE
operators and certain string and JSON functions.
You can optimize your query operations by using the options described in the following sections.
Avoid repeatedly transforming dataBest practice: If you are using SQL to perform ETL operations, then avoid situations where you are repeatedly transforming the same data.
For example, if you are using SQL to trim strings or extract data by using regular expressions, it is more performant to materialize the transformed results in a destination table. Functions like regular expressions require additional computation. Querying the destination table without the added transformation overhead is much more efficient.
Avoid multiple evaluations of the same CTEsBest practice: Use procedural language, variables, temporary tables, and automatically expiring tables to persist calculations and use them later in the query.
When your query contains common table expressions (CTEs) that are used in multiple places in the query, they might end up being evaluated each time they are referenced. The query optimizer attempts to detect parts of the query that could be executed only once, but this might not always be possible. As a result, using a CTE might not help reduce internal query complexity and resource consumption.
You can store the result of a CTE in a scalar variable or a temporary table depending on the data that the CTE returns.
Avoid repeated joins and subqueriesBest practice: Avoid repeatedly joining the same tables and using the same subqueries.
Instead of repeatedly joining the data, it might be more performant for you to use nested repeated data to represent the relationships. Nested repeated data saves you the performance impact of the communication bandwidth that a join requires. It also saves you the I/O costs that you incur by repeatedly reading and writing the same data. For more information, see use nested and repeated fields.
Similarly, repeating the same subqueries affects performance through repetitive query processing. If you are using the same subqueries in multiple queries, consider materializing the subquery results in a table. Then consume the materialized data in your queries.
Materializing your subquery results improves performance and reduces the overall amount of data that BigQuery reads and writes. The small cost of storing the materialized data outweighs the performance impact of repeated I/O and query processing.
Optimize your join patternsBest practice: For queries that join data from multiple tables, optimize your join patterns by starting with the largest table.
When you create a query by using a JOIN
clause, consider the order in which you are merging the data. The GoogleSQL query optimizer determines which table should be on which side of the join. As a best practice, place the table with the largest number of rows first, followed by the table with the fewest rows, and then place the remaining tables by decreasing size.
When you have a large table as the left side of the JOIN
and a small one on the right side of the JOIN
, a broadcast join is created. A broadcast join sends all the data in the smaller table to each slot that processes the larger table. It is advisable to perform the broadcast join first.
To view the size of the tables in your JOIN
, see Get information about tables.
Best practice: Specify key constraints in the table schema when table data satisfies the data integrity requirements of primary key or foreign key constraints. The query engine can use the key constraints to optimize query plans.
BigQuery doesn't automatically check for data integrity, so you must ensure that your data meets the constraints specified in the table schema. If you don't maintain data integrity in tables with specified constraints, your query results might be inaccurate.
Optimize theORDER BY
clause
Best practice: When you use the ORDER BY
clause, ensure that you follow the best practices:
Use ORDER BY
in the outermost query or within window clauses. Push complex operations to the end of the query. Placing an ORDER BY
clause in the middle of a query greatly impacts performance unless it is being used in a window function.
Another technique for ordering your query is to push complex operations, such as regular expressions and mathematical functions, to the end of the query. This technique reduces the data to be processed before the complex operations are performed.
Use a LIMIT
clause. If you are ordering a very large number of values but don't need to have all of them returned, use a LIMIT
clause. For example, the following query orders a very large result set and throws a Resources exceeded
error. The query sorts by the title
column in mytable
. The title
column contains millions of values.
SELECT title FROM `my-project.mydataset.mytable` ORDER BY title;
To remove the error, use a query like the following:
SELECT title FROM `my-project.mydataset.mytable` ORDER BY title DESC LIMIT 1000;
Use a window function. If you are ordering a very large number of values, use a window function, and limit data before calling the window function. For example, the following query lists the ten oldest Stack Overflow users and their ranking, with the oldest account being ranked lowest:
SELECT id, reputation, creation_date, DENSE_RANK() OVER (ORDER BY creation_date) AS user_rank FROM bigquery-public-data.stackoverflow.users ORDER BY user_rank ASC LIMIT 10;
This query takes approximately 15 seconds to run. This query uses LIMIT
at the end of the query, but not in the DENSE_RANK() OVER
window function. Because of this, the query requires all of the data to be sorted on a single worker node.
Instead, you should limit the dataset before computing the window function in order to improve performance:
WITH users AS ( SELECT id, reputation, creation_date, FROM bigquery-public-data.stackoverflow.users ORDER BY creation_date ASC LIMIT 10) SELECT id, reputation, creation_date, DENSE_RANK() OVER (ORDER BY creation_date) AS user_rank FROM users ORDER BY user_rank;
This query takes approximately 2 seconds to run, while returning the same results as the previous query.
One caveat is that the DENSE_RANK()
function ranks the data within years, so for ranking data that spans across multiple years, these queries don't give identical results.
Best practice: Leverage multi-statement query capabilities and stored procedures to perform the computations that were designed as one complex query as multiple smaller and simpler queries instead.
Complex queries, REGEX
functions, and layered subqueries or joins can be slow and resource intensive to run. Trying to fit all computations in one huge SELECT
statement, for example to make it a view, is sometimes an antipattern, and it can result in a slow, resource-intensive query. In extreme cases, the internal query plan becomes so complex that BigQuery is unable to execute it.
Splitting up a complex query allows for materializing intermediate results in variables or temporary tables. You can then use these intermediate results in other parts of the query. It is increasingly useful when these results are needed in more than one place of the query.
Often it lets you better express the true intent of parts of the query with temporary tables being the data materialization points.
Use nested and repeated fieldsFor information about how to denormalize data storage using nested and repeated fields, see Use nested and repeated fields.
UseINT64
data types in joins
Best practice: Use INT64
data types in joins instead of STRING
data types to reduce cost and improve comparison performance.
BigQuery doesn't index primary keys like traditional databases, so the wider the join column is, the longer the comparison takes. Therefore, using INT64
data types in joins is cheaper and more efficient than using STRING
data types.
You can reduce the query outputs by using the options described in the following the sections.
Materialize large result setsBest practice: Consider materializing large result sets to a destination table. Writing large result sets has performance and cost impacts.
BigQuery limits cached results to approximately 10 GB compressed. Queries that return larger results overtake this limit and frequently result in the following error: Response too large
.
This error often occurs when you select a large number of fields from a table with a considerable amount of data. Issues writing cached results can also occur in ETL-style queries that normalize data without reduction or aggregation.
You can overcome the limitation on cached result size by using the following options:
LIMIT
clause to reduce the result set, especially if you are using an ORDER BY
clauseYou can page through the results using the BigQuery REST API. For more information, see Paging through table data.
Note: Writing very large result sets to destination tables impacts query performance (I/O). In addition, you incur a small cost for storing the destination table. You can automatically delete a large destination table by using the dataset's default table expiration. For more information, see Use the expiration settings in the storage best practices. Avoid anti-SQL patternsThe following best practices provide guidance on avoiding query anti-patterns that impact performance in BigQuery.
Avoid self joinsBest practice: Instead of using self-joins, use a window (analytic) function or the PIVOT
operator.
Typically, self-joins are used to compute row-dependent relationships. The result of using a self-join is that it potentially squares the number of output rows. This increase in output data can cause poor performance.
Avoid cross joinsBest practice: Avoid joins that generate more outputs than inputs. When a CROSS JOIN
is required, pre-aggregate your data.
Cross joins are queries where each row from the first table is joined to every row in the second table, with non-unique keys on both sides. The worst case output is the number of rows in the left table multiplied by the number of rows in the right table. In extreme cases, the query might not finish.
If the query job completes, the query plan explanation shows output rows versus input rows. You can confirm a Cartesian product by modifying the query to print the number of rows on each side of the JOIN
clause, grouped by the join key.
To avoid performance issues associated with joins that generate more outputs than inputs:
GROUP BY
clause to pre-aggregate the data.Best practice: Avoid DML statements that update or insert single rows. Batch your updates and inserts.
Using point-specific DML statements is an attempt to treat BigQuery like an Online Transaction Processing (OLTP) system. BigQuery focuses on Online Analytical Processing (OLAP) by using table scans and not point lookups. If you need OLTP-like behavior (single-row updates or inserts), consider a database designed to support OLTP use cases such as Cloud SQL.
BigQuery DML statements are intended for bulk updates. UPDATE
and DELETE
DML statements in BigQuery are oriented towards periodic rewrites of your data, not single row mutations. The INSERT
DML statement is intended to be used sparingly. Inserts consume the same modification quotas as load jobs. If your use case involves frequent single row inserts, consider streaming your data instead.
If batching your UPDATE
statements yields many tuples in very long queries, you might approach the query length limit of 256 KB. To work around the query length limit, consider whether your updates can be handled based on a logical criteria instead of a series of direct tuple replacements.
For example, you could load your set of replacement records into another table, then write the DML statement to update all values in the original table if the non-updated columns match. For example, if the original data is in table t
and the updates are staged in table u
, the query would look like the following:
UPDATE dataset.t t SET my_column = u.my_column FROM dataset.u u WHERE t.my_key = u.my_keyUse alias names for similarly named columns
Best Practice: Use column and table aliases when you work with similarly named columns across queries, including subqueries.
Aliases help to identify which columns and tables are referenced in addition to your initial reference of the column. Using aliases can help you understand and address problems in your SQL query, including finding the columns that are used in subqueries.
What's nextRetroSearch is an open source project built by @garambo | Open a GitHub Issue
Search and Browse the WWW like it's 1997 | Search results from DuckDuckGo
HTML:
3.2
| Encoding:
UTF-8
| Version:
0.7.4