Stay organized with collections Save and categorize content based on your preferences.
This document describes how to use metadata caching (also known as column metadata indexing) to improve query performance on object tables and some types of BigLake tables.
Object tables and some types of BigLake tables can cache metadata information about files in external datastores—for example, Cloud Storage. The following types of BigLake tables support metadata caching:
The metadata includes file names, partitioning information, and metadata for files such as row counts. You can choose whether to enable metadata caching on a table. Queries with a large number of files and with Hive partition filters benefit the most from metadata caching.
If you don't enable metadata caching, queries on the table must read the external data source to get object metadata. Reading this data increases the query latency; listing millions of files from the external data source can take several minutes. If you enable metadata caching, queries can avoid listing files from the external data source and can partition and prune files more quickly.
You can enable metadata caching on a BigLake or object table when you create the table. For more information about creating object tables, see Create object tables. For more information about creating BigLake tables, see one of the following topics:
There are two properties that control the behavior of this feature:
When you have metadata caching enabled, you specify the maximum interval of metadata staleness that is acceptable for operations against the table. For example, if you specify an interval of 1 hour, then operations against the table use cached metadata if it has been refreshed within the past hour. If the cached metadata is older than that, the operation falls back to retrieving metadata from the datastore (Amazon S3 or Cloud Storage) instead. You can specify a staleness interval between 30 minutes and 7 days.
You can choose to refresh the cache either automatically or manually:
BQ.REFRESH_EXTERNAL_METADATA_CACHE
system procedure to refresh the metadata cache on whatever schedule you determine. For BigLake tables, you can refresh the metadata selectively by providing subdirectories of the table data directory. This approach lets you avoid unnecessary metadata processing. Refreshing the cache manually is a good approach if the files in the datastore are added, deleted, or modified at known intervals—for example, as the output of a pipeline.Both manual and automatic cache refreshes are executed with INTERACTIVE
query priority.
If you choose to use automatic refreshes, we recommend that you create a reservation, and then create an assignment with a BACKGROUND
job type for the project that runs the metadata cache refresh jobs. This prevents the refresh jobs from competing with user queries for resources, and potentially failing if there aren't sufficient resources available for them.
You should consider how the staleness interval and metadata caching mode values will interact before you set them. Consider the following examples:
BQ.REFRESH_EXTERNAL_METADATA_CACHE
system procedure every 2 days or less if you want operations against the table to use cached metadata.For more information on setting metadata caching options for BigLake tables, see Create Amazon S3 BigLake external tables or Create BigLake external tables for Cloud Storage.
For more information on setting metadata caching options for object tables, see Create object tables.
To find information about metadata cache refresh jobs, query the INFORMATION_SCHEMA.JOBS
view, as shown in the following example:
SELECT * FROM `region-us.INFORMATION_SCHEMA.JOBS` WHERE job_id LIKE '%metadata_cache_refresh%' AND creation_time > TIMESTAMP_SUB(CURRENT_TIMESTAMP(), INTERVAL 6 HOUR) ORDER BY start_time DESC LIMIT 10;
Cached metadata is protected by the customer-managed encryption key (CMEK) used for the table that the cached metadata is associated with. This might be a CMEK applied directly to the table, or a CMEK that the table inherits from the dataset or project.
If a default CMEK is set for the project or dataset, or if the existing CMEK for the project or dataset is changed, this doesn't affect existing tables or their cached metadata. You must change the key for the table to apply the new key to both the table and its cached metadata.
CMEKs created in BigQuery don't apply to the Cloud Storage files that are used by BigLake and object tables. To obtain end-to-end CMEK encryption, configure CMEKs in Cloud Storage for those files.
To get information about metadata cache usage for a query job, call the jobs.get
method for that job and look at the MetadataCacheStatistics
field in the JobStatistics2
section of the Job
resource. This field provides information on which metadata cache-enabled tables were used by the query, whether the metadata cache was used by the query, and if not, the reason why not.
For BigLake tables that are based on Parquet files, table statistics are collected when the metadata cache is refreshed. Table statistic collection happens during both automatic and manual refreshes, and the statistics are kept for the same period as the metadata cache.
The table statistics collected include file information like row counts, physical and uncompressed file sizes, and cardinality of columns. When you run a query on a Parquet-based BigLake table, these statistics are supplied to the query optimizer to enable better query planning and potentially improve query performance for some types of queries. For example, a common query optimization is dynamic constraint propagation, where the query optimizer dynamically infers predicates on the larger fact tables in a join from the smaller dimension tables. While this optimization can speed up queries by using normalized table schemas, it requires accurate table statistics. The table statistics collected by metadata caching enable greater optimization of query plans in both BigQuery and Apache Spark.
LimitationsThe following limitations apply to the metadata cache:
If you are manually refreshing the metadata cache, and your target dataset and Cloud Storage bucket are in a regional location, you must explicitly specify this location when you run the BQ.REFRESH_EXTERNAL_METADATA_CACHE
procedure call. You can do this one of the following ways:
Go to the BigQuery page.
Select a tab in the Editor.
Click settings More, and then click Query settings.
In the Advanced options section, unselect the Automatic location selection checkbox, and then specify the target region.
Click Save.
Run the query containing the BQ.REFRESH_EXTERNAL_METADATA_CACHE
procedure call in that Editor tab.
If you run the query containing the BQ.REFRESH_EXTERNAL_METADATA_CACHE
procedure call by using bq query
, be sure to specify the --location
flag.
Except as otherwise noted, the content of this page is licensed under the Creative Commons Attribution 4.0 License, and code samples are licensed under the Apache 2.0 License. For details, see the Google Developers Site Policies. Java is a registered trademark of Oracle and/or its affiliates.
Last updated 2025-08-07 UTC.
[[["Easy to understand","easyToUnderstand","thumb-up"],["Solved my problem","solvedMyProblem","thumb-up"],["Other","otherUp","thumb-up"]],[["Hard to understand","hardToUnderstand","thumb-down"],["Incorrect information or sample code","incorrectInformationOrSampleCode","thumb-down"],["Missing the information/samples I need","missingTheInformationSamplesINeed","thumb-down"],["Other","otherDown","thumb-down"]],["Last updated 2025-08-07 UTC."],[],[]]
RetroSearch is an open source project built by @garambo | Open a GitHub Issue
Search and Browse the WWW like it's 1997 | Search results from DuckDuckGo
HTML:
3.2
| Encoding:
UTF-8
| Version:
0.7.4