Caching is an essential technique for improving the performance of data warehouse systems by avoiding the need to recompute or fetch the same data multiple times. In Databricks SQL, caching can significantly speed up query execution and minimize warehouse usage, resulting in lower costs and more efficient resource utilization. Each caching layer improves query performance, minimizes cluster usage, and optimizes resource utilization for a seamless data warehouse experience.
Caching provides numerous advantages in data warehouses, including:
Databricks SQL performs several types of query caching.
Databricks SQL UI cache: Per user caching of all query and dashboard results in the Databricks SQL UI. When users first open a dashboard or SQL query, the Databricks SQL UI cache displays the most recent query result, including the results from scheduled executions.
The Databricks SQL UI cache has at most a 7-day life cycle. The cache is located within your Databricks filesystem in your account. You can delete query results by re-running the query that you no longer want to be stored. Once re-run, the old query results are removed from cache. Additionally, the cache is invalidated once the underlying tables have been updated.
Result cache: Per cluster caching of query results for all queries through SQL warehouses. Result caching includes both local and remote result caches, which work together to improve query performance by storing query results in memory or remote storage mediums.
Accessing the remote result cache requires a running warehouse. When processing a query, a cluster first looks in its local cache and then looks in the remote result cache if necessary. Only if the query result isn't cached in either cache is the query executed. Both the local and the remote caches have a life cycle of 24 hours, which starts at cache entry. The remote result cache persists through the stopping or restarting of a SQL warehouse. Both caches are invalidated when the underlying tables are updated.
Remote result cache is available for queries using ODBC / JDBC clients and SQL Statement API.
To disable query result caching, you can run SET use_cached_result = false
in the SQL editor.
important
You should use this option only in testing or benchmarking.
Disk cache: Local SSD caching for data read from data storage for queries through SQL warehouses. The disk cache is designed to enhance query performance by storing data on disk, allowing for accelerated data reads. Data is automatically cached when files are fetched, utilizing a fast intermediate format. By storing copies of the files on the local storage attached to compute nodes, the disk cache ensures the data is located closer to the workers, resulting in improved query performance. See Optimize performance with caching on Databricks.
In addition to its primary function, the disk cache automatically detects changes to the underlying data files. When it detects changes, the cache is invalidated. The disk cache shares the same lifecycle characteristics as the local result cache. This means that when the cluster is stopped or restarted, the cache is cleaned and needs to be repopulated.
The query results caching and disk cache affect queries in the Databricks SQL UI and BI and other external clients.
RetroSearch is an open source project built by @garambo | Open a GitHub Issue
Search and Browse the WWW like it's 1997 | Search results from DuckDuckGo
HTML:
3.2
| Encoding:
UTF-8
| Version:
0.7.4