A RetroSearch Logo

Home - News ( United States | United Kingdom | Italy | Germany ) - Football scores

Search Query:

Showing content from https://docs.databricks.com/aws/en/optimizations/disk-cache below:

Optimize performance with caching on Databricks

Optimize performance with caching on Databricks

Databricks uses disk caching to accelerate data reads by creating copies of remote Parquet data files in nodes' local storage using a fast intermediate data format. The data is cached automatically whenever a file has to be fetched from a remote location. Successive reads of the same data are then performed locally, which results in significantly improved reading speed. The cache works for all Parquet data files (including Delta Lake tables).

note

In SQL warehouses and Databricks Runtime 14.2 and above, the CACHE SELECT command is ignored. An enhanced disk caching algorithm is used instead.

Delta cache renamed to disk cache​

Disk caching on Databricks was formerly referred to as the Delta cache and the DBIO cache. Disk caching behavior is a proprietary Databricks feature. This name change seeks to resolve confusion that it was part of the Delta Lake protocol.

Disk cache vs. Spark cache​

The Databricks disk cache differs from Apache Spark caching. Databricks recommends using automatic disk caching.

The following table summarizes the key differences between disk and Apache Spark caching so that you can choose the best tool for your workflow:

Disk cache consistency​

The disk cache automatically detects when data files are created, deleted, modified, or overwritten and updates its content accordingly. You can write, modify, and delete table data with no need to explicitly invalidate cached data. Any stale entries are automatically invalidated and evicted from the cache.

Selecting instance types to use disk caching​

The recommended (and easiest) way to use disk caching is to choose a worker type with SSD volumes when you configure your cluster. Such workers are enabled and configured for disk caching.

The disk cache is configured to use at most half of the space available on the local SSDs provided with the worker nodes. For configuration options, see Configure the disk cache.

Configure the disk cache​

Databricks recommends that you choose cache-accelerated worker instance types for your compute. Such instances are automatically configured optimally for the disk cache.

note

When a worker is decommissioned, the Spark cache stored on that worker is lost. So if autoscaling is enabled, there is some instability with the cache. Spark would then need to reread missing partitions from source as needed.

Configure disk usage​

To configure how the disk cache uses the worker nodes' local storage, specify the following Spark configuration settings during cluster creation:

Example configuration:

ini

spark.databricks.io.cache.maxDiskUsage 50g
spark.databricks.io.cache.maxMetaDataCache 1g
spark.databricks.io.cache.compression.enabled false
Enable or disable the disk cache​

To see the current setting for the disk cache, run the following command:

Scala

spark.conf.get("spark.databricks.io.cache.enabled")

To enable and disable the disk cache, run:

Scala

spark.conf.set("spark.databricks.io.cache.enabled", "[true | false]")

Disabling the cache does not result in dropping the data that is already in the local storage. Instead, it prevents queries from adding new data to the cache and reading data from the cache.


RetroSearch is an open source project built by @garambo | Open a GitHub Issue

Search and Browse the WWW like it's 1997 | Search results from DuckDuckGo

HTML: 3.2 | Encoding: UTF-8 | Version: 0.7.4