A RetroSearch Logo

Home - News ( United States | United Kingdom | Italy | Germany ) - Football scores

Search Query:

Showing content from https://docs.databricks.com/aws/en/compute/cluster-config-best-practices below:

Compute configuration recommendations | Databricks Documentation

Compute configuration recommendations

This article includes recommendations and best practices related to compute configuration.

If your workload is supported, Databricks recommends using serverless compute rather than configuring your own compute resource. Serverless compute is the simplest and most reliable compute option. It requires no configuration, is always available, and scales according to your workload. Serverless compute is available for notebooks, jobs, and Lakeflow Declarative Pipelines. See Connect to serverless compute.

Additionally, data analysts can use serverless SQL warehouses to query and explore data on Databricks. See What are Serverless SQL warehouses?.

Use compute policies​

If you are creating new compute from scratch, Databricks recommends using compute policies. Compute policies let you create preconfigured compute resources designed for specific purposes, such as personal compute, shared compute, power users, and jobs. Policies limit the decisions you need to make when configuring compute settings.

If you don't have access to policies, contact your workspace admin. See Default policies and policy families.

Compute sizing considerations​

note

The following recommendations assume that you have unrestricted cluster creation. Workspace admins should only grant this privilege to advanced users.

People often think of compute size in terms of the number of workers, but there are other important factors to consider:

Additional considerations include worker instance type and size, which also influence the factors above. When sizing your compute, consider:

Answering these questions will help you determine optimal compute configurations based on workloads.

There's a balancing act between the number of workers and the size of worker instance types. Configuring compute with two workers, each with 16 cores and 128 GB of RAM, has the same compute and memory as configuring compute with 8 workers, each with 4 cores and 32 GB of RAM.

Compute configuration examples​

The following examples show compute recommendations based on specific types of workloads. These examples also include configurations to avoid and why those configurations are not suitable for the workload types.

note

All of the examples in this section (besides machine learning training) could benefit from using serverless compute rather than spinning up a new compute resource. If your workload isn't supported on serverless, use the recommendations below to help configure your compute resource.

Data analysis​

Data analysts typically perform processing requiring data from multiple partitions, leading to many shuffle operations. A compute resource with a smaller number of larger nodes can reduce the network and disk I/O needed to perform these shuffles.

A single-node compute with a large VM type is likely the best choice, particularly for a single analyst.

Analytical workloads will likely require reading the same data repeatedly, so recommended node types are storage optimized with disk cache enabled or instances with local storage.

Additional features recommended for analytical workloads include:

Basic batch ETL​

Simple batch ETL jobs that don't require wide transformations, such as joins or aggregations, typically benefit from Photon. So, select a general-purpose instance that supports Photon.

Instances with lower requirements for memory and storage might result in cost savings over other worker types.

Complex batch ETL​

For a complex ETL job, such as one that requires unions and joins across multiple tables, Databricks recommends using fewer workers to reduce the amount of data shuffled. To compensate for having fewer workers, increase the size of your instances.

Complex transformations can be compute-intensive. If you observe significant spill to disk or OOM errors, increase the amount of memory available on your instances.

Optionally, use pools to decrease compute launch times and reduce total runtime when running job pipelines.

Training machine learning models​

To train machine learning models, Databricks recommends creating a compute resource using the Personal compute policy.

You should use a single node compute with a large node type for initial experimentation with training machine learning models. Having fewer nodes reduces the impact of shuffles.

Adding more workers can help with stability, but you should avoid adding too many workers because of the overhead of shuffling data.

Recommended worker types are storage optimized with disk caching enabled, or an instance with local storage to account for repeated reads of the same data and to enable caching of training data.

Additional features recommended for machine learning workloads include:


RetroSearch is an open source project built by @garambo | Open a GitHub Issue

Search and Browse the WWW like it's 1997 | Search results from DuckDuckGo

HTML: 3.2 | Encoding: UTF-8 | Version: 0.7.4