A RetroSearch Logo

Home - News ( United States | United Kingdom | Italy | Germany ) - Football scores

Search Query:

Showing content from https://docs.databricks.com/aws/en/lakehouse-architecture/cost-optimization/best-practices below:

Best practices for cost optimization

Best practices for cost optimization

This article covers best practices supporting principles of cost optimization, organized by principle.

1. Choose optimal resources​ Use performance optimized data formats​

To get the most out of the Databricks Data Intelligence Platform, you must use Delta Lake as your storage framework. It helps build simpler and more reliable ETL pipelines, and comes with many performance enhancements that can significantly speed up workloads compared to using Parquet, ORC, and JSON. See Optimization recommendations on Databricks. If the workload is also running on a job compute, this directly translates into shorter uptime of compute resources leading to lower costs.

Use job compute​

A job is a way to run non-interactive code on a Databricks compute instance. For example, you can run an extract, transform, and load (ETL) workload interactively or on a schedule. Of course, you can also run jobs interactively in the notebook UI. However, on job compute, the non-interactive workloads will cost significantly less than on all-purpose compute. See the pricing overview to compare Jobs Compute and All-Purpose Compute.

An additional benefit for some jobs is that each job can run on a new compute instance, isolating workloads from each other. However, multitask jobs can also reuse compute resources for all tasks, so the compute startup time occurs only once per job. See Configure compute for jobs.

Use SQL warehouse for SQL workloads​

For interactive SQL workloads, a Databricks SQL warehouse is the most cost-efficient engine. See the pricing overview. All SQL warehouses come with Photon by default, which accelerates your existing SQL and DataFrame API calls and reduces your overall cost per workload.

In addition, serverless SQL warehouses support intelligent workload management (IWM), a set of features that enhances Databricks SQL serverless ability to process large numbers of queries quickly and cost-effectively.

Use up-to-date runtimes for your workloads​

The Databricks platform provides different runtimes that are optimized for data engineering tasks (Databricks Runtime) or machine learning tasks (Databricks Runtime for Machine Learning). The runtimes are built to provide the best selection of libraries for the tasks, and to ensure that all libraries provided are up-to-date and work together optimally. The Databricks Runtimes are released on a regular cadence, providing performance improvements between major releases. These performance improvements often result in cost savings due to more efficient use of compute resources.

Only use GPUs for the right workloads​

Virtual machines with GPUs can dramatically speed up computations for deep learning, but are significantly more expensive than CPU-only machines. Use GPU instances only for workloads that have GPU-accelerated libraries.

Most workloads do not use GPU-accelerated libraries, so they do not benefit from GPU-enabled instances. Workspace administrators can restrict GPU machines and compute resources to prevent unnecessary usage. See the blog post “Are GPUs Really Expensive? Benchmarking GPUs for Inference on Databricks Clusters”.

Use serverless services for your workloads​

BI use cases

BI workloads typically consume data in bursts and generate multiple concurrent queries. For example, someone using a BI tool might update a dashboard or write a query and then simply analyze the results without further interaction with the platform. In this scenario the data platform:

Non-serverless Databricks SQL warehouses have a startup time of minutes, so many users tend to accept the higher cost and do not terminate them during idle periods. On the other hand, serverless SQL warehouses start and scale up in seconds, so both instant availability and idle termination can be achieved. This results in a great user experience and overall cost savings.

Additionally, serverless SQL warehouses scale down earlier than non-serverless warehouses, again, resulting in lower costs.

ML and AI model serving

Most models are served as a REST API for integration into your web or client application; the model serving service receives varying loads of requests over time, and a model serving platform should always provide sufficient resources, but only as many as are actually needed (upscaling and downscaling).

Mosaic AI Model Serving uses serverless compute and provides a highly available and low latency service for deploying models. The service automatically scales up or down to meet changes in demand, reducing infrastructure costs while optimizing latency performance.

Use the right instance type​

Using the latest generation of cloud instance types almost always provides performance benefits, as they offer the best performance and the latest features.

For example, Graviton2-based Amazon EC2 instances can deliver a significantly better price-performance than comparable Amazon EC2 instances.

Based on your workloads, it is also important to choose the right instance family to get the best performance/price ratio. Some simple rules of thumb are:

Choose the most efficient compute size​

Databricks runs one executor per worker node. Therefore, the terms executor and worker are used interchangeably in the context of the Databricks architecture. People often think of cluster size in terms of the number of workers, but there are other important factors to consider:

Additional considerations include worker instance type and size, which also influence the preceding factors. When sizing your compute, consider the following:

Details and examples can be found under Compute sizing considerations.

Evaluate performance-optimized query engines​

Photon is a high-performance Databricks-native vectorized query engine that speeds up your SQL workloads and DataFrame API calls (for data ingestion, ETL, streaming, data science, and interactive queries). Photon is compatible with Apache Spark APIs, so getting started is as easy as turning it on – no code changes and no lock-in.

The observed speedup can lead to significant cost savings, and jobs that run regularly should be evaluated to see whether they are not only faster but also cheaper with Photon.

2. Dynamically allocate resources​ Use auto-scaling compute​

With autoscaling, Databricks dynamically reallocates workers to account for the characteristics of your job. Certain parts of your pipeline may be more computationally intensive than others, and Databricks automatically adds additional workers during those phases of your job (and removes them when they're no longer needed). Autoscaling can reduce overall costs compared to a statically sized compute instance.

Compute auto-scaling has limitations when scaling down cluster size for structured streaming workloads. Databricks recommends using Lakeflow Declarative Pipelines with enhanced autoscaling for streaming workloads.

Use auto termination​

Databricks provides several features to help control costs by reducing idle resources and controlling when compute resources can be deployed.

Databricks does not charge Databricks Units (DBUs) while instances are idle in the pool, resulting in cost savings. Instance provider billing does apply.

Use compute policies to control costs​

Compute policies can enforce many cost-specific restrictions for compute resources. See Operational Excellence - Use compute policies. For example:

3. Monitor and control cost​

Cost management in Databricks is a critical aspect of optimizing cloud spending while maintaining performance. The process can be broken down into three key areas:

The following best practices cover these three areas.

Setup tagging for cost attribution​

To monitor costs in general and to accurately attribute Databricks usage to your organization's business units and teams (for example, for chargebacks in your organization), you can tag workspaces, clusters, SQL warehouses, and pools.

In the setup phase, organizations should implement effective tagging practices. This involves creating tag naming conventions across the organization. It is important to use both general tags that attribute usage to specific user groups and more granular tags that provide highly specific insights, for example based on roles, products, or services.

Start tagging from the very beginning of using Databricks. In addition to the default tags set by Databricks, as a minimum, set up the custom tags _Business Units_ and _Projects_ and populate them for your specific organization. If you need to differentiate between development, quality assurance, and production costs, consider adding the tag Environment to workspaces and compute resources.

The tags propagate both to usage logs and to cloud provider resources for cost analysis. Total costs include Databricks Units (DBUs) plus virtual machine, disk, and associated network costs. Note that for serverless services, the DBU cost already includes the virtual machine costs.

Since adding tags only affects future usage, it is better to start with a more detailed tagging structure. It is always possible to ignore tags if practical use over time shows that they have no impact on cost understanding and attribution. But missing tags can't be added to past events.

Set up budgets and alerts to enable monitoring of account spending​

Budgets allow you to monitor usage across your entire account. They provide a way to set financial targets and allow you to track either account-wide spending or apply filters to track the spending of specific teams, projects or workspaces. If your account uses serverless compute, be sure to use budget policies to attribute your account's serverless usage. See Attribute serverless usage with budget policies.

It is recommended that you set up email notifications when the monthly budget is reached to avoid unexpected overspends.

Monitor costs to align spending with expectations​

Cost observability dashboards help to visualize spending patterns, and budget policies help attribute serverless compute usage to specific users, groups, or projects, enabling more accurate cost allocation. To stay on top of spending, Databricks offers a range of tools and features to track and analyze costs:

Manage costs to align usage with organizational needs​

Cost management goes beyond technical implementation to include broader organisational strategies:

Overall, cost optimisation needs to be seen as an ongoing process and strategies need to be revisited regularly in the event of scaling, new projects or unexpected cost spikes. Use both Databricks' native cost management capabilities and third-party tools for comprehensive control and optimisation.

4. Design cost-effective workloads​ Balance always-on and triggered streaming​

Traditionally, when people think about streaming, terms such as “real-time”, “24/7,” or “always on” come to mind. If data ingestion happens in real-time, the underlying compute resources must run 24/7, incurring costs every single hour of the day.

However, not every use case that relies on a continuous stream of events requires those events to be immediately added to the analytics data set. If the business requirement for the use case only requires fresh data every few hours or every day, then that requirement can be met with only a few runs per day, resulting in a significant reduction in workload cost. Databricks recommends using Structured Streaming with the AvailableNow trigger for incremental workloads that do not have low latency requirements. See Configuring incremental batch processing.

Balance between on-demand and capacity excess instances​

Spot instances take advantage of excess virtual machine resources in the cloud that are available at a lower price. To save costs, Databricks supports creating clusters using spot instances. It is recommended that the first instance (the Spark driver) should always be an on-demand virtual machine. Spot instances are a good choice for workloads where it is acceptable to take longer because one or more spot instances have been evicted by the cloud provider.

Also, consider using the Fleet instance type. If a cluster uses one of these fleet instance types, Databricks selects the matching physical AWS instance types with the best price and availability to use in your cluster.


RetroSearch is an open source project built by @garambo | Open a GitHub Issue

Search and Browse the WWW like it's 1997 | Search results from DuckDuckGo

HTML: 3.2 | Encoding: UTF-8 | Version: 0.7.4