This article covers best practices supporting principles of cost optimization, organized by principle.
1. Choose optimal resourcesâ Use performance optimized data formatsâTo get the most out of the Databricks Data Intelligence Platform, you must use Delta Lake as your storage framework. It helps build simpler and more reliable ETL pipelines, and comes with many performance enhancements that can significantly speed up workloads compared to using Parquet, ORC, and JSON. See Optimization recommendations on Databricks. If the workload is also running on a job compute, this directly translates into shorter uptime of compute resources leading to lower costs.
Use job computeâA job is a way to run non-interactive code on a Databricks compute instance. For example, you can run an extract, transform, and load (ETL) workload interactively or on a schedule. Of course, you can also run jobs interactively in the notebook UI. However, on job compute, the non-interactive workloads will cost significantly less than on all-purpose compute. See the pricing overview to compare Jobs Compute and All-Purpose Compute.
An additional benefit for some jobs is that each job can run on a new compute instance, isolating workloads from each other. However, multitask jobs can also reuse compute resources for all tasks, so the compute startup time occurs only once per job. See Configure compute for jobs.
Use SQL warehouse for SQL workloadsâFor interactive SQL workloads, a Databricks SQL warehouse is the most cost-efficient engine. See the pricing overview. All SQL warehouses come with Photon by default, which accelerates your existing SQL and DataFrame API calls and reduces your overall cost per workload.
In addition, serverless SQL warehouses support intelligent workload management (IWM), a set of features that enhances Databricks SQL serverless ability to process large numbers of queries quickly and cost-effectively.
Use up-to-date runtimes for your workloadsâThe Databricks platform provides different runtimes that are optimized for data engineering tasks (Databricks Runtime) or machine learning tasks (Databricks Runtime for Machine Learning). The runtimes are built to provide the best selection of libraries for the tasks, and to ensure that all libraries provided are up-to-date and work together optimally. The Databricks Runtimes are released on a regular cadence, providing performance improvements between major releases. These performance improvements often result in cost savings due to more efficient use of compute resources.
Only use GPUs for the right workloadsâVirtual machines with GPUs can dramatically speed up computations for deep learning, but are significantly more expensive than CPU-only machines. Use GPU instances only for workloads that have GPU-accelerated libraries.
Most workloads do not use GPU-accelerated libraries, so they do not benefit from GPU-enabled instances. Workspace administrators can restrict GPU machines and compute resources to prevent unnecessary usage. See the blog post âAre GPUs Really Expensive? Benchmarking GPUs for Inference on Databricks Clustersâ.
Use serverless services for your workloadsâBI use cases
BI workloads typically consume data in bursts and generate multiple concurrent queries. For example, someone using a BI tool might update a dashboard or write a query and then simply analyze the results without further interaction with the platform. In this scenario the data platform:
Non-serverless Databricks SQL warehouses have a startup time of minutes, so many users tend to accept the higher cost and do not terminate them during idle periods. On the other hand, serverless SQL warehouses start and scale up in seconds, so both instant availability and idle termination can be achieved. This results in a great user experience and overall cost savings.
Additionally, serverless SQL warehouses scale down earlier than non-serverless warehouses, again, resulting in lower costs.
ML and AI model serving
Most models are served as a REST API for integration into your web or client application; the model serving service receives varying loads of requests over time, and a model serving platform should always provide sufficient resources, but only as many as are actually needed (upscaling and downscaling).
Mosaic AI Model Serving uses serverless compute and provides a highly available and low latency service for deploying models. The service automatically scales up or down to meet changes in demand, reducing infrastructure costs while optimizing latency performance.
Use the right instance typeâUsing the latest generation of cloud instance types almost always provides performance benefits, as they offer the best performance and the latest features.
For example, Graviton2-based Amazon EC2 instances can deliver a significantly better price-performance than comparable Amazon EC2 instances.
Based on your workloads, it is also important to choose the right instance family to get the best performance/price ratio. Some simple rules of thumb are:
Databricks runs one executor per worker node. Therefore, the terms executor and worker are used interchangeably in the context of the Databricks architecture. People often think of cluster size in terms of the number of workers, but there are other important factors to consider:
Additional considerations include worker instance type and size, which also influence the preceding factors. When sizing your compute, consider the following:
Details and examples can be found under Compute sizing considerations.
Evaluate performance-optimized query enginesâPhoton is a high-performance Databricks-native vectorized query engine that speeds up your SQL workloads and DataFrame API calls (for data ingestion, ETL, streaming, data science, and interactive queries). Photon is compatible with Apache Spark APIs, so getting started is as easy as turning it on â no code changes and no lock-in.
The observed speedup can lead to significant cost savings, and jobs that run regularly should be evaluated to see whether they are not only faster but also cheaper with Photon.
2. Dynamically allocate resourcesâ Use auto-scaling computeâWith autoscaling, Databricks dynamically reallocates workers to account for the characteristics of your job. Certain parts of your pipeline may be more computationally intensive than others, and Databricks automatically adds additional workers during those phases of your job (and removes them when they're no longer needed). Autoscaling can reduce overall costs compared to a statically sized compute instance.
Compute auto-scaling has limitations when scaling down cluster size for structured streaming workloads. Databricks recommends using Lakeflow Declarative Pipelines with enhanced autoscaling for streaming workloads.
Use auto terminationâDatabricks provides several features to help control costs by reducing idle resources and controlling when compute resources can be deployed.
Databricks does not charge Databricks Units (DBUs) while instances are idle in the pool, resulting in cost savings. Instance provider billing does apply.
Use compute policies to control costsâCompute policies can enforce many cost-specific restrictions for compute resources. See Operational Excellence - Use compute policies. For example:
Cost management in Databricks is a critical aspect of optimizing cloud spending while maintaining performance. The process can be broken down into three key areas:
The following best practices cover these three areas.
Setup tagging for cost attributionâTo monitor costs in general and to accurately attribute Databricks usage to your organization's business units and teams (for example, for chargebacks in your organization), you can tag workspaces, clusters, SQL warehouses, and pools.
In the setup phase, organizations should implement effective tagging practices. This involves creating tag naming conventions across the organization. It is important to use both general tags that attribute usage to specific user groups and more granular tags that provide highly specific insights, for example based on roles, products, or services.
Start tagging from the very beginning of using Databricks. In addition to the default tags set by Databricks, as a minimum, set up the custom tags _Business Units_
and _Projects_
and populate them for your specific organization. If you need to differentiate between development, quality assurance, and production costs, consider adding the tag Environment
to workspaces and compute resources.
The tags propagate both to usage logs and to cloud provider resources for cost analysis. Total costs include Databricks Units (DBUs) plus virtual machine, disk, and associated network costs. Note that for serverless services, the DBU cost already includes the virtual machine costs.
Since adding tags only affects future usage, it is better to start with a more detailed tagging structure. It is always possible to ignore tags if practical use over time shows that they have no impact on cost understanding and attribution. But missing tags can't be added to past events.
Set up budgets and alerts to enable monitoring of account spendingâBudgets allow you to monitor usage across your entire account. They provide a way to set financial targets and allow you to track either account-wide spending or apply filters to track the spending of specific teams, projects or workspaces. If your account uses serverless compute, be sure to use budget policies to attribute your account's serverless usage. See Attribute serverless usage with budget policies.
It is recommended that you set up email notifications when the monthly budget is reached to avoid unexpected overspends.
Monitor costs to align spending with expectationsâCost observability dashboards help to visualize spending patterns, and budget policies help attribute serverless compute usage to specific users, groups, or projects, enabling more accurate cost allocation. To stay on top of spending, Databricks offers a range of tools and features to track and analyze costs:
Monitor usage in the account console: Databricks offers cost management AI/BI dashboards in the account console, that can be imported by account admins to any Unity Catalog-enabled workspace in their account. This allows you to monitor either the account usage or a single workspace usage.
Use budgets to monitor account spending: Budgets enable you to monitor usage across your account.
Budget policies can be used to attribute serverless usage by applying tags to any serverless compute activity incurred by a user assigned to the policy.
Monitor and manage Delta Sharing egress costs: Unlike other data sharing platforms, Delta Sharing does not require data replication. This model has many advantages, but it means that your cloud vendor may charge data egress fees when you share data across clouds or regions. See Monitor and manage Delta Sharing egress costs (for providers) to monitor and manage egress charges.
Monitor costs using system tables: The system table system.billing.usage
allows to monitor costs. Custom tags applied to workspaces and compute resources are propagated to this system table. You can monitor the costs of serverless compute, job costs, and model serving costs.
Cost management goes beyond technical implementation to include broader organisational strategies:
Overall, cost optimisation needs to be seen as an ongoing process and strategies need to be revisited regularly in the event of scaling, new projects or unexpected cost spikes. Use both Databricks' native cost management capabilities and third-party tools for comprehensive control and optimisation.
4. Design cost-effective workloadsâ Balance always-on and triggered streamingâTraditionally, when people think about streaming, terms such as âreal-timeâ, â24/7,â or âalways onâ come to mind. If data ingestion happens in real-time, the underlying compute resources must run 24/7, incurring costs every single hour of the day.
However, not every use case that relies on a continuous stream of events requires those events to be immediately added to the analytics data set. If the business requirement for the use case only requires fresh data every few hours or every day, then that requirement can be met with only a few runs per day, resulting in a significant reduction in workload cost. Databricks recommends using Structured Streaming with the AvailableNow
trigger for incremental workloads that do not have low latency requirements. See Configuring incremental batch processing.
Spot instances take advantage of excess virtual machine resources in the cloud that are available at a lower price. To save costs, Databricks supports creating clusters using spot instances. It is recommended that the first instance (the Spark driver) should always be an on-demand virtual machine. Spot instances are a good choice for workloads where it is acceptable to take longer because one or more spot instances have been evicted by the cloud provider.
Also, consider using the Fleet instance type. If a cluster uses one of these fleet instance types, Databricks selects the matching physical AWS instance types with the best price and availability to use in your cluster.
RetroSearch is an open source project built by @garambo | Open a GitHub Issue
Search and Browse the WWW like it's 1997 | Search results from DuckDuckGo
HTML:
3.2
| Encoding:
UTF-8
| Version:
0.7.4