A RetroSearch Logo

Home - News ( United States | United Kingdom | Italy | Germany ) - Football scores

Search Query:

Showing content from https://docs.databricks.com/aws/en/lakehouse-architecture/reliability/best-practices below:

Best practices for reliability | Databricks Documentation

Best practices for reliability

This article covers best practices for reliability organized by architectural principles listed in the following sections.

1. Design for failure​ Use a data format that supports ACID transactions​

ACID transactions are a critical feature for maintaining data integrity and consistency. Choosing a data format that supports ACID transactions helps build pipelines that are simpler and much more reliable.

Delta Lake is an open source storage framework that provides ACID transactions as well as schema enforcement, scalable metadata handling, and unifies streaming and batch data processing. Delta Lake is fully compatible with the Apache Spark APIs and is designed for tight integration with structured streaming, allowing you to easily use a single copy of data for both batch and streaming operations and provide incremental processing at scale.

Use a resilient distributed data engine for all workloads​

Apache Spark, as the compute engine of the Databricks lakehouse, is based on resilient distributed data processing. If an internal Spark task does not return a result as expected, Apache Spark automatically reschedules the missing tasks and continues to execute the entire job. This is helpful for out-of-code failures, such as a brief network problem or a revoked Spot VM. Working with both the SQL API and the Spark DataFrame API, this resiliency is built into the engine.

In the Databricks lakehouse, Photon, a native vectorized engine entirely written in C++, is a high performance compute compatible with Apache Spark APIs.

Automatically rescue invalid or nonconforming data​

Invalid or nonconforming data can cause workloads that rely on an established data format to crash. To increase the end-to-end resiliency of the entire process, it is best practice to filter out invalid and nonconforming data at ingest. Support for rescued data ensures that you never lose or miss data during ingest or ETL. The rescued data column contains any data that was not parsed, either because it was missing from the given schema, because there was a type mismatch, or because the column body in the record or file did not match that in the schema.

Configure jobs for automatic retries and termination​

Distributed systems are complex, and a failure at one point can potentially cascade throughout the system.

On the other hand, a hanging task can prevent the entire job from completing, resulting in high costs. Lakeflow Jobs support timeout configuration to kill jobs that take longer than expected.

Use scalable and production-grade model serving infrastructure​

For batch and streaming inference, use Lakeflow Jobs and MLflow to deploy models as Apache Spark UDFs to leverage job scheduling, retries, autoscaling, and so on. See Deploy models for batch inference and prediction.

Model serving provides a scalable and production-grade infrastructure for real-time model serving. It processes your machine learning models using MLflow and exposes them as REST API endpoints. This functionality uses serverless compute, which means that the endpoints and associated compute resources are managed and run in the Databricks cloud account.

Use managed services where possible​

Leverage managed services (serverless compute) from the Databricks Data Intelligence Platform, such as:

These services are operated by Databricks in a reliable and scalable manner, making workloads more reliable.

2. Manage data quality​ Use a layered storage architecture​

Curate data by creating a layered architecture and ensuring that data quality increases as data moves through the layers. A common layering approach is:

The final layer should contain only high quality data and be fully trusted from a business perspective.

Improve data integrity by reducing data redundancy​

Copying or duplicating data creates data redundancy and leads to loss of integrity, loss of data lineage, and often different access permissions. This reduces the quality of the data in the lakehouse.

A temporary or disposable copy of data is not harmful in itself - it is sometimes necessary to increase agility, experimentation, and innovation. However, when these copies become operational and are regularly used to make business decisions, they become data silos. When these data silos become out of sync, it has a significant negative impact on data integrity and quality, raising questions such as “Which data set is the master?” or “Is the data set current?

Actively manage schemas​

Uncontrolled schema changes can lead to invalid data and failing jobs that use these data sets. Databricks has several methods to validate and enforce the schema:

Use constraints and data expectations​

Delta tables support standard SQL constraint management clauses that ensure that the quality and integrity of data added to a table is automatically checked. When a constraint is violated, Delta Lake throws an InvariantViolationException error to signal that the new data can't be added. See Constraints on Databricks.

To further improve this handling, Lakeflow Declarative Pipelines supports expectations: Expectations define data quality constraints on the contents of a data set. An expectation consists of a description, an invariant, and an action to be taken if a record violates the invariant. Expectations on queries use Python decorators or SQL constraint clauses. See Manage data quality with pipeline expectations.

Take a data-centric approach to machine learning​

A guiding principle that remains at the core of the AI vision for the Databricks Data Intelligence Platform is a data-centric approach to machine learning. As generative AI becomes more prevalent, this perspective remains just as important.

The core components of any ML project can simply be thought of as data pipelines: Feature engineering, training, model deployment, inference, and monitoring pipelines are all data pipelines. As such, operationalizing an ML solution requires merging data from prediction, monitoring, and feature tables with other relevant data. Fundamentally, the easiest way to achieve this is to develop AI-powered solutions on the same platform used to manage production data. See Data-centric MLOps and LLMOps

3. Design for autoscaling​ Enable autoscaling for ETL workloads​

Autoscaling allows clusters to automatically resize based on workloads. Autoscaling can benefit many use cases and scenarios from both a cost and performance perspective. The documentation provides considerations for determining whether to use autoscaling and how to get the most benefit.

For streaming workloads, Databricks recommends using Lakeflow Declarative Pipelines with autoscaling. Databricks enhanced autoscaling optimizes cluster utilization by automatically allocating cluster resources based on workload volume, with minimal impact on the data processing latency of your pipelines.

Enable autoscaling for SQL warehouse​

The scaling parameter of a SQL warehouse sets the minimum and the maximum number of clusters over which queries sent to the warehouse are distributed. The default is one one cluster without autoscaling.

To handle more concurrent users for a given warehouse, increase the number of clusters. To learn how Databricks adds clusters to and removes clusters from a warehouse, see SQL warehouse sizing, scaling, and queuing behavior.

4. Test recovery procedures​ Recover from Structured Streaming query failures​

Structured Streaming provides fault tolerance and data consistency for streaming queries. Using Lakeflow Jobs, you can easily configure your Structured Streaming queries to automatically restart on failure. By enabling checkpointing for a streaming query, you can restart the query after a failure. The restarted query continues from where the failed query left off. See Structured Streaming checkpoints and Production considerations for Structured Streaming.

Recover ETL jobs using data time travel capabilities​

Despite thorough testing, a job may fail in production or produce unexpected, even invalid, data. Sometimes this can be fixed with an additional job after understanding the source of the problem and fixing the pipeline that caused the problem in the first place. However, often this is not straightforward and the job in question should be rolled back. Using Delta Time travel, users can easily roll back changes to an older version or timestamp, repair the pipeline, and restart the fixed pipeline.

A convenient way to do so is the RESTORE command.

Leverage a job automation framework with built-in recovery​

Lakeflow Jobs are designed for recovery. When a task in a multi-task job fails (and, as such, all dependent tasks), jobs provide a matrix view of the runs that allows you to investigate the problem that caused the failure, see View runs for a single job. Whether it was a short network issue or a real issue in the data, you can fix it and start a repair run in Lakeflow Jobs. It will run only the failed and dependent tasks and keep the successful results from the earlier run, saving time and money, see Troubleshoot and repair job failures.

Configure a disaster recovery pattern​

For a cloud-native data analytics platform like Databricks, a clear disaster recovery pattern is critical. It's critical that your data teams can use the Databricks platform even in the rare event of a regional, service-wide outage of a cloud service provider, whether caused by a regional disaster such as a hurricane, earthquake, or some other source.

Databricks is often a core part of an overall data ecosystem that includes many services, including upstream data ingestion services (batch/streaming), cloud-native storage such as Amazon S3, downstream tools and services such as business intelligence apps, and orchestration tooling. Some of your use cases may be particularly sensitive to a regional service-wide outage.

Disaster recovery involves a set of policies, tools, and procedures that enable the recovery or continuation of vital technology infrastructure and systems following a natural or human-induced disaster. A large cloud service such as AWS serves many customers and has built-in protections against a single failure. For example, a region is a group of buildings connected to different power sources to ensure that a single power outage will not bring down a region. However, cloud region failures can occur, and the severity of the failure and its impact on your business can vary.

5. Automate deployments and workloads​

See Operational Excellence - Automate deployments and workloads.

6. Monitor systems and workloads​

See Operational Excellence - Set up monitoring, alerting, and logging.


RetroSearch is an open source project built by @garambo | Open a GitHub Issue

Search and Browse the WWW like it's 1997 | Search results from DuckDuckGo

HTML: 3.2 | Encoding: UTF-8 | Version: 0.7.4