A RetroSearch Logo

Home - News ( United States | United Kingdom | Italy | Germany ) - Football scores

Search Query:

Showing content from https://docs.databricks.com/aws/en/dev-tools/ci-cd/best-practices below:

Best practices and recommended CI/CD workflows on Databricks

Best practices and recommended CI/CD workflows on Databricks

CI/CD (Continuous Integration and Continuous Delivery) has become a cornerstone of modern data engineering and analytics, as it ensures that code changes are integrated, tested, and deployed rapidly and reliably. Databricks recognizes that you may have diverse CI/CD requirements shaped by your organizational preferences, existing workflows, and specific technology environment, and provides a flexible framework that supports various CI/CD options.

This page describes best practices to help you design and build robust, customized CI/CD pipelines that align with your unique needs and constraints. By leveraging these insights, you can accelerate your data engineering and analytics initiatives, improve code quality, and reduce the risk of deployment failures.

Core principles of CI/CD​

Effective CI/CD pipelines share foundational principles regardless of implementation specifics. The following universal best practices apply across organizational preferences, developer workflows, and cloud environments, and ensure consistency across diverse implementations, whether your team prioritizes notebook-first development or infrastructure-as-code workflows. Adopt these principles as guardrails while tailoring specifics to your organization's technology stack and processes.

note

Databricks recommends workload identity federation for CI/CD authentication. Workload identity federation eliminates the need for Databricks secrets, which makes it the most secure way to authenticate your automated flows to Databricks. See Enable workload identity federation in CI/CD.

Databricks Asset Bundles for CI/CD​

Databricks Asset Bundles offer a powerful, unified approach to managing code, workflows, and infrastructure within the Databricks ecosystem and are recommended for your CI/CD pipelines. By bundling these elements into a single YAML-defined unit, bundles simplify deployment and ensure consistency across environments. However, for users accustomed to traditional CI/CD workflows, adopting bundles may require a shift in mindset.

For example, Java developers are used to building JARs with Maven or Gradle, running unit tests with JUnit, and integrating these steps into CI/CD pipelines. Similarly, Python developers often package code into wheels and test with pytest, while SQL developers focus on query validation and notebook management. With bundles, these workflows converge into a more structured and prescriptive format, emphasizing bundling code and infrastructure for seamless deployment.

The following sections explore how developers can adapt their workflows to leverage bundles effectively.

To quickly get started with Databricks Asset Bundles, try a tutorial: Develop a job with Databricks Asset Bundles or Develop Lakeflow Declarative Pipelines with Databricks Asset Bundles.

CI/CD source control recommendations​

The first choice developers need to make when implementing CI/CD is how to store and version source files. Bundles enable you to easily contain everything - source code, build artifacts, and configuration files - and locate them in the same source code repository, but another option is to separate bundle configuration files from code-related files. The choice depends on your team's workflow, project complexity, and CI/CD requirements, but Databricks recommends the following:

Whether you choose to co-locate or separate your code-related files from your bundle configuration files, always use versioned artifacts, such as Git commit hashes, when uploading to Databricks or external storage to ensure traceability and rollback capabilities.

Single repository for code and configuration​

In this approach, both the source code and bundle configuration files are stored in the same repository. This simplifies workflows and ensures atomic changes.

Example: Python code in a bundle​

This example has Python files and bundle files in one repository:

databricks-dab-repo/
├── databricks.yml # Bundle definition
├── resources/
│ ├── workflows/
│ │ ├── my_pipeline.yml # YAML pipeline def
│ │ └── my_pipeline_job.yml # YAML job def that runs pipeline
│ ├── clusters/
│ │ ├── dev_cluster.yml # development cluster def
│ │ └── prod_cluster.yml # production def
├── src/
│ ├── dlt_pipeline.ipynb # pipeline notebook
│ └── mypython.py # Additional Python
└── README.md

Separate repositories for code and configuration​

In this approach, the source code resides in one repository, while the bundle configuration files are maintained in another. This option is ideal for larger teams or projects where separate groups handle application development and Databricks workflow management.

Example: Java project and bundle​

In this example, a Java project and its files are in one repository and the bundle files are in another repository.

Repository 1: Java files

The first repository contains all Java-related files:

java-app-repo/
├── pom.xml # Maven build configuration
├── src/
│ ├── main/
│ │ ├── java/ # Java source code
│ │ │ └── com/
│ │ │ └── mycompany/
│ │ │ └── app/
│ │ │ └── App.java
│ │ └── resources/ # Application resources
│ └── test/
│ ├── java/ # Unit tests for Java code
│ │ └── com/
│ │ └── mycompany/
│ │ └── app/
│ │ └── AppTest.java
│ └── resources/ # Test-specific resources
├── target/ # Compiled JARs and classes
└── README.md

Repository 2: Bundle files

A second repository contains only the bundle configuration files:

databricks-dab-repo/
├── databricks.yml # Bundle definition
├── resources/
│ ├── jobs/
│ │ ├── my_java_job.yml # YAML job dev
│ │ └── my_other_job.yml # Additional job definitions
│ ├── clusters/
│ │ ├── dev_cluster.yml # development cluster def
│ │ └── prod_cluster.yml # production def
└── README.md
Recommended CI/CD workflow​

Regardless of whether you are co-locating or separating your code files from your bundle configuration files, a recommended workflow would be the following:

  1. Compile and test the code

  2. Upload and store the compiled file, such as a JAR, to a Databricks Unity Catalog volume.

  3. Validate the bundle

  4. Deploy the bundle

CI/CD for machine learning​

Machine learning projects introduce unique CI/CD challenges compared to traditional software development. When implementing CI/CD for ML projects, you will likely need to consider the following:

MLOps Stacks for ML CI/CD​

Databricks addresses ML CI/CD complexity through MLOps Stacks, a production-grade framework that combines Databricks Asset Bundles, preconfigured CI/CD workflows, and modular ML project templates. These stacks enforce best practices while allowing flexibility for multi-team collaboration across data engineering, data science, and MLOps roles.

ML CI/CD collaboration might look like:

For implementation details, see:

By aligning teams with standardized bundles and MLOps Stacks, organizations can streamline collaboration while maintaining auditability across the ML lifecycle.

CI/CD for SQL developers​

SQL developers using Databricks SQL to manage streaming tables and materialized views can leverage Git integration and CI/CD pipelines to streamline their workflows and maintain high-quality pipelines. With the introduction of Git support for queries, SQL developers can focus on writing queries while leveraging Git to version control their .sql files, which enables collaboration and automation without needing deep infrastructure expertise. In addition, the SQL editor enables real-time collaboration and integrates seamlessly with Git workflows.

For SQL-centric workflows:

A workflow might be:

  1. Develop: Write and test .sql scripts locally or in the Databricks SQL editor, then commit them to a Git branch.
  2. Validate: During a pull request, validate syntax and schema compatibility using automated CI checks.
  3. Deploy: Upon merge, deploy the .sql scripts to the target environment using CI/CD pipelines, for example, GitHub Actions or Azure Pipelines.
  4. Monitor: Use Databricks dashboards and alerts to track query performance and data freshness.
CI/CD for dashboard developers​

Databricks supports integrating dashboards into CI/CD workflows using Databricks Asset Bundles. This capability enables dashboard developers to:

For dashboards in CI/CD:

For information about dashboards in bundles, see dashboard resource. For details about bundle commands, see bundle command group.


RetroSearch is an open source project built by @garambo | Open a GitHub Issue

Search and Browse the WWW like it's 1997 | Search results from DuckDuckGo

HTML: 3.2 | Encoding: UTF-8 | Version: 0.7.4