CI/CD (Continuous Integration and Continuous Delivery) has become a cornerstone of modern data engineering and analytics, as it ensures that code changes are integrated, tested, and deployed rapidly and reliably. Databricks recognizes that you may have diverse CI/CD requirements shaped by your organizational preferences, existing workflows, and specific technology environment, and provides a flexible framework that supports various CI/CD options.
This page describes best practices to help you design and build robust, customized CI/CD pipelines that align with your unique needs and constraints. By leveraging these insights, you can accelerate your data engineering and analytics initiatives, improve code quality, and reduce the risk of deployment failures.
Core principles of CI/CDâEffective CI/CD pipelines share foundational principles regardless of implementation specifics. The following universal best practices apply across organizational preferences, developer workflows, and cloud environments, and ensure consistency across diverse implementations, whether your team prioritizes notebook-first development or infrastructure-as-code workflows. Adopt these principles as guardrails while tailoring specifics to your organization's technology stack and processes.
note
Databricks recommends workload identity federation for CI/CD authentication. Workload identity federation eliminates the need for Databricks secrets, which makes it the most secure way to authenticate your automated flows to Databricks. See Enable workload identity federation in CI/CD.
Databricks Asset Bundles for CI/CDâDatabricks Asset Bundles offer a powerful, unified approach to managing code, workflows, and infrastructure within the Databricks ecosystem and are recommended for your CI/CD pipelines. By bundling these elements into a single YAML-defined unit, bundles simplify deployment and ensure consistency across environments. However, for users accustomed to traditional CI/CD workflows, adopting bundles may require a shift in mindset.
For example, Java developers are used to building JARs with Maven or Gradle, running unit tests with JUnit, and integrating these steps into CI/CD pipelines. Similarly, Python developers often package code into wheels and test with pytest, while SQL developers focus on query validation and notebook management. With bundles, these workflows converge into a more structured and prescriptive format, emphasizing bundling code and infrastructure for seamless deployment.
The following sections explore how developers can adapt their workflows to leverage bundles effectively.
To quickly get started with Databricks Asset Bundles, try a tutorial: Develop a job with Databricks Asset Bundles or Develop Lakeflow Declarative Pipelines with Databricks Asset Bundles.
CI/CD source control recommendationsâThe first choice developers need to make when implementing CI/CD is how to store and version source files. Bundles enable you to easily contain everything - source code, build artifacts, and configuration files - and locate them in the same source code repository, but another option is to separate bundle configuration files from code-related files. The choice depends on your team's workflow, project complexity, and CI/CD requirements, but Databricks recommends the following:
Whether you choose to co-locate or separate your code-related files from your bundle configuration files, always use versioned artifacts, such as Git commit hashes, when uploading to Databricks or external storage to ensure traceability and rollback capabilities.
Single repository for code and configurationâIn this approach, both the source code and bundle configuration files are stored in the same repository. This simplifies workflows and ensures atomic changes.
Example: Python code in a bundleâThis example has Python files and bundle files in one repository:
databricks-dab-repo/
âââ databricks.yml # Bundle definition
âââ resources/
â âââ workflows/
â â âââ my_pipeline.yml # YAML pipeline def
â â âââ my_pipeline_job.yml # YAML job def that runs pipeline
â âââ clusters/
â â âââ dev_cluster.yml # development cluster def
â â âââ prod_cluster.yml # production def
âââ src/
â âââ dlt_pipeline.ipynb # pipeline notebook
â âââ mypython.py # Additional Python
âââ README.md
Separate repositories for code and configurationâ
In this approach, the source code resides in one repository, while the bundle configuration files are maintained in another. This option is ideal for larger teams or projects where separate groups handle application development and Databricks workflow management.
Example: Java project and bundleâIn this example, a Java project and its files are in one repository and the bundle files are in another repository.
Repository 1: Java files
The first repository contains all Java-related files:
java-app-repo/
âââ pom.xml # Maven build configuration
âââ src/
â âââ main/
â â âââ java/ # Java source code
â â â âââ com/
â â â âââ mycompany/
â â â âââ app/
â â â âââ App.java
â â âââ resources/ # Application resources
â âââ test/
â âââ java/ # Unit tests for Java code
â â âââ com/
â â âââ mycompany/
â â âââ app/
â â âââ AppTest.java
â âââ resources/ # Test-specific resources
âââ target/ # Compiled JARs and classes
âââ README.md
src/main/java
or src/main/scala
.src/test/java
or src/test/scala
.target/my-app-1.0.jar
.Repository 2: Bundle files
A second repository contains only the bundle configuration files:
databricks-dab-repo/
âââ databricks.yml # Bundle definition
âââ resources/
â âââ jobs/
â â âââ my_java_job.yml # YAML job dev
â â âââ my_other_job.yml # Additional job definitions
â âââ clusters/
â â âââ dev_cluster.yml # development cluster def
â â âââ prod_cluster.yml # production def
âââ README.md
The bundle configuration databricks.yml and job definitions are maintained independently.
The databricks.yml references the uploaded JAR artifact, for example:
YAML
- jar: /Volumes/artifacts/my-app-${{ GIT_SHA }}.)jar
Regardless of whether you are co-locating or separating your code files from your bundle configuration files, a recommended workflow would be the following:
Compile and test the code
my-app-1.0.jar
.Upload and store the compiled file, such as a JAR, to a Databricks Unity Catalog volume.
dbfs:/mnt/artifacts/my-app-${{ github.sha }}.jar
.Validate the bundle
databricks bundle validate
to ensure that the databricks.yml
configuration is correct.Deploy the bundle
databricks bundle deploy
to deploy the bundle to a staging or production environment.databricks.yml
. For information about referencing libraries, see Databricks Asset Bundles library dependencies.Machine learning projects introduce unique CI/CD challenges compared to traditional software development. When implementing CI/CD for ML projects, you will likely need to consider the following:
Databricks addresses ML CI/CD complexity through MLOps Stacks, a production-grade framework that combines Databricks Asset Bundles, preconfigured CI/CD workflows, and modular ML project templates. These stacks enforce best practices while allowing flexibility for multi-team collaboration across data engineering, data science, and MLOps roles.
ML CI/CD collaboration might look like:
For implementation details, see:
By aligning teams with standardized bundles and MLOps Stacks, organizations can streamline collaboration while maintaining auditability across the ML lifecycle.
CI/CD for SQL developersâSQL developers using Databricks SQL to manage streaming tables and materialized views can leverage Git integration and CI/CD pipelines to streamline their workflows and maintain high-quality pipelines. With the introduction of Git support for queries, SQL developers can focus on writing queries while leveraging Git to version control their .sql
files, which enables collaboration and automation without needing deep infrastructure expertise. In addition, the SQL editor enables real-time collaboration and integrates seamlessly with Git workflows.
For SQL-centric workflows:
Version control SQL files
Integrate .sql
files into CI/CD pipelines to automate deployment:
.sql
files to Databricks SQL workflows or jobs.Parameterize for environment isolation
Use variables in .sql
files to dynamically reference environment-specific resources, such as data paths or table names:
SQL
CREATE OR REFRESH STREAMING TABLE ${env}_sales_ingest AS SELECT * FROM read_files('s3://${env}-sales-data')
Schedule and monitor refreshes
REFRESH MATERIALIZED VIEW view_name
).A workflow might be:
.sql
scripts locally or in the Databricks SQL editor, then commit them to a Git branch.Databricks supports integrating dashboards into CI/CD workflows using Databricks Asset Bundles. This capability enables dashboard developers to:
For dashboards in CI/CD:
Use the databricks bundle generate
command to export existing dashboards as JSON files and generate the YAML configuration that includes it in the bundle:
YAML
resources:
dashboards:
sales_dashboard:
display_name: 'Sales Dashboard'
file_path: ./dashboards/sales_dashboard.lvdash.json
warehouse_id: ${var.warehouse_id}
Store these .lvdash.json
files in Git repositories to track changes and collaborate effectively.
Authomatically deploy dashboards in CI/CD pipelines with databricks bundle deploy
. For example, the GitHub Actions step for deployment:
YAML
name: Deploy Dashboard
run: databricks bundle deploy --target=prod
env:
DATABRICKS_TOKEN: ${{ secrets.DATABRICKS_TOKEN }}
Use variables, for example ${var.warehouse_id}
, to parameterize configurations like SQL warehouses or data sources, ensuring seamless deployment across dev, staging, and production environments.
Use the bundle generate --watch
option to continuously sync local dashboard JSON files with changes made in the Databricks UI. If discrepancies occur, use the --force
flag during deployment to overwrite remote dashboards with local versions.
For information about dashboards in bundles, see dashboard resource. For details about bundle commands, see bundle
command group.
RetroSearch is an open source project built by @garambo | Open a GitHub Issue
Search and Browse the WWW like it's 1997 | Search results from DuckDuckGo
HTML:
3.2
| Encoding:
UTF-8
| Version:
0.7.4