This section provides a guide to developing notebooks and jobs in Databricks using the Python language, including tutorials for common workflows and tasks, and links to APIs, libraries, and tools.
To get started:
The below tutorials provide example code and notebooks to learn about common workflows. See Import a notebook for instructions on importing notebook examples into your workspace.
Data engineeringâThe example notebook illustrates how to use the Python debugger (pdb) in Databricks notebooks. To use the Python debugger, you must be running Databricks Runtime 11.3 LTS or above.
With Databricks Runtime 12.2 LTS and above, you can use variable explorer to track the current value of Python variables in the notebook UI. You can use variable explorer to observe the values of Python variables as you step through breakpoints.
Python debugger example notebooknote
breakpoint()
is not supported in IPython and thus does not work in Databricks notebooks. You can use import pdb; pdb.set_trace()
instead of breakpoint()
.
Python code that runs outside of Databricks can generally run within Databricks, and vice versa. If you have existing code, just import it into Databricks to get started. See Manage code with notebooks and Databricks Git folders below for details.
Databricks can run both single-machine and distributed Python workloads. For single-machine computing, you can use Python APIs and libraries as usual; for example, pandas and scikit-learn will âjust work.â For distributed Python workloads, Databricks offers two popular APIs out of the box: PySpark and Pandas API on Spark.
PySpark APIâPySpark is the official Python API for Apache Spark and combines the power of Python and Apache Spark. PySpark is more flexibility than the Pandas API on Spark and provides extensive support and features for data science and engineering functionality such as Spark SQL, Structured Streaming, MLLib, and GraphX.
Pandas API on Sparkâpandas is a Python package commonly used by data scientists for data analysis and manipulation. However, pandas does not scale out to big data. Pandas API on Spark fills this gap by providing pandas-equivalent APIs that work on Apache Spark. This open-source API is an ideal choice for data scientists who are familiar with pandas but not Apache Spark.
Manage code with notebooks and Databricks Git foldersâDatabricks notebooks support Python. These notebooks provide functionality similar to that of Jupyter, but with additions such as built-in visualizations using big data, Apache Spark integrations for debugging and performance monitoring, and MLflow integrations for tracking machine learning experiments. Get started by importing a notebook. Once you have access to a cluster, you can attach a notebook to the cluster and run the notebook.
tip
To reset the state of your notebook, restart the iPython kernel. For Jupyter users, the ârestart kernelâ option in Jupyter corresponds to detaching and reattaching a notebook in Databricks. To restart the kernel in a Python notebook, click the compute selector in the notebook toolbar and hover over the attached cluster or SQL warehouse in the list to display a side menu. Select Detach & re-attach. This detaches the notebook from your cluster and reattaches it, which restarts the Python process.
Databricks Git folders allow users to synchronize notebooks and other files with Git repositories. Databricks Git folders help with code versioning and collaboration, and it can simplify importing a full repository of code into Databricks, viewing past notebook versions, and integrating with IDE development. Get started by cloning a remote Git repository. You can then open or create notebooks with the repository clone, attach the notebook to a cluster, and run the notebook.
Clusters and librariesâDatabricks compute provides compute management for clusters of any size: from single node clusters up to large clusters. You can customize cluster hardware and libraries according to your needs. Data scientists will generally begin work either by creating a cluster or using an existing shared cluster. Once you have access to a cluster, you can attach a notebook to the cluster or run a job on the cluster.
Databricks clusters use a Databricks Runtime, which provides many popular libraries out-of-the-box, including Apache Spark, Delta Lake, pandas, and more. You can also install additional third-party or custom Python libraries to use with notebooks and jobs.
%pip install my_library
magic command installs my_library
to all nodes in your currently attached cluster, yet does not interfere with other workloads on compute with standard access mode.Databricks Python notebooks have built-in support for many types of visualizations. You can also use legacy visualizations.
You can also visualize data using third-party libraries; some are pre-installed in the Databricks Runtime, but you can install custom libraries as well. Popular options include:
JobsâYou can automate Python workloads as scheduled or triggered jobs in Databricks. Jobs can run notebooks, Python scripts, and Python wheel files.
tip
To schedule a Python script instead of a notebook, use the spark_python_task
field under tasks
in the body of a create job request.
Databricks supports a wide variety of machine learning (ML) workloads, including traditional ML on tabular data, deep learning for computer vision and natural language processing, recommendation systems, graph analytics, and more. For general information about machine learning on Databricks, see AI and machine learning on Databricks.
For ML algorithms, you can use pre-installed libraries in Databricks Runtime for Machine Learning, which includes popular Python tools such as scikit-learn, TensorFlow, Keras, PyTorch, Apache Spark MLlib, and XGBoost. You can also install custom libraries.
For machine learning operations (MLOps), Databricks provides a managed service for the open source library MLflow. With MLflow Tracking you can record model development and save models in reusable formats. You can use the MLflow Model Registry to manage and automate the promotion of models towards production. Jobs and Model Serving allow hosting models as batch and streaming jobs and as REST endpoints. For more information and examples, see the MLflow for ML model lifecycle or the MLflow Python API docs.
To get started with common machine learning workloads, see the following pages:
In addition to developing Python code within Databricks notebooks, you can develop externally using integrated development environments (IDEs) such as PyCharm, Jupyter, and Visual Studio Code. To synchronize work between external development environments and Databricks, there are several options:
Databricks provides a set of SDKs, including a Python SDK, that support automation and integration with external tooling. You can use the Databricks SDKs to manage resources like clusters and libraries, code and other workspace objects, workloads and jobs, and more. See the Databricks SDKs.
For more information on IDEs, developer tools, and SDKs, see Local development tools.
Additional resourcesâRetroSearch is an open source project built by @garambo | Open a GitHub Issue
Search and Browse the WWW like it's 1997 | Search results from DuckDuckGo
HTML:
3.2
| Encoding:
UTF-8
| Version:
0.7.4