A RetroSearch Logo

Home - News ( United States | United Kingdom | Italy | Germany ) - Football scores

Search Query:

Showing content from https://learn.microsoft.com/en-us/azure/databricks/dev-tools/sdk-python below:

Databricks SDK for Python - Azure Databricks

Note

Databricks recommends Databricks Asset Bundles for creating, developing, deploying, and testing jobs and other Databricks resources as source code. See What are Databricks Asset Bundles?.

In this article, you learn how to automate Azure Databricks operations and accelerate development with the Databricks SDK for Python. This article supplements the Databricks SDK for Python documentation on Read The Docs and the code examples in the Databricks SDK for Python repository in GitHub.

Note

The Databricks SDK for Python is in Beta and is okay to use in production.

During the Beta period, Databricks recommends that you pin a dependency on the specific minor version of the Databricks SDK for Python that your code depends on. For example, you can pin dependencies in files such as requirements.txt for venv, or pyproject.toml and poetry.lock for Poetry. For more information about pinning dependencies, see Virtual Environments and Packages for venv, or Installing dependencies for Poetry.

Before you begin

You can use the Databricks SDK for Python from within an Azure Databricks notebook or from your local development machine.

Before you begin to use the Databricks SDK for Python, your development machine must have:

Get started with the Databricks SDK for Python

This section describes how to get started with the Databricks SDK for Python from your local development machine. To use the Databricks SDK for Python from within an Azure Databricks notebook, skip ahead to Use the Databricks SDK for Python from an Azure Databricks notebook.

  1. On your development machine with Azure Databricks authentication configured, Python already installed, and your Python virtual environment already activated, install the databricks-sdk package (and its dependencies) from the Python Package Index (PyPI), as follows:

    Venv

    Use pip to install the databricks-sdk package. (On some systems, you might need to replace pip3 with pip, here and throughout.)

    pip3 install databricks-sdk
    
    Poetry
    poetry add databricks-sdk
    

    To install a specific version of the databricks-sdk package while the Databricks SDK for Python is in Beta, see the package's Release history. For example, to install version 0.1.6:

    Venv
    pip3 install databricks-sdk==0.1.6
    
    Poetry
    poetry add databricks-sdk==0.1.6
    

    To upgrade an existing installation of the Databricks SDK for Python package to the latest version, run the following command:

    Venv
    pip3 install --upgrade databricks-sdk
    
    Poetry
    poetry add databricks-sdk@latest
    

    To show the Databricks SDK for Python package's current Version and other details, run the following command:

    Venv
    pip3 show databricks-sdk
    
    Poetry
    poetry show databricks-sdk
    
  2. In your Python virtual environment, create a Python code file that imports the Databricks SDK for Python. The following example, in a file named main.py with the following contents, simply lists all the clusters in your Azure Databricks workspace:

    from databricks.sdk import WorkspaceClient
    
    w = WorkspaceClient()
    
    for c in w.clusters.list():
      print(c.cluster_name)
    
  3. Run your Python code file, assuming a file named main.py, by running the python command:

    Venv
    python3.10 main.py
    
    Poetry

    If you are in the virtual environment's shell:

    python3.10 main.py
    

    If you are not in the virtual environment's shell:

    poetry run python3.10 main.py
    

    Note

    By not setting any arguments in the preceding call to w = WorkspaceClient(), the Databricks SDK for Python uses its default process for trying to perform Azure Databricks authentication. To override this default behavior, see the following authentication section.

Authenticate the Databricks SDK for Python with your Azure Databricks account or workspace

This section describes how to authenticate the Databricks SDK for Python from your local development machine over to your Azure Databricks account or workspace. To authenticate the Databricks SDK for Python from within an Azure Databricks notebook, skip ahead to Use the Databricks SDK for Python from an Azure Databricks notebook.

The Databricks SDK for Python implements the Databricks client unified authentication standard, a consolidated and consistent architectural and programmatic approach to authentication. This approach helps make setting up and automating authentication with Azure Databricks more centralized and predictable. It enables you to configure Databricks authentication once and then use that configuration across multiple Databricks tools and SDKs without further authentication configuration changes. For more information, including more complete code examples in Python, see Databricks client unified authentication.

Some of the available coding patterns to initialize Databricks authentication with the Databricks SDK for Python include:

See also Authentication in the Databricks SDK for Python documentation.

Use the Databricks SDK for Python from an Azure Databricks notebook

You can call Databricks SDK for Python functionality from an Azure Databricks notebook that has an attached Azure Databricks cluster with the Databricks SDK for Python installed. It is installed by default on all Azure Databricks clusters that use Databricks Runtime 13.3 LTS or above. For Azure Databricks clusters that use Databricks Runtime 12.2 LTS and below, you must install the Databricks SDK for Python first. See Step 1: Install or upgrade the Databricks SDK for Python.

To see the Databricks SDK for Python version that is installed for a specific Databricks Runtime version, see the Installed Python libraries section of the Databricks Runtime release notes for that version.

Databricks recommends that you install the latest available version of the SDK from PiPy, but at a minimum install or upgrade to Databricks SDK for Python 0.6.0 or above, as default Azure Databricks notebook authentication is used by version 0.6.0 and above on all Databricks Runtime versions.

Note

Databricks Runtime 15.1 is the first Databricks Runtime to have a version of the Databricks SDK for Python (0.20.0) installed that supports default notebook authentication with no upgrade required.

The following table outlines notebook authentication support for Databricks SDK for Python and Databricks Runtime versions:

SDK/DBR 10.4 LTS 11.3 LTS 12.3 LTS 13.3 LTS 14.3 LTS 15.1 and above 0.1.7 and below 0.1.10 ✓ ✓ ✓ ✓ ✓ 0.6.0 ✓ ✓ ✓ ✓ ✓ ✓ 0.20.0 and above ✓ ✓ ✓ ✓ ✓ ✓

Default Azure Databricks notebook authentication relies on a temporary Azure Databricks personal access token that Azure Databricks automatically generates in the background for its own use. Azure Databricks deletes this temporary token after the notebook stops running.

Important

If you want to call Azure Databricks account-level APIs or if you want to use a Databricks authentication type other than default Databricks notebook authentication, the following authentication types are also supported:

Azure managed identities authentication is not yet supported.

Step 1: Install or upgrade the Databricks SDK for Python

Note

Databricks SDK for Python is installed by default on all Azure Databricks clusters that use Databricks Runtime 13.3 LTS or above.

  1. Azure Databricks Python notebooks can use the Databricks SDK for Python just like any other Python library. To install or upgrade the Databricks SDK for Python library on the attached Azure Databricks cluster, run the %pip magic command from a notebook cell as follows:

    %pip install databricks-sdk --upgrade
    
  2. After you run the %pip magic command, you must restart Python to make the installed or upgraded library available to the notebook. To do this, run the following command from a notebook cell immediately after the cell with the %pip magic command:

    dbutils.library.restartPython()
    
  3. To display the installed version of the Databricks SDK for Python, run the following command from a notebook cell:

    %pip show databricks-sdk | grep -oP '(?<=Version: )\S+'
    
Step 2: Run your code

In your notebook cells, create Python code that imports and then calls the Databricks SDK for Python. The following example uses default Azure Databricks notebook authentication to list all the clusters in your Azure Databricks workspace:

from databricks.sdk import WorkspaceClient

w = WorkspaceClient()

for c in w.clusters.list():
  print(c.cluster_name)

When you run this cell, a list of the names of all of the available clusters in your Azure Databricks workspace appears.

To use a different Azure Databricks authentication type, see Azure Databricks authorization methods and click the corresponding link for additional technical details.

Use Databricks Utilities

You can use Databricks Utilities from Databricks SDK for Python code running on your local development machine or from within an Azure Databricks notebook.

To call Databricks Utilities from either your local development machine or an Azure Databricks notebook, use dbutils within WorkspaceClient. This code example uses default Azure Databricks notebook authentication to call dbutils within WorkspaceClient to list the paths of all of the objects in the DBFS root of the workspace.

from databricks.sdk import WorkspaceClient

w = WorkspaceClient()
d = w.dbutils.fs.ls('/')

for f in d:
  print(f.path)

Alternatively, you can call dbutils directly. However, you are limited to using default Azure Databricks notebook authentication only. This code example calls dbutils directly to list all of the objects in the DBFS root of the workspace.

from databricks.sdk.runtime import *

d = dbutils.fs.ls('/')

for f in d:
  print(f.path)

To access Unity Catalog volumes, use files within WorkspaceClient. See Manage files in Unity Catalog volumes. You cannot use dbutils by itself or within WorkspaceClient to access volumes.

See also Interaction with dbutils.

Code examples

The following code examples demonstrate how to use the Databricks SDK for Python to create and delete clusters, run jobs, and list account-level groups. These code examples use default Azure Databricks notebook authentication. For details about default Azure Databricks notebook authentication, see Use the Databricks SDK for Python from an Azure Databricks notebook. For details about default authentication outside of notebooks, see Authenticate the Databricks SDK for Python with your Azure Databricks account or workspace.

For additional code examples, see the examples in the Databricks SDK for Python repository in GitHub. See also:

Create a cluster

This code example creates a cluster with the specified Databricks Runtime version and cluster node type. This cluster has one worker, and the cluster will automatically terminate after 15 minutes of idle time.

from databricks.sdk import WorkspaceClient

w = WorkspaceClient()

print("Attempting to create cluster. Please wait...")

c = w.clusters.create_and_wait(
  cluster_name             = 'my-cluster',
  spark_version            = '12.2.x-scala2.12',
  node_type_id             = 'Standard_DS3_v2',
  autotermination_minutes  = 15,
  num_workers              = 1
)

print(f"The cluster is now ready at " \
      f"{w.config.host}#setting/clusters/{c.cluster_id}/configuration\n")
Permanently delete a cluster

This code example permanently deletes the cluster with the specified cluster ID from the workspace.

from databricks.sdk import WorkspaceClient

w = WorkspaceClient()

c_id = input('ID of cluster to delete (for example, 1234-567890-ab123cd4): ')

w.clusters.permanent_delete(cluster_id = c_id)
Create a job

This code example creates a Azure Databricks job that runs the specified notebook on the specified cluster. As the code runs, it gets the existing notebook's path, the existing cluster ID, and related job settings from the user at the terminal.

from databricks.sdk import WorkspaceClient
from databricks.sdk.service.jobs import Task, NotebookTask, Source

w = WorkspaceClient()

job_name            = input("Some short name for the job (for example, my-job): ")
description         = input("Some short description for the job (for example, My job): ")
existing_cluster_id = input("ID of the existing cluster in the workspace to run the job on (for example, 1234-567890-ab123cd4): ")
notebook_path       = input("Workspace path of the notebook to run (for example, /Users/someone@example.com/my-notebook): ")
task_key            = input("Some key to apply to the job's tasks (for example, my-key): ")

print("Attempting to create the job. Please wait...\n")

j = w.jobs.create(
  name = job_name,
  tasks = [
    Task(
      description = description,
      existing_cluster_id = existing_cluster_id,
      notebook_task = NotebookTask(
        base_parameters = dict(""),
        notebook_path = notebook_path,
        source = Source("WORKSPACE")
      ),
      task_key = task_key
    )
  ]
)

print(f"View the job at {w.config.host}/#job/{j.job_id}\n")
Create a job that uses serverless compute

The following example creates a job that uses Serverless Compute for Jobs:

from databricks.sdk import WorkspaceClient
from databricks.sdk.service.jobs import NotebookTask, Source, Task

w = WorkspaceClient()

j = w.jobs.create(
  name = "My Serverless Job",
  tasks = [
    Task(
      notebook_task = NotebookTask(
      notebook_path = "/Users/someone@example.com/MyNotebook",
      source = Source("WORKSPACE")
      ),
      task_key = "MyTask",
   )
  ]
)
Manage files in Unity Catalog volumes

This code example demonstrates various calls to files functionality within WorkspaceClient to access a Unity Catalog volume.

from databricks.sdk import WorkspaceClient

w = WorkspaceClient()

# Define volume, folder, and file details.
catalog            = 'main'
schema             = 'default'
volume             = 'my-volume'
volume_path        = f"/Volumes/{catalog}/{schema}/{volume}" # /Volumes/main/default/my-volume
volume_folder      = 'my-folder'
volume_folder_path = f"{volume_path}/{volume_folder}" # /Volumes/main/default/my-volume/my-folder
volume_file        = 'data.csv'
volume_file_path   = f"{volume_folder_path}/{volume_file}" # /Volumes/main/default/my-volume/my-folder/data.csv
upload_file_path   = './data.csv'

# Create an empty folder in a volume.
w.files.create_directory(volume_folder_path)

# Upload a file to a volume.
with open(upload_file_path, 'rb') as file:
  file_bytes = file.read()
  binary_data = io.BytesIO(file_bytes)
  w.files.upload(volume_file_path, binary_data, overwrite = True)

# List the contents of a volume.
for item in w.files.list_directory_contents(volume_path):
  print(item.path)

# List the contents of a folder in a volume.
for item in w.files.list_directory_contents(volume_folder_path):
  print(item.path)

# Print the contents of a file in a volume.
resp = w.files.download(volume_file_path)
print(str(resp.contents.read(), encoding='utf-8'))

# Delete a file from a volume.
w.files.delete(volume_file_path)

# Delete a folder from a volume.
w.files.delete_directory(volume_folder_path)
List account-level groups

This code example lists the display names for all of the available groups within the Azure Databricks account.

Note

Notebook-native authentication is not supported for AccountClient, so you must set credentials in the constructor to run this example in a notebook.

from databricks.sdk import AccountClient

a = AccountClient()

for g in a.groups.list():
  print(g.display_name)
Testing

To test your code, use Python test frameworks such as pytest. To test your code under simulated conditions without calling Azure Databricks REST API endpoints or changing the state of your Azure Databricks accounts or workspaces, use Python mocking libraries such as unittest.mock.

The following example file named helpers.py contains a create_cluster function that returns information about the new cluster:

# helpers.py

from databricks.sdk import WorkspaceClient
from databricks.sdk.service.compute import ClusterDetails

def create_cluster(
  w: WorkspaceClient,
  cluster_name:            str,
  spark_version:           str,
  node_type_id:            str,
  autotermination_minutes: int,
  num_workers:             int
) -> ClusterDetails:
  response = w.clusters.create(
    cluster_name            = cluster_name,
    spark_version           = spark_version,
    node_type_id            = node_type_id,
    autotermination_minutes = autotermination_minutes,
    num_workers             = num_workers
  )
  return response

Given the following file named main.py that calls the create_cluster function:

# main.py

from databricks.sdk import WorkspaceClient
from helpers import *

w = WorkspaceClient()

# Replace <spark-version> with the target Spark version string.
# Replace <node-type-id> with the target node type string.
response = create_cluster(
  w = w,
  cluster_name            = 'Test Cluster',
  spark_version           = '<spark-version>',
  node_type_id            = '<node-type-id>',
  autotermination_minutes = 15,
  num_workers             = 1
)

print(response.cluster_id)

The following file named test_helpers.py tests whether the create_cluster function returns the expected response. Rather than creating a cluster in the target workspace, this test mocks a WorkspaceClient object, defines the mocked object's settings, and then passes the mocked object to the create_cluster function. The test then checks whether the function returns the new mocked cluster's expected ID.

# test_helpers.py

from databricks.sdk import WorkspaceClient
from helpers import *
from unittest.mock import create_autospec # Included with the Python standard library.

def test_create_cluster():
  # Create a mock WorkspaceClient.
  mock_workspace_client = create_autospec(WorkspaceClient)

  # Set the mock WorkspaceClient's clusters.create().cluster_id value.
  mock_workspace_client.clusters.create.return_value.cluster_id = '123abc'

  # Call the actual function but with the mock WorkspaceClient.
  # Replace <spark-version> with the target Spark version string.
  # Replace <node-type-id> with the target node type string.
  response = create_cluster(
    w = mock_workspace_client,
    cluster_name            = 'Test Cluster',
    spark_version           = '<spark-version>',
    node_type_id            = '<node-type-id>',
    autotermination_minutes = 15,
    num_workers             = 1
  )

  # Assert that the function returned the mocked cluster ID.
  assert response.cluster_id == '123abc'

To run this test, run the pytest command from the code project's root, which should produce test results similar to the following:

$ pytest
=================== test session starts ====================
platform darwin -- Python 3.12.2, pytest-8.1.1, pluggy-1.4.0
rootdir: <project-rootdir>
collected 1 item

test_helpers.py . [100%]
======================== 1 passed ==========================
Additional resources

For more information, see:


RetroSearch is an open source project built by @garambo | Open a GitHub Issue

Search and Browse the WWW like it's 1997 | Search results from DuckDuckGo

HTML: 3.2 | Encoding: UTF-8 | Version: 0.7.4