This article shows you how to use Python to create and manage directories and files in storage accounts that have a hierarchical namespace.
To learn about how to get, set, and update the access control lists (ACL) of directories and files, see Use Python to manage ACLs in Azure Data Lake Storage.
Package (PyPi) | Samples | API reference | Gen1 to Gen2 mapping | Give Feedback
PrerequisitesAn Azure subscription. See Get Azure free trial.
A storage account that has hierarchical namespace enabled. Follow these instructions to create one.
This section walks you through preparing a project to work with the Azure Data Lake Storage client library for Python.
From your project directory, install packages for the Azure Data Lake Storage and Azure Identity client libraries using the pip install
command. The azure-identity package is needed for passwordless connections to Azure services.
pip install azure-storage-file-datalake azure-identity
Then open your code file and add the necessary import statements. In this example, we add the following to our .py file:
import os
from azure.storage.filedatalake import (
DataLakeServiceClient,
DataLakeDirectoryClient,
FileSystemClient
)
from azure.identity import DefaultAzureCredential
Note
Multi-protocol access on Data Lake Storage enables applications to use both Blob APIs and Data Lake Storage Gen2 APIs to work with data in storage accounts with hierarchical namespace (HNS) enabled. When working with capabilities unique to Data Lake Storage Gen2, such as directory operations and ACLs, use the Data Lake Storage Gen2 APIs, as shown in this article.
When choosing which APIs to use in a given scenario, consider the workload and the needs of your application, along with the known issues and impact of HNS on workloads and applications.
To work with the code examples in this article, you need to create an authorized DataLakeServiceClient instance that represents the storage account. You can authorize a DataLakeServiceClient
object using Microsoft Entra ID, an account access key, or a shared access signature (SAS).
You can use the Azure identity client library for Python to authenticate your application with Microsoft Entra ID.
Create an instance of the DataLakeServiceClient class and pass in a DefaultAzureCredential object.
def get_service_client_token_credential(self, account_name) -> DataLakeServiceClient:
account_url = f"https://{account_name}.dfs.core.windows.net"
token_credential = DefaultAzureCredential()
service_client = DataLakeServiceClient(account_url, credential=token_credential)
return service_client
To learn more about using DefaultAzureCredential
to authorize access to data, see Overview: Authenticate Python apps to Azure using the Azure SDK.
To use a shared access signature (SAS) token, provide the token as a string and initialize a DataLakeServiceClient object. If your account URL includes the SAS token, omit the credential parameter.
def get_service_client_sas(self, account_name: str, sas_token: str) -> DataLakeServiceClient:
account_url = f"https://{account_name}.dfs.core.windows.net"
# The SAS token string can be passed in as credential param or appended to the account URL
service_client = DataLakeServiceClient(account_url, credential=sas_token)
return service_client
To learn more about generating and managing SAS tokens, see the following article:
You can authorize access to data using your account access keys (Shared Key). The following code example creates a DataLakeServiceClient instance that is authorized with the account key:
def get_service_client_account_key(self, account_name, account_key) -> DataLakeServiceClient:
account_url = f"https://{account_name}.dfs.core.windows.net"
service_client = DataLakeServiceClient(account_url, credential=account_key)
return service_client
Caution
Authorization with Shared Key is not recommended as it may be less secure. For optimal security, disable authorization via Shared Key for your storage account, as described in Prevent Shared Key authorization for an Azure Storage account.
Use of access keys and connection strings should be limited to initial proof of concept apps or development prototypes that don't access production or sensitive data. Otherwise, the token-based authentication classes available in the Azure SDK should always be preferred when authenticating to Azure resources.
Microsoft recommends that clients use either Microsoft Entra ID or a shared access signature (SAS) to authorize access to data in Azure Storage. For more information, see Authorize operations for data access.
Create a containerA container acts as a file system for your files. You can create a container by using the following method:
The following code example creates a container and returns a FileSystemClient
object for later use:
def create_file_system(self, service_client: DataLakeServiceClient, file_system_name: str) -> FileSystemClient:
file_system_client = service_client.create_file_system(file_system=file_system_name)
return file_system_client
Create a directory
You can create a directory reference in the container by using the following method:
The following code example adds a directory to a container and returns a DataLakeDirectoryClient
object for later use:
def create_directory(self, file_system_client: FileSystemClient, directory_name: str) -> DataLakeDirectoryClient:
directory_client = file_system_client.create_directory(directory_name)
return directory_client
Rename or move a directory
You can rename or move a directory by using the following method:
Pass the path with the new directory name in the new_name
argument. The value must have the following format: {filesystem}/{directory}/{subdirectory}.
The following code example shows how to rename a subdirectory:
def rename_directory(self, directory_client: DataLakeDirectoryClient, new_dir_name: str):
directory_client.rename_directory(
new_name=f"{directory_client.file_system_name}/{new_dir_name}")
Upload a file to a directory
You can upload content to a new or existing file by using the following method:
The following code example shows how to upload a file to a directory using the upload_data method:
def upload_file_to_directory(self, directory_client: DataLakeDirectoryClient, local_path: str, file_name: str):
file_client = directory_client.get_file_client(file_name)
with open(file=os.path.join(local_path, file_name), mode="rb") as data:
file_client.upload_data(data, overwrite=True)
You can use this method to create and upload content to a new file, or you can set the overwrite
argument to True
to overwrite an existing file.
You can upload data to be appended to a file by using the following method:
The following code example shows how to append data to the end of a file using these steps:
DataLakeFileClient
object to represent the file resource you're working with.def append_data_to_file(self, directory_client: DataLakeDirectoryClient, file_name: str):
file_client = directory_client.get_file_client(file_name)
file_size = file_client.get_file_properties().size
data = b"Data to append to end of file"
file_client.append_data(data, offset=file_size, length=len(data))
file_client.flush_data(file_size + len(data))
With this method, data can only be appended to a file and the operation is limited to 4000 MiB per request.
Download from a directoryThe following code example shows how to download a file from a directory to a local file using these steps:
DataLakeFileClient
object to represent the file you want to download.def download_file_from_directory(self, directory_client: DataLakeDirectoryClient, local_path: str, file_name: str):
file_client = directory_client.get_file_client(file_name)
with open(file=os.path.join(local_path, file_name), mode="wb") as local_file:
download = file_client.download_file()
local_file.write(download.readall())
local_file.close()
List directory contents
You can list directory contents by using the following method and enumerating the result:
Enumerating the paths in the result may make multiple requests to the service while fetching the values.
The following code example prints the path of each subdirectory and file that is located in a directory:
def list_directory_contents(self, file_system_client: FileSystemClient, directory_name: str):
paths = file_system_client.get_paths(path=directory_name)
for path in paths:
print(path.name + '\n')
Delete a directory
You can delete a directory by using the following method:
The following code example shows how to delete a directory:
def delete_directory(self, directory_client: DataLakeDirectoryClient):
directory_client.delete_directory()
See also
RetroSearch is an open source project built by @garambo | Open a GitHub Issue
Search and Browse the WWW like it's 1997 | Search results from DuckDuckGo
HTML:
3.2
| Encoding:
UTF-8
| Version:
0.7.4