A RetroSearch Logo

Home - News ( United States | United Kingdom | Italy | Germany ) - Football scores

Search Query:

Showing content from https://learn.microsoft.com/en-us/azure/machine-learning/how-to-read-write-data-v2 below:

Access data in a job - Azure Machine Learning

APPLIES TO: Azure CLI ml extension v2 (current) Python SDK azure-ai-ml v2 (current)

In this article you learn:

Prerequisites Quickstart

Before you explore the detailed options available to you when you access data, we first describe the relevant code snippets for data access.

Read data from Azure storage in an Azure Machine Learning job

In this example, you submit an Azure Machine Learning job that accesses data from a public blob storage account. However, you can adapt the snippet to access your own data in a private Azure Storage account. Update the path as described here. Azure Machine Learning seamlessly handles authentication to cloud storage, with Microsoft Entra passthrough. When you submit a job, you can choose:

Tip

If you use keys or SAS tokens to authenticate, we recommend creating an Azure Machine Learning datastore. The runtime automatically connects to storage without exposing your credentials.

from azure.ai.ml import command, Input, MLClient, UserIdentityConfiguration, ManagedIdentityConfiguration
from azure.ai.ml.entities import Data
from azure.ai.ml.constants import AssetTypes, InputOutputModes
from azure.identity import DefaultAzureCredential

# Set your subscription, resource group and workspace name:
subscription_id = "<SUBSCRIPTION_ID>"
resource_group = "<RESOURCE_GROUP>"
workspace = "<AML_WORKSPACE_NAME>"

# connect to the AzureML workspace
ml_client = MLClient(
    DefaultAzureCredential(), subscription_id, resource_group, workspace
)

# ==============================================================
# Set the URI path for the data.
# Supported `path` formats for input include:
# local: `./<path>
# Blob: wasbs://<container_name>@<account_name>.blob.core.windows.net/<path>
# ADLS: abfss://<file_system>@<account_name>.dfs.core.windows.net/<path>
# Datastore: azureml://datastores/<data_store_name>/paths/<path>
# Data Asset: azureml:<my_data>:<version>
# Supported `path` format for output is:
# Datastore: azureml://datastores/<data_store_name>/paths/<path>
# We set the input path to a file on a public blob container
# ==============================================================
path = "wasbs://data@azuremlexampledata.blob.core.windows.net/titanic.csv"


# ==============================================================
# What type of data does the path point to? Options include:
# data_type = AssetTypes.URI_FILE # a specific file
# data_type = AssetTypes.URI_FOLDER # a folder
# data_type = AssetTypes.MLTABLE # an mltable
# The path we set above is a specific file
# ==============================================================
data_type = AssetTypes.URI_FILE

# ==============================================================
# Set the mode. The popular modes include:
# mode = InputOutputModes.RO_MOUNT # Read-only mount on the compute target
# mode = InputOutputModes.DOWNLOAD # Download the data to the compute target
# ==============================================================
mode = InputOutputModes.RO_MOUNT

# ==============================================================
# You can set the identity you want to use in a job to access the data. Options include:
# identity = UserIdentityConfiguration() # Use the user's identity
# identity = ManagedIdentityConfiguration() # Use the compute target managed identity
# ==============================================================
# This example accesses public data, so we don't need an identity.
# You also set identity to None if you use a credential-based datastore
identity = None

# Set the input for the job:
inputs = {
    "input_data": Input(type=data_type, path=path, mode=mode)
}

# This command job uses the head Linux command to print the first 10 lines of the file
job = command(
    command="head ${{inputs.input_data}}",
    inputs=inputs,
    environment="azureml://registries/azureml/environments/sklearn-1.1/versions/4",
    compute="cpu-cluster",
    identity=identity,
)

# Submit the command
ml_client.jobs.create_or_update(job)

Create a job specification YAML file (<file-name>.yml).

$schema: https://azuremlschemas.azureedge.net/latest/commandJob.schema.json

# ==============================================================
# Set the URI path for the data.
# Supported `path` formats for input include:
# local: `./<path>
# Blob: wasbs://<container_name>@<account_name>.blob.core.windows.net/<path>
# ADLS: abfss://<file_system>@<account_name>.dfs.core.windows.net/<path>
# Datastore: azureml://datastores/<data_store_name>/paths/<path>
# Data Asset: azureml:<my_data>:<version>
# Supported `path` format for output is:
# Datastore: azureml://datastores/<data_store_name>/paths/<path>
# ==============================================================

# ==============================================================
# What type of data does the path point to? Options include:
# type: uri_file # a specific file
# type: uri_folder # a folder
# type: mltable # an mltable
# ==============================================================

# ==============================================================
# Set the mode. The popular modes include:
# mode: ro_mount # Read-only mount on the compute target
# mode: download # Download the data to the compute target
# ==============================================================

# ==============================================================
# You can set the identity you want to use in a job to access the data. Options include:
# identity:
#   type: user_identity
# identity:
#   type: managed_identity
# ==============================================================
# This example accesses public data, so we don't need an identity.
# You don't need to set an identity to if you use a credential-based datastore

# This command job prints the first 10 lines of the file

type: command
command: head ${{inputs.input_data}}
compute: azureml:cpu-cluster
environment: azureml://registries/azureml/environments/sklearn-1.1/versions/4
inputs:
  input_data:
    mode: ro_mount
    path: wasbs://data@azuremlexampledata.blob.core.windows.net/titanic.csv
    type: uri_file

Next, submit your job using the CLI:

az ml job create -f <file-name>.yml
Write data from your Azure Machine Learning job to Azure Storage

In this example, you submit an Azure Machine Learning job that writes data to your default Azure Machine Learning Datastore. You can optionally set the name value of your data asset to create a data asset in the output.

from azure.ai.ml import command, Input, Output, MLClient
from azure.ai.ml.constants import AssetTypes, InputOutputModes
from azure.identity import DefaultAzureCredential

# Set your subscription, resource group and workspace name:
subscription_id = "<SUBSCRIPTION_ID>"
resource_group = "<RESOURCE_GROUP>"
workspace = "<AML_WORKSPACE_NAME>"

# connect to the AzureML workspace
ml_client = MLClient(
    DefaultAzureCredential(), subscription_id, resource_group, workspace
)

# ==============================================================
# Set the URI path for the data.
# Supported `path` formats for input include:
# local: `./<path>
# Blob: wasbs://<container_name>@<account_name>.blob.core.windows.net/<path>
# ADLS: abfss://<file_system>@<account_name>.dfs.core.windows.net/<path>
# Datastore: azureml://datastores/<data_store_name>/paths/<path>
# Data Asset: azureml:<my_data>:<version>
# Supported `path` format for output is:
# Datastore: azureml://datastores/<data_store_name>/paths/<path>
# As an example, we set the input path to a file on a public blob container
# As an example, we set the output path to a folder in the default datastore
# ==============================================================
input_path = "wasbs://data@azuremlexampledata.blob.core.windows.net/titanic.csv"
output_path = "azureml://datastores/workspaceblobstore/paths/quickstart-output/titanic.csv"

# ==============================================================
# What type of data are you pointing to?
# AssetTypes.URI_FILE (a specific file)
# AssetTypes.URI_FOLDER (a folder)
# AssetTypes.MLTABLE (a table)
# The path we set above is a specific file
# ==============================================================
data_type = AssetTypes.URI_FILE

# ==============================================================
# Set the input mode. The most commonly-used modes:
# InputOutputModes.RO_MOUNT
# InputOutputModes.DOWNLOAD
# Set the mode to Read Only (RO) to mount the data
# ==============================================================
input_mode = InputOutputModes.RO_MOUNT

# ==============================================================
# Set the output mode. The most commonly-used modes:
# InputOutputModes.RW_MOUNT
# InputOutputModes.UPLOAD
# Set the mode to Read Write (RW) to mount the data
# ==============================================================
output_mode = InputOutputModes.RW_MOUNT

# Set the input and output for the job:
inputs = {
    "input_data": Input(type=data_type, path=input_path, mode=input_mode)
}

outputs = {
    "output_data": Output(type=data_type, 
                          path=output_path, 
                          mode=output_mode,
                          # optional: if you want to create a data asset from the output, 
                          # then uncomment `name` (`name` can be set without setting `version`, and in this way, we will set `version` automatically for you)
                          # name = "<name_of_data_asset>", # use `name` and `version` to create a data asset from the output
                          # version = "<version>",
                  )
}

# This command job copies the data to your default Datastore
job = command(
    command="cp ${{inputs.input_data}} ${{outputs.output_data}}",
    inputs=inputs,
    outputs=outputs,
    environment="azureml://registries/azureml/environments/sklearn-1.1/versions/4",
    compute="cpu-cluster",
)

# Submit the command
ml_client.jobs.create_or_update(job)

Create a job specification YAML file (<file-name>.yml):

$schema: https://azuremlschemas.azureedge.net/latest/commandJob.schema.json

# path: Set the URI path for the data. Supported paths include
# local: `./<path>
# Blob: wasbs://<container_name>@<account_name>.blob.core.windows.net/<path>
# ADLS: abfss://<file_system>@<account_name>.dfs.core.windows.net/<path>
# Datastore: azureml://datastores/<data_store_name>/paths/<path>
# Data Asset: azureml:<my_data>:<version>

# type: What type of data are you pointing to?
# uri_file (a specific file)
# uri_folder (a folder)
# mltable (a table)

# mode: Set INPUT mode:
# ro_mount (read-only mount)
# download (download from storage to node)
# mode: Set the OUTPUT mode
# rw_mount (read-write mount)
# upload (upload data from node to storage)

type: command
command: cp ${{inputs.input_data}} ${{outputs.output_data}}
compute: azureml:cpu-cluster
environment: azureml://registries/azureml/environments/sklearn-1.1/versions/4
inputs:
  input_data:
    mode: ro_mount
    path: wasbs://data@azuremlexampledata.blob.core.windows.net/titanic.csv
    type: uri_file
outputs:
  output_data:
    mode: rw_mount
    path: azureml://datastores/workspaceblobstore/paths/quickstart-output/titanic.csv
    type: uri_file
    # optional: if you want to create a data asset from the output, 
    # then uncomment `name` (`name` can be set without setting `version`, and in this way, we will set `version` automatically for you)
    # name: <name_of_data_asset> # use `name` and `version` to create a data asset from the output
    # version: <version>

Next, submit the job using the CLI:

az ml job create --file <file-name>.yml
The Azure Machine Learning data runtime

When you submit a job, the Azure Machine Learning data runtime controls the data load, from the storage location to the compute target. The Azure Machine Learning data runtime is optimized for speed and efficiency for machine learning tasks. The key benefits include:

Tip

We suggest that you apply the Azure Machine Learning data runtime, instead of creation of your own mounting/downloading capability in your training (client) code. We have observed storage throughput constraints when the client code uses Python to download data from storage, because of Global Interpreter Lock (GIL) issues.

Paths

When you provide a data input or output to a job, you must specify a path parameter that points to the data location. This table shows the different data locations that Azure Machine Learning supports and provides path parameter examples:

Location Examples Input Output A path on your local computer ./home/username/data/my_data Y N A path on a public HTTP(S) server https://raw.githubusercontent.com/pandas-dev/pandas/main/doc/data/titanic.csv Y N A path on Azure Storage wasbs://<container_name>@<account_name>.blob.core.windows.net/<path>
abfss://<file_system>@<account_name>.dfs.core.windows.net/<path> Y, only for identity-based authentication N A path on an Azure Machine Learning datastore azureml://datastores/<data_store_name>/paths/<path> Y Y A path to a data asset azureml:<my_data>:<version> Y N, but you can use name and version to create a data asset from output Modes

When you run a job with data inputs/outputs, you can select from these mode options:

This table shows the possible modes for different type/mode/input/output combinations:

Type Input/Output upload download ro_mount rw_mount direct eval_download eval_mount uri_folder Input ✓ ✓ ✓ uri_file Input ✓ ✓ ✓ mltable Input ✓ ✓ ✓ ✓ ✓ uri_folder Output ✓ ✓ uri_file Output ✓ ✓ mltable Output ✓ ✓ ✓ Download

In download mode, all the input data is copied to the local disk (SSD) of the compute target. The Azure Machine Learning data runtime starts the user training script, once all the data is copied. When the user script starts, it reads data from the local disk, just like any other files. When the job finishes, the data is removed from the disk of the compute target.

Advantages Disadvantages When training starts, all the data is available on the local disk (SSD) of the compute target, for the training script. No Azure storage / network interaction is required. The dataset must completely fit on a compute target disk. After the user script starts, there are no dependencies on storage / network reliability. The entire dataset is downloaded (if training needs to randomly select only a small portion of a data, much of the download is then wasted). Azure Machine Learning data runtime can parallelize the download (significant difference on many small files) and max network / storage throughput. The job waits until all data downloads to the local disk of the compute target. For a submitted deep-learning job, the GPUs idle until data is ready. No unavoidable overhead added by the FUSE layer (roundtrip: user space call in user script → kernel → user space fuse daemon → kernel → response to user script in user space) Storage changes aren't reflected on the data after download is done. When to use download Available download settings

You can tune the download settings with these environment variables in your job:

Environment Variable Name Type Default Value Description RSLEX_DOWNLOADER_THREADS u64 NUMBER_OF_CPU_CORES * 4 Number of concurrent threads download can use AZUREML_DATASET_HTTP_RETRY_COUNT u64 7 Number of retry attempts of individual storage / http request to recover from transient errors.

In your job, you can change the above defaults by setting the environment variables - for example:

For brevity, we only show how to define the environment variables in the job.

from azure.ai.ml import command

env_var = {
"RSLEX_DOWNLOADER_THREADS": 64,
"AZUREML_DATASET_HTTP_RETRY_COUNT": 10
}

job = command(
        environment_variables=env_var
)

For brevity, we only show how to define the environment variables in the job.

$schema: https://azuremlschemas.azureedge.net/latest/commandJob.schema.json

environment_variables:
    RSLEX_DOWNLOADER_THREADS: 64
    AZUREML_DATASET_HTTP_RETRY_COUNT: 10
Download performance metrics

The VM size of your compute target has an effect on the download time of your data. Specifically:

Note

For A100 GPU VMs, the Azure Machine Learning data runtime can saturate the NIC (Network Interface Card) when downloading data to the compute target (~24 Gbit/s): The theoretical maximum throughput possible.

This table shows the download performance the Azure Machine Learning data runtime can handle for a 100-GB file on a Standard_D15_v2 VM (20cores, 25 Gbit/s Network throughput):

Data structure Download only (secs) Download and calculate MD5 (secs) Throughput Achieved (Gbit/s) 10 x 10 GB Files 55.74 260.97 14.35 Gbit/s 100 x 1 GB Files 58.09 259.47 13.77 Gbit/s 1 x 100 GB File 96.13 300.61 8.32 Gbit/s

We can see that a larger file, broken up into smaller files, can improve download performance due to parallelism. We recommend that you avoid files that become too small (less than 4 MB) because the time needed for storage request submissions increases, relative to time spent downloading the payload. For more information, read Many small files problem.

Mount (streaming)

In mount mode, the Azure Machine Learning data capability uses the FUSE (filesystem in user space) Linux feature, to create an emulated filesystem. Instead of downloading all the data to the local disk (SSD) of the compute target, the runtime can react to the user's script actions in real-time. For example, "open file", "read 2-KB chunk from position X", "list directory content".

Advantages Disadvantages Data that exceeds the compute target local disk capacity can be used (not limited by compute hardware) Added overhead of the Linux FUSE module. No delay at the start of training (unlike download mode). Dependency on user's code behavior (if the training code that sequentially reads small files in a single thread mount also requests data from storage, it might not maximize the network or storage throughput). More available settings to tune for a usage scenario. No windows support. Only data needed for training is read from storage. When to use Mount Available mount settings

You can tune the mount settings with these environment variables in your job:

Env variable name Type Default value Description DATASET_MOUNT_ATTRIBUTE_CACHE_TTL u64 Not set (cache never expires) Time, in milliseconds, needed to keep the getattr call results in cache, and to avoid subsequent requests of this info from storage again. DATASET_RESERVED_FREE_DISK_SPACE u64 150 MB Intended for a system configuration, to keep compute healthy. No matter what values the other settings have, Azure Machine Learning data runtime doesn't use the last RESERVED_FREE_DISK_SPACE bytes of disk space. DATASET_MOUNT_CACHE_SIZE usize Unlimited Controls how much disk space mount can use. A positive value sets absolute value in bytes. Negative value sets how much of a disk space to leave free. This table provides more disk cache options. Supports KB, MB and GB modifiers for convenience. DATASET_MOUNT_FILE_CACHE_PRUNE_THRESHOLD f64 1.0 Volume mount starts cache pruning when cache is filled up to AVAILABLE_CACHE_SIZE * DATASET_MOUNT_FILE_CACHE_PRUNE_THRESHOLD. Should be between 0 and 1. Setting it < 1 triggers background cache pruning earlier. AVAILABLE_CACHE_SIZE isn't an environment variable you can modify or view directly. In this context, it refers to the "number of bytes the system calculates as available for caching." This value depends on factors such as disk size, the amount of disk space required for system health, and configurations set in environment variables (like DATASET_RESERVED_FREE_DISK_SPACE and DATASET_MOUNT_CACHE_SIZE). DATASET_MOUNT_FILE_CACHE_PRUNE_TARGET f64 0.7 Pruning cache tries to free at least (1-DATASET_MOUNT_FILE_CACHE_PRUNE_TARGET) of a cache space. DATASET_MOUNT_READ_BLOCK_SIZE usize 2 MB Streaming read block size. When file is large enough, request at least DATASET_MOUNT_READ_BLOCK_SIZE of data from storage, and cache even when fuse requested read operation was for less. DATASET_MOUNT_READ_BUFFER_BLOCK_COUNT usize 32 Number of blocks to prefetch (reading block k triggers background prefetching of blocks k+1, ..., k.+DATASET_MOUNT_READ_BUFFER_BLOCK_COUNT) DATASET_MOUNT_READ_THREADS usize NUMBER_OF_CORES * 4 Number of background prefetching threads. DATASET_MOUNT_BLOCK_BASED_CACHE_ENABLED bool false Enable block-based caching. DATASET_MOUNT_MEMORY_CACHE_SIZE usize 128 MB Applies to block-based caching only. Size of RAM block-based caching can use. A value of 0 disables memory caching completely. DATASET_MOUNT_BLOCK_FILE_CACHE_ENABLED bool true Applies to block-based caching only. When set to true, block-based caching uses local hard drive to cache blocks. DATASET_MOUNT_BLOCK_FILE_CACHE_MAX_QUEUE_SIZE usize 512 MB Applies to block-based caching only. Block-based caching writes cached block to a local disk in a background. This setting controls how much memory mount can use to store blocks waiting for flush to the local disk cache. DATASET_MOUNT_BLOCK_FILE_CACHE_WRITE_THREADS usize NUMBER_OF_CORES * 2 Applies to block-based caching only. Number of background threads block-based caching uses to write downloaded blocks to the local disk of the compute target. DATASET_UNMOUNT_TIMEOUT_SECONDS u64 30 Time in seconds for unmount to (gracefully) finish all pending operations (for example, flush calls) before forcefully terminating the mount message loop.

In your job, you can change the above defaults by setting the environment variables, for example:

from azure.ai.ml import command

env_var = {
"DATASET_MOUNT_BLOCK_FILE_CACHE_ENABLED": True
}

job = command(
        environment_variables=env_var
)
$schema: https://azuremlschemas.azureedge.net/latest/commandJob.schema.json

environment_variables:
    DATASET_MOUNT_BLOCK_FILE_CACHE_ENABLED: true
Block-based open mode

Block-based open mode splits each file into blocks of a predefined size (except for the last block). A read request from a specified position requests a corresponding block from storage, and returns the requested data immediately. A read also triggers background prefetching of N next blocks, using multiple threads (optimized for sequential read). Downloaded blocks are cached in two layer cache (RAM and local disk).

Advantages Disadvantages Fast data delivery to the training script (less blocking for chunks that weren't yet requested). Random reads may waste forward-prefetched blocks. More work offloads to background threads (prefetching / caching). The training can then proceed. Added overhead to navigate between caches, compared to direct reads from a file on a local disk cache (for example, in whole-file cache mode). Only requested data (plus prefetching) is read from storage. For small enough data, fast RAM-based cache is used. When to use block-based open mode

Recommended for most scenarios except when you need fast reads from random file locations. In those cases, use Whole file cache open mode.

Whole file cache open mode

When a file under a mount folder is opened (for example, f = open(path, args)) in whole file mode, the call is blocked until the entire file is downloaded into a compute target cache folder on the disk. All subsequent read calls redirect to the cached file, so no storage interaction is needed. If cache doesn't have enough available space to fit the current file, mount tries to prune by deleting the least-recently used file from the cache. In cases where the file can't fit on disk (with respect to cache settings), the data runtime falls back to streaming mode.

Advantages Disadvantages No storage reliability / throughput dependencies after the file is opened. Open call is blocked until the entire file is downloaded. Fast random reads (reading chunks from random places of the file). The entire file is read from storage, even when some portions of the file may not be needed. When to use it

When random reads are needed for relatively large files that exceed 128 MB.

Usage

Set environment variable DATASET_MOUNT_BLOCK_BASED_CACHE_ENABLED to false in your job:

from azure.ai.ml import command

env_var = {
"DATASET_MOUNT_BLOCK_BASED_CACHE_ENABLED": False
}

job = command(
        environment_variables=env_var
)
$schema: https://azuremlschemas.azureedge.net/latest/commandJob.schema.json

environment_variables:
    DATASET_MOUNT_BLOCK_BASED_CACHE_ENABLED: false
Mount: Listing files

When working with millions of files, avoid a recursive listing - for example ls -R /mnt/dataset/folder/. A recursive listing triggers many calls to list the directory contents of the parent directory. It then requires a separate recursive call for each directory inside, at all child levels. Typically, Azure Storage allows only 5,000 elements to be returned per single list request. As a result, a recursive listing of 1M folders containing 10 files each requires 1,000,000 / 5000 + 1,000,000 = 1,000,200 requests to storage. In comparison, 1,000 folders with 10,000 files would only need 1,001 requests to storage for a recursive listing.

Azure Machine Learning mount handles listing in a lazy manner. Therefore, to list many small files, it's better to use an iterative client library call (for example, os.scandir() in Python) instead of a client library call that returns the full list (for example, os.listdir() in Python). An iterative client library call returns a generator, meaning that it doesn't need to wait until the entire list loads. It can then proceed faster.

This table compares the time needed for the Python os.scandir() and os.listdir() functions to list a folder that contains ~4M files in a flat structure:

Metric os.scandir() os.listdir() Time to get first entry (secs) 0.67 553.79 Time to get first 50k entries (secs) 9.56 562.73 Time to get all entries (secs) 558.35 582.14 Optimum mount settings for common scenarios

For certain common scenarios, we show the optimal mount settings you need to set in your Azure Machine Learning job.

Reading large file sequentially one time (processing lines in csv file)

Include these mount settings in the environment_variables section of your Azure Machine Learning job:

from azure.ai.ml import command

env_var = {
  "DATASET_MOUNT_BLOCK_BASED_CACHE_ENABLED": True, # Enable block-based caching
  "DATASET_MOUNT_BLOCK_FILE_CACHE_ENABLED": False, # Disable caching on disk
  "DATASET_MOUNT_MEMORY_CACHE_SIZE": 0, # Disabling in-memory caching

  # Increase the number of blocks used for prefetch. This leads to use of more RAM (2 MB * #value set).
  # Can adjust up and down for fine-tuning, depending on the actual data processing pattern.
  # An optimal setting based on our test ~= the number of prefetching threads (#CPU_CORES * 4 by default)
  "DATASET_MOUNT_READ_BUFFER_BLOCK_COUNT": 80,
}

job = command(
        environment_variables=env_var
)
$schema: https://azuremlschemas.azureedge.net/latest/commandJob.schema.json

environment_variables:
  DATASET_MOUNT_BLOCK_BASED_CACHE_ENABLED: true # Enable block-based caching
  DATASET_MOUNT_BLOCK_FILE_CACHE_ENABLED: false # Disable caching on disk
  DATASET_MOUNT_MEMORY_CACHE_SIZE: 0 # Disabling in-memory caching

  # Increase the number of blocks used for prefetch. This leads to use of more RAM (2 MB * #value set).
  # Can adjust up and down for fine-tuning, depending on the actual data processing pattern.
  # An optimal setting based on our test ~= the number of prefetching threads (#CPU_CORES * 4 by default)
  DATASET_MOUNT_READ_BUFFER_BLOCK_COUNT: 80
Reading large file one time from multiple threads (processing partitioned csv file in multiple threads)

Include these mount settings in the environment_variables section of your Azure Machine Learning job:

from azure.ai.ml import command

env_var = {
  "DATASET_MOUNT_BLOCK_BASED_CACHE_ENABLED": True, # Enable block-based caching
  "DATASET_MOUNT_BLOCK_FILE_CACHE_ENABLED": False, # Disable caching on disk
  "DATASET_MOUNT_MEMORY_CACHE_SIZE": 0, # Disabling in-memory caching
}

job = command(
        environment_variables=env_var
)
$schema: https://azuremlschemas.azureedge.net/latest/commandJob.schema.json

environment_variables:
  DATASET_MOUNT_BLOCK_BASED_CACHE_ENABLED: true # Enable block-based caching
  DATASET_MOUNT_BLOCK_FILE_CACHE_ENABLED: false # Disable caching on disk
  DATASET_MOUNT_MEMORY_CACHE_SIZE: 0 # Disabling in-memory caching
Reading millions of small files (images) from multiple threads one time (single epoch training on images)

Include these mount settings in the environment_variables section of your Azure Machine Learning job:

from azure.ai.ml import command

env_var = {
  "DATASET_MOUNT_BLOCK_BASED_CACHE_ENABLED": True, # Enable block-based caching
  "DATASET_MOUNT_BLOCK_FILE_CACHE_ENABLED": False, # Disable caching on disk
  "DATASET_MOUNT_MEMORY_CACHE_SIZE": 0, # Disabling in-memory caching
}

job = command(
        environment_variables=env_var
)
$schema: https://azuremlschemas.azureedge.net/latest/commandJob.schema.json

environment_variables:
  DATASET_MOUNT_BLOCK_BASED_CACHE_ENABLED: true # Enable block-based caching
  DATASET_MOUNT_BLOCK_FILE_CACHE_ENABLED: false # Disable caching on disk
  DATASET_MOUNT_MEMORY_CACHE_SIZE: 0 # Disabling in-memory caching
Reading millions of small files (images) from multiple threads multiple times (multiple epochs training on images)

Include these mount settings in the environment_variables section of your Azure Machine Learning job:

from azure.ai.ml import command

env_var = {
  "DATASET_MOUNT_BLOCK_BASED_CACHE_ENABLED": True, # Enable block-based caching
}

job = command(
        environment_variables=env_var
)
$schema: https://azuremlschemas.azureedge.net/latest/commandJob.schema.json

environment_variables:
  DATASET_MOUNT_BLOCK_BASED_CACHE_ENABLED: true # Enable block-based caching
Reading large file with random seeks (like serving file database from mounted folder)

Include these mount settings in the environment_variables section of your Azure Machine Learning job:

from azure.ai.ml import command

env_var = {
  "DATASET_MOUNT_BLOCK_BASED_CACHE_ENABLED": False, # Disable block-based caching
}

job = command(
        environment_variables=env_var
)
$schema: https://azuremlschemas.azureedge.net/latest/commandJob.schema.json

environment_variables:
  DATASET_MOUNT_BLOCK_BASED_CACHE_ENABLED: false # Disable block-based caching
Diagnosing and solving data loading bottlenecks

When an Azure Machine Learning job executes with data, the mode of an input determines how bytes are read from storage and cached on the compute target local SSD disk. For download mode, all the data caches on disk before the user code starts its execution. Several factors affect maximum download speeds:

For mount mode, the user code must start to open files before the data starts to cache. Different mount settings result in different reading and caching behavior. Various factors affect the speed that data loads from storage:

Using logs to diagnose issues

To access the logs of the data runtime from your job:

  1. Select Outputs+Logs tab from the job page.
  2. Select the system_logs folder, followed by data_capability folder.
  3. You should see two log files:

The log file data-capability.log shows the high-level information about the time spent on key data loading tasks. For example, when you download data, the runtime logs the download activity start and finish times:

INFO 2023-05-18 17:14:47,790 sdk_logger.py:44 [28] - ActivityStarted, download
INFO 2023-05-18 17:14:50,295 sdk_logger.py:44 [28] - ActivityCompleted: Activity=download, HowEnded=Success, Duration=2504.39 [ms]

If the download throughput is a fraction of the expected network bandwidth for the VM size, you can inspect the log file rslex.log.<TIMESTAMP>. This file contains all the fine-grain logging from the Rust-based runtime; for example, parallelization:

2023-05-18T14:08:25.388670Z  INFO copy_uri:copy_uri:copy_dataset:write_streams_to_files:collect:reduce:reduce_and_combine:reduce:get_iter: rslex::prefetching: close time.busy=23.2µs time.idle=1.90µs sessionId=012ea46a-341c-4258-8aba-90bde4fdfb51 source=Dataset[Partitions: 1, Sources: 1] file_name_column=None break_on_first_error=true skip_existing_files=false parallelization_degree=4 self=Dataset[Partitions: 1, Sources: 1] parallelization_degree=4 self=Dataset[Partitions: 1, Sources: 1] parallelization_degree=4 self=Dataset[Partitions: 1, Sources: 1] parallelization_degree=4 i=0 index=0
2023-05-18T14:08:25.388731Z  INFO copy_uri:copy_uri:copy_dataset:write_streams_to_files:collect:reduce:reduce_and_combine:reduce: rslex::dataset_crossbeam: close time.busy=90.9µs time.idle=9.10µs sessionId=012ea46a-341c-4258-8aba-90bde4fdfb51 source=Dataset[Partitions: 1, Sources: 1] file_name_column=None break_on_first_error=true skip_existing_files=false parallelization_degree=4 self=Dataset[Partitions: 1, Sources: 1] parallelization_degree=4 self=Dataset[Partitions: 1, Sources: 1] parallelization_degree=4 self=Dataset[Partitions: 1, Sources: 1] parallelization_degree=4 i=0
2023-05-18T14:08:25.388762Z  INFO copy_uri:copy_uri:copy_dataset:write_streams_to_files:collect:reduce:reduce_and_combine:combine: rslex::dataset_crossbeam: close time.busy=1.22ms time.idle=9.50µs sessionId=012ea46a-341c-4258-8aba-90bde4fdfb51 source=Dataset[Partitions: 1, Sources: 1] file_name_column=None break_on_first_error=true skip_existing_files=false parallelization_degree=4 self=Dataset[Partitions: 1, Sources: 1] parallelization_degree=4 self=Dataset[Partitions: 1, Sources: 1] parallelization_degree=4 self=Dataset[Partitions: 1, Sources: 1] parallelization_degree=4

The rslex.log file provides details about all the file copying, whether or not you chose the mount or download modes. It also describes the Settings (environment variables) used. To start debugging, check whether you set the Optimum mount settings for common scenarios.

Monitor Azure storage

In the Azure portal, you can select your Storage account, and then Metrics, to see the storage metrics:

You then plot the SuccessE2ELatency with SuccessServerLatency. If the metrics show high SuccessE2ELatency and low SuccessServerLatency, you have limited available threads, or you run low on resources such as CPU, memory, or network bandwidth, you should:

If the metrics show low SuccessE2ELatency and low SuccessServerLatency, but the client experiences high latency, you have a delay in the storage request that reaches the service. You should check:

Monitor disk usage during a job

From the Azure Machine Learning studio, you can also monitor the compute target disk IO and usage during your job execution. Navigate to your job, and select the Monitoring tab. This tab provides insights on the resources of your job, on a 30 day rolling basis. For example:

Note

Job monitoring supports only compute resources that Azure Machine Learning manages. Jobs with a runtime of less than 5 minutes will not have enough data to populate this view.

Azure Machine Learning data runtime doesn't use the last RESERVED_FREE_DISK_SPACE bytes of disk space, to keep the compute healthy (the default value is 150MB). If your disk is full, your code is writing files to disk without declaring the files as an output. Therefore, check your code to make sure that data isn't being written erroneously to temporary disk. If you must write files to temporary disk, and that resource is becoming full, consider:

Colocate data with compute

Caution

If your storage and compute are in different regions, your performance degrades because data must transfer across regions. This increases costs. Make sure that your storage account and compute resources are in the same region.

If your data and Azure Machine Learning Workspace are stored in different regions, we recommend that you copy the data to a storage account in the same region with the azcopy utility. AzCopy uses server-to-server APIs, so that data copies directly between storage servers. These copy operations don't use the network bandwidth of your computer. You can increase the throughput of these operations with the AZCOPY_CONCURRENCY_VALUE environment variable. To learn more, see Increase concurrency.

Storage load

A single storage account can become throttled when it comes under high load, when:

This section shows the calculations to determine if throttling might become an issue for your workload, and how to approach reductions of throttling.

Calculate bandwidth limits

An Azure Storage account has a default egress limit of 120 Gbit/s. Azure VMs have different network bandwidths, which have an effect on the theoretical number of compute nodes needed to hit the maximum default egress capacity of storage:

Size GPU Card vCPU Memory: GiB Temp storage (SSD) GiB Number of GPU Cards GPU memory: GiB Expected network bandwidth (Gbit/s) Storage Account Egress Default Max (Gbit/s)* Number of Nodes to hit default egress capacity Standard_ND96asr_v4 A100 96 900 6000 8 40 24 120 5 Standard_ND96amsr_A100_v4 A100 96 1900 6400 8 80 24 120 5 Standard_NC6s_v3 V100 6 112 736 1 16 24 120 5 Standard_NC12s_v3 V100 12 224 1474 2 32 24 120 5 Standard_NC24s_v3 V100 24 448 2948 4 64 24 120 5 Standard_NC24rs_v3 V100 24 448 2948 4 64 24 120 5 Standard_NC4as_T4_v3 T4 4 28 180 1 16 8 120 15 Standard_NC8as_T4_v3 T4 8 56 360 1 16 8 120 15 Standard_NC16as_T4_v3 T4 16 110 360 1 16 8 120 15 Standard_NC64as_T4_v3 T4 64 440 2880 4 64 32 120 3

Both the A100/V100 SKUs have a maximum network bandwidth per node of 24 Gbit/s. If each node that reads data from a single account can read close to the theoretical maximum of 24 Gbit/s, egress capacity would occur with five nodes. Use of six or more compute nodes would start to degrade data throughput across all nodes.

Important

If your workload needs more than six nodes of A100/V100, or you believe you'll exceed the default egress capacity of storage (120Gbit/s), contact support (via the Azure portal) and request a storage egress limit increase.

Scaling across multiple storage accounts

You might exceed the maximum egress capacity of storage, and/or you might hit the request rate limits. If these issues occur, we suggest that you contact support first, to increase these limits on the storage account.

If you can't increase the maximum egress capacity or request rate limit, you should consider replicating the data across multiple storage accounts. Copy the data to multiple accounts with Azure Data Factory, Azure Storage Explorer, or azcopy, and mount all the accounts in your training job. Only the data accessed on a mount is downloaded. Therefore, your training code can read the RANK from the environment variable, to pick which of the multiple inputs mounts from which to read. Your job definition passes in a list of storage accounts:

$schema: https://azuremlschemas.azureedge.net/latest/commandJob.schema.json
code: src
command: >-
  python train.py
  --epochs ${{inputs.epochs}}
  --learning-rate ${{inputs.learning_rate}}
  --data ${{inputs.cifar_storage1}}, ${{inputs.cifar_storage2}}
inputs:
  epochs: 1
  learning_rate: 0.2
  cifar_storage1:
    type: uri_folder
    path: azureml://datastores/storage1/paths/cifar
  cifar_storage2:
    type: uri_folder
    path: azureml://datastores/storage2/paths/cifar
environment: azureml:AzureML-pytorch-1.9-ubuntu18.04-py37-cuda11-gpu@latest
compute: azureml:gpu-cluster
distribution:
  type: pytorch
  process_count_per_instance: 1
resources:
  instance_count: 2
display_name: pytorch-cifar-distributed-example
experiment_name: pytorch-cifar-distributed-example
description: Train a basic convolutional neural network (CNN) with PyTorch on the CIFAR-10 dataset, distributed via PyTorch.

Your training python code can then use RANK to get the storage account specific to that node:

import argparse
import os

parser = argparse.ArgumentParser()
parser.add_argument('--data', nargs='+')
args = parser.parse_args()

world_size = int(os.environ["WORLD_SIZE"])
rank = int(os.environ["RANK"])
local_rank = int(os.environ["LOCAL_RANK"])

data_path_for_this_rank = args.data[rank]
Many small files problem

Reading files from storage involves making requests for each file. The request count per file varies, based on file sizes and the settings of the software that handles the file reads.

Files are read in blocks of 1-4 MB in size. Files smaller than a block are read with a single request (GET file.jpg 0-4MB), and files larger than a block have one request made per block (GET file.jpg 0-4MB, GET file.jpg 4-8 MB). This table shows that files smaller than a 4-MB block result in more storage requests compared to larger files:

# Files File Size Total data size Block size # Storage requests 2,000,000 500KB 1 TB 4 MB 2,000,000 1,000 1 GB 1 TB 4 MB 256,000

For small files, the latency interval mostly involves handling the requests to storage, instead of data transfers. Therefore, we offer these recommendations to increase the file size:

If you can't increase your file sizes, explore your Azure Storage options.

Azure Storage options

Azure Storage offers two tiers - standard and premium:

Storage Scenario Azure Blob - Standard (HDD) Your data is structured in larger blobs - images, video, etc. Azure Blob - Premium (SSD) High transaction rates, smaller objects, or consistently low storage latency requirements Read V1 data assets

This section explains how to read V1 FileDataset and TabularDataset data entities in a V2 job.

Read a FileDataset

In the Input object, specify the type as AssetTypes.MLTABLE and mode as InputOutputModes.EVAL_MOUNT:

For more information about the MLClient object, MLClient object initialization options, and how to connect to a workspace, visit Connect to a workspace.

from azure.ai.ml import command
from azure.ai.ml.entities import Data
from azure.ai.ml import Input
from azure.ai.ml.constants import AssetTypes, InputOutputModes
from azure.ai.ml import MLClient

ml_client = MLClient.from_config(credential=DefaultAzureCredential())

filedataset_asset = ml_client.data.get(name="<filedataset_name>", version="<version>")

my_job_inputs = {
    "input_data": Input(
            type=AssetTypes.MLTABLE,
            path=filedataset_asset.id,
            mode=InputOutputModes.EVAL_MOUNT
    )
}

job = command(
    code="./src",  # Local path where the code is stored
    command="ls ${{inputs.input_data}}",
    inputs=my_job_inputs,
    environment="<environment_name>:<version>",
    compute="cpu-cluster",
)

# Submit the command
returned_job = ml_client.jobs.create_or_update(job)
# Get a URL for the job status
returned_job.services["Studio"].endpoint

Create a job specification YAML file (<file-name>.yml), with the type set to mltable and the mode set to eval_mount:

$schema: https://azuremlschemas.azureedge.net/latest/commandJob.schema.json

command: |
  ls ${{inputs.my_data}}
code: <folder where code is located>
inputs:
  my_data:
    type: mltable
    mode: eval_mount
    path: azureml:<filedataset_name>@latest
environment: azureml:<environment_name>@latest
compute: azureml:cpu-cluster

Next, run in the CLI

az ml job create -f <file-name>.yml
Read a TabularDataset

In the Input object, specify the type as AssetTypes.MLTABLE, and mode as InputOutputModes.DIRECT:

from azure.ai.ml import command
from azure.ai.ml.entities import Data
from azure.ai.ml import Input
from azure.ai.ml.constants import AssetTypes, InputOutputModes
from azure.ai.ml import MLClient

ml_client = MLClient.from_config(credential=DefaultAzureCredential())

filedataset_asset = ml_client.data.get(name="<tabulardataset_name>", version="<version>")

my_job_inputs = {
    "input_data": Input(
            type=AssetTypes.MLTABLE,
            path=filedataset_asset.id,
            mode=InputOutputModes.DIRECT
    )
}

job = command(
    code="./src",  # Local path where the code is stored
    command="python train.py --inputs ${{inputs.input_data}}",
    inputs=my_job_inputs,
    environment="<environment_name>:<version>",
    compute="cpu-cluster",
)

# Submit the command
returned_job = ml_client.jobs.create_or_update(job)
# Get a URL for the status of the job
returned_job.services["Studio"].endpoint

Create a job specification YAML file (<file-name>.yml), with the type set to mltable and the mode set to direct:

$schema: https://azuremlschemas.azureedge.net/latest/commandJob.schema.json

command: |
  ls ${{inputs.my_data}}
code: <folder where code is located>
inputs:
  my_data:
    type: mltable
    mode: direct
    path: azureml:<tabulardataset_name>@latest
environment: azureml:<environment_name>@latest
compute: azureml:cpu-cluster

Next, run in the CLI

az ml job create -f <file-name>.yml
Next steps

RetroSearch is an open source project built by @garambo | Open a GitHub Issue

Search and Browse the WWW like it's 1997 | Search results from DuckDuckGo

HTML: 3.2 | Encoding: UTF-8 | Version: 0.7.4