RetroSearch Browse

Home - News ( United States | United Kingdom | Italy | Germany ) - Football scores

Showing content from https://learn.microsoft.com/en-us/azure/machine-learning/how-to-read-write-data-v2 below:

Access data in a job - Azure Machine Learning

APPLIES TO: Azure CLI ml extension v2 (current) Python SDK azure-ai-ml v2 (current)

In this article you learn:

How to read data from Azure storage in an Azure Machine Learning job.
How to write data from your Azure Machine Learning job to Azure Storage.
The difference between mount and download modes.
How to use user identity and managed identity to access data.
Mount settings available in a job.
Optimum mount settings for common scenarios.
How to access V1 data assets.

Prerequisites

An Azure subscription. If you don't have an Azure subscription, create a free account before you begin. Try the free or paid version of Azure Machine Learning.
The Azure Machine Learning SDK for Python v2.
An Azure Machine Learning workspace

Quickstart

Before you explore the detailed options available to you when you access data, we first describe the relevant code snippets for data access.

Read data from Azure storage in an Azure Machine Learning job

In this example, you submit an Azure Machine Learning job that accesses data from a public blob storage account. However, you can adapt the snippet to access your own data in a private Azure Storage account. Update the path as described here. Azure Machine Learning seamlessly handles authentication to cloud storage, with Microsoft Entra passthrough. When you submit a job, you can choose:

User identity: Passthrough your Microsoft Entra identity to access the data
Managed identity: Use the managed identity of the compute target to access data
None: Don't specify an identity to access the data. Use None when using credential-based (key/SAS token) datastores or when accessing public data

Tip

If you use keys or SAS tokens to authenticate, we recommend creating an Azure Machine Learning datastore. The runtime automatically connects to storage without exposing your credentials.

from azure.ai.ml import command, Input, MLClient, UserIdentityConfiguration, ManagedIdentityConfiguration
from azure.ai.ml.entities import Data
from azure.ai.ml.constants import AssetTypes, InputOutputModes
from azure.identity import DefaultAzureCredential

# Set your subscription, resource group and workspace name:
subscription_id = "<SUBSCRIPTION_ID>"
resource_group = "<RESOURCE_GROUP>"
workspace = "<AML_WORKSPACE_NAME>"

# connect to the AzureML workspace
ml_client = MLClient(
    DefaultAzureCredential(), subscription_id, resource_group, workspace
)

# ==============================================================
# Set the URI path for the data.
# Supported `path` formats for input include:
# local: `./<path>
# Blob: wasbs://<container_name>@<account_name>.blob.core.windows.net/<path>
# ADLS: abfss://<file_system>@<account_name>.dfs.core.windows.net/<path>
# Datastore: azureml://datastores/<data_store_name>/paths/<path>
# Data Asset: azureml:<my_data>:<version>
# Supported `path` format for output is:
# Datastore: azureml://datastores/<data_store_name>/paths/<path>
# We set the input path to a file on a public blob container
# ==============================================================
path = "wasbs://data@azuremlexampledata.blob.core.windows.net/titanic.csv"


# ==============================================================
# What type of data does the path point to? Options include:
# data_type = AssetTypes.URI_FILE # a specific file
# data_type = AssetTypes.URI_FOLDER # a folder
# data_type = AssetTypes.MLTABLE # an mltable
# The path we set above is a specific file
# ==============================================================
data_type = AssetTypes.URI_FILE

# ==============================================================
# Set the mode. The popular modes include:
# mode = InputOutputModes.RO_MOUNT # Read-only mount on the compute target
# mode = InputOutputModes.DOWNLOAD # Download the data to the compute target
# ==============================================================
mode = InputOutputModes.RO_MOUNT

# ==============================================================
# You can set the identity you want to use in a job to access the data. Options include:
# identity = UserIdentityConfiguration() # Use the user's identity
# identity = ManagedIdentityConfiguration() # Use the compute target managed identity
# ==============================================================
# This example accesses public data, so we don't need an identity.
# You also set identity to None if you use a credential-based datastore
identity = None

# Set the input for the job:
inputs = {
    "input_data": Input(type=data_type, path=path, mode=mode)
}

# This command job uses the head Linux command to print the first 10 lines of the file
job = command(
    command="head ${{inputs.input_data}}",
    inputs=inputs,
    environment="azureml://registries/azureml/environments/sklearn-1.1/versions/4",
    compute="cpu-cluster",
    identity=identity,
)

# Submit the command
ml_client.jobs.create_or_update(job)

Create a job specification YAML file (<file-name>.yml).

$schema: https://azuremlschemas.azureedge.net/latest/commandJob.schema.json

# ==============================================================
# Set the URI path for the data.
# Supported `path` formats for input include:
# local: `./<path>
# Blob: wasbs://<container_name>@<account_name>.blob.core.windows.net/<path>
# ADLS: abfss://<file_system>@<account_name>.dfs.core.windows.net/<path>
# Datastore: azureml://datastores/<data_store_name>/paths/<path>
# Data Asset: azureml:<my_data>:<version>
# Supported `path` format for output is:
# Datastore: azureml://datastores/<data_store_name>/paths/<path>
# ==============================================================

# ==============================================================
# What type of data does the path point to? Options include:
# type: uri_file # a specific file
# type: uri_folder # a folder
# type: mltable # an mltable
# ==============================================================

# ==============================================================
# Set the mode. The popular modes include:
# mode: ro_mount # Read-only mount on the compute target
# mode: download # Download the data to the compute target
# ==============================================================

# ==============================================================
# You can set the identity you want to use in a job to access the data. Options include:
# identity:
#   type: user_identity
# identity:
#   type: managed_identity
# ==============================================================
# This example accesses public data, so we don't need an identity.
# You don't need to set an identity to if you use a credential-based datastore

# This command job prints the first 10 lines of the file

type: command
command: head ${{inputs.input_data}}
compute: azureml:cpu-cluster
environment: azureml://registries/azureml/environments/sklearn-1.1/versions/4
inputs:
  input_data:
    mode: ro_mount
    path: wasbs://data@azuremlexampledata.blob.core.windows.net/titanic.csv
    type: uri_file

Next, submit your job using the CLI:

az ml job create -f <file-name>.yml

Write data from your Azure Machine Learning job to Azure Storage

In this example, you submit an Azure Machine Learning job that writes data to your default Azure Machine Learning Datastore. You can optionally set the name value of your data asset to create a data asset in the output.

from azure.ai.ml import command, Input, Output, MLClient
from azure.ai.ml.constants import AssetTypes, InputOutputModes
from azure.identity import DefaultAzureCredential

# Set your subscription, resource group and workspace name:
subscription_id = "<SUBSCRIPTION_ID>"
resource_group = "<RESOURCE_GROUP>"
workspace = "<AML_WORKSPACE_NAME>"

# connect to the AzureML workspace
ml_client = MLClient(
    DefaultAzureCredential(), subscription_id, resource_group, workspace
)

# ==============================================================
# Set the URI path for the data.
# Supported `path` formats for input include:
# local: `./<path>
# Blob: wasbs://<container_name>@<account_name>.blob.core.windows.net/<path>
# ADLS: abfss://<file_system>@<account_name>.dfs.core.windows.net/<path>
# Datastore: azureml://datastores/<data_store_name>/paths/<path>
# Data Asset: azureml:<my_data>:<version>
# Supported `path` format for output is:
# Datastore: azureml://datastores/<data_store_name>/paths/<path>
# As an example, we set the input path to a file on a public blob container
# As an example, we set the output path to a folder in the default datastore
# ==============================================================
input_path = "wasbs://data@azuremlexampledata.blob.core.windows.net/titanic.csv"
output_path = "azureml://datastores/workspaceblobstore/paths/quickstart-output/titanic.csv"

# ==============================================================
# What type of data are you pointing to?
# AssetTypes.URI_FILE (a specific file)
# AssetTypes.URI_FOLDER (a folder)
# AssetTypes.MLTABLE (a table)
# The path we set above is a specific file
# ==============================================================
data_type = AssetTypes.URI_FILE

# ==============================================================
# Set the input mode. The most commonly-used modes:
# InputOutputModes.RO_MOUNT
# InputOutputModes.DOWNLOAD
# Set the mode to Read Only (RO) to mount the data
# ==============================================================
input_mode = InputOutputModes.RO_MOUNT

# ==============================================================
# Set the output mode. The most commonly-used modes:
# InputOutputModes.RW_MOUNT
# InputOutputModes.UPLOAD
# Set the mode to Read Write (RW) to mount the data
# ==============================================================
output_mode = InputOutputModes.RW_MOUNT

# Set the input and output for the job:
inputs = {
    "input_data": Input(type=data_type, path=input_path, mode=input_mode)
}

outputs = {
    "output_data": Output(type=data_type, 
                          path=output_path, 
                          mode=output_mode,
                          # optional: if you want to create a data asset from the output, 
                          # then uncomment `name` (`name` can be set without setting `version`, and in this way, we will set `version` automatically for you)
                          # name = "<name_of_data_asset>", # use `name` and `version` to create a data asset from the output
                          # version = "<version>",
                  )
}

# This command job copies the data to your default Datastore
job = command(
    command="cp ${{inputs.input_data}} ${{outputs.output_data}}",
    inputs=inputs,
    outputs=outputs,
    environment="azureml://registries/azureml/environments/sklearn-1.1/versions/4",
    compute="cpu-cluster",
)

# Submit the command
ml_client.jobs.create_or_update(job)

Create a job specification YAML file (<file-name>.yml):

$schema: https://azuremlschemas.azureedge.net/latest/commandJob.schema.json

# path: Set the URI path for the data. Supported paths include
# local: `./<path>
# Blob: wasbs://<container_name>@<account_name>.blob.core.windows.net/<path>
# ADLS: abfss://<file_system>@<account_name>.dfs.core.windows.net/<path>
# Datastore: azureml://datastores/<data_store_name>/paths/<path>
# Data Asset: azureml:<my_data>:<version>

# type: What type of data are you pointing to?
# uri_file (a specific file)
# uri_folder (a folder)
# mltable (a table)

# mode: Set INPUT mode:
# ro_mount (read-only mount)
# download (download from storage to node)
# mode: Set the OUTPUT mode
# rw_mount (read-write mount)
# upload (upload data from node to storage)

type: command
command: cp ${{inputs.input_data}} ${{outputs.output_data}}
compute: azureml:cpu-cluster
environment: azureml://registries/azureml/environments/sklearn-1.1/versions/4
inputs:
  input_data:
    mode: ro_mount
    path: wasbs://data@azuremlexampledata.blob.core.windows.net/titanic.csv
    type: uri_file
outputs:
  output_data:
    mode: rw_mount
    path: azureml://datastores/workspaceblobstore/paths/quickstart-output/titanic.csv
    type: uri_file
    # optional: if you want to create a data asset from the output, 
    # then uncomment `name` (`name` can be set without setting `version`, and in this way, we will set `version` automatically for you)
    # name: <name_of_data_asset> # use `name` and `version` to create a data asset from the output
    # version: <version>

Next, submit the job using the CLI:

az ml job create --file <file-name>.yml

The Azure Machine Learning data runtime

When you submit a job, the Azure Machine Learning data runtime controls the data load, from the storage location to the compute target. The Azure Machine Learning data runtime is optimized for speed and efficiency for machine learning tasks. The key benefits include:

Data loads written in the Rust language, a language known for high speed and high memory efficiency. For concurrent data downloads, Rust avoids Python Global Interpreter Lock (GIL) issues
Light weight; Rust has no dependencies on other technologies - for example JVM. As a result, the runtime installs quickly, and it doesn't drain extra resources (CPU, Memory) on the compute target
Multi-process (parallel) data loading
Prefetches data as a background task on one or more CPUs, to enable better utilization of GPUs when doing deep-learning
Seamless authentication handling to cloud storage
Provides options to mount data (stream) or download all the data. For more information, visit the Mount (streaming) and Download sections.
Seamless integration with fsspec - a unified pythonic interface to local, remote, and embedded file systems and byte storage.

Tip

We suggest that you apply the Azure Machine Learning data runtime, instead of creation of your own mounting/downloading capability in your training (client) code. We have observed storage throughput constraints when the client code uses Python to download data from storage, because of Global Interpreter Lock (GIL) issues.

Paths

When you provide a data input or output to a job, you must specify a path parameter that points to the data location. This table shows the different data locations that Azure Machine Learning supports and provides path parameter examples:

Location Examples Input Output A path on your local computer ./home/username/data/my_data Y N A path on a public HTTP(S) server https://raw.githubusercontent.com/pandas-dev/pandas/main/doc/data/titanic.csv Y N A path on Azure Storage wasbs://<container_name>@<account_name>.blob.core.windows.net/<path>
abfss://<file_system>@<account_name>.dfs.core.windows.net/<path> Y, only for identity-based authentication N A path on an Azure Machine Learning datastore azureml://datastores/<data_store_name>/paths/<path> Y Y A path to a data asset azureml:<my_data>:<version> Y N, but you can use name and version to create a data asset from output Modes

When you run a job with data inputs/outputs, you can select from these mode options:

ro_mount: Mount storage location, as read-only on the local disk (SSD) compute target.
rw_mount: Mount storage location, as read-write on the local disk (SSD) compute target.
download: Download the data from the storage location to the local disk (SSD) compute target.
upload: Upload data from the compute target to the storage location.
eval_mount/eval_download: These modes are unique to MLTable. In some scenarios, an MLTable can yield files that might be located in a storage account different from the storage account that hosts the MLTable file. Or, an MLTable can subset or shuffle the data located in the storage resource. That view of the subset/shuffle becomes visible only when the Azure Machine Learning data runtime evaluates the MLTable file. For example, this diagram shows how an MLTable used with eval_mount or eval_download can take images from two different storage containers and an annotations file located in a different storage account, and then mount/download to the filesystem of the remote compute target.

The camera1 folder, camera2 folder and annotations.csv file are then accessible on the compute target's filesystem in the folder structure:
```
/INPUT_DATA
âââ account-a
â   âââ container1
â   â   âââ camera1
â   â       âââ image1.jpg
â   â       âââ image2.jpg
â   âââ container2
â       âââ camera2
â           âââ image1.jpg
â           âââ image2.jpg
âââ account-b
    âââ container1
        âââ annotations.csv
```
direct: You might want to read data directly from a URI through other APIs, rather than go through the Azure Machine Learning data runtime. For example, you might want to access data on an s3 bucket (with a virtual-hostedâstyle or path-style https URL) using the boto s3 client. You can obtain the URI of the input as a string with the direct mode. You see use of the direct mode in Spark Jobs, because the spark.read_*() methods know how to process the URIs. For non-Spark jobs, it is your responsibility to manage access credentials. For example, you must explicitly make use of compute MSI, or otherwise broker access.

This table shows the possible modes for different type/mode/input/output combinations:

Type Input/Output upload download ro_mount rw_mount direct eval_download eval_mount uri_folder Input â â â uri_file Input â â â mltable Input â â â â â uri_folder Output â â uri_file Output â â mltable Output â â â Download

In download mode, all the input data is copied to the local disk (SSD) of the compute target. The Azure Machine Learning data runtime starts the user training script, once all the data is copied. When the user script starts, it reads data from the local disk, just like any other files. When the job finishes, the data is removed from the disk of the compute target.

Advantages Disadvantages When training starts, all the data is available on the local disk (SSD) of the compute target, for the training script. No Azure storage / network interaction is required. The dataset must completely fit on a compute target disk. After the user script starts, there are no dependencies on storage / network reliability. The entire dataset is downloaded (if training needs to randomly select only a small portion of a data, much of the download is then wasted). Azure Machine Learning data runtime can parallelize the download (significant difference on many small files) and max network / storage throughput. The job waits until all data downloads to the local disk of the compute target. For a submitted deep-learning job, the GPUs idle until data is ready. No unavoidable overhead added by the FUSE layer (roundtrip: user space call in user script â kernel â user space fuse daemon â kernel â response to user script in user space) Storage changes aren't reflected on the data after download is done. When to use download

The data is small enough to fit on the compute target's disk without interference with other training
The training uses most or all of the dataset
The training reads files from a dataset more than once
The training must jump to random positions of a large file
It's OK to wait until all the data downloads before training starts

Available download settings

You can tune the download settings with these environment variables in your job:

Environment Variable Name Type Default Value Description RSLEX_DOWNLOADER_THREADS u64 NUMBER_OF_CPU_CORES * 4 Number of concurrent threads download can use AZUREML_DATASET_HTTP_RETRY_COUNT u64 7 Number of retry attempts of individual storage / http request to recover from transient errors.

In your job, you can change the above defaults by setting the environment variables - for example:

For brevity, we only show how to define the environment variables in the job.

from azure.ai.ml import command

env_var = {
"RSLEX_DOWNLOADER_THREADS": 64,
"AZUREML_DATASET_HTTP_RETRY_COUNT": 10
}

job = command(
        environment_variables=env_var
)

For brevity, we only show how to define the environment variables in the job.

$schema: https://azuremlschemas.azureedge.net/latest/commandJob.schema.json

environment_variables:
    RSLEX_DOWNLOADER_THREADS: 64
    AZUREML_DATASET_HTTP_RETRY_COUNT: 10

Download performance metrics

The VM size of your compute target has an effect on the download time of your data. Specifically:

The number of cores. The more cores available, the more concurrency and therefore faster download speed.
The expected network bandwidth. Each VM in Azure has a maximum throughput from the Network Interface Card (NIC).

Note

For A100 GPU VMs, the Azure Machine Learning data runtime can saturate the NIC (Network Interface Card) when downloading data to the compute target (~24 Gbit/s): The theoretical maximum throughput possible.

This table shows the download performance the Azure Machine Learning data runtime can handle for a 100-GB file on a Standard_D15_v2 VM (20cores, 25 Gbit/s Network throughput):

Data structure Download only (secs) Download and calculate MD5 (secs) Throughput Achieved (Gbit/s) 10 x 10 GB Files 55.74 260.97 14.35 Gbit/s 100 x 1 GB Files 58.09 259.47 13.77 Gbit/s 1 x 100 GB File 96.13 300.61 8.32 Gbit/s

We can see that a larger file, broken up into smaller files, can improve download performance due to parallelism. We recommend that you avoid files that become too small (less than 4 MB) because the time needed for storage request submissions increases, relative to time spent downloading the payload. For more information, read Many small files problem.

Mount (streaming)

In mount mode, the Azure Machine Learning data capability uses the FUSE (filesystem in user space) Linux feature, to create an emulated filesystem. Instead of downloading all the data to the local disk (SSD) of the compute target, the runtime can react to the user's script actions in real-time. For example, "open file", "read 2-KB chunk from position X", "list directory content".

Advantages Disadvantages Data that exceeds the compute target local disk capacity can be used (not limited by compute hardware) Added overhead of the Linux FUSE module. No delay at the start of training (unlike download mode). Dependency on user's code behavior (if the training code that sequentially reads small files in a single thread mount also requests data from storage, it might not maximize the network or storage throughput). More available settings to tune for a usage scenario. No windows support. Only data needed for training is read from storage. When to use Mount

The data is large, and it doesn't fit on the compute target local disk.
Each individual compute node in a cluster doesn't need to read the entire dataset (random file or rows in csv file selection, etc.).
Delays waiting for all data to download before training starts can become a problem (idle GPU time).

Available mount settings

You can tune the mount settings with these environment variables in your job:

Env variable name Type Default value Description DATASET_MOUNT_ATTRIBUTE_CACHE_TTL u64 Not set (cache never expires) Time, in milliseconds, needed to keep the getattr call results in cache, and to avoid subsequent requests of this info from storage again. DATASET_RESERVED_FREE_DISK_SPACE u64 150 MB Intended for a system configuration, to keep compute healthy. No matter what values the other settings have, Azure Machine Learning data runtime doesn't use the last RESERVED_FREE_DISK_SPACE bytes of disk space. DATASET_MOUNT_CACHE_SIZE usize Unlimited Controls how much disk space mount can use. A positive value sets absolute value in bytes. Negative value sets how much of a disk space to leave free. This table provides more disk cache options. Supports KB, MB and GB modifiers for convenience. DATASET_MOUNT_FILE_CACHE_PRUNE_THRESHOLD f64 1.0 Volume mount starts cache pruning when cache is filled up to AVAILABLE_CACHE_SIZE * DATASET_MOUNT_FILE_CACHE_PRUNE_THRESHOLD. Should be between 0 and 1. Setting it < 1 triggers background cache pruning earlier. AVAILABLE_CACHE_SIZE isn't an environment variable you can modify or view directly. In this context, it refers to the "number of bytes the system calculates as available for caching." This value depends on factors such as disk size, the amount of disk space required for system health, and configurations set in environment variables (like DATASET_RESERVED_FREE_DISK_SPACE and DATASET_MOUNT_CACHE_SIZE). DATASET_MOUNT_FILE_CACHE_PRUNE_TARGET f64 0.7 Pruning cache tries to free at least (1-DATASET_MOUNT_FILE_CACHE_PRUNE_TARGET) of a cache space. DATASET_MOUNT_READ_BLOCK_SIZE usize 2 MB Streaming read block size. When file is large enough, request at least DATASET_MOUNT_READ_BLOCK_SIZE of data from storage, and cache even when fuse requested read operation was for less. DATASET_MOUNT_READ_BUFFER_BLOCK_COUNT usize 32 Number of blocks to prefetch (reading block k triggers background prefetching of blocks k+1, ..., k.+DATASET_MOUNT_READ_BUFFER_BLOCK_COUNT) DATASET_MOUNT_READ_THREADS usize NUMBER_OF_CORES * 4 Number of background prefetching threads. DATASET_MOUNT_BLOCK_BASED_CACHE_ENABLED bool false Enable block-based caching. DATASET_MOUNT_MEMORY_CACHE_SIZE usize 128 MB Applies to block-based caching only. Size of RAM block-based caching can use. A value of 0 disables memory caching completely. DATASET_MOUNT_BLOCK_FILE_CACHE_ENABLED bool true Applies to block-based caching only. When set to true, block-based caching uses local hard drive to cache blocks. DATASET_MOUNT_BLOCK_FILE_CACHE_MAX_QUEUE_SIZE usize 512 MB Applies to block-based caching only. Block-based caching writes cached block to a local disk in a background. This setting controls how much memory mount can use to store blocks waiting for flush to the local disk cache. DATASET_MOUNT_BLOCK_FILE_CACHE_WRITE_THREADS usize NUMBER_OF_CORES * 2 Applies to block-based caching only. Number of background threads block-based caching uses to write downloaded blocks to the local disk of the compute target. DATASET_UNMOUNT_TIMEOUT_SECONDS u64 30 Time in seconds for unmount to (gracefully) finish all pending operations (for example, flush calls) before forcefully terminating the mount message loop.

In your job, you can change the above defaults by setting the environment variables, for example:

from azure.ai.ml import command

env_var = {
"DATASET_MOUNT_BLOCK_FILE_CACHE_ENABLED": True
}

job = command(
        environment_variables=env_var
)

$schema: https://azuremlschemas.azureedge.net/latest/commandJob.schema.json

environment_variables:
    DATASET_MOUNT_BLOCK_FILE_CACHE_ENABLED: true

Block-based open mode

Block-based open mode splits each file into blocks of a predefined size (except for the last block). A read request from a specified position requests a corresponding block from storage, and returns the requested data immediately. A read also triggers background prefetching of N next blocks, using multiple threads (optimized for sequential read). Downloaded blocks are cached in two layer cache (RAM and local disk).

Advantages Disadvantages Fast data delivery to the training script (less blocking for chunks that weren't yet requested). Random reads may waste forward-prefetched blocks. More work offloads to background threads (prefetching / caching). The training can then proceed. Added overhead to navigate between caches, compared to direct reads from a file on a local disk cache (for example, in whole-file cache mode). Only requested data (plus prefetching) is read from storage. For small enough data, fast RAM-based cache is used. When to use block-based open mode

Recommended for most scenarios except when you need fast reads from random file locations. In those cases, use Whole file cache open mode.

Whole file cache open mode

When a file under a mount folder is opened (for example, f = open(path, args)) in whole file mode, the call is blocked until the entire file is downloaded into a compute target cache folder on the disk. All subsequent read calls redirect to the cached file, so no storage interaction is needed. If cache doesn't have enough available space to fit the current file, mount tries to prune by deleting the least-recently used file from the cache. In cases where the file can't fit on disk (with respect to cache settings), the data runtime falls back to streaming mode.

Advantages Disadvantages No storage reliability / throughput dependencies after the file is opened. Open call is blocked until the entire file is downloaded. Fast random reads (reading chunks from random places of the file). The entire file is read from storage, even when some portions of the file may not be needed. When to use it

When random reads are needed for relatively large files that exceed 128 MB.

Usage

Set environment variable DATASET_MOUNT_BLOCK_BASED_CACHE_ENABLED to false in your job:

from azure.ai.ml import command

env_var = {
"DATASET_MOUNT_BLOCK_BASED_CACHE_ENABLED": False
}

job = command(
        environment_variables=env_var
)

$schema: https://azuremlschemas.azureedge.net/latest/commandJob.schema.json

environment_variables:
    DATASET_MOUNT_BLOCK_BASED_CACHE_ENABLED: false

Mount: Listing files

When working with millions of files, avoid a recursive listing - for example ls -R /mnt/dataset/folder/. A recursive listing triggers many calls to list the directory contents of the parent directory. It then requires a separate recursive call for each directory inside, at all child levels. Typically, Azure Storage allows only 5,000 elements to be returned per single list request. As a result, a recursive listing of 1M folders containing 10 files each requires 1,000,000 / 5000 + 1,000,000 = 1,000,200 requests to storage. In comparison, 1,000 folders with 10,000 files would only need 1,001 requests to storage for a recursive listing.

Azure Machine Learning mount handles listing in a lazy manner. Therefore, to list many small files, it's better to use an iterative client library call (for example, os.scandir() in Python) instead of a client library call that returns the full list (for example, os.listdir() in Python). An iterative client library call returns a generator, meaning that it doesn't need to wait until the entire list loads. It can then proceed faster.

This table compares the time needed for the Python os.scandir() and os.listdir() functions to list a folder that contains ~4M files in a flat structure:

Metric os.scandir() os.listdir() Time to get first entry (secs) 0.67 553.79 Time to get first 50k entries (secs) 9.56 562.73 Time to get all entries (secs) 558.35 582.14 Optimum mount settings for common scenarios

For certain common scenarios, we show the optimal mount settings you need to set in your Azure Machine Learning job.

Reading large file sequentially one time (processing lines in csv file)