The mlflow.data
module helps you record your model training and evaluation datasets to runs with MLflow Tracking, as well as retrieve dataset information from runs. It provides the following important interfaces:
Dataset
: Represents a dataset used in model training or evaluation, including features, targets, predictions, and metadata such as the datasetâs name, digest (hash) schema, profile, and source. You can log this metadata to a run in MLflow Tracking using the mlflow.log_input()
API. mlflow.data
provides APIs for constructing Datasets
from a variety of Python data objects, including Pandas DataFrames (mlflow.data.from_pandas()
), NumPy arrays (mlflow.data.from_numpy()
), Spark DataFrames (mlflow.data.from_spark()
/ mlflow.data.load_delta()
), Polars DataFrames (mlflow.data.from_polars()
), and more.
DatasetSource
: Represents the source of a dataset. For example, this may be a directory of files stored in S3, a Delta Table, or a web URL. Each Dataset
references the source from which it was derived. A Dataset
âs features and targets may differ from the source if transformations and filtering were applied. You can get the DatasetSource
of a dataset logged to a run in MLflow Tracking using the mlflow.data.get_source()
API.
The following example demonstrates how to use mlflow.data
to log a training dataset to a run, retrieve information about the dataset from the run, and load the datasetâs source.
import mlflow.data import pandas as pd from mlflow.data.pandas_dataset import PandasDataset # Construct a Pandas DataFrame using iris flower data from a web URL dataset_source_url = "http://archive.ics.uci.edu/ml/machine-learning-databases/wine-quality/winequality-red.csv" df = pd.read_csv(dataset_source_url) # Construct an MLflow PandasDataset from the Pandas DataFrame, and specify the web URL # as the source dataset: PandasDataset = mlflow.data.from_pandas(df, source=dataset_source_url) with mlflow.start_run(): # Log the dataset to the MLflow Run. Specify the "training" context to indicate that the # dataset is used for model training mlflow.log_input(dataset, context="training") # Retrieve the run, including dataset information run = mlflow.get_run(mlflow.last_active_run().info.run_id) dataset_info = run.inputs.dataset_inputs[0].dataset print(f"Dataset name: {dataset_info.name}") print(f"Dataset digest: {dataset_info.digest}") print(f"Dataset profile: {dataset_info.profile}") print(f"Dataset schema: {dataset_info.schema}") # Load the dataset's source, which downloads the content from the source URL to the local # filesystem dataset_source = mlflow.data.get_source(dataset_info) dataset_source.load()
Bases: object
Represents a dataset for use with MLflow Tracking, including the name, digest (hash), schema, and profile of the dataset as well as source information (e.g. the S3 bucket or managed Delta table from which the dataset was derived). Most datasets expose features and targets for training and evaluation as well.
A unique hash or fingerprint of the dataset, e.g. "498c7496"
.
The name of the dataset, e.g. "iris_data"
, "myschema.mycatalog.mytable@v1"
, etc.
Optional summary statistics for the dataset, such as the number of rows in a table, the mean / median / std of each table column, etc.
Optional dataset schema, such as an instance of mlflow.types.Schema
representing the features and targets of the dataset.
Information about the datasetâs source, represented as an instance of DatasetSource
. For example, this may be the S3 location or the name of the managed Delta Table from which the dataset was derived.
Create config dictionary for the dataset.
Subclasses should override this method to provide additional fields in the config dict, e.g., schema, profile, etc.
Returns a string dictionary containing the following fields: name, digest, source, source type.
Bases: object
Represents the source of a dataset used in MLflow Tracking, providing information such as cloud storage location, delta table name / version, etc.
Constructs an instance of the DatasetSource from a dictionary representation.
source_dict â A dictionary representation of the DatasetSource.
A DatasetSource instance.
Loads files / objects referred to by the DatasetSource. For example, depending on the type of DatasetSource
, this may download source CSV files from S3 to the local filesystem, load a source Delta Table as a Spark DataFrame, etc.
The downloaded source, e.g. a local filesystem path, a Spark DataFrame, etc.
Obtains a JSON-compatible dictionary representation of the DatasetSource.
A JSON-compatible dictionary representation of the DatasetSource.
Obtains a JSON string representation of the DatasetSource
.
A JSON string representation of the DatasetSource
.
Obtains the source of the specified dataset or dataset input.
dataset â An instance of mlflow.data.dataset.Dataset
, mlflow.entities.Dataset
, or mlflow.entities.DatasetInput
.
An instance of DatasetSource
.
Constructs a PandasDataset
instance from a Pandas DataFrame, optional targets, optional predictions, and source.
df â A Pandas DataFrame.
source â The source from which the DataFrame was derived, e.g. a filesystem path, an S3 URI, an HTTPS URL, a delta table name with version, or spark table etc. source
may be specified as a URI, a path-like string, or an instance of DatasetSource
. If unspecified, the source is assumed to be the code location (e.g. notebook cell, script, etc.) where from_pandas
is being called.
targets â An optional target column name for supervised training. This column must be present in the dataframe (df
).
name â The name of the dataset. If unspecified, a name is generated.
digest â The dataset digest (hash). If unspecified, a digest is computed automatically.
predictions â An optional predictions column name for model evaluation. This column must be present in the dataframe (df
).
import mlflow import pandas as pd x = pd.DataFrame( [["tom", 10, 1, 1], ["nick", 15, 0, 1], ["july", 14, 1, 1]], columns=["Name", "Age", "Label", "ModelOutput"], ) dataset = mlflow.data.from_pandas(x, targets="Label", predictions="ModelOutput")
Represents a Pandas DataFrame for use with MLflow Tracking.
The underlying pandas DataFrame.
The name of the predictions column. May be None
if no predictions column is available.
A profile of the dataset. May be None
if a profile cannot be computed.
An instance of mlflow.types.Schema
representing the tabular dataset. May be None
if the schema cannot be inferred from the dataset.
The source of the dataset.
The name of the target column. May be None
if no target column is available.
Create config dictionary for the dataset.
Returns a string dictionary containing the following fields: name, digest, source, source type, schema, and profile.
Constructs a NumpyDataset
object from NumPy features, optional targets, and source. If the source is path like, then this will construct a DatasetSource object from the source path. Otherwise, the source is assumed to be a DatasetSource object.
features â NumPy features, represented as an np.ndarray or dictionary of named np.ndarrays.
source â The source from which the numpy data was derived, e.g. a filesystem path, an S3 URI, an HTTPS URL, a delta table name with version, or spark table etc. source
may be specified as a URI, a path-like string, or an instance of DatasetSource
. If unspecified, the source is assumed to be the code location (e.g. notebook cell, script, etc.) where from_numpy
is being called.
targets â Optional NumPy targets, represented as an np.ndarray or dictionary of named np.ndarrays.
name â The name of the dataset. If unspecified, a name is generated.
digest â The dataset digest (hash). If unspecified, a digest is computed automatically.
import mlflow import numpy as np x = np.random.uniform(size=[2, 5, 4]) y = np.random.randint(2, size=[2]) dataset = mlflow.data.from_numpy(x, targets=y)
import mlflow import numpy as np x = { "feature_1": np.random.uniform(size=[2, 5, 4]), "feature_2": np.random.uniform(size=[2, 5, 4]), } y = np.random.randint(2, size=[2]) dataset = mlflow.data.from_numpy(x, targets=y)
Represents a NumPy dataset for use with MLflow Tracking.
The features of the dataset.
A profile of the dataset. May be None
if a profile cannot be computed.
MLflow TensorSpec schema representing the dataset features and targets (optional).
The source of the dataset.
The targets of the dataset. May be None
if no targets are available.
Create config dictionary for the dataset.
Returns a string dictionary containing the following fields: name, digest, source, source type, schema, and profile.
Loads a SparkDataset
from a Delta table for use with MLflow Tracking.
path â The path to the Delta table. Either path
or table_name
must be specified.
table_name â The name of the Delta table. Either path
or table_name
must be specified.
version â The Delta table version. If not specified, the version will be inferred.
targets â Optional. The name of the Delta table column containing targets (labels) for supervised learning.
name â The name of the dataset. E.g. âwiki_trainâ. If unspecified, a name is automatically generated.
digest â The digest (hash, fingerprint) of the dataset. If unspecified, a digest is automatically computed.
An instance of SparkDataset
.
Given a Spark DataFrame, constructs a SparkDataset
object for use with MLflow Tracking.
df â The Spark DataFrame from which to construct a SparkDataset.
path â The path of the Spark or Delta source that the DataFrame originally came from. Note that the path does not have to match the DataFrame exactly, since the DataFrame may have been modified by Spark operations. This is used to reload the dataset upon request via SparkDataset.source.load()
. If none of path
, table_name
, or sql
are specified, a CodeDatasetSource is used, which will source information from the run context.
table_name â The name of the Spark or Delta table that the DataFrame originally came from. Note that the table does not have to match the DataFrame exactly, since the DataFrame may have been modified by Spark operations. This is used to reload the dataset upon request via SparkDataset.source.load()
. If none of path
, table_name
, or sql
are specified, a CodeDatasetSource is used, which will source information from the run context.
version â If the DataFrame originally came from a Delta table, specifies the version of the Delta table. This is used to reload the dataset upon request via SparkDataset.source.load()
. version
cannot be specified if sql
is specified.
sql â The Spark SQL statement that was originally used to construct the DataFrame. Note that the Spark SQL statement does not have to match the DataFrame exactly, since the DataFrame may have been modified by Spark operations. This is used to reload the dataset upon request via SparkDataset.source.load()
. If none of path
, table_name
, or sql
are specified, a CodeDatasetSource is used, which will source information from the run context.
targets â Optional. The name of the Data Frame column containing targets (labels) for supervised learning.
name â The name of the dataset. E.g. âwiki_trainâ. If unspecified, a name is automatically generated.
digest â The digest (hash, fingerprint) of the dataset. If unspecified, a digest is automatically computed.
predictions â Optional. The name of the column containing model predictions, if the dataset contains model predictions. If specified, this column must be present in the dataframe (df
).
An instance of SparkDataset
.
Represents a Spark dataset (e.g. data derived from a Spark Table / file directory or Delta Table) for use with MLflow Tracking.
The Spark DataFrame instance.
The Spark DataFrame instance.
The name of the predictions column. May be None
if no predictions column was specified when the dataset was created.
A profile of the dataset. May be None if no profile is available.
The MLflow ColSpec schema of the Spark dataset.
Spark dataset source information.
An instance of SparkDatasetSource
or DeltaDatasetSource
.
The name of the Spark DataFrame column containing targets (labels) for supervised learning.
The string name of the Spark DataFrame column containing targets.
Create config dictionary for the dataset.
Returns a string dictionary containing the following fields: name, digest, source, source type, schema, and profile.
Create a mlflow.data.huggingface_dataset.HuggingFaceDataset from a Hugging Face dataset.
ds â A Hugging Face dataset. Must be an instance of datasets.Dataset. Other types, such as datasets.DatasetDict, are not supported.
path â The path of the Hugging Face dataset used to construct the source. This is the same argument as path in datasets.load_dataset() function. To be able to reload the dataset via MLflow, path must match the path of the dataset on the hub, e.g., âdatabricks/databricks-dolly-15kâ. If no path is specified, a CodeDatasetSource is, used which will source information from the run context.
targets â The name of the Hugging Face dataset.Dataset column containing targets (labels) for supervised learning.
data_dir â The data_dir of the Hugging Face dataset configuration. This is used by the datasets.load_dataset() function to reload the dataset upon request via HuggingFaceDataset.source.load()
.
data_files â Paths to source data file(s) for the Hugging Face dataset configuration. This is used by the datasets.load_dataset() function to reload the dataset upon request via HuggingFaceDataset.source.load()
.
revision â Version of the dataset script to load. This is used by the datasets.load_dataset() function to reload the dataset upon request via HuggingFaceDataset.source.load()
.
name â The name of the dataset. E.g. âwiki_trainâ. If unspecified, a name is automatically generated.
digest â The digest (hash, fingerprint) of the dataset. If unspecified, a digest is automatically computed.
trust_remote_code â Whether to trust remote code from the dataset repo.
source â The source of the dataset, e.g. a S3 URI, an HTTPS URL etc.
Represents a HuggingFace dataset for use with MLflow Tracking.
The Hugging Face datasets.Dataset
instance.
The Hugging Face datasets.Dataset
instance.
Summary statistics for the Hugging Face dataset, including the number of rows, size, and size in bytes.
The MLflow ColSpec schema of the Hugging Face dataset.
Hugging Face dataset source information.
The name of the Hugging Face dataset column containing targets (labels) for supervised learning.
The string name of the Hugging Face dataset column containing targets.
Create config dictionary for the dataset.
Returns a string dictionary containing the following fields: name, digest, source, source type, schema, and profile.
Converts the dataset to an EvaluationDataset for model evaluation. Required for use with mlflow.evaluate().
Constructs a TensorFlowDataset object from TensorFlow data, optional targets, and source.
If the source is path like, then this will construct a DatasetSource object from the source path. Otherwise, the source is assumed to be a DatasetSource object.
features â A TensorFlow dataset or tensor of features.
source â The source from which the data was derived, e.g. a filesystem path, an S3 URI, an HTTPS URL, a delta table name with version, or spark table etc. If source is not a path like string, pass in a DatasetSource object directly. If no source is specified, a CodeDatasetSource is used, which will source information from the run context.
targets â A TensorFlow dataset or tensor of targets. Optional.
name â The name of the dataset. If unspecified, a name is generated.
digest â A dataset digest (hash). If unspecified, a digest is computed automatically.
Represents a TensorFlow dataset for use with MLflow Tracking.
The underlying TensorFlow data.
A profile of the dataset. May be None if no profile is available.
An MLflow TensorSpec schema representing the tensor dataset
The source of the dataset.
The targets of the dataset.
Create config dictionary for the dataset.
Returns a string dictionary containing the following fields: name, digest, source, source type, schema, and profile.
Converts the dataset to an EvaluationDataset for model evaluation. Only supported if the dataset is a Tensor. Required for use with mlflow.evaluate().
An input dataset for model evaluation. This is intended for use with the mlflow.models.evaluate()
API.
Return the digest of the dataset.
return features data as a numpy array or a pandas DataFrame.
Returns True if the dataset has targets, False otherwise.
Returns True if the dataset has targets, False otherwise.
Dataset hash, includes hash on first 20 rows and last 20 rows.
return labels data as a numpy array
Dataset name, which is specified dataset name or the dataset hash if user donât specify name.
Dataset path
return labels data as a numpy array
return predictions name
return targets name
Construct a PolarsDataset
instance.
df â A polars DataFrame.
source â Source from which the DataFrame was derived, e.g. a filesystem path, an S3 URI, an HTTPS URL, a delta table name with version, or spark table etc. source
may be specified as a URI, a path-like string, or an instance of DatasetSource
. If unspecified, the source is assumed to be the code location (e.g. notebook cell, script, etc.) where from_polars
is being called.
targets â An optional target column name for supervised training. This column must be present in df
.
name â Name of the dataset. If unspecified, a name is generated.
digest â Dataset digest (hash). If unspecified, a digest is computed automatically.
predictions â An optional predictions column name for model evaluation. This column must be present in df
.
import mlflow import polars as pl x = pl.DataFrame( [["tom", 10, 1, 1], ["nick", 15, 0, 1], ["julie", 14, 1, 1]], schema=["Name", "Age", "Label", "ModelOutput"], ) dataset = mlflow.data.from_polars(x, targets="Label", predictions="ModelOutput")
A polars DataFrame for use with MLflow Tracking.
Underlying DataFrame.
Name of the predictions column.
May be None
if no predictions column is available.
Profile of the dataset.
Instance of mlflow.types.Schema
representing the tabular dataset.
May be None
if the schema cannot be inferred from the dataset.
Source of the dataset.
Name of the target column.
May be None
if no target column is available.
Create config dictionary for the dataset.
Return a string dictionary containing the following fields: name, digest, source, source type, schema, and profile.
Represents the source of a dataset stored on a filesystem, e.g. a local UNIX filesystem, blob storage services like S3, etc.
source_dict â A dictionary representation of the FileSystemDatasetSource.
Downloads the dataset source to the local filesystem.
dst_path â Path of the local filesystem destination directory to which to download the dataset source. If the directory does not exist, it is created. If unspecified, the dataset source is downloaded to a new uniquely-named directory on the local filesystem, unless the dataset source already exists on the local filesystem, in which case its local path is returned directly.
The path to the downloaded dataset source on the local filesystem.
A JSON-compatible dictionary representation of the FileSystemDatasetSource.
The URI referring to the dataset source filesystem location.
The URI referring to the dataset source filesystem location, e.g âs3://mybucket/path/to/mydatasetâ, â/tmp/path/to/my/datasetâ etc.
Represents the source of a dataset stored at a web location and referred to by an HTTP or HTTPS URL.
source_dict â A dictionary representation of the HTTPDatasetSource.
Downloads the dataset source to the local filesystem.
dst_path â Path of the local filesystem destination directory to which to download the dataset source. If the directory does not exist, it is created. If unspecified, the dataset source is downloaded to a new uniquely-named directory on the local filesystem.
The path to the downloaded dataset source on the local filesystem.
A JSON-compatible dictionary representation of the HTTPDatasetSource.
The HTTP/S URL referring to the dataset source location.
The HTTP/S URL referring to the dataset source location.
Represents the source of a Hugging Face dataset used in MLflow Tracking.
Load the Hugging Face dataset based on HuggingFaceDatasetSource.
kwargs â Additional keyword arguments used for loading the dataset with the Hugging Face datasets.load_dataset() method.
An instance of datasets.Dataset.
Obtains a JSON-compatible dictionary representation of the DatasetSource.
A JSON-compatible dictionary representation of the DatasetSource.
Represents the source of a dataset stored at in a delta table.
Loads the dataset source as a Delta Dataset Source.
An instance of pyspark.sql.DataFrame
.
Obtains a JSON-compatible dictionary representation of the DatasetSource.
A JSON-compatible dictionary representation of the DatasetSource.
Represents the source of a dataset stored in a spark table.
Loads the dataset source as a Spark Dataset Source.
An instance of pyspark.sql.DataFrame
.
Obtains a JSON-compatible dictionary representation of the DatasetSource.
A JSON-compatible dictionary representation of the DatasetSource.
RetroSearch is an open source project built by @garambo | Open a GitHub Issue
Search and Browse the WWW like it's 1997 | Search results from DuckDuckGo
HTML:
3.2
| Encoding:
UTF-8
| Version:
0.7.4