Thomas J. Fan
Accepted
Standards Track
2022-06-22
This SLEP proposes a set_output
method to configure the output data container of scikit-learn transformers.
Currently, scikit-learn transformers return NumPy ndarrays or SciPy sparse matrices. This SLEP proposes adding a set_output
method to configure a transformer to output pandas DataFrames:
scalar = StandardScaler().set_output(transform="pandas") scalar.fit(X_df) # X_trans_df is a pandas DataFrame X_trans_df = scalar.transform(X_df)
The index of the output DataFrame must match the index of the input. If the transformer does not support transform="pandas"
, then it must raise a ValueError
stating that it does not support the feature.
This SLEP’s only focus is dense data for set_output
. If a transformer returns sparse data, e.g. OneHotEncoder(sparse=True)
, then transform
will raise a ValueError
if set_output(transform="pandas")
. Dealing with sparse output might be the scope of another future SLEP.
For a pipeline, calling set_output
will configure all inner transformers and does not configure non-transformers. This enables the following workflow:
log_reg = make_pipeline(SimpleImputer(), StandardScaler(), LogisticRegression()) log_reg.set_output(transform="pandas") # All transformers return DataFrames during fit log_reg.fit(X_df, y) # X_trans_df is a pandas DataFrame X_trans_df = log_reg[:-1].transform(X_df) # X_trans_df is again a pandas DataFrame X_trans_df = log_reg[0].transform(X_df) # The classifier contains the feature names in log_reg[-1].feature_names_in_
Meta-estimators that support set_output
are required to configure all inner transformers by calling set_output
. Specifically all fitted and non-fitted inner transformers must be configured with set_output
. This enables transform
’s output to be a DataFrame before and after the meta-estimator is fitted. If an inner transformer does not define set_output
, then an error is raised.
For ease of use, this SLEP proposes a global configuration flag that sets the output for all transformers:
import sklearn sklearn.set_config(transform_output="pandas")
The global default configuration is "default"
where the transformer determines the output container.
The configuration can also be set locally using the config_context
context manager:
from sklearn import config_context with config_context(transform_output="pandas"): num_prep = make_pipeline(SimpleImputer(), StandardScaler(), PCA()) num_preprocessor.fit_transform(X_df)
The following specifies the precedence levels for the three ways to configure the output container:
Locally configure a transformer: transformer.set_output
Context manager: config_context
Global configuration: set_config
A possible implementation of this SLEP is worked out in #23734.
Backward compatibilityThere are no backward compatibility concerns, because the set_output
method is a new API. Third party transformers can opt-in to the API by defining set_output
.
Alternatives to this SLEP includes:
SLEP014 proposes that if the input is a DataFrame than the output is a DataFrame.
Prototype #20100 showcases array_out="pandas"
in transform
. This API is limited because does not directly support fitting on a pipeline where the steps requires data frames input.
A list of issues discussing Pandas output are: #14315, #20100, and #23001. This SLEP proposes configuring the output to be pandas because it is the DataFrame library that is most widely used and requested by users. The set_output
API can be extended to support additional DataFrame libraries and sparse data formats in the future.
This document has been placed in the public domain. [1]
RetroSearch is an open source project built by @garambo | Open a GitHub Issue
Search and Browse the WWW like it's 1997 | Search results from DuckDuckGo
HTML:
3.2
| Encoding:
UTF-8
| Version:
0.7.4