A RetroSearch Logo

Home - News ( United States | United Kingdom | Italy | Germany ) - Football scores

Search Query:

Showing content from https://scikit-learn.org/dev/auto_examples/miscellaneous/plot_set_output.html below:

Introducing the set_output API — scikit-learn 1.8.dev0 documentation

Note

Go to the end to download the full example code. or to run this example in your browser via JupyterLite or Binder

Introducing the set_output API#

This example will demonstrate the set_output API to configure transformers to output pandas DataFrames. set_output can be configured per estimator by calling the set_output method or globally by setting set_config(transform_output="pandas"). For details, see SLEP018.

First, we load the iris dataset as a DataFrame to demonstrate the set_output API.

sepal length (cm) sepal width (cm) petal length (cm) petal width (cm) 60 5.0 2.0 3.5 1.0 1 4.9 3.0 1.4 0.2 8 4.4 2.9 1.4 0.2 93 5.0 2.3 3.3 1.0 106 4.9 2.5 4.5 1.7

To configure an estimator such as preprocessing.StandardScaler to return DataFrames, call set_output. This feature requires pandas to be installed.

from sklearn.preprocessing import StandardScaler

scaler = StandardScaler().set_output(transform="pandas")

scaler.fit(X_train)
X_test_scaled = scaler.transform(X_test)
X_test_scaled.head()
sepal length (cm) sepal width (cm) petal length (cm) petal width (cm) 39 -0.894264 0.798301 -1.271411 -1.327605 12 -1.244466 -0.086944 -1.327407 -1.459074 48 -0.660797 1.462234 -1.271411 -1.327605 23 -0.894264 0.576989 -1.159419 -0.933197 81 -0.427329 -1.414810 -0.039497 -0.275851

set_output can be called after fit to configure transform after the fact.

scaler2 = StandardScaler()

scaler2.fit(X_train)
X_test_np = scaler2.transform(X_test)
print(f"Default output type: {type(X_test_np).__name__}")

scaler2.set_output(transform="pandas")
X_test_df = scaler2.transform(X_test)
print(f"Configured pandas output type: {type(X_test_df).__name__}")
Default output type: ndarray
Configured pandas output type: DataFrame

In a pipeline.Pipeline, set_output configures all steps to output DataFrames.

Pipeline(steps=[('standardscaler', StandardScaler()),
                ('selectpercentile', SelectPercentile(percentile=75)),
                ('logisticregression', LogisticRegression())])
In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
On GitHub, the HTML representation is unable to render, please try loading this page with nbviewer.org.
Parameters steps  [('standardscaler', ...), ('selectpercentile', ...), ...] transform_input  None memory  None verbose  False Parameters copy  True with_mean  True with_std  True Parameters score_func  <function f_c...x7f82d03ec160> percentile  75 Parameters penalty  'l2' dual  False tol  0.0001 C  1.0 fit_intercept  True intercept_scaling  1 class_weight  None random_state  None solver  'lbfgs' max_iter  100 multi_class  'deprecated' verbose  0 warm_start  False n_jobs  None l1_ratio  None

Each transformer in the pipeline is configured to return DataFrames. This means that the final logistic regression step contains the feature names of the input.

clf[-1].feature_names_in_
array(['sepal length (cm)', 'petal length (cm)', 'petal width (cm)'],
      dtype=object)

Note

If one uses the method set_params, the transformer will be replaced by a new one with the default output format.

clf.set_params(standardscaler=StandardScaler())
clf.fit(X_train, y_train)
clf[-1].feature_names_in_
array(['x0', 'x2', 'x3'], dtype=object)

To keep the intended behavior, use set_output on the new transformer beforehand

scaler = StandardScaler().set_output(transform="pandas")
clf.set_params(standardscaler=scaler)
clf.fit(X_train, y_train)
clf[-1].feature_names_in_
array(['sepal length (cm)', 'petal length (cm)', 'petal width (cm)'],
      dtype=object)

Next we load the titanic dataset to demonstrate set_output with compose.ColumnTransformer and heterogeneous data.

The set_output API can be configured globally by using set_config and setting transform_output to "pandas".

from sklearn import set_config
from sklearn.compose import ColumnTransformer
from sklearn.impute import SimpleImputer
from sklearn.preprocessing import OneHotEncoder, StandardScaler

set_config(transform_output="pandas")

num_pipe = make_pipeline(SimpleImputer(), StandardScaler())
num_cols = ["age", "fare"]
ct = ColumnTransformer(
    (
        ("numerical", num_pipe, num_cols),
        (
            "categorical",
            OneHotEncoder(
                sparse_output=False, drop="if_binary", handle_unknown="ignore"
            ),
            ["embarked", "sex", "pclass"],
        ),
    ),
    verbose_feature_names_out=False,
)
clf = make_pipeline(ct, SelectPercentile(percentile=50), LogisticRegression())
clf.fit(X_train, y_train)
clf.score(X_test, y_test)

With the global configuration, all transformers output DataFrames. This allows us to easily plot the logistic regression coefficients with the corresponding feature names.

import pandas as pd

log_reg = clf[-1]
coef = pd.Series(log_reg.coef_.ravel(), index=log_reg.feature_names_in_)
_ = coef.sort_values().plot.barh()

In order to demonstrate the config_context functionality below, let us first reset transform_output to its default value.

When configuring the output type with config_context the configuration at the time when transform or fit_transform are called is what counts. Setting these only when you construct or fit the transformer has no effect.

StandardScaler()
In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
On GitHub, the HTML representation is unable to render, please try loading this page with nbviewer.org.
Parameters copy  True with_mean  True with_std  True

with config_context(transform_output="pandas"):
    # the output of transform will be a Pandas DataFrame
    X_test_scaled = scaler.transform(X_test[num_cols])
X_test_scaled.head()
age fare 629 0.628306 -0.063210 688 -0.057984 -0.515704 439 1.314596 0.566624 664 -0.675645 -0.512279 669 -0.744274 -0.496950

outside of the context manager, the output will be a NumPy array

X_test_scaled = scaler.transform(X_test[num_cols])
X_test_scaled[:5]
array([[ 0.62830616, -0.06320955],
       [-0.05798371, -0.51570367],
       [ 1.31459603,  0.56662405],
       [-0.6756446 , -0.51227857],
       [-0.74427358, -0.49694966]])

Total running time of the script: (0 minutes 0.145 seconds)

Related examples

Gallery generated by Sphinx-Gallery


RetroSearch is an open source project built by @garambo | Open a GitHub Issue

Search and Browse the WWW like it's 1997 | Search results from DuckDuckGo

HTML: 3.2 | Encoding: UTF-8 | Version: 0.7.4