n_features_in_
attribute
Nicolas Hug
Final
Standards Track
2019-11-23
Implemented with v0.23.
AbstractThis SLEP proposes the introduction of a public n_features_in_
attribute for most estimators (where relevant).
Knowing the number of features that an estimator expects is useful for inspection purposes. This is also useful for implementing the feature names propagation (SLEP 8) . For example any of the scaler can easily create feature names if they know n_features_in_
.
The proposed solution is to replace most calls to check_array()
or check_X_y()
by calls to a newly created private method:
def _validate_data(self, X, y=None, reset=True, **check_array_params) ...
The _validate_data()
method will call check_array()
or check_X_y()
function depending on the y
parameter.
If the reset
parameter is True (default), the method will set the n_feature_in_
attribute of the estimator, regardless of its potential previous value. This should typically be used in fit()
, or in the first partial_fit()
call. Passing reset=False
will not set the attribute but instead check against it, and potentially raise an error. This should typically be used in predict()
or transform()
, or on subsequent calls to partial_fit
.
In most cases, the n_features_in_
attribute exists only once fit
has been called, but there are exceptions (see below).
A new common check is added: it makes sure that for most estimators, the n_features_in_
attribute does not exist until fit
is called, and that its value is correct. Instead of raising an exception, this check will raise a warning for the next two releases. This will give downstream packages some time to adjust (see considerations below).
Since the introduced method is private, third party libraries are recommended not to rely on it.
The logic that is proposed here (calling a stateful method instead of a stateless function) is a pre-requisite to fixing the dataframe column ordering issue: with a stateless check_array
, there is no way to raise an error if the column ordering of a dataframe was changed between fit
and predict
. This is however out os scope for this SLEP, which only focuses on the introduction of the n_features_in_
attribute.
The main consideration is that the addition of the common test means that existing estimators in downstream libraries will not pass our test suite, unless the estimators also have the n_features_in_
attribute.
The newly introduced checks will only raise a warning instead of an exception for the next 2 releases, so this will give more time for downstream packages to adjust.
There are other minor considerations:
In most meta-estimators, the input validation is handled by the sub-estimator(s). The n_features_in_
attribute of the meta-estimator is thus explicitly set to that of the sub-estimator, either via a @property
, or directly in fit()
.
Some estimators like the dummy estimators do not validate the input (the ‘no_validation’ tag should be True). The n_features_in_
attribute should be set to None, though this is not enforced in the common check.
Some estimators expect a non-rectangular input: the vectorizers. These estimators expect dicts or lists, not a n_samples * n_features
matrix. n_features_in_
makes no sense here and these estimators just don’t have the attribute.
Some estimators may know the number of input features before fit
is called: typically the SparseCoder
, where n_feature_in_
is known at __init__
from the dictionary
parameter. In this case the attribute is a property and is available right after object instantiation.
This document has been placed in the public domain. [1]
RetroSearch is an open source project built by @garambo | Open a GitHub Issue
Search and Browse the WWW like it's 1997 | Search results from DuckDuckGo
HTML:
3.2
| Encoding:
UTF-8
| Version:
0.7.4