Univariate linear regression tests returning F-statistic and p-values.
Quick linear model for testing the effect of a single regressor, sequentially for many regressors.
This is done in 2 steps:
The cross correlation between each regressor and the target is computed using r_regression
as:
E[(X[:, i] - mean(X[:, i])) * (y - mean(y))] / (std(X[:, i]) * std(y))
It is converted to an F score and then to a p-value.
f_regression
is derived from r_regression
and will rank features in the same order if all the features are positively correlated with the target.
Note however that contrary to f_regression
, r_regression
values lie in [-1, 1] and can thus be negative. f_regression
is therefore recommended as a feature selection criterion to identify potentially predictive feature for a downstream classifier, irrespective of the sign of the association with the target variable.
Furthermore f_regression
returns p-values while r_regression
does not.
Read more in the User Guide.
The data matrix.
The target vector.
Whether or not to center the data matrix X
and the target vector y
. By default, X
and y
will be centered.
Whether or not to force the F-statistics and associated p-values to be finite. There are two cases where the F-statistic is expected to not be finite:
when the target y
or some features in X
are constant. In this case, the Pearson’s R correlation is not defined leading to obtain np.nan
values in the F-statistic and p-value. When force_finite=True
, the F-statistic is set to 0.0
and the associated p-value is set to 1.0
.
when a feature in X
is perfectly correlated (or anti-correlated) with the target y
. In this case, the F-statistic is expected to be np.inf
. When force_finite=True
, the F-statistic is set to np.finfo(dtype).max
and the associated p-value is set to 0.0
.
Added in version 1.1.
F-statistic for each feature.
P-values associated with the F-statistic.
See also
r_regression
Pearson’s R between label/feature for regression tasks.
f_classif
ANOVA F-value between label/feature for classification tasks.
chi2
Chi-squared stats of non-negative features for classification tasks.
SelectKBest
Select features based on the k highest scores.
SelectFpr
Select features based on a false positive rate test.
SelectFdr
Select features based on an estimated false discovery rate.
SelectFwe
Select features based on family-wise error rate.
SelectPercentile
Select features based on percentile of the highest scores.
Examples
>>> from sklearn.datasets import make_regression >>> from sklearn.feature_selection import f_regression >>> X, y = make_regression( ... n_samples=50, n_features=3, n_informative=1, noise=1e-4, random_state=42 ... ) >>> f_statistic, p_values = f_regression(X, y) >>> f_statistic array([1.21, 2.67e13, 2.66]) >>> p_values array([0.276, 1.54e-283, 0.11])
RetroSearch is an open source project built by @garambo | Open a GitHub Issue
Search and Browse the WWW like it's 1997 | Search results from DuckDuckGo
HTML:
3.2
| Encoding:
UTF-8
| Version:
0.7.4