Compute chi-squared stats between each non-negative feature and class.
This score can be used to select the n_features
features with the highest values for the test chi-squared statistic from X, which must contain only non-negative integer feature values such as booleans or frequencies (e.g., term counts in document classification), relative to the classes.
If some of your features are continuous, you need to bin them, for example by using KBinsDiscretizer
.
Recall that the chi-square test measures dependence between stochastic variables, so using this function “weeds out” the features that are the most likely to be independent of class and therefore irrelevant for classification.
Read more in the User Guide.
Sample vectors.
Target vector (class labels).
Chi2 statistics for each feature.
P-values for each feature.
See also
f_classif
ANOVA F-value between label/feature for classification tasks.
f_regression
F-value between label/feature for regression tasks.
Notes
Complexity of this algorithm is O(n_classes * n_features).
Examples
>>> import numpy as np >>> from sklearn.feature_selection import chi2 >>> X = np.array([[1, 1, 3], ... [0, 1, 5], ... [5, 4, 1], ... [6, 6, 2], ... [1, 4, 0], ... [0, 0, 0]]) >>> y = np.array([1, 1, 0, 0, 2, 2]) >>> chi2_stats, p_values = chi2(X, y) >>> chi2_stats array([15.3, 6.5 , 8.9]) >>> p_values array([0.000456, 0.0387, 0.0116 ])
RetroSearch is an open source project built by @garambo | Open a GitHub Issue
Search and Browse the WWW like it's 1997 | Search results from DuckDuckGo
HTML:
3.2
| Encoding:
UTF-8
| Version:
0.7.4