fs_param_grids <- list(
"Lasso" = list(
"feature_selector__estimator__C" = c(0.01, 0.1, 1L, 10L),
"feature_selector__estimator__solver" = c('liblinear','saga')
),
"Univariate" = list(
"feature_selector__param" = seq(50L, 200L, by = 50L)
),
"boruta" = list(
"feature_selector__perc" = seq(80L, 100L, by = 10L),
'feature_selector__n_estimators' = c(50L, 100L, 250L, 500L)
),
"RandomForest" = list(
"feature_selector__estimator__n_estimators" = seq(100L, 500L,by = 50L),
"feature_selector__estimator__max_depth" = c(10L, 20L, 30L),
"feature_selector__estimator__min_samples_split" = c(2L, 5L, 10L),
"feature_selector__estimator__min_samples_leaf" = c(1L, 2L, 4L),
"feature_selector__estimator__bootstrap" = c(TRUE, FALSE)
)
)
Pipeline Selection: If custom pipelines are provided, they’re used. Otherwise, default pipelines are created using the chosen feature selection methods. Pipelines contain two steps: VarianceThreshold that is set to 0.85 to filter out low variance features and MinMaxScaler. As per sklearn workflow to prevent data leakage every pipeline is fit on train and test/validate data separately.
Repeated Train-Test Splitting: The data undergoes repeated splitting into training and test sets, with the number of repetitions being defined by n_splits. For each split:
4.1 Note on Inbuilt vs Permutation feature importance: GeneSelectR package offers to ways of selecting features, and it is important to understand the difference in these two approaches. In short, these two methods can be summarized in this manner:
A Note on Permutation Importance: Permutation importance provides insights specific to the model it’s computed for. A feature that’s vital for one model might not be for another. Hence, it’s crucial to understand that permutation importance doesn’t indicate the intrinsic predictive value of a feature in isolation. Moreover, it’s vital to have a trustworthy model (with good cross-validation scores) before considering its permutation importance values. Features deemed unimportant in a poorly performing model could be crucial for a well-performing one. It’s always recommended to evaluate a model’s predictive power using held-out data or cross-validation prior to computing permutation importances. For more information please refere to the sklearn official documentation on permutation importance.
Note: Permutation Importance calculation is pretty computationally intensive and may increase the time to finish the analysis run. As a general recommendation it should be used if there large computational resources are avaialble.
selection_results <- GeneSelectR::GeneSelectR(X = X,
y = y,
njobs = -1,
search_type = 'grid')
GeneSelectR also supports scikit-optimize method BayesSearchCV, that performs Bayesian Optimization of hyperparameters. Depending on settings and data it could potentially speed up computation:
selection_results <- GeneSelectR::GeneSelectR(X = X,
y = y,
njobs = -1,
search_type = 'bayesian')
The GeneSelectR workflow is highly customizable depending on your needs. For example, if you wish to add any other feature selection methods from sklearn, you should pass it as a named list like this:
# sklearn is already imported when the library is loaded
# define the feature selection submodule and wanted methods with an estimator
feature_selection <- sklearn$feature_selection
select_from_model <- feature_selection$SelectFromModel
RFE <- feature_selection$RFE
rf <- sklearn$ensemble$RandomForestClassifier
# feature selection methods of your choice
my_methods <- list('RFE' = RFE(estimator = rf(), n_features_to_select = 100L),
'SelectFromModel' = select_from_model(estimator = rf()))
The parameters for the methods should be passed like this with a prefix ‘feature_selector__’ for every parameter:
# params for the feature selection methods of your choice
my_params <- list('RFE' = list('feature_selector__step' = seq(0.1, 0.001, 1, 10)),
'SelectFromModel' = list('feature_selector__estimator__n_estimators' = c(50L, 100L, 250L, 500L),
"feature_selector__estimator__max_depth" = c(10L, 20L, 30L),
"feature_selector__estimator__bootstrap" = c(TRUE, FALSE))
)
Finally, we can pass it as an arguments to the GeneSelectR() function:
selection_results <- GeneSelectR::GeneSelectR(X = X,
y = y,
njobs = -1,
feature_selection_methods=my_methods,
fs_param_grids = my_params)
Other than feature selection methods, preprocessing steps as well as other estimators for classifications can be passed. For example, if you want to pass other preprocessing steps you can do it like this:
minmax <- sklearn$preprocessing$MinMaxScaler()
varthr <- sklearn$feature_selection$VarianceThreshold()
preprocessing <- list( 'MinMaxScaler' = minmax,
'VarianceThreshold' = varthr)
selection_results <- GeneSelectR::GeneSelectR(X = X,
y = y,
njobs = -1,
feature_selection_methods=my_methods,
fs_param_grids = my_params,
preprocessing_steps = preprocessing)
Same can be done with the classifiying estimator. For example if you want to use the XGBoost classifier instead of the default random forest:
# import xgboost
# NOTE: it should be installed in the working conda env
xgb <- reticulate::import('xgboost')
xgb.classifier <- xgb$XGBClassifier()
selection_results <- GeneSelectR::GeneSelectR(X = X,
y = y,
njobs = -1,
classifier = xgb.classifier)
Note: if you supply your own classifier, you have to specify the parameter grid for it with a ‘classifier__’ prefix:
xgb <- reticulate::import('xgboost')
xgb.classifier <- xgb$XGBClassifier()
xgb_param_grid <- list(
"classifier__learning_rate" = c(0.01, 0.05, 0.1),
"classifier__n_estimators" = c(100L, 200L, 300L),
"classifier__max_depth" = c(3L, 5L, 7L),
"classifier__min_child_weight" = c(1L, 3L, 5L),
"classifier__gamma" = c(0, 0.1, 0.2),
"classifier__subsample" = c(0.8, 1.0),
"classifier__colsample_bytree" = c(0.8, 1.0)
)
selection_results <- GeneSelectR::GeneSelectR(X = X,
y = y,
njobs = -1,
classifier = xgb.classifier,
classifier_grid = xgb_param_grid)
RetroSearch is an open source project built by @garambo | Open a GitHub Issue
Search and Browse the WWW like it's 1997 | Search results from DuckDuckGo
HTML:
3.2
| Encoding:
UTF-8
| Version:
0.7.4