RetroSearch Browse

Home - News ( United States | United Kingdom | Italy | Germany ) - Football scores

Showing content from https://dzhakparov.github.io/GeneSelectR/vignettes/example.html below:

GeneSelectR Tutorial

Feature Selection Procedure

Settings Defaults: If not provided, default feature selection methods and hyperparameter grids are established.By default there are four methods implemented to select the features: Univariate feature selection , Logistic regression with L1 penalty, boruta and Random Forest. The default grids for every default feature selection method are as follows:

fs_param_grids <- list(
  "Lasso" = list(
    "feature_selector__estimator__C" = c(0.01, 0.1, 1L, 10L),
    "feature_selector__estimator__solver" = c('liblinear','saga')
  ),
  "Univariate" = list(
    "feature_selector__param" = seq(50L, 200L, by = 50L)
  ),
  "boruta" = list(
    "feature_selector__perc" = seq(80L, 100L, by = 10L),
    'feature_selector__n_estimators' = c(50L, 100L, 250L, 500L)
  ),
  "RandomForest" = list(
    "feature_selector__estimator__n_estimators" = seq(100L, 500L,by = 50L),
    "feature_selector__estimator__max_depth" = c(10L, 20L, 30L),
    "feature_selector__estimator__min_samples_split" = c(2L, 5L, 10L),
    "feature_selector__estimator__min_samples_leaf" = c(1L, 2L, 4L),
    "feature_selector__estimator__bootstrap" = c(TRUE, FALSE)
  )
)

Pipeline Selection: If custom pipelines are provided, they’re used. Otherwise, default pipelines are created using the chosen feature selection methods. Pipelines contain two steps: VarianceThreshold that is set to 0.85 to filter out low variance features and MinMaxScaler. As per sklearn workflow to prevent data leakage every pipeline is fit on train and test/validate data separately.
Repeated Train-Test Splitting: The data undergoes repeated splitting into training and test sets, with the number of repetitions being defined by n_splits. For each split:

Pipelines are iterated over for fitting.
Each pipeline undergoes a secondary splitting for validation purposes.
Hyperparameter tuning is conducted through grid, randomized or grid search with Bayesian Optimization.
If the perform_test_split parameter is set to TRUE, the test metrics for the split are computed. Otherwise, only cross validation metrics are reported. Attention: due to the nature of biomedical data most of the time there is not so many samples available. As a general recommendation it is better to set perform_test_split to FALSE if number of samples is under 100. This way only cross validation performance will be reported.

Feature Importance Calculation: For each split and each pipeline, feature importance scores are computed. By default model specific, inbuilt feature importance is reported. If calculate_permutation_importance is set to TRUE, permutation feature importance scores are calculated, in addition to inbuilt feature importance scores.

4.1 Note on Inbuilt vs Permutation feature importance: GeneSelectR package offers to ways of selecting features, and it is important to understand the difference in these two approaches. In short, these two methods can be summarized in this manner:

Inbuilt (Model-specific) Feature Importance:
- Nature: Derived directly from the model. For instance, in tree-based models like decision trees or random forests, feature importance is based on the number of times a feature is used to split the data.
- Speed: Generally faster as it’s a by-product of the model training process.
- Bias: Might be biased towards features with more categories or higher cardinality.
Permutation Importance:
- Nature: Calculated by permuting the values of each feature and measuring the decrease in the model’s performance. If shuffling a feature’s values drops the performance significantly, it’s deemed important.
- Speed: Can be computationally intensive as it involves multiple evaluations of the model.
- Bias: Offers a more unbiased approach, especially for features with high cardinality.

A Note on Permutation Importance: Permutation importance provides insights specific to the model it’s computed for. A feature that’s vital for one model might not be for another. Hence, it’s crucial to understand that permutation importance doesn’t indicate the intrinsic predictive value of a feature in isolation. Moreover, it’s vital to have a trustworthy model (with good cross-validation scores) before considering its permutation importance values. Features deemed unimportant in a poorly performing model could be crucial for a well-performing one. It’s always recommended to evaluate a model’s predictive power using held-out data or cross-validation prior to computing permutation importances. For more information please refere to the sklearn official documentation on permutation importance.

Note: Permutation Importance calculation is pretty computationally intensive and may increase the time to finish the analysis run. As a general recommendation it should be used if there large computational resources are avaialble.

Hyperparamter adjustment For the sake of computational time the RandomizedSearchCV is used by default. However, if you want to perform an extensive hyperparameter optimization grid search testing all possible combinations can be performed:

selection_results <- GeneSelectR::GeneSelectR(X = X, 
                                              y = y,
                                              njobs = -1,
                                              search_type = 'grid')

GeneSelectR also supports scikit-optimize method BayesSearchCV, that performs Bayesian Optimization of hyperparameters. Depending on settings and data it could potentially speed up computation:

selection_results <- GeneSelectR::GeneSelectR(X = X, 
                                              y = y,
                                              njobs = -1,
                                              search_type = 'bayesian')

Aggregating Results: Across all the splits, mean inbuilt feature importance values with standard deviation are aggregated for each method. Same is done for the permutation importance if it is enabled. Additionally, rank for every feature at every data split is reported.

Customizing GeneSelectR Workflow with Other sklearn Methods

The GeneSelectR workflow is highly customizable depending on your needs. For example, if you wish to add any other feature selection methods from sklearn, you should pass it as a named list like this:

# sklearn is already imported when the library is loaded 
# define the feature selection submodule and wanted methods with an estimator
feature_selection <- sklearn$feature_selection
select_from_model <- feature_selection$SelectFromModel
RFE <- feature_selection$RFE
rf <- sklearn$ensemble$RandomForestClassifier

# feature selection methods of your choice 
my_methods <- list('RFE' = RFE(estimator = rf(), n_features_to_select = 100L),
                   'SelectFromModel' = select_from_model(estimator = rf()))

The parameters for the methods should be passed like this with a prefix ‘feature_selector__’ for every parameter:

# params for the feature selection methods of your choice 
my_params <- list('RFE' = list('feature_selector__step' = seq(0.1, 0.001, 1, 10)),
                  'SelectFromModel' = list('feature_selector__estimator__n_estimators' = c(50L, 100L, 250L, 500L),
                                           "feature_selector__estimator__max_depth" = c(10L, 20L, 30L),
                                           "feature_selector__estimator__bootstrap" = c(TRUE, FALSE))
                  )

Finally, we can pass it as an arguments to the GeneSelectR() function:

selection_results <- GeneSelectR::GeneSelectR(X = X, 
                                              y = y,
                                              njobs = -1,
                                              feature_selection_methods=my_methods,
                                              fs_param_grids = my_params)

Other than feature selection methods, preprocessing steps as well as other estimators for classifications can be passed. For example, if you want to pass other preprocessing steps you can do it like this:

minmax <- sklearn$preprocessing$MinMaxScaler()
varthr <- sklearn$feature_selection$VarianceThreshold()
preprocessing <- list( 'MinMaxScaler' = minmax, 
                       'VarianceThreshold' = varthr)

selection_results <- GeneSelectR::GeneSelectR(X = X, 
                                              y = y,
                                              njobs = -1,
                                              feature_selection_methods=my_methods,
                                              fs_param_grids = my_params,
                                              preprocessing_steps = preprocessing)

Same can be done with the classifiying estimator. For example if you want to use the XGBoost classifier instead of the default random forest:

# import xgboost 
# NOTE: it should be installed in the working conda env 
xgb <- reticulate::import('xgboost')
xgb.classifier <- xgb$XGBClassifier()

selection_results <- GeneSelectR::GeneSelectR(X = X, 
                                              y = y,
                                              njobs = -1,
                                              classifier = xgb.classifier)

Note: if you supply your own classifier, you have to specify the parameter grid for it with a ‘classifier__’ prefix:

xgb <- reticulate::import('xgboost')
xgb.classifier <- xgb$XGBClassifier()

xgb_param_grid <- list(
  "classifier__learning_rate" = c(0.01, 0.05, 0.1),
  "classifier__n_estimators" = c(100L, 200L, 300L),
  "classifier__max_depth" = c(3L, 5L, 7L),
  "classifier__min_child_weight" = c(1L, 3L, 5L),
  "classifier__gamma" = c(0, 0.1, 0.2),
  "classifier__subsample" = c(0.8, 1.0),
  "classifier__colsample_bytree" = c(0.8, 1.0)
)

selection_results <- GeneSelectR::GeneSelectR(X = X, 
                                              y = y,
                                              njobs = -1,
                                              classifier = xgb.classifier,
                                              classifier_grid = xgb_param_grid)

RetroSearch is an open source project built by @garambo | Open a GitHub Issue

Search and Browse the WWW like it's 1997 | Search results from DuckDuckGo

HTML: 3.2 | Encoding: UTF-8 | Version: 0.7.4