Time limit in seconds for the search of appropriate models. By increasing this value, auto-sklearn has a higher chance of finding better models.
Time limit for a single call to the machine learning model. Model fitting will be terminated if the machine learning algorithm runs over the time limit. Set this value high enough so that typical machine learning algorithms can be fit on the training data.
Initialize the hyperparameter optimization algorithm with this many configurations which worked well on previously seen datasets. Disable if the hyperparameter optimization algorithm should start from scratch.
Number of models added to the ensemble built by Ensemble selection from libraries of models. Models are drawn with replacement. If set to 0
no ensemble is fit.
Deprecated - will be removed in Auto-sklearn 0.16. Please pass this argument via ensemble_kwargs={"ensemble_size": int}
if you want to change the ensemble size for ensemble selection.
Class implementing the post-hoc ensemble algorithm. Set to None
to disable ensemble building or use SingleBest
to obtain only use the single best model instead of an ensemble.
If set to “default” it will use EnsembleSelection
for single-objective problems and MultiObjectiveDummyEnsemble
for multi-objective problems.
Keyword arguments that are passed to the ensemble class upon initialization.
Only consider the ensemble_nbest
models when building an ensemble. This is inspired by a concept called library pruning introduced in Getting Most out of Ensemble Selection. This is independent of the ensemble_class
argument and this pruning step is done prior to constructing an ensemble.
Defines the maximum number of models that are kept in the disc. The additional number of models are permanently deleted. Due to the nature of this variable, it sets the upper limit on how many models can be used for an ensemble. It must be an integer greater or equal than 1. If set to None, all models are kept on the disc.
Used to seed SMAC. Will determine the output file names.
Memory limit in MB for the machine learning algorithm. auto-sklearn will stop fitting the machine learning algorithm if it tries to allocate more than memory_limit
MB.
Important notes:
If None
is provided, no memory limit is set.
In case of multi-processing, memory_limit
will be per job, so the total usage is n_jobs x memory_limit
.
The memory limit also applies to the ensemble creation process.
If None, all possible algorithms are used.
Otherwise, specifies a step and the components that are included in search. See /pipeline/components/<step>/*
for available components.
Incompatible with parameter exclude
.
Possible Steps:
"data_preprocessor"
"balancing"
"feature_preprocessor"
"classifier"
- Only for when when using AutoSklearnClasssifier
"regressor"
- Only for when when using AutoSklearnRegressor
Example:
include = { 'classifier': ["random_forest"], 'feature_preprocessor': ["no_preprocessing"] }
If None, all possible algorithms are used.
Otherwise, specifies a step and the components that are excluded from search. See /pipeline/components/<step>/*
for available components.
Incompatible with parameter include
.
Possible Steps:
"data_preprocessor"
"balancing"
"feature_preprocessor"
"classifier"
- Only for when when using AutoSklearnClasssifier
"regressor"
- Only for when when using AutoSklearnRegressor
Example:
exclude = { 'classifier': ["random_forest"], 'feature_preprocessor': ["no_preprocessing"] }
How to to handle overfitting, might need to use resampling_strategy_arguments
if using "cv"
based method or a Splitter object.
"holdout"
- Use a 67:33 (train:test) split
"cv"
: perform cross validation, requires “folds” in resampling_strategy_arguments
"holdout-iterative-fit"
- Same as “holdout” but iterative fit where possible
"cv-iterative-fit"
: Same as “cv” but iterative fit where possible
"partial-cv"
: Same as “cv” but uses intensification.
BaseCrossValidator
- any BaseCrossValidator subclass (found in scikit-learn model_selection module)
_RepeatedSplits
- any _RepeatedSplits subclass (found in scikit-learn model_selection module)
BaseShuffleSplit
- any BaseShuffleSplit subclass (found in scikit-learn model_selection module)
If using a Splitter object that relies on the dataset retaining it’s current size and order, you will need to look at the dataset_compression
argument and ensure that "subsample"
is not included in the applied compression "methods"
or disable it entirely with False
.
Additional arguments for resampling_strategy
, this is required if using a cv
based strategy. The default arguments if left as None
are:
{ "train_size": 0.67, # The size of the training set "shuffle": True, # Whether to shuffle before splitting data "folds": 5 # Used in 'cv' based resampling strategies }
If using a custom splitter class, which takes n_splits
such as PredefinedSplit, the value of "folds"
will be used.
folder to store configuration output and log files, if None
automatically use /tmp/autosklearn_tmp_$pid_$random_number
remove tmp_folder, when finished. If tmp_folder is None tmp_dir will always be deleted
The number of jobs to run in parallel for fit()
. -1
means using all processors.
Important notes:
By default, Auto-sklearn uses one core.
Ensemble building is not affected by n_jobs
but can be controlled by the number of models in the ensemble.
predict()
is not affected by n_jobs
(in contrast to most scikit-learn models)
If dask_client
is None
, a new dask client is created.
User-created dask client, can be used to start a dask cluster and then attach auto-sklearn to it.
If True, disable model and prediction output. Cannot be used together with ensemble building. predict()
cannot be used when setting this True. Can also be used as a list to pass more fine-grained information on what to save. Allowed elements in the list are:
'y_optimization'
: do not save the predictions for the optimization set, which would later on be used to build an ensemble.
model
: do not save any model files
Additional arguments inserted into the scenario of SMAC. See the SMAC documentation for a list of available arguments.
Callback function to create an object of class smac.optimizer.smbo.SMBO. The function must accept the arguments scenario_dict
, instances
, num_params
, runhistory
, seed
and ta
. This is an advanced feature. Use only if you are familiar with SMAC.
dictionary object specifying the logger configuration. If None, the default logging.yaml file is used, which can be found in the directory util/logging.yaml
relative to the installation.
path to the metadata directory. If None, the default directory (autosklearn.metalearning.files) is used.
An instance of autosklearn.metrics.Scorer
as created by autosklearn.metrics.make_scorer()
. These are the Built-in Metrics. If None is provided, a default metric is selected depending on the task.
List of scorers which will be calculated for each pipeline and results will be available via cv_results
Whether to load the models after fitting Auto-sklearn.
A callable with the following definition.
(smac.SMBO, smac.RunInfo, smac.RunValue, time_left: float) -> bool | None
This will be called after SMAC, the underlying optimizer for autosklearn, finishes training each run.
You can use this to record your own information about the optimization process. You can also use this to enable a early stopping based on some critera.
See the example: Early Stopping And Callbacks.
We compress datasets so that they fit into some predefined amount of memory. Currently this does not apply to dataframes or sparse arrays, only to raw numpy arrays.
NOTE - If using a custom resampling_strategy
that relies on specific size or ordering of data, this must be disabled to preserve these properties.
You can disable this entirely by passing False
or leave as the default True
for configuration below.
{ "memory_allocation": 0.1, "methods": ["precision", "subsample"] }
You can also pass your own configuration with the same keys and choosing from the available "methods"
.
The available options are described here:
By default, we attempt to fit the dataset into 0.1 * memory_limit
. This float value can be set with "memory_allocation": 0.1
. We also allow for specifying absolute memory in MB, e.g. 10MB is "memory_allocation": 10
.
The memory used by the dataset is checked after each reduction method is performed. If the dataset fits into the allocated memory, any further methods listed in "methods"
will not be performed.
For example, if methods: ["precision", "subsample"]
and the "precision"
reduction step was enough to make the dataset fit into memory, then the "subsample"
reduction step will not be performed.
We provide the following methods for reducing the dataset size. These can be provided in a list and are performed in the order as given.
"precision"
- We reduce floating point precision as follows: * np.float128 -> np.float64
* np.float96 -> np.float64
* np.float64 -> np.float32
subsample
- We subsample data such that it fits directly into the memory allocation memory_allocation * memory_limit
. Therefore, this should likely be the last method listed in "methods"
. Subsampling takes into account classification labels and stratifies accordingly. We guarantee that at least one occurrence of each label is included in the sampled set.
Whether autosklearn should process string features. By default the textpreprocessing is enabled.
RetroSearch is an open source project built by @garambo | Open a GitHub Issue
Search and Browse the WWW like it's 1997 | Search results from DuckDuckGo
HTML:
3.2
| Encoding:
UTF-8
| Version:
0.7.4