This page gives the Python API reference of xgboost, please also refer to Python Package Introduction for more information about the Python package.
Context manager for global XGBoost configuration.
Global configuration consists of a collection of parameters that can be applied in the global scope. See Global Configuration for the full list of parameters supported in the global configuration.
Note
All settings, not just those presently modified, will be returned to their previous values when the context manager is exited. This is not thread-safe.
Added in version 1.4.0.
new_config (Dict[str, Any]) – Keyword arguments representing the parameters and their values
Iterator[None]
Example
import xgboost as xgb # Show all messages, including ones pertaining to debugging xgb.set_config(verbosity=2) # Get current value of global configuration # This is a dict containing all parameters in the global configuration, # including 'verbosity' config = xgb.get_config() assert config['verbosity'] == 2 # Example of using the context manager xgb.config_context(). # The context manager will restore the previous value of the global # configuration upon exiting. with xgb.config_context(verbosity=0): # Suppress warning caused by model generated with XGBoost version < 1.0.0 bst = xgb.Booster(model_file='./old_model.bin') assert xgb.get_config()['verbosity'] == 2 # old value restored
Nested configuration context is also supported:
Example
with xgb.config_context(verbosity=3): assert xgb.get_config()["verbosity"] == 3 with xgb.config_context(verbosity=2): assert xgb.get_config()["verbosity"] == 2 xgb.set_config(verbosity=2) assert xgb.get_config()["verbosity"] == 2 with xgb.config_context(verbosity=3): assert xgb.get_config()["verbosity"] == 3
See also
set_config
Set global XGBoost configuration
get_config
Get current values of the global configuration
Set global configuration.
Global configuration consists of a collection of parameters that can be applied in the global scope. See Global Configuration for the full list of parameters supported in the global configuration.
Added in version 1.4.0.
new_config (Dict[str, Any]) – Keyword arguments representing the parameters and their values
None
Example
import xgboost as xgb # Show all messages, including ones pertaining to debugging xgb.set_config(verbosity=2) # Get current value of global configuration # This is a dict containing all parameters in the global configuration, # including 'verbosity' config = xgb.get_config() assert config['verbosity'] == 2 # Example of using the context manager xgb.config_context(). # The context manager will restore the previous value of the global # configuration upon exiting. with xgb.config_context(verbosity=0): # Suppress warning caused by model generated with XGBoost version < 1.0.0 bst = xgb.Booster(model_file='./old_model.bin') assert xgb.get_config()['verbosity'] == 2 # old value restored
Nested configuration context is also supported:
Example
with xgb.config_context(verbosity=3): assert xgb.get_config()["verbosity"] == 3 with xgb.config_context(verbosity=2): assert xgb.get_config()["verbosity"] == 2 xgb.set_config(verbosity=2) assert xgb.get_config()["verbosity"] == 2 with xgb.config_context(verbosity=3): assert xgb.get_config()["verbosity"] == 3
Get current values of the global configuration.
Global configuration consists of a collection of parameters that can be applied in the global scope. See Global Configuration for the full list of parameters supported in the global configuration.
Added in version 1.4.0.
args – The list of global parameters and their values
Dict[str, Any]
Example
import xgboost as xgb # Show all messages, including ones pertaining to debugging xgb.set_config(verbosity=2) # Get current value of global configuration # This is a dict containing all parameters in the global configuration, # including 'verbosity' config = xgb.get_config() assert config['verbosity'] == 2 # Example of using the context manager xgb.config_context(). # The context manager will restore the previous value of the global # configuration upon exiting. with xgb.config_context(verbosity=0): # Suppress warning caused by model generated with XGBoost version < 1.0.0 bst = xgb.Booster(model_file='./old_model.bin') assert xgb.get_config()['verbosity'] == 2 # old value restored
Nested configuration context is also supported:
Example
with xgb.config_context(verbosity=3): assert xgb.get_config()["verbosity"] == 3 with xgb.config_context(verbosity=2): assert xgb.get_config()["verbosity"] == 2 xgb.set_config(verbosity=2) assert xgb.get_config()["verbosity"] == 2 with xgb.config_context(verbosity=3): assert xgb.get_config()["verbosity"] == 3
Build information of XGBoost. The returned value format is not stable. Also, please note that build time dependency is not the same as runtime dependency. For instance, it’s possible to build XGBoost with older CUDA version but run it with the lastest one.
Core XGBoost Library.
Bases: object
Data Matrix used in XGBoost.
DMatrix is an internal data structure that is used by XGBoost, which is optimized for both memory efficiency and training speed. You can construct DMatrix from multiple different sources of data.
data (Any) –
Data source of DMatrix. See Markers for a list of supported input types.
Note that, if passing an iterator, it will cache data on disk, and note that fields like label
will be concatenated in-memory from multiple calls to the iterator.
label (Any | None) – Label of the training data.
weight (Any | None) –
Weight for each instance.
Note
For ranking task, weights are per-group. In ranking task, one weight is assigned to each group (not each data point). This is because we only care about the relative ordering of data points within each group, so it doesn’t make sense to assign weights to individual data points.
base_margin (Any | None) – Global bias for each instance. See Intercept for details.
missing (float | None) – Value in the input data which needs to be present as a missing value. If None, defaults to np.nan.
silent (bool) – Whether print messages during construction
feature_names (Sequence[str] | None) – Set names for features.
feature_types (Sequence[str] | Categories | None) –
Set types for features. If data is a DataFrame type and passing enable_categorical=True, the types will be deduced automatically from the column types.
Otherwise, one can pass a list-like input with the same length as number of columns in data, with the following possible values:
”c”, which represents categorical columns.
”q”, which represents numeric columns.
”int”, which represents integer columns.
”i”, which represents boolean columns.
Note that, while categorical types are treated differently from the rest for model fitting purposes, the other types do not influence the generated model, but have effects in other functionalities such as feature importances.
For categorical features, the input is assumed to be preprocessed and encoded by the users. The encoding can be done via sklearn.preprocessing.OrdinalEncoder
or pandas dataframe .cat.codes method. This is useful when users want to specify categorical features without having to construct a dataframe as input.
Added in version 3.1.0.
Alternatively, user can pass a Categories
object returned from previous training as a reference for re-coding. One can obtain the reference with the get_categories()
from the previous training DMatrix or the Booster. This feature is experimental.
nthread (int | None) – Number of threads to use for loading data when parallelization is applicable. If -1, uses maximum threads available on the system.
group (Any | None) – Group size for all ranking group.
qid (Any | None) – Query ID for data samples, used for ranking.
label_lower_bound (Any | None) – Lower bound for survival training.
label_upper_bound (Any | None) – Upper bound for survival training.
feature_weights (Any | None) – Set feature weights for column sampling.
enable_categorical (bool) –
Added in version 1.3.0.
Note
This parameter is experimental
Experimental support of specializing for categorical features. See Categorical Data for more info.
If passing True and data is a data frame (from supported libraries such as Pandas, Modin, polars, and cuDF), The DMatrix recognizes categorical columns and automatically set the feature_types parameter. If data is not a data frame, this argument is ignored.
If passing False and data is a data frame with categorical columns, it will result in an error.
See notes in the DataIter
for consistency requirement when the input is an iterator.
Changed in version 3.1.0.
XGBoost can remember the encoding of categories when the input is a dataframe.
data_split_mode (DataSplitMode)
Get the data split mode of the DMatrix.
Added in version 2.1.0.
DataSplitMode
Labels for features (column labels).
Setting it to None
resets existing feature names.
Type of features (column types).
This is for displaying the results and categorical data support. See DMatrix
for details.
Setting it to None
resets existing feature types.
Get the base margin of the DMatrix.
base_margin
Get the categories in the dataset.
Added in version 3.1.0.
Warning
This function is experimental.
export_to_arrow (bool) – The returned container will contain a list of pyarrow
arrays for the categories. See the to_arrow()
for more info.
Get the predictors from DMatrix as a CSR matrix. This getter is mostly for testing purposes. If this is a quantized DMatrix then quantized values are returned instead of input values.
Added in version 1.7.0.
Get float property from the DMatrix.
field (str) – The field name of the information
info – a numpy array of float information of the data
array
Get the group of the DMatrix.
group
Get the label of the DMatrix.
label
array
Get quantile cuts for quantization.
Added in version 2.0.0.
Get unsigned integer property from the DMatrix.
field (str) – The field name of the information
info – a numpy array of unsigned integer information of the data
array
Get the weight of the DMatrix.
weight
array
Get the number of columns (features) in the DMatrix.
Get the number of non-missing values in the DMatrix.
Added in version 1.7.0.
Get the number of rows in the DMatrix.
Save DMatrix to an XGBoost buffer. Saved binary can be later loaded by providing the path to xgboost.DMatrix()
as input.
fname (string or os.PathLike) – Name of the output buffer file.
silent (bool (optional; default: True)) – If set, the output is suppressed.
None
Set base margin of booster to start from.
This can be used to specify a prediction value of existing model to be base_margin However, remember margin is needed, instead of transformed prediction e.g. for logistic regression: need to put in value before logistic transformation see also example/demo.py
margin (array like) – Prediction margin of each datapoint
None
Set float type property into the DMatrix.
field (str) – The field name of the information
data (numpy array) – The array of data to be set
None
for numpy 2d array input
field (str) – The field name of the information
data (numpy array) – The array of data to be set
None
Set group size of DMatrix (used for ranking).
group (array like) – Group size of each group
None
Set meta info for DMatrix. See doc string for xgboost.DMatrix
.
Set label of dmatrix
label (array like) – The label information to be set into DMatrix
None
Set uint type property into the DMatrix.
field (str) – The field name of the information
data (numpy array) – The array of data to be set
None
Set weight of each instance.
weight (array like) –
Weight for each data point
Note
For ranking task, weights are per-group.
In ranking task, one weight is assigned to each group (not each data point). This is because we only care about the relative ordering of data points within each group, so it doesn’t make sense to assign weights to individual data points.
None
Slice the DMatrix and return a new DMatrix that only contains rindex.
Bases: DMatrix
, _RefMixIn
A DMatrix variant that generates quantilized data directly from input for the hist
tree method. This DMatrix is primarily designed to save memory in training by avoiding intermediate storage. Set max_bin
to control the number of bins during quantisation, which should be consistent with the training parameter max_bin
. When QuantileDMatrix
is used for validation/test dataset, ref
should be another QuantileDMatrix
or DMatrix
, but not recommended as it defeats the purpose of saving memory) constructed from training dataset. See xgboost.DMatrix
for documents on meta info.
Note
Do not use QuantileDMatrix
as validation/test dataset without supplying a reference (the training dataset) QuantileDMatrix
using ref
as some information may be lost in quantisation.
Added in version 1.7.0.
Examples
from sklearn.datasets import make_regression from sklearn.model_selection import train_test_split X, y = make_regression() X_train, X_test, y_train, y_test = train_test_split(X, y) Xy_train = xgb.QuantileDMatrix(X_train, y_train) # It's necessary to have the training DMatrix as a reference for valid # quantiles. Xy_test = xgb.QuantileDMatrix(X_test, y_test, ref=Xy_train)
max_bin (int | None) – The number of histogram bin, should be consistent with the training parameter max_bin
.
ref (DMatrix | None) – The training dataset that provides quantile information, needed when creating validation/test dataset with QuantileDMatrix
. Supplying the training DMatrix as a reference means that the same quantisation applied to the training data is applied to the validation/test data
max_quantile_batches (int | None) –
For GPU-based inputs from an iterator, XGBoost handles incoming batches with multiple growing sub-streams. This parameter sets the maximum number of batches before XGBoost can cut a sub-stream and create a new one. This can help bound the memory usage. By default, XGBoost grows a sub-stream exponentially until batches are exhausted. This option is only used for the training dataset and the default is None (unbounded). Lastly, if the data is a single batch instead of an iterator, this parameter has no effect.
Added in version 3.0.0.
Warning
This is an experimental parameter and subject to change.
data (Any) –
Data source of DMatrix. See Markers for a list of supported input types.
Note that, if passing an iterator, it will cache data on disk, and note that fields like label
will be concatenated in-memory from multiple calls to the iterator.
label (Any | None) – Label of the training data.
weight (Any | None) –
Weight for each instance.
Note
For ranking task, weights are per-group. In ranking task, one weight is assigned to each group (not each data point). This is because we only care about the relative ordering of data points within each group, so it doesn’t make sense to assign weights to individual data points.
base_margin (Any | None) – Global bias for each instance. See Intercept for details.
missing (float | None) – Value in the input data which needs to be present as a missing value. If None, defaults to np.nan.
silent (bool) – Whether print messages during construction
feature_names (Sequence[str] | None) – Set names for features.
feature_types (Sequence[str] | None) –
Set types for features. If data is a DataFrame type and passing enable_categorical=True, the types will be deduced automatically from the column types.
Otherwise, one can pass a list-like input with the same length as number of columns in data, with the following possible values:
”c”, which represents categorical columns.
”q”, which represents numeric columns.
”int”, which represents integer columns.
”i”, which represents boolean columns.
Note that, while categorical types are treated differently from the rest for model fitting purposes, the other types do not influence the generated model, but have effects in other functionalities such as feature importances.
For categorical features, the input is assumed to be preprocessed and encoded by the users. The encoding can be done via sklearn.preprocessing.OrdinalEncoder
or pandas dataframe .cat.codes method. This is useful when users want to specify categorical features without having to construct a dataframe as input.
Added in version 3.1.0.
Alternatively, user can pass a Categories
object returned from previous training as a reference for re-coding. One can obtain the reference with the get_categories()
from the previous training DMatrix or the Booster. This feature is experimental.
nthread (int | None) – Number of threads to use for loading data when parallelization is applicable. If -1, uses maximum threads available on the system.
group (Any | None) – Group size for all ranking group.
qid (Any | None) – Query ID for data samples, used for ranking.
label_lower_bound (Any | None) – Lower bound for survival training.
label_upper_bound (Any | None) – Upper bound for survival training.
feature_weights (Any | None) – Set feature weights for column sampling.
enable_categorical (bool) –
Added in version 1.3.0.
Note
This parameter is experimental
Experimental support of specializing for categorical features. See Categorical Data for more info.
If passing True and data is a data frame (from supported libraries such as Pandas, Modin, polars, and cuDF), The DMatrix recognizes categorical columns and automatically set the feature_types parameter. If data is not a data frame, this argument is ignored.
If passing False and data is a data frame with categorical columns, it will result in an error.
See notes in the DataIter
for consistency requirement when the input is an iterator.
Changed in version 3.1.0.
XGBoost can remember the encoding of categories when the input is a dataframe.
data_split_mode (DataSplitMode)
Get the data split mode of the DMatrix.
Added in version 2.1.0.
DataSplitMode
Labels for features (column labels).
Setting it to None
resets existing feature names.
Type of features (column types).
This is for displaying the results and categorical data support. See DMatrix
for details.
Setting it to None
resets existing feature types.
Get the base margin of the DMatrix.
base_margin
Get the categories in the dataset.
Added in version 3.1.0.
Warning
This function is experimental.
export_to_arrow (bool) – The returned container will contain a list of pyarrow
arrays for the categories. See the to_arrow()
for more info.
Get the predictors from DMatrix as a CSR matrix. This getter is mostly for testing purposes. If this is a quantized DMatrix then quantized values are returned instead of input values.
Added in version 1.7.0.
Get float property from the DMatrix.
field (str) – The field name of the information
info – a numpy array of float information of the data
array
Get the group of the DMatrix.
group
Get the label of the DMatrix.
label
array
Get quantile cuts for quantization.
Added in version 2.0.0.
Get unsigned integer property from the DMatrix.
field (str) – The field name of the information
info – a numpy array of unsigned integer information of the data
array
Get the weight of the DMatrix.
weight
array
Get the number of columns (features) in the DMatrix.
Get the number of non-missing values in the DMatrix.
Added in version 1.7.0.
Get the number of rows in the DMatrix.
Internal method for retrieving a reference to the training DMatrix.
Save DMatrix to an XGBoost buffer. Saved binary can be later loaded by providing the path to xgboost.DMatrix()
as input.
fname (string or os.PathLike) – Name of the output buffer file.
silent (bool (optional; default: True)) – If set, the output is suppressed.
None
Set base margin of booster to start from.
This can be used to specify a prediction value of existing model to be base_margin However, remember margin is needed, instead of transformed prediction e.g. for logistic regression: need to put in value before logistic transformation see also example/demo.py
margin (array like) – Prediction margin of each datapoint
None
Set float type property into the DMatrix.
field (str) – The field name of the information
data (numpy array) – The array of data to be set
None
for numpy 2d array input
field (str) – The field name of the information
data (numpy array) – The array of data to be set
None
Set group size of DMatrix (used for ranking).
group (array like) – Group size of each group
None
Set meta info for DMatrix. See doc string for xgboost.DMatrix
.
Set label of dmatrix
label (array like) – The label information to be set into DMatrix
None
Set uint type property into the DMatrix.
field (str) – The field name of the information
data (numpy array) – The array of data to be set
None
Set weight of each instance.
weight (array like) –
Weight for each data point
Note
For ranking task, weights are per-group.
In ranking task, one weight is assigned to each group (not each data point). This is because we only care about the relative ordering of data points within each group, so it doesn’t make sense to assign weights to individual data points.
None
Slice the DMatrix and return a new DMatrix that only contains rindex.
Bases: DMatrix
, _RefMixIn
The external memory version of the QuantileDMatrix
.
See Using XGBoost External Memory Version for explanation and usage examples, and QuantileDMatrix
for parameter document.
Warning
This is an experimental feature and subject to change.
Added in version 3.0.0.
max_quantile_batches (int | None) – See QuantileDMatrix
.
cache_host_ratio (float | None) –
Added in version 3.1.0.
Used by the GPU implementation. For GPU-based inputs, XGBoost can split the cache into host and device caches to reduce the data transfer overhead. This parameter specifies the size of host cache compared to the size of the entire cache: \(host / (host + device)\).
See Adaptive Cache for more info.
missing (float | None)
nthread (int | None)
max_bin (int | None)
ref (DMatrix | None)
enable_categorical (bool)
Get the data split mode of the DMatrix.
Added in version 2.1.0.
DataSplitMode
Labels for features (column labels).
Setting it to None
resets existing feature names.
Type of features (column types).
This is for displaying the results and categorical data support. See DMatrix
for details.
Setting it to None
resets existing feature types.
Get the base margin of the DMatrix.
base_margin
Get the categories in the dataset.
Added in version 3.1.0.
Warning
This function is experimental.
export_to_arrow (bool) – The returned container will contain a list of pyarrow
arrays for the categories. See the to_arrow()
for more info.
Get the predictors from DMatrix as a CSR matrix. This getter is mostly for testing purposes. If this is a quantized DMatrix then quantized values are returned instead of input values.
Added in version 1.7.0.
Get float property from the DMatrix.
field (str) – The field name of the information
info – a numpy array of float information of the data
array
Get the group of the DMatrix.
group
Get the label of the DMatrix.
label
array
Get quantile cuts for quantization.
Added in version 2.0.0.
Get unsigned integer property from the DMatrix.
field (str) – The field name of the information
info – a numpy array of unsigned integer information of the data
array
Get the weight of the DMatrix.
weight
array
Get the number of columns (features) in the DMatrix.
Get the number of non-missing values in the DMatrix.
Added in version 1.7.0.
Get the number of rows in the DMatrix.
Internal method for retrieving a reference to the training DMatrix.
Save DMatrix to an XGBoost buffer. Saved binary can be later loaded by providing the path to xgboost.DMatrix()
as input.
fname (string or os.PathLike) – Name of the output buffer file.
silent (bool (optional; default: True)) – If set, the output is suppressed.
None
Set base margin of booster to start from.
This can be used to specify a prediction value of existing model to be base_margin However, remember margin is needed, instead of transformed prediction e.g. for logistic regression: need to put in value before logistic transformation see also example/demo.py
margin (array like) – Prediction margin of each datapoint
None
Set float type property into the DMatrix.
field (str) – The field name of the information
data (numpy array) – The array of data to be set
None
for numpy 2d array input
field (str) – The field name of the information
data (numpy array) – The array of data to be set
None
Set group size of DMatrix (used for ranking).
group (array like) – Group size of each group
None
Set meta info for DMatrix. See doc string for xgboost.DMatrix
.
Set label of dmatrix
label (array like) – The label information to be set into DMatrix
None
Set uint type property into the DMatrix.
field (str) – The field name of the information
data (numpy array) – The array of data to be set
None
Set weight of each instance.
weight (array like) –
Weight for each data point
Note
For ranking task, weights are per-group.
In ranking task, one weight is assigned to each group (not each data point). This is because we only care about the relative ordering of data points within each group, so it doesn’t make sense to assign weights to individual data points.
None
Slice the DMatrix and return a new DMatrix that only contains rindex.
Bases: object
A Booster of XGBoost.
Booster is the model of xgboost, that contains low level routines for training, prediction and evaluation.
Get a slice of the tree-based model. Attributes like best_iteration and best_score are removed in the resulting booster.
Added in version 1.3.0.
Get attribute string from the Booster.
key (str) – The key to get attribute from.
The attribute value of the key, returns None if attribute do not exist.
value
Get attributes stored in the Booster as a dictionary.
result – Returns an empty dict if there’s no attributes.
dictionary of attribute_name: attribute_value pairs of strings.
The best iteration during training.
The best evaluation score during training.
Boost the booster for one iteration with customized gradient statistics. Like xgboost.Booster.update()
, this function should not be called directly by users.
Copy the booster object.
A copied booster model
booster
Dump model into a text or JSON file. Unlike save_model()
, the output format is primarily used for visualization or interpretation, hence it’s more human readable but cannot be loaded back to XGBoost.
Evaluate the model on mat.
Evaluate a set of data.
Feature names for this booster. Can be directly set by input data or by assignment.
Feature types for this booster. Can be directly set by input data or by assignment. See DMatrix
for details.
Same method as DMatrix.get_categories()
.
export_to_arrow (bool)
Returns the model dump as a list of strings. Unlike save_model()
, the output format is primarily used for visualization or interpretation, hence it’s more human readable but cannot be loaded back to XGBoost.
Get feature importance of each feature.
Note
Zero-importance features will not be included
Keep in mind that this function does not include zero-importance feature, i.e. those features that have not been used in any split conditions.
Get feature importance of each feature. For tree model Importance type can be defined as:
trees.
‘gain’: the average gain across all splits the feature is used in.
‘cover’: the average coverage across all splits the feature is used in.
‘total_gain’: the total gain across all splits the feature is used in.
‘total_cover’: the total coverage across all splits the feature is used in.
Note
For linear model, only “weight” is defined and it’s the normalized coefficients without bias.
Note
Zero-importance features will not be included
Keep in mind that this function does not include zero-importance feature, i.e. those features that have not been used in any split conditions.
A map between feature names and their scores. When gblinear is used for
multi-class classification the scores for each feature is a list with length
n_classes, otherwise they’re scalars.
Get split value histogram of a feature
feature (str) – The name of the feature.
bin – The maximum number of bins. Number of bins equals number of unique split values n_unique, if bins == None or bins > n_unique.
as_pandas (bool) – Return pd.DataFrame when pandas is installed. If False or pandas is not installed, return numpy ndarray.
bins (int | None)
a histogram of used splitting values for the specified feature
either as numpy array or pandas DataFrame.
ndarray | PdDataFrame
Run prediction in-place when possible, Unlike predict()
method, inplace prediction does not cache the prediction result.
Calling only inplace_predict
in multiple threads is safe and lock free. But the safety does not hold when used in conjunction with other methods. E.g. you can’t train the booster in one thread and perform prediction in the other.
Note
If the device ordinal of the input data doesn’t match the one configured for the booster, data will be copied to the booster device.
booster.set_param({"device": "cuda:0"}) booster.inplace_predict(cupy_array) booster.set_param({"device": "cpu"}) booster.inplace_predict(numpy_array)
Added in version 1.1.0.
data (Any) – The input data.
iteration_range (Tuple[int | integer, int | integer]) – See predict()
for details.
predict_type (str) –
value Output model prediction values.
margin Output the raw untransformed margin value.
missing (float) – See xgboost.DMatrix
for details.
validate_features (bool) – See xgboost.Booster.predict()
for details.
base_margin (Any) –
See xgboost.DMatrix
for details.
Added in version 1.4.0.
strict_shape (bool) –
See xgboost.Booster.predict()
for details.
Added in version 1.4.0.
prediction – The prediction result. When input data is on GPU, prediction result is stored in a cupy array.
numpy.ndarray/cupy.ndarray
Load configuration returned by save_config.
Added in version 1.0.0.
config (str)
None
Load the model from a file or a bytearray.
The model is saved in an XGBoost internal format which is universal among the various XGBoost interfaces. Auxiliary attributes of the Python Booster object (such as feature_names) are only saved when using JSON or UBJSON (default) format. Also, parameters that are not part of the model (like metrics, max_depth, etc) are not saved, see Model IO for more info.
model.save_model("model.json") model.load_model("model.json") # or model.save_model("model.ubj") model.load_model("model.ubj") # or buf = model.save_raw() model.load_model(buf)
Get number of boosted rounds. For gblinear this is reset to 0 after serializing the model.
Number of features in booster.
Predict with data. The full model will be used unless iteration_range is specified, meaning user have to either slice the model or use the best_iteration
attribute to get prediction from best model returned from early stopping.
Note
See Prediction for issues like thread safety and a summary of outputs from this function.
data (DMatrix) – The dmatrix storing the input.
output_margin (bool) – Whether to output the raw untransformed margin value.
pred_leaf (bool) – When this option is on, the output will be a matrix of (nsample, ntrees) with each record indicating the predicted leaf index of each sample in each tree. Note that the leaf index of a tree is unique per tree, so you may find leaf 1 in both tree 1 and tree 0.
pred_contribs (bool) – When this is True the output will be a matrix of size (nsample, nfeats + 1) with each record indicating the feature contributions (SHAP values) for that prediction. The sum of all feature contributions is equal to the raw untransformed margin value of the prediction. Note the final column is the bias term.
approx_contribs (bool) – Approximate the contributions of each feature. Used when pred_contribs
or pred_interactions
is set to True. Changing the default of this parameter (False) is not recommended.
pred_interactions (bool) – When this is True the output will be a matrix of size (nsample, nfeats + 1, nfeats + 1) indicating the SHAP interaction values for each pair of features. The sum of each row (or column) of the interaction values equals the corresponding SHAP value (from pred_contribs), and the sum of the entire matrix equals the raw untransformed margin value of the prediction. Note the last row and column correspond to the bias term.
validate_features (bool) – When this is True, validate that the Booster’s and data’s feature_names are identical. Otherwise, it is assumed that the feature_names are the same.
training (bool) –
Whether the prediction value is used for training. This can effect dart booster, which performs dropouts during training iterations but use all trees for inference. If you want to obtain result with dropouts, set this parameter to True. Also, the parameter is set to true when obtaining prediction for custom objective function.
Added in version 1.0.0.
iteration_range (Tuple[int | integer, int | integer]) –
Specifies which layer of trees are used in prediction. For example, if a random forest is trained with 100 rounds. Specifying iteration_range=(10, 20), then only the forests built during [10, 20) (half open set) rounds are used in this prediction.
Added in version 1.4.0.
strict_shape (bool) –
When set to True, output shape is invariant to whether classification is used. For both value and margin prediction, the output shape is (n_samples, n_groups), n_groups == 1 when multi-class is not used. Default to False, in which case the output shape can be (n_samples, ) if multi-class is not used.
Added in version 1.4.0.
prediction
numpy array
Reset the booster object to release data caches used for training.
Added in version 3.0.0.
Output internal parameter configuration of Booster as a JSON string.
Added in version 1.0.0.
Save the model to a file.
The model is saved in an XGBoost internal format which is universal among the various XGBoost interfaces. Auxiliary attributes of the Python Booster object (such as feature_names) are only saved when using JSON or UBJSON (default) format. Also, parameters that are not part of the model (like metrics, max_depth, etc) are not saved, see Model IO for more info.
model.save_model("model.json") # or model.save_model("model.ubj")
Save the model to a in memory buffer representation instead of file.
The model is saved in an XGBoost internal format which is universal among the various XGBoost interfaces. Auxiliary attributes of the Python Booster object (such as feature_names) are only saved when using JSON or UBJSON (default) format. Also, parameters that are not part of the model (like metrics, max_depth, etc) are not saved, see Model IO for more info.
raw_format (str) – Format of output buffer. Can be json or ubj.
An in memory buffer representation of the model
Set the attribute of the Booster.
**kwargs (Any | None) – The attributes to set. Setting a value to None deletes an attribute.
None
Set parameters into the Booster.
Parse a boosted tree model text dump into a pandas DataFrame structure.
This feature is only defined when the decision tree model is chosen as base learner (booster in {gbtree, dart}). It is not defined for other base learner types, such as linear learners (booster=gblinear).
Update for one iteration, with objective function calculated internally. This function should not be called directly by users.
Bases: ABC
The interface for user defined data iterator. The iterator facilitates distributed training, QuantileDMatrix
, and external memory support using DMatrix
or ExtMemQuantileDMatrix
. Most of time, users don’t need to interact with this class directly.
Note
The class caches some intermediate results using the data input (predictor X) as key. Don’t repeat the X for multiple batches with different meta data (like label), make a copy if necessary.
Note
When the input for each batch is a DataFrame, we assume categories are consistently encoded for all batches. For example, given two dataframes for two batches, this is invalid:
import pandas as pd x0 = pd.DataFrame({"a": [0, 1]}, dtype="category") x1 = pd.DataFrame({"a": [1, 2]}, dtype="category")
This is invalid because the x0 has [0, 1] as categories while x2 has [1, 2]. They should share the same set of categories and encoding:
import numpy as np categories = np.array([0, 1, 2]) x0["a"] = pd.Categorical.from_codes( codes=np.array([0, 1]), categories=categories ) x1["a"] = pd.Categorical.from_codes( codes=np.array([1, 2]), categories=categories )
You can make sure the consistent encoding in your preprocessing step be careful that the data is stored in formats that preserve the encoding when chunking the data.
cache_prefix (str | None) –
Prefix to the cache files, only used in external memory.
Note that using this class for external memory will cache data on disk under the path passed here.
release_data (bool) – Whether the iterator should release the data during iteration. Set it to True if the data transformation (converting data to np.float32 type) is memory intensive. Otherwise, if the transformation is computation intensive then we can keep the cache.
on_host (bool) –
Whether the data should be cached on the host memory instead of the file system when using GPU with external memory. When set to true (the default), the “external memory” is the CPU (host) memory. See Using XGBoost External Memory Version for more info.
Added in version 3.0.0.
Warning
This is an experimental parameter and subject to change.
min_cache_page_bytes (int | None) –
The minimum number of bytes of each cached pages. Only used for on-host cache with GPU-based ExtMemQuantileDMatrix
. When using GPU-based external memory with the data cached in the host memory, XGBoost can concatenate the pages internally to increase the batch size for the GPU. The default page size is about 1/16 of the total device memory. Users can manually set the value based on the actual hardware and datasets. Set this to 0 to disable page concatenation.
Added in version 3.0.0.
Warning
This is an experimental parameter and subject to change.
Get callback functions for iterating in C. This is an internal function.
Set the next batch of data.
input_data (Callable) – A function with same data fields like data, label with xgboost.DMatrix.
False if there’s no more batch, otherwise True.
Handle of DMatrix proxy.
Reraise the exception thrown during iteration.
None
Reset the data iterator. Prototype for user defined function.
None
An internal storage class for categories returned by the DMatrix and the Booster. This class is designed to be opaque. It is intended to be used exclusively by XGBoost as an intermediate storage for re-coding categorical data.
The categories are saved along with the booster object. As a result, users don’t need to preserve this class for re-coding. Use the booster model IO instead if you want to preserve the categories in a stable format.
Added in version 3.1.0.
Warning
This class is internal.
Xy = xgboost.QuantileDMatrix(X, y, enable_categorical=True) booster = xgboost.train({}, Xy) categories = booster.get_categories() # Get categories # Use categories as a reference for re-coding Xy_new = xgboost.QuantileDMatrix( X_new, y_new, feature_types=categories, enable_categorical=True, ref=Xy ) # Categories will be part of the `model.json`. booster.save_model("model.json")
Training Library containing training routines.
Train a booster with given parameters.
dtrain (DMatrix) – Data to be trained.
num_boost_round (int) – Number of boosting iterations.
evals (Sequence[Tuple[DMatrix, str]] | None) – List of validation sets for which metrics will evaluated during training. Validation metrics will help us track the performance of the model.
obj (Callable[[ndarray, DMatrix], Tuple[ndarray, ndarray]] | None) – Custom objective function. See Custom Objective for details.
maximize (bool | None) – Whether to maximize custom_metric.
early_stopping_rounds (int | None) –
Activates early stopping. Validation metric needs to improve at least once in every early_stopping_rounds round(s) to continue training.
Requires at least one item in evals.
The method returns the model from the last iteration (not the best one). Use custom callback EarlyStopping
or model slicing
if the best model is desired. If there’s more than one item in evals, the last entry will be used for early stopping.
If there’s more than one metric in the eval_metric parameter given in params, the last metric will be used for early stopping.
If early stopping occurs, the model will have two additional fields: bst.best_score
, bst.best_iteration
.
evals_result (Dict[str, Dict[str, List[float] | List[Tuple[float, float]]]] | None) –
This dictionary stores the evaluation results of all the items in watchlist.
Example: with a watchlist containing [(dtest,'eval'), (dtrain,'train')]
and a parameter containing ('eval_metric': 'logloss')
, the evals_result returns
{'train': {'logloss': ['0.48253', '0.35953']}, 'eval': {'logloss': ['0.480385', '0.357756']}}
verbose_eval (bool | int | None) –
Requires at least one item in evals.
If verbose_eval is True then the evaluation metric on the validation set is printed at each boosting stage.
If verbose_eval is an integer then the evaluation metric on the validation set is printed at every given verbose_eval boosting stage. The last boosting stage / the boosting stage found by using early_stopping_rounds is also printed.
Example: with verbose_eval=4
and at least one item in evals, an evaluation metric is printed every 4 boosting stages, instead of every boosting stage.
xgb_model (str | PathLike | Booster | bytearray | None) – Xgb model to be loaded before training (allows training continuation).
callbacks (Sequence[TrainingCallback] | None) –
List of callback functions that are applied at end of each iteration. It is possible to use predefined callbacks by using Callback API.
Note
States in callback are not preserved during training, which means callback objects can not be reused for multiple training sessions without reinitialization or deepcopy.
for params in parameters_grid: # be sure to (re)initialize the callbacks before each run callbacks = [xgb.callback.LearningRateScheduler(custom_rates)] xgboost.train(params, Xy, callbacks=callbacks)
custom_metric (Callable[[ndarray, DMatrix], Tuple[str, float]] | None) –
Custom metric function. See Custom Metric for details. The metric receives transformed prediction (after applying the reverse link function) when using a builtin objective, and raw output when using a custom objective.
Booster
a trained booster model
Cross-validation with given parameters.
params (dict) – Booster params.
dtrain (DMatrix) – Data to be trained. Only the DMatrix
without external memory is supported.
num_boost_round (int) – Number of boosting iterations.
nfold (int) – Number of folds in CV.
stratified (bool) – Perform stratified sampling.
folds (a KFold or StratifiedKFold instance or list of fold indices) – Sklearn KFolds or StratifiedKFolds object. Alternatively may explicitly pass sample indices for each fold. For n
folds, folds should be a length n
list of tuples. Each tuple is (in,out)
where in
is a list of indices to be used as the training samples for the n
th fold and out
is a list of indices to be used as the testing samples for the n
th fold.
metrics (string or list of strings) – Evaluation metrics to be watched in CV.
obj (Callable[[ndarray, DMatrix], Tuple[ndarray, ndarray]] | None) – Custom objective function. See Custom Objective for details.
maximize (bool) – Whether to maximize the evaluataion metric (score or error).
early_stopping_rounds (int) – Activates early stopping. Cross-Validation metric (average of validation metric computed over CV folds) needs to improve at least once in every early_stopping_rounds round(s) to continue training. The last entry in the evaluation history will represent the best iteration. If there’s more than one metric in the eval_metric parameter given in params, the last metric will be used for early stopping.
fpreproc (function) – Preprocessing function that takes (dtrain, dtest, param) and returns transformed versions of those.
as_pandas (bool, default True) – Return pd.DataFrame when pandas is installed. If False or pandas is not installed, return np.ndarray
verbose_eval (bool, int, or None, default None) – Whether to display the progress. If None, progress will be displayed when np.ndarray is returned. If True, progress will be displayed at boosting stage. If an integer is given, progress will be displayed at every given verbose_eval boosting stage.
show_stdv (bool, default True) – Whether to display the standard deviation in progress. Results are not affected, and always contains std.
seed (int) – Seed used to generate the folds (passed to numpy.random.seed).
callbacks (Sequence[TrainingCallback] | None) –
List of callback functions that are applied at end of each iteration. It is possible to use predefined callbacks by using Callback API.
Note
States in callback are not preserved during training, which means callback objects can not be reused for multiple training sessions without reinitialization or deepcopy.
for params in parameters_grid: # be sure to (re)initialize the callbacks before each run callbacks = [xgb.callback.LearningRateScheduler(custom_rates)] xgboost.train(params, Xy, callbacks=callbacks)
shuffle (bool) – Shuffle data before creating folds.
custom_metric (Callable[[ndarray, DMatrix], Tuple[str, float]] | None) –
Custom metric function. See Custom Metric for details.
evaluation history
list(string)
Scikit-Learn Wrapper interface for XGBoost.
Bases: RegressorMixin
, XGBModel
Implementation of the scikit-learn API for XGBoost regression. See Using the Scikit-Learn Estimator Interface for more information.
n_estimators (Optional[int]) – Number of gradient boosted trees. Equivalent to number of boosting rounds.
max_depth (Optional[int]) – Maximum tree depth for base learners.
max_leaves (Optional[int]) – Maximum number of leaves; 0 indicates no limit.
max_bin (Optional[int]) – If using histogram-based algorithm, maximum number of bins per feature
Tree growing policy.
depthwise: Favors splitting at nodes closest to the node,
lossguide: Favors splitting at nodes with highest loss change.
learning_rate (Optional[float]) – Boosting learning rate (xgb’s “eta”)
verbosity (Optional[int]) – The degree of verbosity. Valid values are 0 (silent) - 3 (debug).
objective (Union[str, xgboost.sklearn._SklObjWProto, Callable[[Any, Any], Tuple[numpy.ndarray, numpy.ndarray]], NoneType]) –
Specify the learning task and the corresponding learning objective or a custom objective function to be used.
For custom objective, see Custom Objective and Evaluation Metric and Custom objective and metric for more information, along with the end note for function signatures.
booster (Optional[str]) – Specify which booster to use: gbtree
, gblinear
or dart
.
tree_method (Optional[str]) – Specify which tree method to use. Default to auto. If this parameter is set to default, XGBoost will choose the most conservative option available. It’s recommended to study this option from the parameters document tree method
n_jobs (Optional[int]) – Number of parallel threads used to run xgboost. When used with other Scikit-Learn algorithms like grid search, you may choose which algorithm to parallelize and balance the threads. Creating thread contention will significantly slow down both algorithms.
gamma (Optional[float]) – (min_split_loss) Minimum loss reduction required to make a further partition on a leaf node of the tree.
min_child_weight (Optional[float]) – Minimum sum of instance weight(hessian) needed in a child.
max_delta_step (Optional[float]) – Maximum delta step we allow each tree’s weight estimation to be.
subsample (Optional[float]) – Subsample ratio of the training instance.
sampling_method (Optional[str]) –
Sampling method. Used only by the GPU version of hist
tree method.
uniform
: Select random training instances uniformly.
gradient_based
: Select random training instances with higher probability
when the gradient and hessian are larger. (cf. CatBoost)
colsample_bytree (Optional[float]) – Subsample ratio of columns when constructing each tree.
colsample_bylevel (Optional[float]) – Subsample ratio of columns for each level.
colsample_bynode (Optional[float]) – Subsample ratio of columns for each split.
reg_alpha (Optional[float]) – L1 regularization term on weights (xgb’s alpha).
reg_lambda (Optional[float]) – L2 regularization term on weights (xgb’s lambda).
scale_pos_weight (Optional[float]) – Balancing of positive and negative weights.
base_score (Optional[float]) – The initial prediction score of all instances, global bias.
random_state (Union[numpy.random.mtrand.RandomState, numpy.random._generator.Generator, int, NoneType]) –
Random number seed.
Note
Using gblinear booster with shotgun updater is nondeterministic as it uses Hogwild algorithm.
missing (float) – Value in the data which needs to be present as a missing value. Default to numpy.nan
.
num_parallel_tree (Optional[int]) – Used for boosting random forest.
monotone_constraints (Union[Dict[str, int], str, NoneType]) – Constraint of variable monotonicity. See tutorial for more information.
interaction_constraints (Union[str, List[Tuple[str]], NoneType]) – Constraints for interaction representing permitted interactions. The constraints must be specified in the form of a nested list, e.g. [[0, 1], [2, 3, 4]]
, where each inner list is a group of indices of features that are allowed to interact with each other. See tutorial for more information
importance_type (Optional[str]) –
The feature importance type for the feature_importances_ property:
For tree model, it’s either “gain”, “weight”, “cover”, “total_gain” or “total_cover”.
For linear model, only “weight” is defined and it’s the normalized coefficients without bias.
Added in version 2.0.0.
Device ordinal, available options are cpu, cuda, and gpu.
validate_parameters (Optional[bool]) – Give warnings for unknown parameter.
enable_categorical (bool) – See the same parameter of DMatrix
for details.
feature_types (Optional[Sequence[str]]) –
Added in version 1.7.0.
Used for specifying feature types without constructing a dataframe. See the DMatrix
for details.
feature_weights (Optional[ArrayLike]) – Weight for each feature, defines the probability of each feature being selected when colsample is being used. All values must be greater than 0, otherwise a ValueError is thrown.
max_cat_to_onehot (Optional[int]) –
Added in version 1.6.0.
Note
This parameter is experimental
A threshold for deciding whether XGBoost should use one-hot encoding based split for categorical data. When number of categories is lesser than the threshold then one-hot encoding is chosen, otherwise the categories will be partitioned into children nodes. Also, enable_categorical needs to be set to have categorical feature support. See Categorical Data and Parameters for Categorical Feature for details.
max_cat_threshold (Optional[int]) –
Added in version 1.7.0.
Note
This parameter is experimental
Maximum number of categories considered for each split. Used only by partition-based splits for preventing over-fitting. Also, enable_categorical needs to be set to have categorical feature support. See Categorical Data and Parameters for Categorical Feature for details.
multi_strategy (Optional[str]) –
Added in version 2.0.0.
Note
This parameter is working-in-progress.
The strategy used for training multi-target models, including multi-target regression and multi-class classification. See Multiple Outputs for more information.
one_output_per_tree
: One model for each target.
multi_output_tree
: Use multi-target trees.
eval_metric (Union[str, List[Union[str, Callable]], Callable, NoneType]) –
Added in version 1.6.0.
Metric used for monitoring the training result and early stopping. It can be a string or list of strings as names of predefined metric in XGBoost (See XGBoost Parameters), one of the metrics in sklearn.metrics
, or any other user defined metric that looks like sklearn.metrics.
If custom objective is also provided, then custom metric should implement the corresponding reverse link function.
Unlike the scoring parameter commonly used in scikit-learn, when a callable object is provided, it’s assumed to be a cost function and by default XGBoost will minimize the result during early stopping.
For advanced usage on Early stopping like directly choosing to maximize instead of minimize, see xgboost.callback.EarlyStopping
.
See Custom Objective and Evaluation Metric and Custom objective and metric for more information.
from sklearn.datasets import load_diabetes from sklearn.metrics import mean_absolute_error X, y = load_diabetes(return_X_y=True) reg = xgb.XGBRegressor( tree_method="hist", eval_metric=mean_absolute_error, ) reg.fit(X, y, eval_set=[(X, y)])
early_stopping_rounds (Optional[int]) –
Added in version 1.6.0.
Activates early stopping. Validation metric needs to improve at least once in every early_stopping_rounds round(s) to continue training. Requires at least one item in eval_set in fit()
.
If early stopping occurs, the model will have two additional attributes: best_score
and best_iteration
. These are used by the predict()
and apply()
methods to determine the optimal number of trees during inference. If users want to access the full model (including trees built after early stopping), they can specify the iteration_range in these inference methods. In addition, other utilities like model plotting can also use the entire model.
If you prefer to discard the trees after best_iteration, consider using the callback function xgboost.callback.EarlyStopping
.
If there’s more than one item in eval_set, the last entry will be used for early stopping. If there’s more than one metric in eval_metric, the last metric will be used for early stopping.
callbacks (Optional[List[xgboost.callback.TrainingCallback]]) –
List of callback functions that are applied at end of each iteration. It is possible to use predefined callbacks by using Callback API.
Note
States in callback are not preserved during training, which means callback objects can not be reused for multiple training sessions without reinitialization or deepcopy.
for params in parameters_grid: # be sure to (re)initialize the callbacks before each run callbacks = [xgb.callback.LearningRateScheduler(custom_rates)] reg = xgboost.XGBRegressor(**params, callbacks=callbacks) reg.fit(X, y)
Keyword arguments for XGBoost Booster object. Full documentation of parameters can be found here. Attempting to set a parameter via the constructor args and **kwargs dict simultaneously will result in a TypeError.
Note
**kwargs unsupported by scikit-learn
**kwargs is unsupported by scikit-learn. We do not guarantee that parameters passed via this argument will interact properly with scikit-learn.
Note
Custom objective function
A custom objective function can be provided for the objective
parameter. In this case, it should have the signature objective(y_true, y_pred) -> [grad, hess]
or objective(y_true, y_pred, *, sample_weight) -> [grad, hess]
:
The target values
The predicted values
Optional sample weights.
The value of the gradient for each sample point.
The value of the second derivative for each sample point
Note that, if the custom objective produces negative values for the Hessian, these will be clipped. If the objective is non-convex, one might also consider using the expected Hessian (Fisher information) instead.
Return the predicted leaf every tree for each sample. If the model is trained with early stopping, then best_iteration
is used automatically.
X_leaves – For each datapoint x in X and for each tree, return the index of the leaf x ends up in. Leaves are numbered within [0; 2**(self.max_depth+1))
, possibly with gaps in the numbering.
array_like, shape=[n_samples, n_trees]
The best iteration obtained by early stopping. This attribute is 0-based, for instance if the best iteration is the first round, then best_iteration is 0.
The best score obtained by early stopping.
Coefficients property
Note
Coefficients are defined only for linear learners
Coefficients are only defined when the linear model is chosen as base learner (booster=gblinear). It is not defined for other base learner types, such as tree learners (booster=gbtree).
coef_
array of shape [n_features]
or [n_classes, n_features]
Return the evaluation results.
If eval_set is passed to the fit()
function, you can call evals_result()
to get evaluation results for all passed eval_sets. When eval_metric is also passed to the fit()
function, the evals_result will contain the eval_metrics passed to the fit()
function.
The returned evaluation result is a dictionary:
{'validation_0': {'logloss': ['0.604835', '0.531479']}, 'validation_1': {'logloss': ['0.41965', '0.17686']}}
evals_result
Feature importances property, return depends on importance_type parameter. When model trained with multi-class/multi-label/multi-target dataset, the feature importance is “averaged” over all targets. The “average” is defined based on the importance type. For instance, if the importance type is “total_gain”, then the score is sum of loss change for each split from all trees.
feature_importances_ (array of shape [n_features]
except for multi-class)
linear model, which returns an array with shape (n_features, n_classes)
Names of features seen during fit()
. Defined only when X has feature names that are all strings.
Fit gradient boosting model.
Note that calling fit()
multiple times will cause the model object to be re-fit from scratch. To resume training from a previous checkpoint, explicitly pass xgb_model
argument.
X (Any) –
Input feature matrix. See Markers for a list of supported types.
When the tree_method
is set to hist
, internally, the QuantileDMatrix
will be used instead of the DMatrix
for conserving memory. However, this has performance implications when the device of input data is not matched with algorithm. For instance, if the input is a numpy array on CPU but cuda
is used for training, then the data is first processed on CPU then transferred to GPU.
y (Any) – Labels
sample_weight (Any | None) – instance weights
base_margin (Any | None) – Global bias for each instance. See Intercept for details.
eval_set (Sequence[Tuple[Any, Any]] | None) – A list of (X, y) tuple pairs to use as validation sets, for which metrics will be computed. Validation metrics will help us track the performance of the model.
verbose (bool | int | None) – If verbose is True and an evaluation set is used, the evaluation metric measured on the validation set is printed to stdout at each boosting stage. If verbose is an integer, the evaluation metric is printed at each verbose boosting stage. The last boosting stage / the boosting stage found by using early_stopping_rounds is also printed.
xgb_model (Booster | XGBModel | str | None) – file name of stored XGBoost model or ‘Booster’ instance XGBoost model to be loaded before training (allows training continuation).
sample_weight_eval_set (Sequence[Any] | None) – A list of the form [L_1, L_2, …, L_n], where each L_i is an array like object storing instance weights for the i-th validation set.
base_margin_eval_set (Sequence[Any] | None) – A list of the form [M_1, M_2, …, M_n], where each M_i is an array like object storing base margin for the i-th validation set.
feature_weights (Any | None) –
Deprecated since version 3.0.0.
Use feature_weights in __init__()
or set_params()
instead.
XGBModel
Get the underlying xgboost Booster of this model.
This will raise an exception when fit was not called
booster
a xgboost booster of underlying model
Get metadata routing of this object.
Please check User Guide on how the routing mechanism works.
routing – A MetadataRequest
encapsulating routing information.
MetadataRequest
Gets the number of xgboost boosting rounds.
Intercept (bias) property
For tree-based model, the returned value is the base_score.
intercept_
array of shape (1,)
or [n_classes]
Load the model from a file or a bytearray.
The model is saved in an XGBoost internal format which is universal among the various XGBoost interfaces. Auxiliary attributes of the Python Booster object (such as feature_names) are only saved when using JSON or UBJSON (default) format. Also, parameters that are not part of the model (like metrics, max_depth, etc) are not saved, see Model IO for more info.
model.save_model("model.json") model.load_model("model.json") # or model.save_model("model.ubj") model.load_model("model.ubj") # or buf = model.save_raw() model.load_model(buf)
Number of features seen during fit()
.
Predict with X. If the model is trained with early stopping, then best_iteration
is used automatically. The estimator uses inplace_predict by default and falls back to using DMatrix
if devices between the data and the estimator don’t match.
Note
This function is only thread safe for gbtree and dart.
X (Any) – Data to predict with. See Markers for a list of supported types.
output_margin (bool) – Whether to output the raw untransformed margin value.
validate_features (bool) – When this is True, validate that the Booster’s and data’s feature_names are identical. Otherwise, it is assumed that the feature_names are the same.
base_margin (Any | None) – Global bias for each instance. See Intercept for details.
iteration_range (Tuple[int | integer, int | integer] | None) –
Specifies which layer of trees are used in prediction. For example, if a random forest is trained with 100 rounds. Specifying iteration_range=(10, 20)
, then only the forests built during [10, 20) (half open set) rounds are used in this prediction.
Added in version 1.4.0.
prediction
Save the model to a file.
The model is saved in an XGBoost internal format which is universal among the various XGBoost interfaces. Auxiliary attributes of the Python Booster object (such as feature_names) are only saved when using JSON or UBJSON (default) format. Also, parameters that are not part of the model (like metrics, max_depth, etc) are not saved, see Model IO for more info.
model.save_model("model.json") # or model.save_model("model.ubj")
Return coefficient of determination on test data.
The coefficient of determination, \(R^2\), is defined as \((1 - \frac{u}{v})\), where \(u\) is the residual sum of squares ((y_true - y_pred)** 2).sum()
and \(v\) is the total sum of squares ((y_true - y_true.mean()) ** 2).sum()
. The best possible score is 1.0 and it can be negative (because the model can be arbitrarily worse). A constant model that always predicts the expected value of y, disregarding the input features, would get a \(R^2\) score of 0.0.
X (array-like of shape (n_samples, n_features)) – Test samples. For some estimators this may be a precomputed kernel matrix or a list of generic objects instead with shape (n_samples, n_samples_fitted)
, where n_samples_fitted
is the number of samples used in the fitting for the estimator.
y (array-like of shape (n_samples,) or (n_samples, n_outputs)) – True values for X.
sample_weight (array-like of shape (n_samples,), default=None) – Sample weights.
score – \(R^2\) of self.predict(X)
w.r.t. y.
Notes
The \(R^2\) score used when calling score
on a regressor uses multioutput='uniform_average'
from version 0.23 to keep consistent with default value of r2_score()
. This influences the score
method of all the multioutput regressors (except for MultiOutputRegressor
).
Configure whether metadata should be requested to be passed to the fit
method.
Note that this method is only relevant when this estimator is used as a sub-estimator within a meta-estimator and metadata routing is enabled with
enable_metadata_routing=True
(seesklearn.set_config()
). Please check the User Guide on how the routing mechanism works.The options for each parameter are:
True
: metadata is requested, and passed tofit
if provided. The request is ignored if metadata is not provided.
False
: metadata is not requested and the meta-estimator will not pass it tofit
.
None
: metadata is not requested, and the meta-estimator will raise an error if the user provides it.
str
: metadata should be passed to the meta-estimator with this given alias instead of the original name.The default (
sklearn.utils.metadata_routing.UNCHANGED
) retains the existing request. This allows you to change the request for some parameters and not others.Added in version 1.3.
- base_marginstr, True, False, or None, default=sklearn.utils.metadata_routing.UNCHANGED
Metadata routing for
base_margin
parameter infit
.- base_margin_eval_setstr, True, False, or None, default=sklearn.utils.metadata_routing.UNCHANGED
Metadata routing for
base_margin_eval_set
parameter infit
.- eval_setstr, True, False, or None, default=sklearn.utils.metadata_routing.UNCHANGED
Metadata routing for
eval_set
parameter infit
.- feature_weightsstr, True, False, or None, default=sklearn.utils.metadata_routing.UNCHANGED
Metadata routing for
feature_weights
parameter infit
.- sample_weightstr, True, False, or None, default=sklearn.utils.metadata_routing.UNCHANGED
Metadata routing for
sample_weight
parameter infit
.- sample_weight_eval_setstr, True, False, or None, default=sklearn.utils.metadata_routing.UNCHANGED
Metadata routing for
sample_weight_eval_set
parameter infit
.- verbosestr, True, False, or None, default=sklearn.utils.metadata_routing.UNCHANGED
Metadata routing for
verbose
parameter infit
.- xgb_modelstr, True, False, or None, default=sklearn.utils.metadata_routing.UNCHANGED
Metadata routing for
xgb_model
parameter infit
.
- selfobject
The updated object.
Set the parameters of this estimator. Modification of the sklearn method to allow unknown kwargs. This allows using the full range of xgboost parameters that are not defined as member variables in sklearn grid search.
self
params (Any)
Configure whether metadata should be requested to be passed to the predict
method.
Note that this method is only relevant when this estimator is used as a sub-estimator within a meta-estimator and metadata routing is enabled with
enable_metadata_routing=True
(seesklearn.set_config()
). Please check the User Guide on how the routing mechanism works.The options for each parameter are:
True
: metadata is requested, and passed topredict
if provided. The request is ignored if metadata is not provided.
False
: metadata is not requested and the meta-estimator will not pass it topredict
.
None
: metadata is not requested, and the meta-estimator will raise an error if the user provides it.
str
: metadata should be passed to the meta-estimator with this given alias instead of the original name.The default (
sklearn.utils.metadata_routing.UNCHANGED
) retains the existing request. This allows you to change the request for some parameters and not others.Added in version 1.3.
- base_marginstr, True, False, or None, default=sklearn.utils.metadata_routing.UNCHANGED
Metadata routing for
base_margin
parameter inpredict
.- iteration_rangestr, True, False, or None, default=sklearn.utils.metadata_routing.UNCHANGED
Metadata routing for
iteration_range
parameter inpredict
.- output_marginstr, True, False, or None, default=sklearn.utils.metadata_routing.UNCHANGED
Metadata routing for
output_margin
parameter inpredict
.- validate_featuresstr, True, False, or None, default=sklearn.utils.metadata_routing.UNCHANGED
Metadata routing for
validate_features
parameter inpredict
.
- selfobject
The updated object.
Configure whether metadata should be requested to be passed to the score
method.
Note that this method is only relevant when this estimator is used as a sub-estimator within a meta-estimator and metadata routing is enabled with
enable_metadata_routing=True
(seesklearn.set_config()
). Please check the User Guide on how the routing mechanism works.The options for each parameter are:
True
: metadata is requested, and passed toscore
if provided. The request is ignored if metadata is not provided.
False
: metadata is not requested and the meta-estimator will not pass it toscore
.
None
: metadata is not requested, and the meta-estimator will raise an error if the user provides it.
str
: metadata should be passed to the meta-estimator with this given alias instead of the original name.The default (
sklearn.utils.metadata_routing.UNCHANGED
) retains the existing request. This allows you to change the request for some parameters and not others.Added in version 1.3.
- sample_weightstr, True, False, or None, default=sklearn.utils.metadata_routing.UNCHANGED
Metadata routing for
sample_weight
parameter inscore
.
- selfobject
The updated object.
self (XGBRegressor)
Bases: ClassifierMixin
, XGBModel
Implementation of the scikit-learn API for XGBoost classification. See Using the Scikit-Learn Estimator Interface for more information.
n_estimators (Optional[int]) – Number of boosting rounds.
max_depth (Optional[int]) – Maximum tree depth for base learners.
max_leaves (Optional[int]) – Maximum number of leaves; 0 indicates no limit.
max_bin (Optional[int]) – If using histogram-based algorithm, maximum number of bins per feature
Tree growing policy.
depthwise: Favors splitting at nodes closest to the node,
lossguide: Favors splitting at nodes with highest loss change.
learning_rate (Optional[float]) – Boosting learning rate (xgb’s “eta”)
verbosity (Optional[int]) – The degree of verbosity. Valid values are 0 (silent) - 3 (debug).
objective (Union[str, xgboost.sklearn._SklObjWProto, Callable[[Any, Any], Tuple[numpy.ndarray, numpy.ndarray]], NoneType]) –
Specify the learning task and the corresponding learning objective or a custom objective function to be used.
For custom objective, see Custom Objective and Evaluation Metric and Custom objective and metric for more information, along with the end note for function signatures.
booster (Optional[str]) – Specify which booster to use: gbtree
, gblinear
or dart
.
tree_method (Optional[str]) – Specify which tree method to use. Default to auto. If this parameter is set to default, XGBoost will choose the most conservative option available. It’s recommended to study this option from the parameters document tree method
n_jobs (Optional[int]) – Number of parallel threads used to run xgboost. When used with other Scikit-Learn algorithms like grid search, you may choose which algorithm to parallelize and balance the threads. Creating thread contention will significantly slow down both algorithms.
gamma (Optional[float]) – (min_split_loss) Minimum loss reduction required to make a further partition on a leaf node of the tree.
min_child_weight (Optional[float]) – Minimum sum of instance weight(hessian) needed in a child.
max_delta_step (Optional[float]) – Maximum delta step we allow each tree’s weight estimation to be.
subsample (Optional[float]) – Subsample ratio of the training instance.
sampling_method (Optional[str]) –
Sampling method. Used only by the GPU version of hist
tree method.
uniform
: Select random training instances uniformly.
gradient_based
: Select random training instances with higher probability
when the gradient and hessian are larger. (cf. CatBoost)
colsample_bytree (Optional[float]) – Subsample ratio of columns when constructing each tree.
colsample_bylevel (Optional[float]) – Subsample ratio of columns for each level.
colsample_bynode (Optional[float]) – Subsample ratio of columns for each split.
reg_alpha (Optional[float]) – L1 regularization term on weights (xgb’s alpha).
reg_lambda (Optional[float]) – L2 regularization term on weights (xgb’s lambda).
scale_pos_weight (Optional[float]) – Balancing of positive and negative weights.
base_score (Optional[float]) – The initial prediction score of all instances, global bias.
random_state (Union[numpy.random.mtrand.RandomState, numpy.random._generator.Generator, int, NoneType]) –
Random number seed.
Note
Using gblinear booster with shotgun updater is nondeterministic as it uses Hogwild algorithm.
missing (float) – Value in the data which needs to be present as a missing value. Default to numpy.nan
.
num_parallel_tree (Optional[int]) – Used for boosting random forest.
monotone_constraints (Union[Dict[str, int], str, NoneType]) – Constraint of variable monotonicity. See tutorial for more information.
interaction_constraints (Union[str, List[Tuple[str]], NoneType]) – Constraints for interaction representing permitted interactions. The constraints must be specified in the form of a nested list, e.g. [[0, 1], [2, 3, 4]]
, where each inner list is a group of indices of features that are allowed to interact with each other. See tutorial for more information
importance_type (Optional[str]) –
The feature importance type for the feature_importances_ property:
For tree model, it’s either “gain”, “weight”, “cover”, “total_gain” or “total_cover”.
For linear model, only “weight” is defined and it’s the normalized coefficients without bias.
Added in version 2.0.0.
Device ordinal, available options are cpu, cuda, and gpu.
validate_parameters (Optional[bool]) – Give warnings for unknown parameter.
enable_categorical (bool) – See the same parameter of DMatrix
for details.
feature_types (Optional[Sequence[str]]) –
Added in version 1.7.0.
Used for specifying feature types without constructing a dataframe. See the DMatrix
for details.
feature_weights (Optional[ArrayLike]) – Weight for each feature, defines the probability of each feature being selected when colsample is being used. All values must be greater than 0, otherwise a ValueError is thrown.
max_cat_to_onehot (Optional[int]) –
Added in version 1.6.0.
Note
This parameter is experimental
A threshold for deciding whether XGBoost should use one-hot encoding based split for categorical data. When number of categories is lesser than the threshold then one-hot encoding is chosen, otherwise the categories will be partitioned into children nodes. Also, enable_categorical needs to be set to have categorical feature support. See Categorical Data and Parameters for Categorical Feature for details.
max_cat_threshold (Optional[int]) –
Added in version 1.7.0.
Note
This parameter is experimental
Maximum number of categories considered for each split. Used only by partition-based splits for preventing over-fitting. Also, enable_categorical needs to be set to have categorical feature support. See Categorical Data and Parameters for Categorical Feature for details.
multi_strategy (Optional[str]) –
Added in version 2.0.0.
Note
This parameter is working-in-progress.
The strategy used for training multi-target models, including multi-target regression and multi-class classification. See Multiple Outputs for more information.
one_output_per_tree
: One model for each target.
multi_output_tree
: Use multi-target trees.
eval_metric (Union[str, List[Union[str, Callable]], Callable, NoneType]) –
Added in version 1.6.0.
Metric used for monitoring the training result and early stopping. It can be a string or list of strings as names of predefined metric in XGBoost (See XGBoost Parameters), one of the metrics in sklearn.metrics
, or any other user defined metric that looks like sklearn.metrics.
If custom objective is also provided, then custom metric should implement the corresponding reverse link function.
Unlike the scoring parameter commonly used in scikit-learn, when a callable object is provided, it’s assumed to be a cost function and by default XGBoost will minimize the result during early stopping.
For advanced usage on Early stopping like directly choosing to maximize instead of minimize, see xgboost.callback.EarlyStopping
.
See Custom Objective and Evaluation Metric and Custom objective and metric for more information.
from sklearn.datasets import load_diabetes from sklearn.metrics import mean_absolute_error X, y = load_diabetes(return_X_y=True) reg = xgb.XGBRegressor( tree_method="hist", eval_metric=mean_absolute_error, ) reg.fit(X, y, eval_set=[(X, y)])
early_stopping_rounds (Optional[int]) –
Added in version 1.6.0.
Activates early stopping. Validation metric needs to improve at least once in every early_stopping_rounds round(s) to continue training. Requires at least one item in eval_set in fit()
.
If early stopping occurs, the model will have two additional attributes: best_score
and best_iteration
. These are used by the predict()
and apply()
methods to determine the optimal number of trees during inference. If users want to access the full model (including trees built after early stopping), they can specify the iteration_range in these inference methods. In addition, other utilities like model plotting can also use the entire model.
If you prefer to discard the trees after best_iteration, consider using the callback function xgboost.callback.EarlyStopping
.
If there’s more than one item in eval_set, the last entry will be used for early stopping. If there’s more than one metric in eval_metric, the last metric will be used for early stopping.
callbacks (Optional[List[xgboost.callback.TrainingCallback]]) –
List of callback functions that are applied at end of each iteration. It is possible to use predefined callbacks by using Callback API.
Note
States in callback are not preserved during training, which means callback objects can not be reused for multiple training sessions without reinitialization or deepcopy.
for params in parameters_grid: # be sure to (re)initialize the callbacks before each run callbacks = [xgb.callback.LearningRateScheduler(custom_rates)] reg = xgboost.XGBRegressor(**params, callbacks=callbacks) reg.fit(X, y)
Keyword arguments for XGBoost Booster object. Full documentation of parameters can be found here. Attempting to set a parameter via the constructor args and **kwargs dict simultaneously will result in a TypeError.
Note
**kwargs unsupported by scikit-learn
**kwargs is unsupported by scikit-learn. We do not guarantee that parameters passed via this argument will interact properly with scikit-learn.
Note
Custom objective function
A custom objective function can be provided for the objective
parameter. In this case, it should have the signature objective(y_true, y_pred) -> [grad, hess]
or objective(y_true, y_pred, *, sample_weight) -> [grad, hess]
:
The target values
The predicted values
Optional sample weights.
The value of the gradient for each sample point.
The value of the second derivative for each sample point
Note that, if the custom objective produces negative values for the Hessian, these will be clipped. If the objective is non-convex, one might also consider using the expected Hessian (Fisher information) instead.
Return the predicted leaf every tree for each sample. If the model is trained with early stopping, then best_iteration
is used automatically.
X_leaves – For each datapoint x in X and for each tree, return the index of the leaf x ends up in. Leaves are numbered within [0; 2**(self.max_depth+1))
, possibly with gaps in the numbering.
array_like, shape=[n_samples, n_trees]
The best iteration obtained by early stopping. This attribute is 0-based, for instance if the best iteration is the first round, then best_iteration is 0.
The best score obtained by early stopping.
Coefficients property
Note
Coefficients are defined only for linear learners
Coefficients are only defined when the linear model is chosen as base learner (booster=gblinear). It is not defined for other base learner types, such as tree learners (booster=gbtree).
coef_
array of shape [n_features]
or [n_classes, n_features]
Return the evaluation results.
If eval_set is passed to the fit()
function, you can call evals_result()
to get evaluation results for all passed eval_sets. When eval_metric is also passed to the fit()
function, the evals_result will contain the eval_metrics passed to the fit()
function.
The returned evaluation result is a dictionary:
{'validation_0': {'logloss': ['0.604835', '0.531479']}, 'validation_1': {'logloss': ['0.41965', '0.17686']}}
evals_result
Feature importances property, return depends on importance_type parameter. When model trained with multi-class/multi-label/multi-target dataset, the feature importance is “averaged” over all targets. The “average” is defined based on the importance type. For instance, if the importance type is “total_gain”, then the score is sum of loss change for each split from all trees.
feature_importances_ (array of shape [n_features]
except for multi-class)
linear model, which returns an array with shape (n_features, n_classes)
Names of features seen during fit()
. Defined only when X has feature names that are all strings.
Fit gradient boosting classifier.
Note that calling fit()
multiple times will cause the model object to be re-fit from scratch. To resume training from a previous checkpoint, explicitly pass xgb_model
argument.
X (Any) –
Input feature matrix. See Markers for a list of supported types.
When the tree_method
is set to hist
, internally, the QuantileDMatrix
will be used instead of the DMatrix
for conserving memory. However, this has performance implications when the device of input data is not matched with algorithm. For instance, if the input is a numpy array on CPU but cuda
is used for training, then the data is first processed on CPU then transferred to GPU.
y (Any) – Labels
sample_weight (Any | None) – instance weights
base_margin (Any | None) – Global bias for each instance. See Intercept for details.
eval_set (Sequence[Tuple[Any, Any]] | None) – A list of (X, y) tuple pairs to use as validation sets, for which metrics will be computed. Validation metrics will help us track the performance of the model.
verbose (bool | int | None) – If verbose is True and an evaluation set is used, the evaluation metric measured on the validation set is printed to stdout at each boosting stage. If verbose is an integer, the evaluation metric is printed at each verbose boosting stage. The last boosting stage / the boosting stage found by using early_stopping_rounds is also printed.
xgb_model (Booster | str | XGBModel | None) – file name of stored XGBoost model or ‘Booster’ instance XGBoost model to be loaded before training (allows training continuation).
sample_weight_eval_set (Sequence[Any] | None) – A list of the form [L_1, L_2, …, L_n], where each L_i is an array like object storing instance weights for the i-th validation set.
base_margin_eval_set (Sequence[Any] | None) – A list of the form [M_1, M_2, …, M_n], where each M_i is an array like object storing base margin for the i-th validation set.
feature_weights (Any | None) –
Deprecated since version 3.0.0.
Use feature_weights in __init__()
or set_params()
instead.
Get the underlying xgboost Booster of this model.
This will raise an exception when fit was not called
booster
a xgboost booster of underlying model
Get metadata routing of this object.
Please check User Guide on how the routing mechanism works.
routing – A MetadataRequest
encapsulating routing information.
MetadataRequest
Gets the number of xgboost boosting rounds.
Intercept (bias) property
For tree-based model, the returned value is the base_score.
intercept_
array of shape (1,)
or [n_classes]
Load the model from a file or a bytearray.
The model is saved in an XGBoost internal format which is universal among the various XGBoost interfaces. Auxiliary attributes of the Python Booster object (such as feature_names) are only saved when using JSON or UBJSON (default) format. Also, parameters that are not part of the model (like metrics, max_depth, etc) are not saved, see Model IO for more info.
model.save_model("model.json") model.load_model("model.json") # or model.save_model("model.ubj") model.load_model("model.ubj") # or buf = model.save_raw() model.load_model(buf)
Number of features seen during fit()
.
Predict with X. If the model is trained with early stopping, then best_iteration
is used automatically. The estimator uses inplace_predict by default and falls back to using DMatrix
if devices between the data and the estimator don’t match.
Note
This function is only thread safe for gbtree and dart.
X (Any) – Data to predict with. See Markers for a list of supported types.
output_margin (bool) – Whether to output the raw untransformed margin value.
validate_features (bool) – When this is True, validate that the Booster’s and data’s feature_names are identical. Otherwise, it is assumed that the feature_names are the same.
base_margin (Any | None) – Global bias for each instance. See Intercept for details.
iteration_range (Tuple[int | integer, int | integer] | None) –
Specifies which layer of trees are used in prediction. For example, if a random forest is trained with 100 rounds. Specifying iteration_range=(10, 20)
, then only the forests built during [10, 20) (half open set) rounds are used in this prediction.
Added in version 1.4.0.
prediction
Predict the probability of each X example being of a given class. If the model is trained with early stopping, then best_iteration
is used automatically. The estimator uses inplace_predict by default and falls back to using DMatrix
if devices between the data and the estimator don’t match.
Note
This function is only thread safe for gbtree and dart.
X (Any) – Feature matrix. See Markers for a list of supported types.
validate_features (bool) – When this is True, validate that the Booster’s and data’s feature_names are identical. Otherwise, it is assumed that the feature_names are the same.
base_margin (Any | None) – Global bias for each instance. See Intercept for details.
iteration_range (Tuple[int | integer, int | integer] | None) – Specifies which layer of trees are used in prediction. For example, if a random forest is trained with 100 rounds. Specifying iteration_range=(10, 20), then only the forests built during [10, 20) (half open set) rounds are used in this prediction.
a numpy array of shape array-like of shape (n_samples, n_classes) with the probability of each data example being of a given class.
prediction
Save the model to a file.
The model is saved in an XGBoost internal format which is universal among the various XGBoost interfaces. Auxiliary attributes of the Python Booster object (such as feature_names) are only saved when using JSON or UBJSON (default) format. Also, parameters that are not part of the model (like metrics, max_depth, etc) are not saved, see Model IO for more info.
model.save_model("model.json") # or model.save_model("model.ubj")
Return accuracy on provided data and labels.
In multi-label classification, this is the subset accuracy which is a harsh metric since you require for each sample that each label set be correctly predicted.
X (array-like of shape (n_samples, n_features)) – Test samples.
y (array-like of shape (n_samples,) or (n_samples, n_outputs)) – True labels for X.
sample_weight (array-like of shape (n_samples,), default=None) – Sample weights.
score – Mean accuracy of self.predict(X)
w.r.t. y.
Configure whether metadata should be requested to be passed to the fit
method.
Note that this method is only relevant when this estimator is used as a sub-estimator within a meta-estimator and metadata routing is enabled with
enable_metadata_routing=True
(seesklearn.set_config()
). Please check the User Guide on how the routing mechanism works.The options for each parameter are:
True
: metadata is requested, and passed tofit
if provided. The request is ignored if metadata is not provided.
False
: metadata is not requested and the meta-estimator will not pass it tofit
.
None
: metadata is not requested, and the meta-estimator will raise an error if the user provides it.
str
: metadata should be passed to the meta-estimator with this given alias instead of the original name.The default (
sklearn.utils.metadata_routing.UNCHANGED
) retains the existing request. This allows you to change the request for some parameters and not others.Added in version 1.3.
- base_marginstr, True, False, or None, default=sklearn.utils.metadata_routing.UNCHANGED
Metadata routing for
base_margin
parameter infit
.- base_margin_eval_setstr, True, False, or None, default=sklearn.utils.metadata_routing.UNCHANGED
Metadata routing for
base_margin_eval_set
parameter infit
.- eval_setstr, True, False, or None, default=sklearn.utils.metadata_routing.UNCHANGED
Metadata routing for
eval_set
parameter infit
.- feature_weightsstr, True, False, or None, default=sklearn.utils.metadata_routing.UNCHANGED
Metadata routing for
feature_weights
parameter infit
.- sample_weightstr, True, False, or None, default=sklearn.utils.metadata_routing.UNCHANGED
Metadata routing for
sample_weight
parameter infit
.- sample_weight_eval_setstr, True, False, or None, default=sklearn.utils.metadata_routing.UNCHANGED
Metadata routing for
sample_weight_eval_set
parameter infit
.- verbosestr, True, False, or None, default=sklearn.utils.metadata_routing.UNCHANGED
Metadata routing for
verbose
parameter infit
.- xgb_modelstr, True, False, or None, default=sklearn.utils.metadata_routing.UNCHANGED
Metadata routing for
xgb_model
parameter infit
.
- selfobject
The updated object.
Set the parameters of this estimator. Modification of the sklearn method to allow unknown kwargs. This allows using the full range of xgboost parameters that are not defined as member variables in sklearn grid search.
self
params (Any)
Configure whether metadata should be requested to be passed to the predict_proba
method.
Note that this method is only relevant when this estimator is used as a sub-estimator within a meta-estimator and metadata routing is enabled with
enable_metadata_routing=True
(seesklearn.set_config()
). Please check the User Guide on how the routing mechanism works.The options for each parameter are:
True
: metadata is requested, and passed topredict_proba
if provided. The request is ignored if metadata is not provided.
False
: metadata is not requested and the meta-estimator will not pass it topredict_proba
.
None
: metadata is not requested, and the meta-estimator will raise an error if the user provides it.
str
: metadata should be passed to the meta-estimator with this given alias instead of the original name.The default (
sklearn.utils.metadata_routing.UNCHANGED
) retains the existing request. This allows you to change the request for some parameters and not others.Added in version 1.3.
- base_marginstr, True, False, or None, default=sklearn.utils.metadata_routing.UNCHANGED
Metadata routing for
base_margin
parameter inpredict_proba
.- iteration_rangestr, True, False, or None, default=sklearn.utils.metadata_routing.UNCHANGED
Metadata routing for
iteration_range
parameter inpredict_proba
.- validate_featuresstr, True, False, or None, default=sklearn.utils.metadata_routing.UNCHANGED
Metadata routing for
validate_features
parameter inpredict_proba
.
- selfobject
The updated object.
Configure whether metadata should be requested to be passed to the predict
method.
Note that this method is only relevant when this estimator is used as a sub-estimator within a meta-estimator and metadata routing is enabled with
enable_metadata_routing=True
(seesklearn.set_config()
). Please check the User Guide on how the routing mechanism works.The options for each parameter are:
True
: metadata is requested, and passed topredict
if provided. The request is ignored if metadata is not provided.
False
: metadata is not requested and the meta-estimator will not pass it topredict
.
None
: metadata is not requested, and the meta-estimator will raise an error if the user provides it.
str
: metadata should be passed to the meta-estimator with this given alias instead of the original name.The default (
sklearn.utils.metadata_routing.UNCHANGED
) retains the existing request. This allows you to change the request for some parameters and not others.Added in version 1.3.
- base_marginstr, True, False, or None, default=sklearn.utils.metadata_routing.UNCHANGED
Metadata routing for
base_margin
parameter inpredict
.- iteration_rangestr, True, False, or None, default=sklearn.utils.metadata_routing.UNCHANGED
Metadata routing for
iteration_range
parameter inpredict
.- output_marginstr, True, False, or None, default=sklearn.utils.metadata_routing.UNCHANGED
Metadata routing for
output_margin
parameter inpredict
.- validate_featuresstr, True, False, or None, default=sklearn.utils.metadata_routing.UNCHANGED
Metadata routing for
validate_features
parameter inpredict
.
- selfobject
The updated object.
Configure whether metadata should be requested to be passed to the score
method.
Note that this method is only relevant when this estimator is used as a sub-estimator within a meta-estimator and metadata routing is enabled with
enable_metadata_routing=True
(seesklearn.set_config()
). Please check the User Guide on how the routing mechanism works.The options for each parameter are:
True
: metadata is requested, and passed toscore
if provided. The request is ignored if metadata is not provided.
False
: metadata is not requested and the meta-estimator will not pass it toscore
.
None
: metadata is not requested, and the meta-estimator will raise an error if the user provides it.
str
: metadata should be passed to the meta-estimator with this given alias instead of the original name.The default (
sklearn.utils.metadata_routing.UNCHANGED
) retains the existing request. This allows you to change the request for some parameters and not others.Added in version 1.3.
- sample_weightstr, True, False, or None, default=sklearn.utils.metadata_routing.UNCHANGED
Metadata routing for
sample_weight
parameter inscore
.
- selfobject
The updated object.
self (XGBClassifier)
Bases: XGBRankerMixIn
, XGBModel
Implementation of the Scikit-Learn API for XGBoost Ranking.
See Learning to Rank for an introducion.
See Using the Scikit-Learn Estimator Interface for more information.
n_estimators (Optional[int]) – Number of gradient boosted trees. Equivalent to number of boosting rounds.
max_depth (Optional[int]) – Maximum tree depth for base learners.
max_leaves (Optional[int]) – Maximum number of leaves; 0 indicates no limit.
max_bin (Optional[int]) – If using histogram-based algorithm, maximum number of bins per feature
Tree growing policy.
depthwise: Favors splitting at nodes closest to the node,
lossguide: Favors splitting at nodes with highest loss change.
learning_rate (Optional[float]) – Boosting learning rate (xgb’s “eta”)
verbosity (Optional[int]) – The degree of verbosity. Valid values are 0 (silent) - 3 (debug).
objective (Union[str, xgboost.sklearn._SklObjWProto, Callable[[Any, Any], Tuple[numpy.ndarray, numpy.ndarray]], NoneType]) –
Specify the learning task and the corresponding learning objective or a custom objective function to be used.
For custom objective, see Custom Objective and Evaluation Metric and Custom objective and metric for more information, along with the end note for function signatures.
booster (Optional[str]) – Specify which booster to use: gbtree
, gblinear
or dart
.
tree_method (Optional[str]) – Specify which tree method to use. Default to auto. If this parameter is set to default, XGBoost will choose the most conservative option available. It’s recommended to study this option from the parameters document tree method
n_jobs (Optional[int]) – Number of parallel threads used to run xgboost. When used with other Scikit-Learn algorithms like grid search, you may choose which algorithm to parallelize and balance the threads. Creating thread contention will significantly slow down both algorithms.
gamma (Optional[float]) – (min_split_loss) Minimum loss reduction required to make a further partition on a leaf node of the tree.
min_child_weight (Optional[float]) – Minimum sum of instance weight(hessian) needed in a child.
max_delta_step (Optional[float]) – Maximum delta step we allow each tree’s weight estimation to be.
subsample (Optional[float]) – Subsample ratio of the training instance.
sampling_method (Optional[str]) –
Sampling method. Used only by the GPU version of hist
tree method.
uniform
: Select random training instances uniformly.
gradient_based
: Select random training instances with higher probability
when the gradient and hessian are larger. (cf. CatBoost)
colsample_bytree (Optional[float]) – Subsample ratio of columns when constructing each tree.
colsample_bylevel (Optional[float]) – Subsample ratio of columns for each level.
colsample_bynode (Optional[float]) – Subsample ratio of columns for each split.
reg_alpha (Optional[float]) – L1 regularization term on weights (xgb’s alpha).
reg_lambda (Optional[float]) – L2 regularization term on weights (xgb’s lambda).
scale_pos_weight (Optional[float]) – Balancing of positive and negative weights.
base_score (Optional[float]) – The initial prediction score of all instances, global bias.
random_state (Union[numpy.random.mtrand.RandomState, numpy.random._generator.Generator, int, NoneType]) –
Random number seed.
Note
Using gblinear booster with shotgun updater is nondeterministic as it uses Hogwild algorithm.
missing (float) – Value in the data which needs to be present as a missing value. Default to numpy.nan
.
num_parallel_tree (Optional[int]) – Used for boosting random forest.
monotone_constraints (Union[Dict[str, int], str, NoneType]) – Constraint of variable monotonicity. See tutorial for more information.
interaction_constraints (Union[str, List[Tuple[str]], NoneType]) – Constraints for interaction representing permitted interactions. The constraints must be specified in the form of a nested list, e.g. [[0, 1], [2, 3, 4]]
, where each inner list is a group of indices of features that are allowed to interact with each other. See tutorial for more information
importance_type (Optional[str]) –
The feature importance type for the feature_importances_ property:
For tree model, it’s either “gain”, “weight”, “cover”, “total_gain” or “total_cover”.
For linear model, only “weight” is defined and it’s the normalized coefficients without bias.
Added in version 2.0.0.
Device ordinal, available options are cpu, cuda, and gpu.
validate_parameters (Optional[bool]) – Give warnings for unknown parameter.
enable_categorical (bool) – See the same parameter of DMatrix
for details.
feature_types (Optional[Sequence[str]]) –
Added in version 1.7.0.
Used for specifying feature types without constructing a dataframe. See the DMatrix
for details.
feature_weights (Optional[ArrayLike]) – Weight for each feature, defines the probability of each feature being selected when colsample is being used. All values must be greater than 0, otherwise a ValueError is thrown.
max_cat_to_onehot (Optional[int]) –
Added in version 1.6.0.
Note
This parameter is experimental
A threshold for deciding whether XGBoost should use one-hot encoding based split for categorical data. When number of categories is lesser than the threshold then one-hot encoding is chosen, otherwise the categories will be partitioned into children nodes. Also, enable_categorical needs to be set to have categorical feature support. See Categorical Data and Parameters for Categorical Feature for details.
max_cat_threshold (Optional[int]) –
Added in version 1.7.0.
Note
This parameter is experimental
Maximum number of categories considered for each split. Used only by partition-based splits for preventing over-fitting. Also, enable_categorical needs to be set to have categorical feature support. See Categorical Data and Parameters for Categorical Feature for details.
multi_strategy (Optional[str]) –
Added in version 2.0.0.
Note
This parameter is working-in-progress.
The strategy used for training multi-target models, including multi-target regression and multi-class classification. See Multiple Outputs for more information.
one_output_per_tree
: One model for each target.
multi_output_tree
: Use multi-target trees.
eval_metric (Union[str, List[Union[str, Callable]], Callable, NoneType]) –
Added in version 1.6.0.
Metric used for monitoring the training result and early stopping. It can be a string or list of strings as names of predefined metric in XGBoost (See XGBoost Parameters), one of the metrics in sklearn.metrics
, or any other user defined metric that looks like sklearn.metrics.
If custom objective is also provided, then custom metric should implement the corresponding reverse link function.
Unlike the scoring parameter commonly used in scikit-learn, when a callable object is provided, it’s assumed to be a cost function and by default XGBoost will minimize the result during early stopping.
For advanced usage on Early stopping like directly choosing to maximize instead of minimize, see xgboost.callback.EarlyStopping
.
See Custom Objective and Evaluation Metric and Custom objective and metric for more information.
from sklearn.datasets import load_diabetes from sklearn.metrics import mean_absolute_error X, y = load_diabetes(return_X_y=True) reg = xgb.XGBRegressor( tree_method="hist", eval_metric=mean_absolute_error, ) reg.fit(X, y, eval_set=[(X, y)])
early_stopping_rounds (Optional[int]) –
Added in version 1.6.0.
Activates early stopping. Validation metric needs to improve at least once in every early_stopping_rounds round(s) to continue training. Requires at least one item in eval_set in fit()
.
If early stopping occurs, the model will have two additional attributes: best_score
and best_iteration
. These are used by the predict()
and apply()
methods to determine the optimal number of trees during inference. If users want to access the full model (including trees built after early stopping), they can specify the iteration_range in these inference methods. In addition, other utilities like model plotting can also use the entire model.
If you prefer to discard the trees after best_iteration, consider using the callback function xgboost.callback.EarlyStopping
.
If there’s more than one item in eval_set, the last entry will be used for early stopping. If there’s more than one metric in eval_metric, the last metric will be used for early stopping.
callbacks (Optional[List[xgboost.callback.TrainingCallback]]) –
List of callback functions that are applied at end of each iteration. It is possible to use predefined callbacks by using Callback API.
Note
States in callback are not preserved during training, which means callback objects can not be reused for multiple training sessions without reinitialization or deepcopy.
for params in parameters_grid: # be sure to (re)initialize the callbacks before each run callbacks = [xgb.callback.LearningRateScheduler(custom_rates)] reg = xgboost.XGBRegressor(**params, callbacks=callbacks) reg.fit(X, y)
Keyword arguments for XGBoost Booster object. Full documentation of parameters can be found here. Attempting to set a parameter via the constructor args and **kwargs dict simultaneously will result in a TypeError.
Note
**kwargs unsupported by scikit-learn
**kwargs is unsupported by scikit-learn. We do not guarantee that parameters passed via this argument will interact properly with scikit-learn.
Note
A custom objective function is currently not supported by XGBRanker.
Note
Query group information is only required for ranking training but not prediction. Multiple groups can be predicted on a single call to predict()
.
When fitting the model with the group parameter, your data need to be sorted by the query group first. group is an array that contains the size of each query group.
Similarly, when fitting the model with the qid parameter, the data should be sorted according to query index and qid is an array that contains the query index for each training sample.
For example, if your original data look like:
qid
label
features
1
0
x_1
1
1
x_2
1
0
x_3
2
0
x_4
2
1
x_5
2
1
x_6
2
1
x_7
then fit()
method can be called with either group array as [3, 4]
or with qid as [1, 1, 1, 2, 2, 2, 2]
, that is the qid column. Also, the qid can be a special column of input X instead of a separated parameter, see fit()
for more info.
Return the predicted leaf every tree for each sample. If the model is trained with early stopping, then best_iteration
is used automatically.
X_leaves – For each datapoint x in X and for each tree, return the index of the leaf x ends up in. Leaves are numbered within [0; 2**(self.max_depth+1))
, possibly with gaps in the numbering.
array_like, shape=[n_samples, n_trees]
The best iteration obtained by early stopping. This attribute is 0-based, for instance if the best iteration is the first round, then best_iteration is 0.
The best score obtained by early stopping.
Coefficients property
Note
Coefficients are defined only for linear learners
Coefficients are only defined when the linear model is chosen as base learner (booster=gblinear). It is not defined for other base learner types, such as tree learners (booster=gbtree).
coef_
array of shape [n_features]
or [n_classes, n_features]
Return the evaluation results.
If eval_set is passed to the fit()
function, you can call evals_result()
to get evaluation results for all passed eval_sets. When eval_metric is also passed to the fit()
function, the evals_result will contain the eval_metrics passed to the fit()
function.
The returned evaluation result is a dictionary:
{'validation_0': {'logloss': ['0.604835', '0.531479']}, 'validation_1': {'logloss': ['0.41965', '0.17686']}}
evals_result
Feature importances property, return depends on importance_type parameter. When model trained with multi-class/multi-label/multi-target dataset, the feature importance is “averaged” over all targets. The “average” is defined based on the importance type. For instance, if the importance type is “total_gain”, then the score is sum of loss change for each split from all trees.
feature_importances_ (array of shape [n_features]
except for multi-class)
linear model, which returns an array with shape (n_features, n_classes)
Names of features seen during fit()
. Defined only when X has feature names that are all strings.
Fit gradient boosting ranker
Note that calling fit()
multiple times will cause the model object to be re-fit from scratch. To resume training from a previous checkpoint, explicitly pass xgb_model
argument.
X (Any) –
Feature matrix. See Markers for a list of supported types.
When this is a pandas.DataFrame
or a cudf.DataFrame
, it may contain a special column called qid
for specifying the query index. Using a special column is the same as using the qid parameter, except for being compatible with sklearn utility functions like sklearn.model_selection.cross_validation()
. The same convention applies to the XGBRanker.score()
and XGBRanker.predict()
.
qid
feat_0
feat_1
0
\(x_{00}\)
\(x_{01}\)
1
\(x_{10}\)
\(x_{11}\)
1
\(x_{20}\)
\(x_{21}\)
When the tree_method
is set to hist
, internally, the QuantileDMatrix
will be used instead of the DMatrix
for conserving memory. However, this has performance implications when the device of input data is not matched with algorithm. For instance, if the input is a numpy array on CPU but cuda
is used for training, then the data is first processed on CPU then transferred to GPU.
y (Any) – Labels
group (Any | None) – Size of each query group of training data. Should have as many elements as the query groups in the training data. If this is set to None, then user must provide qid.
qid (Any | None) – Query ID for each training sample. Should have the size of n_samples. If this is set to None, then user must provide group or a special column in X.
sample_weight (Any | None) –
Query group weights
Note
Weights are per-group for ranking tasks
In ranking task, one weight is assigned to each query group/id (not each data point). This is because we only care about the relative ordering of data points within each group, so it doesn’t make sense to assign weights to individual data points.
base_margin (Any | None) – Global bias for each instance. See Intercept for details.
eval_set (Sequence[Tuple[Any, Any]] | None) – A list of (X, y) tuple pairs to use as validation sets, for which metrics will be computed. Validation metrics will help us track the performance of the model.
eval_group (Sequence[Any] | None) – A list in which eval_group[i]
is the list containing the sizes of all query groups in the i
-th pair in eval_set.
eval_qid (Sequence[Any] | None) – A list in which eval_qid[i]
is the array containing query ID of i
-th pair in eval_set. The special column convention in X applies to validation datasets as well.
verbose (bool | int | None) – If verbose is True and an evaluation set is used, the evaluation metric measured on the validation set is printed to stdout at each boosting stage. If verbose is an integer, the evaluation metric is printed at each verbose boosting stage. The last boosting stage / the boosting stage found by using early_stopping_rounds is also printed.
xgb_model (Booster | str | XGBModel | None) – file name of stored XGBoost model or ‘Booster’ instance XGBoost model to be loaded before training (allows training continuation).
sample_weight_eval_set (Sequence[Any] | None) –
A list of the form [L_1, L_2, …, L_n], where each L_i is a list of group weights on the i-th validation set.
Note
Weights are per-group for ranking tasks
In ranking task, one weight is assigned to each query group (not each data point). This is because we only care about the relative ordering of data points within each group, so it doesn’t make sense to assign weights to individual data points.
base_margin_eval_set (Sequence[Any] | None) – A list of the form [M_1, M_2, …, M_n], where each M_i is an array like object storing base margin for the i-th validation set.
feature_weights (Any | None) – Weight for each feature, defines the probability of each feature being selected when colsample is being used. All values must be greater than 0, otherwise a ValueError is thrown.
Get the underlying xgboost Booster of this model.
This will raise an exception when fit was not called
booster
a xgboost booster of underlying model
Get metadata routing of this object.
Please check User Guide on how the routing mechanism works.
routing – A MetadataRequest
encapsulating routing information.
MetadataRequest
Gets the number of xgboost boosting rounds.
Intercept (bias) property
For tree-based model, the returned value is the base_score.
intercept_
array of shape (1,)
or [n_classes]
Load the model from a file or a bytearray.
The model is saved in an XGBoost internal format which is universal among the various XGBoost interfaces. Auxiliary attributes of the Python Booster object (such as feature_names) are only saved when using JSON or UBJSON (default) format. Also, parameters that are not part of the model (like metrics, max_depth, etc) are not saved, see Model IO for more info.
model.save_model("model.json") model.load_model("model.json") # or model.save_model("model.ubj") model.load_model("model.ubj") # or buf = model.save_raw() model.load_model(buf)
Number of features seen during fit()
.
Predict with X. If the model is trained with early stopping, then best_iteration
is used automatically. The estimator uses inplace_predict by default and falls back to using DMatrix
if devices between the data and the estimator don’t match.
Note
This function is only thread safe for gbtree and dart.
X (Any) – Data to predict with. See Markers for a list of supported types.
output_margin (bool) – Whether to output the raw untransformed margin value.
validate_features (bool) – When this is True, validate that the Booster’s and data’s feature_names are identical. Otherwise, it is assumed that the feature_names are the same.
base_margin (Any | None) – Global bias for each instance. See Intercept for details.
iteration_range (Tuple[int | integer, int | integer] | None) –
Specifies which layer of trees are used in prediction. For example, if a random forest is trained with 100 rounds. Specifying iteration_range=(10, 20)
, then only the forests built during [10, 20) (half open set) rounds are used in this prediction.
Added in version 1.4.0.
prediction
Save the model to a file.
The model is saved in an XGBoost internal format which is universal among the various XGBoost interfaces. Auxiliary attributes of the Python Booster object (such as feature_names) are only saved when using JSON or UBJSON (default) format. Also, parameters that are not part of the model (like metrics, max_depth, etc) are not saved, see Model IO for more info.
model.save_model("model.json") # or model.save_model("model.ubj")
Evaluate score for data using the last evaluation metric. If the model is trained with early stopping, then best_iteration
is used automatically.
X (Union[pd.DataFrame, cudf.DataFrame]) – Feature matrix. A DataFrame with a special qid column.
y (Any) – Labels
The result of the first evaluation metric for the ranker.
score
Configure whether metadata should be requested to be passed to the fit
method.
Note that this method is only relevant when this estimator is used as a sub-estimator within a meta-estimator and metadata routing is enabled with
enable_metadata_routing=True
(seesklearn.set_config()
). Please check the User Guide on how the routing mechanism works.The options for each parameter are:
True
: metadata is requested, and passed tofit
if provided. The request is ignored if metadata is not provided.
False
: metadata is not requested and the meta-estimator will not pass it tofit
.
None
: metadata is not requested, and the meta-estimator will raise an error if the user provides it.
str
: metadata should be passed to the meta-estimator with this given alias instead of the original name.The default (
sklearn.utils.metadata_routing.UNCHANGED
) retains the existing request. This allows you to change the request for some parameters and not others.Added in version 1.3.
- base_marginstr, True, False, or None, default=sklearn.utils.metadata_routing.UNCHANGED
Metadata routing for
base_margin
parameter infit
.- base_margin_eval_setstr, True, False, or None, default=sklearn.utils.metadata_routing.UNCHANGED
Metadata routing for
base_margin_eval_set
parameter infit
.- eval_groupstr, True, False, or None, default=sklearn.utils.metadata_routing.UNCHANGED
Metadata routing for
eval_group
parameter infit
.- eval_qidstr, True, False, or None, default=sklearn.utils.metadata_routing.UNCHANGED
Metadata routing for
eval_qid
parameter infit
.- eval_setstr, True, False, or None, default=sklearn.utils.metadata_routing.UNCHANGED
Metadata routing for
eval_set
parameter infit
.- feature_weightsstr, True, False, or None, default=sklearn.utils.metadata_routing.UNCHANGED
Metadata routing for
feature_weights
parameter infit
.- groupstr, True, False, or None, default=sklearn.utils.metadata_routing.UNCHANGED
Metadata routing for
group
parameter infit
.- qidstr, True, False, or None, default=sklearn.utils.metadata_routing.UNCHANGED
Metadata routing for
qid
parameter infit
.- sample_weightstr, True, False, or None, default=sklearn.utils.metadata_routing.UNCHANGED
Metadata routing for
sample_weight
parameter infit
.- sample_weight_eval_setstr, True, False, or None, default=sklearn.utils.metadata_routing.UNCHANGED
Metadata routing for
sample_weight_eval_set
parameter infit
.- verbosestr, True, False, or None, default=sklearn.utils.metadata_routing.UNCHANGED
Metadata routing for
verbose
parameter infit
.- xgb_modelstr, True, False, or None, default=sklearn.utils.metadata_routing.UNCHANGED
Metadata routing for
xgb_model
parameter infit
.
- selfobject
The updated object.
self (XGBRanker)
Set the parameters of this estimator. Modification of the sklearn method to allow unknown kwargs. This allows using the full range of xgboost parameters that are not defined as member variables in sklearn grid search.
self
params (Any)
Configure whether metadata should be requested to be passed to the predict
method.
Note that this method is only relevant when this estimator is used as a sub-estimator within a meta-estimator and metadata routing is enabled with
enable_metadata_routing=True
(seesklearn.set_config()
). Please check the User Guide on how the routing mechanism works.The options for each parameter are:
True
: metadata is requested, and passed topredict
if provided. The request is ignored if metadata is not provided.
False
: metadata is not requested and the meta-estimator will not pass it topredict
.
None
: metadata is not requested, and the meta-estimator will raise an error if the user provides it.
str
: metadata should be passed to the meta-estimator with this given alias instead of the original name.The default (
sklearn.utils.metadata_routing.UNCHANGED
) retains the existing request. This allows you to change the request for some parameters and not others.Added in version 1.3.
- base_marginstr, True, False, or None, default=sklearn.utils.metadata_routing.UNCHANGED
Metadata routing for
base_margin
parameter inpredict
.- iteration_rangestr, True, False, or None, default=sklearn.utils.metadata_routing.UNCHANGED
Metadata routing for
iteration_range
parameter inpredict
.- output_marginstr, True, False, or None, default=sklearn.utils.metadata_routing.UNCHANGED
Metadata routing for
output_margin
parameter inpredict
.- validate_featuresstr, True, False, or None, default=sklearn.utils.metadata_routing.UNCHANGED
Metadata routing for
validate_features
parameter inpredict
.
- selfobject
The updated object.
Bases: XGBRegressor
scikit-learn API for XGBoost random forest regression. See Using the Scikit-Learn Estimator Interface for more information.
n_estimators (Optional[int]) – Number of trees in random forest to fit.
max_depth (Optional[int]) – Maximum tree depth for base learners.
max_leaves (Optional[int]) – Maximum number of leaves; 0 indicates no limit.
max_bin (Optional[int]) – If using histogram-based algorithm, maximum number of bins per feature
Tree growing policy.
depthwise: Favors splitting at nodes closest to the node,
lossguide: Favors splitting at nodes with highest loss change.
learning_rate (Optional[float]) – Boosting learning rate (xgb’s “eta”)
verbosity (Optional[int]) – The degree of verbosity. Valid values are 0 (silent) - 3 (debug).
objective (Union[str, xgboost.sklearn._SklObjWProto, Callable[[Any, Any], Tuple[numpy.ndarray, numpy.ndarray]], NoneType]) –
Specify the learning task and the corresponding learning objective or a custom objective function to be used.
For custom objective, see Custom Objective and Evaluation Metric and Custom objective and metric for more information, along with the end note for function signatures.
booster (Optional[str]) – Specify which booster to use: gbtree
, gblinear
or dart
.
tree_method (Optional[str]) – Specify which tree method to use. Default to auto. If this parameter is set to default, XGBoost will choose the most conservative option available. It’s recommended to study this option from the parameters document tree method
n_jobs (Optional[int]) – Number of parallel threads used to run xgboost. When used with other Scikit-Learn algorithms like grid search, you may choose which algorithm to parallelize and balance the threads. Creating thread contention will significantly slow down both algorithms.
gamma (Optional[float]) – (min_split_loss) Minimum loss reduction required to make a further partition on a leaf node of the tree.
min_child_weight (Optional[float]) – Minimum sum of instance weight(hessian) needed in a child.
max_delta_step (Optional[float]) – Maximum delta step we allow each tree’s weight estimation to be.
subsample (Optional[float]) – Subsample ratio of the training instance.
sampling_method (Optional[str]) –
Sampling method. Used only by the GPU version of hist
tree method.
uniform
: Select random training instances uniformly.
gradient_based
: Select random training instances with higher probability
when the gradient and hessian are larger. (cf. CatBoost)
colsample_bytree (Optional[float]) – Subsample ratio of columns when constructing each tree.
colsample_bylevel (Optional[float]) – Subsample ratio of columns for each level.
colsample_bynode (Optional[float]) – Subsample ratio of columns for each split.
reg_alpha (Optional[float]) – L1 regularization term on weights (xgb’s alpha).
reg_lambda (Optional[float]) – L2 regularization term on weights (xgb’s lambda).
scale_pos_weight (Optional[float]) – Balancing of positive and negative weights.
base_score (Optional[float]) – The initial prediction score of all instances, global bias.
random_state (Union[numpy.random.mtrand.RandomState, numpy.random._generator.Generator, int, NoneType]) –
Random number seed.
Note
Using gblinear booster with shotgun updater is nondeterministic as it uses Hogwild algorithm.
missing (float) – Value in the data which needs to be present as a missing value. Default to numpy.nan
.
num_parallel_tree (Optional[int]) – Used for boosting random forest.
monotone_constraints (Union[Dict[str, int], str, NoneType]) – Constraint of variable monotonicity. See tutorial for more information.
interaction_constraints (Union[str, List[Tuple[str]], NoneType]) – Constraints for interaction representing permitted interactions. The constraints must be specified in the form of a nested list, e.g. [[0, 1], [2, 3, 4]]
, where each inner list is a group of indices of features that are allowed to interact with each other. See tutorial for more information
importance_type (Optional[str]) –
The feature importance type for the feature_importances_ property:
For tree model, it’s either “gain”, “weight”, “cover”, “total_gain” or “total_cover”.
For linear model, only “weight” is defined and it’s the normalized coefficients without bias.
Added in version 2.0.0.
Device ordinal, available options are cpu, cuda, and gpu.
validate_parameters (Optional[bool]) – Give warnings for unknown parameter.
enable_categorical (bool) – See the same parameter of DMatrix
for details.
feature_types (Optional[Sequence[str]]) –
Added in version 1.7.0.
Used for specifying feature types without constructing a dataframe. See the DMatrix
for details.
feature_weights (Optional[ArrayLike]) – Weight for each feature, defines the probability of each feature being selected when colsample is being used. All values must be greater than 0, otherwise a ValueError is thrown.
max_cat_to_onehot (Optional[int]) –
Added in version 1.6.0.
Note
This parameter is experimental
A threshold for deciding whether XGBoost should use one-hot encoding based split for categorical data. When number of categories is lesser than the threshold then one-hot encoding is chosen, otherwise the categories will be partitioned into children nodes. Also, enable_categorical needs to be set to have categorical feature support. See Categorical Data and Parameters for Categorical Feature for details.
max_cat_threshold (Optional[int]) –
Added in version 1.7.0.
Note
This parameter is experimental
Maximum number of categories considered for each split. Used only by partition-based splits for preventing over-fitting. Also, enable_categorical needs to be set to have categorical feature support. See Categorical Data and Parameters for Categorical Feature for details.
multi_strategy (Optional[str]) –
Added in version 2.0.0.
Note
This parameter is working-in-progress.
The strategy used for training multi-target models, including multi-target regression and multi-class classification. See Multiple Outputs for more information.
one_output_per_tree
: One model for each target.
multi_output_tree
: Use multi-target trees.
eval_metric (Union[str, List[Union[str, Callable]], Callable, NoneType]) –
Added in version 1.6.0.
Metric used for monitoring the training result and early stopping. It can be a string or list of strings as names of predefined metric in XGBoost (See XGBoost Parameters), one of the metrics in sklearn.metrics
, or any other user defined metric that looks like sklearn.metrics.
If custom objective is also provided, then custom metric should implement the corresponding reverse link function.
Unlike the scoring parameter commonly used in scikit-learn, when a callable object is provided, it’s assumed to be a cost function and by default XGBoost will minimize the result during early stopping.
For advanced usage on Early stopping like directly choosing to maximize instead of minimize, see xgboost.callback.EarlyStopping
.
See Custom Objective and Evaluation Metric and Custom objective and metric for more information.
from sklearn.datasets import load_diabetes from sklearn.metrics import mean_absolute_error X, y = load_diabetes(return_X_y=True) reg = xgb.XGBRegressor( tree_method="hist", eval_metric=mean_absolute_error, ) reg.fit(X, y, eval_set=[(X, y)])
early_stopping_rounds (Optional[int]) –
Added in version 1.6.0.
Activates early stopping. Validation metric needs to improve at least once in every early_stopping_rounds round(s) to continue training. Requires at least one item in eval_set in fit()
.
If early stopping occurs, the model will have two additional attributes: best_score
and best_iteration
. These are used by the predict()
and apply()
methods to determine the optimal number of trees during inference. If users want to access the full model (including trees built after early stopping), they can specify the iteration_range in these inference methods. In addition, other utilities like model plotting can also use the entire model.
If you prefer to discard the trees after best_iteration, consider using the callback function xgboost.callback.EarlyStopping
.
If there’s more than one item in eval_set, the last entry will be used for early stopping. If there’s more than one metric in eval_metric, the last metric will be used for early stopping.
callbacks (Optional[List[xgboost.callback.TrainingCallback]]) –
List of callback functions that are applied at end of each iteration. It is possible to use predefined callbacks by using Callback API.
Note
States in callback are not preserved during training, which means callback objects can not be reused for multiple training sessions without reinitialization or deepcopy.
for params in parameters_grid: # be sure to (re)initialize the callbacks before each run callbacks = [xgb.callback.LearningRateScheduler(custom_rates)] reg = xgboost.XGBRegressor(**params, callbacks=callbacks) reg.fit(X, y)
Keyword arguments for XGBoost Booster object. Full documentation of parameters can be found here. Attempting to set a parameter via the constructor args and **kwargs dict simultaneously will result in a TypeError.
Note
**kwargs unsupported by scikit-learn
**kwargs is unsupported by scikit-learn. We do not guarantee that parameters passed via this argument will interact properly with scikit-learn.
Note
Custom objective function
A custom objective function can be provided for the objective
parameter. In this case, it should have the signature objective(y_true, y_pred) -> [grad, hess]
or objective(y_true, y_pred, *, sample_weight) -> [grad, hess]
:
The target values
The predicted values
Optional sample weights.
The value of the gradient for each sample point.
The value of the second derivative for each sample point
Note that, if the custom objective produces negative values for the Hessian, these will be clipped. If the objective is non-convex, one might also consider using the expected Hessian (Fisher information) instead.
Return the predicted leaf every tree for each sample. If the model is trained with early stopping, then best_iteration
is used automatically.
X_leaves – For each datapoint x in X and for each tree, return the index of the leaf x ends up in. Leaves are numbered within [0; 2**(self.max_depth+1))
, possibly with gaps in the numbering.
array_like, shape=[n_samples, n_trees]
The best iteration obtained by early stopping. This attribute is 0-based, for instance if the best iteration is the first round, then best_iteration is 0.
The best score obtained by early stopping.
Coefficients property
Note
Coefficients are defined only for linear learners
Coefficients are only defined when the linear model is chosen as base learner (booster=gblinear). It is not defined for other base learner types, such as tree learners (booster=gbtree).
coef_
array of shape [n_features]
or [n_classes, n_features]
Return the evaluation results.
If eval_set is passed to the fit()
function, you can call evals_result()
to get evaluation results for all passed eval_sets. When eval_metric is also passed to the fit()
function, the evals_result will contain the eval_metrics passed to the fit()
function.
The returned evaluation result is a dictionary:
{'validation_0': {'logloss': ['0.604835', '0.531479']}, 'validation_1': {'logloss': ['0.41965', '0.17686']}}
evals_result
Feature importances property, return depends on importance_type parameter. When model trained with multi-class/multi-label/multi-target dataset, the feature importance is “averaged” over all targets. The “average” is defined based on the importance type. For instance, if the importance type is “total_gain”, then the score is sum of loss change for each split from all trees.
feature_importances_ (array of shape [n_features]
except for multi-class)
linear model, which returns an array with shape (n_features, n_classes)
Names of features seen during fit()
. Defined only when X has feature names that are all strings.
Fit gradient boosting model.
Note that calling fit()
multiple times will cause the model object to be re-fit from scratch. To resume training from a previous checkpoint, explicitly pass xgb_model
argument.
X (Any) –
Input feature matrix. See Markers for a list of supported types.
When the tree_method
is set to hist
, internally, the QuantileDMatrix
will be used instead of the DMatrix
for conserving memory. However, this has performance implications when the device of input data is not matched with algorithm. For instance, if the input is a numpy array on CPU but cuda
is used for training, then the data is first processed on CPU then transferred to GPU.
y (Any) – Labels
sample_weight (Any | None) – instance weights
base_margin (Any | None) – Global bias for each instance. See Intercept for details.
eval_set (Sequence[Tuple[Any, Any]] | None) – A list of (X, y) tuple pairs to use as validation sets, for which metrics will be computed. Validation metrics will help us track the performance of the model.
verbose (bool | int | None) – If verbose is True and an evaluation set is used, the evaluation metric measured on the validation set is printed to stdout at each boosting stage. If verbose is an integer, the evaluation metric is printed at each verbose boosting stage. The last boosting stage / the boosting stage found by using early_stopping_rounds is also printed.
xgb_model (Booster | str | XGBModel | None) – file name of stored XGBoost model or ‘Booster’ instance XGBoost model to be loaded before training (allows training continuation).
sample_weight_eval_set (Sequence[Any] | None) – A list of the form [L_1, L_2, …, L_n], where each L_i is an array like object storing instance weights for the i-th validation set.
base_margin_eval_set (Sequence[Any] | None) – A list of the form [M_1, M_2, …, M_n], where each M_i is an array like object storing base margin for the i-th validation set.
feature_weights (Any | None) –
Deprecated since version 3.0.0.
Use feature_weights in __init__()
or set_params()
instead.
Get the underlying xgboost Booster of this model.
This will raise an exception when fit was not called
booster
a xgboost booster of underlying model
Get metadata routing of this object.
Please check User Guide on how the routing mechanism works.
routing – A MetadataRequest
encapsulating routing information.
MetadataRequest
Gets the number of xgboost boosting rounds.
Intercept (bias) property
For tree-based model, the returned value is the base_score.
intercept_
array of shape (1,)
or [n_classes]
Load the model from a file or a bytearray.
The model is saved in an XGBoost internal format which is universal among the various XGBoost interfaces. Auxiliary attributes of the Python Booster object (such as feature_names) are only saved when using JSON or UBJSON (default) format. Also, parameters that are not part of the model (like metrics, max_depth, etc) are not saved, see Model IO for more info.
model.save_model("model.json") model.load_model("model.json") # or model.save_model("model.ubj") model.load_model("model.ubj") # or buf = model.save_raw() model.load_model(buf)
Number of features seen during fit()
.
Predict with X. If the model is trained with early stopping, then best_iteration
is used automatically. The estimator uses inplace_predict by default and falls back to using DMatrix
if devices between the data and the estimator don’t match.
Note
This function is only thread safe for gbtree and dart.
X (Any) – Data to predict with. See Markers for a list of supported types.
output_margin (bool) – Whether to output the raw untransformed margin value.
validate_features (bool) – When this is True, validate that the Booster’s and data’s feature_names are identical. Otherwise, it is assumed that the feature_names are the same.
base_margin (Any | None) – Global bias for each instance. See Intercept for details.
iteration_range (Tuple[int | integer, int | integer] | None) –
Specifies which layer of trees are used in prediction. For example, if a random forest is trained with 100 rounds. Specifying iteration_range=(10, 20)
, then only the forests built during [10, 20) (half open set) rounds are used in this prediction.
Added in version 1.4.0.
prediction
Save the model to a file.
The model is saved in an XGBoost internal format which is universal among the various XGBoost interfaces. Auxiliary attributes of the Python Booster object (such as feature_names) are only saved when using JSON or UBJSON (default) format. Also, parameters that are not part of the model (like metrics, max_depth, etc) are not saved, see Model IO for more info.
model.save_model("model.json") # or model.save_model("model.ubj")
Return coefficient of determination on test data.
The coefficient of determination, \(R^2\), is defined as \((1 - \frac{u}{v})\), where \(u\) is the residual sum of squares ((y_true - y_pred)** 2).sum()
and \(v\) is the total sum of squares ((y_true - y_true.mean()) ** 2).sum()
. The best possible score is 1.0 and it can be negative (because the model can be arbitrarily worse). A constant model that always predicts the expected value of y, disregarding the input features, would get a \(R^2\) score of 0.0.
X (array-like of shape (n_samples, n_features)) – Test samples. For some estimators this may be a precomputed kernel matrix or a list of generic objects instead with shape (n_samples, n_samples_fitted)
, where n_samples_fitted
is the number of samples used in the fitting for the estimator.
y (array-like of shape (n_samples,) or (n_samples, n_outputs)) – True values for X.
sample_weight (array-like of shape (n_samples,), default=None) – Sample weights.
score – \(R^2\) of self.predict(X)
w.r.t. y.
Notes
The \(R^2\) score used when calling score
on a regressor uses multioutput='uniform_average'
from version 0.23 to keep consistent with default value of r2_score()
. This influences the score
method of all the multioutput regressors (except for MultiOutputRegressor
).
Configure whether metadata should be requested to be passed to the fit
method.
Note that this method is only relevant when this estimator is used as a sub-estimator within a meta-estimator and metadata routing is enabled with
enable_metadata_routing=True
(seesklearn.set_config()
). Please check the User Guide on how the routing mechanism works.The options for each parameter are:
True
: metadata is requested, and passed tofit
if provided. The request is ignored if metadata is not provided.
False
: metadata is not requested and the meta-estimator will not pass it tofit
.
None
: metadata is not requested, and the meta-estimator will raise an error if the user provides it.
str
: metadata should be passed to the meta-estimator with this given alias instead of the original name.The default (
sklearn.utils.metadata_routing.UNCHANGED
) retains the existing request. This allows you to change the request for some parameters and not others.Added in version 1.3.
- base_marginstr, True, False, or None, default=sklearn.utils.metadata_routing.UNCHANGED
Metadata routing for
base_margin
parameter infit
.- base_margin_eval_setstr, True, False, or None, default=sklearn.utils.metadata_routing.UNCHANGED
Metadata routing for
base_margin_eval_set
parameter infit
.- eval_setstr, True, False, or None, default=sklearn.utils.metadata_routing.UNCHANGED
Metadata routing for
eval_set
parameter infit
.- feature_weightsstr, True, False, or None, default=sklearn.utils.metadata_routing.UNCHANGED
Metadata routing for
feature_weights
parameter infit
.- sample_weightstr, True, False, or None, default=sklearn.utils.metadata_routing.UNCHANGED
Metadata routing for
sample_weight
parameter infit
.- sample_weight_eval_setstr, True, False, or None, default=sklearn.utils.metadata_routing.UNCHANGED
Metadata routing for
sample_weight_eval_set
parameter infit
.- verbosestr, True, False, or None, default=sklearn.utils.metadata_routing.UNCHANGED
Metadata routing for
verbose
parameter infit
.- xgb_modelstr, True, False, or None, default=sklearn.utils.metadata_routing.UNCHANGED
Metadata routing for
xgb_model
parameter infit
.
- selfobject
The updated object.
Set the parameters of this estimator. Modification of the sklearn method to allow unknown kwargs. This allows using the full range of xgboost parameters that are not defined as member variables in sklearn grid search.
self
params (Any)
Configure whether metadata should be requested to be passed to the predict
method.
Note that this method is only relevant when this estimator is used as a sub-estimator within a meta-estimator and metadata routing is enabled with
enable_metadata_routing=True
(seesklearn.set_config()
). Please check the User Guide on how the routing mechanism works.The options for each parameter are:
True
: metadata is requested, and passed topredict
if provided. The request is ignored if metadata is not provided.
False
: metadata is not requested and the meta-estimator will not pass it topredict
.
None
: metadata is not requested, and the meta-estimator will raise an error if the user provides it.
str
: metadata should be passed to the meta-estimator with this given alias instead of the original name.The default (
sklearn.utils.metadata_routing.UNCHANGED
) retains the existing request. This allows you to change the request for some parameters and not others.Added in version 1.3.
- base_marginstr, True, False, or None, default=sklearn.utils.metadata_routing.UNCHANGED
Metadata routing for
base_margin
parameter inpredict
.- iteration_rangestr, True, False, or None, default=sklearn.utils.metadata_routing.UNCHANGED
Metadata routing for
iteration_range
parameter inpredict
.- output_marginstr, True, False, or None, default=sklearn.utils.metadata_routing.UNCHANGED
Metadata routing for
output_margin
parameter inpredict
.- validate_featuresstr, True, False, or None, default=sklearn.utils.metadata_routing.UNCHANGED
Metadata routing for
validate_features
parameter inpredict
.
- selfobject
The updated object.
Configure whether metadata should be requested to be passed to the score
method.
Note that this method is only relevant when this estimator is used as a sub-estimator within a meta-estimator and metadata routing is enabled with
enable_metadata_routing=True
(seesklearn.set_config()
). Please check the User Guide on how the routing mechanism works.The options for each parameter are:
True
: metadata is requested, and passed toscore
if provided. The request is ignored if metadata is not provided.
False
: metadata is not requested and the meta-estimator will not pass it toscore
.
None
: metadata is not requested, and the meta-estimator will raise an error if the user provides it.
str
: metadata should be passed to the meta-estimator with this given alias instead of the original name.The default (
sklearn.utils.metadata_routing.UNCHANGED
) retains the existing request. This allows you to change the request for some parameters and not others.Added in version 1.3.
- sample_weightstr, True, False, or None, default=sklearn.utils.metadata_routing.UNCHANGED
Metadata routing for
sample_weight
parameter inscore
.
- selfobject
The updated object.
self (XGBRFRegressor)
Bases: XGBClassifier
scikit-learn API for XGBoost random forest classification. See Using the Scikit-Learn Estimator Interface for more information.
n_estimators (Optional[int]) – Number of trees in random forest to fit.
max_depth (Optional[int]) – Maximum tree depth for base learners.
max_leaves (Optional[int]) – Maximum number of leaves; 0 indicates no limit.
max_bin (Optional[int]) – If using histogram-based algorithm, maximum number of bins per feature
Tree growing policy.
depthwise: Favors splitting at nodes closest to the node,
lossguide: Favors splitting at nodes with highest loss change.
learning_rate (Optional[float]) – Boosting learning rate (xgb’s “eta”)
verbosity (Optional[int]) – The degree of verbosity. Valid values are 0 (silent) - 3 (debug).
objective (Union[str, xgboost.sklearn._SklObjWProto, Callable[[Any, Any], Tuple[numpy.ndarray, numpy.ndarray]], NoneType]) –
Specify the learning task and the corresponding learning objective or a custom objective function to be used.
For custom objective, see Custom Objective and Evaluation Metric and Custom objective and metric for more information, along with the end note for function signatures.
booster (Optional[str]) – Specify which booster to use: gbtree
, gblinear
or dart
.
tree_method (Optional[str]) – Specify which tree method to use. Default to auto. If this parameter is set to default, XGBoost will choose the most conservative option available. It’s recommended to study this option from the parameters document tree method
n_jobs (Optional[int]) – Number of parallel threads used to run xgboost. When used with other Scikit-Learn algorithms like grid search, you may choose which algorithm to parallelize and balance the threads. Creating thread contention will significantly slow down both algorithms.
gamma (Optional[float]) – (min_split_loss) Minimum loss reduction required to make a further partition on a leaf node of the tree.
min_child_weight (Optional[float]) – Minimum sum of instance weight(hessian) needed in a child.
max_delta_step (Optional[float]) – Maximum delta step we allow each tree’s weight estimation to be.
subsample (Optional[float]) – Subsample ratio of the training instance.
sampling_method (Optional[str]) –
Sampling method. Used only by the GPU version of hist
tree method.
uniform
: Select random training instances uniformly.
gradient_based
: Select random training instances with higher probability
when the gradient and hessian are larger. (cf. CatBoost)
colsample_bytree (Optional[float]) – Subsample ratio of columns when constructing each tree.
colsample_bylevel (Optional[float]) – Subsample ratio of columns for each level.
colsample_bynode (Optional[float]) – Subsample ratio of columns for each split.
reg_alpha (Optional[float]) – L1 regularization term on weights (xgb’s alpha).
reg_lambda (Optional[float]) – L2 regularization term on weights (xgb’s lambda).
scale_pos_weight (Optional[float]) – Balancing of positive and negative weights.
base_score (Optional[float]) – The initial prediction score of all instances, global bias.
random_state (Union[numpy.random.mtrand.RandomState, numpy.random._generator.Generator, int, NoneType]) –
Random number seed.
Note
Using gblinear booster with shotgun updater is nondeterministic as it uses Hogwild algorithm.
missing (float) – Value in the data which needs to be present as a missing value. Default to numpy.nan
.
num_parallel_tree (Optional[int]) – Used for boosting random forest.
monotone_constraints (Union[Dict[str, int], str, NoneType]) – Constraint of variable monotonicity. See tutorial for more information.
interaction_constraints (Union[str, List[Tuple[str]], NoneType]) – Constraints for interaction representing permitted interactions. The constraints must be specified in the form of a nested list, e.g. [[0, 1], [2, 3, 4]]
, where each inner list is a group of indices of features that are allowed to interact with each other. See tutorial for more information
importance_type (Optional[str]) –
The feature importance type for the feature_importances_ property:
For tree model, it’s either “gain”, “weight”, “cover”, “total_gain” or “total_cover”.
For linear model, only “weight” is defined and it’s the normalized coefficients without bias.
Added in version 2.0.0.
Device ordinal, available options are cpu, cuda, and gpu.
validate_parameters (Optional[bool]) – Give warnings for unknown parameter.
enable_categorical (bool) – See the same parameter of DMatrix
for details.
feature_types (Optional[Sequence[str]]) –
Added in version 1.7.0.
Used for specifying feature types without constructing a dataframe. See the DMatrix
for details.
feature_weights (Optional[ArrayLike]) – Weight for each feature, defines the probability of each feature being selected when colsample is being used. All values must be greater than 0, otherwise a ValueError is thrown.
max_cat_to_onehot (Optional[int]) –
Added in version 1.6.0.
Note
This parameter is experimental
A threshold for deciding whether XGBoost should use one-hot encoding based split for categorical data. When number of categories is lesser than the threshold then one-hot encoding is chosen, otherwise the categories will be partitioned into children nodes. Also, enable_categorical needs to be set to have categorical feature support. See Categorical Data and Parameters for Categorical Feature for details.
max_cat_threshold (Optional[int]) –
Added in version 1.7.0.
Note
This parameter is experimental
Maximum number of categories considered for each split. Used only by partition-based splits for preventing over-fitting. Also, enable_categorical needs to be set to have categorical feature support. See Categorical Data and Parameters for Categorical Feature for details.
multi_strategy (Optional[str]) –
Added in version 2.0.0.
Note
This parameter is working-in-progress.
The strategy used for training multi-target models, including multi-target regression and multi-class classification. See Multiple Outputs for more information.
one_output_per_tree
: One model for each target.
multi_output_tree
: Use multi-target trees.
eval_metric (Union[str, List[Union[str, Callable]], Callable, NoneType]) –
Added in version 1.6.0.
Metric used for monitoring the training result and early stopping. It can be a string or list of strings as names of predefined metric in XGBoost (See XGBoost Parameters), one of the metrics in sklearn.metrics
, or any other user defined metric that looks like sklearn.metrics.
If custom objective is also provided, then custom metric should implement the corresponding reverse link function.
Unlike the scoring parameter commonly used in scikit-learn, when a callable object is provided, it’s assumed to be a cost function and by default XGBoost will minimize the result during early stopping.
For advanced usage on Early stopping like directly choosing to maximize instead of minimize, see xgboost.callback.EarlyStopping
.
See Custom Objective and Evaluation Metric and Custom objective and metric for more information.
from sklearn.datasets import load_diabetes from sklearn.metrics import mean_absolute_error X, y = load_diabetes(return_X_y=True) reg = xgb.XGBRegressor( tree_method="hist", eval_metric=mean_absolute_error, ) reg.fit(X, y, eval_set=[(X, y)])
early_stopping_rounds (Optional[int]) –
Added in version 1.6.0.
Activates early stopping. Validation metric needs to improve at least once in every early_stopping_rounds round(s) to continue training. Requires at least one item in eval_set in fit()
.
If early stopping occurs, the model will have two additional attributes: best_score
and best_iteration
. These are used by the predict()
and apply()
methods to determine the optimal number of trees during inference. If users want to access the full model (including trees built after early stopping), they can specify the iteration_range in these inference methods. In addition, other utilities like model plotting can also use the entire model.
If you prefer to discard the trees after best_iteration, consider using the callback function xgboost.callback.EarlyStopping
.
If there’s more than one item in eval_set, the last entry will be used for early stopping. If there’s more than one metric in eval_metric, the last metric will be used for early stopping.
callbacks (Optional[List[xgboost.callback.TrainingCallback]]) –
List of callback functions that are applied at end of each iteration. It is possible to use predefined callbacks by using Callback API.
Note
States in callback are not preserved during training, which means callback objects can not be reused for multiple training sessions without reinitialization or deepcopy.
for params in parameters_grid: # be sure to (re)initialize the callbacks before each run callbacks = [xgb.callback.LearningRateScheduler(custom_rates)] reg = xgboost.XGBRegressor(**params, callbacks=callbacks) reg.fit(X, y)
Keyword arguments for XGBoost Booster object. Full documentation of parameters can be found here. Attempting to set a parameter via the constructor args and **kwargs dict simultaneously will result in a TypeError.
Note
**kwargs unsupported by scikit-learn
**kwargs is unsupported by scikit-learn. We do not guarantee that parameters passed via this argument will interact properly with scikit-learn.
Note
Custom objective function
A custom objective function can be provided for the objective
parameter. In this case, it should have the signature objective(y_true, y_pred) -> [grad, hess]
or objective(y_true, y_pred, *, sample_weight) -> [grad, hess]
:
The target values
The predicted values
Optional sample weights.
The value of the gradient for each sample point.
The value of the second derivative for each sample point
Note that, if the custom objective produces negative values for the Hessian, these will be clipped. If the objective is non-convex, one might also consider using the expected Hessian (Fisher information) instead.
Return the predicted leaf every tree for each sample. If the model is trained with early stopping, then best_iteration
is used automatically.
X_leaves – For each datapoint x in X and for each tree, return the index of the leaf x ends up in. Leaves are numbered within [0; 2**(self.max_depth+1))
, possibly with gaps in the numbering.
array_like, shape=[n_samples, n_trees]
The best iteration obtained by early stopping. This attribute is 0-based, for instance if the best iteration is the first round, then best_iteration is 0.
The best score obtained by early stopping.
Coefficients property
Note
Coefficients are defined only for linear learners
Coefficients are only defined when the linear model is chosen as base learner (booster=gblinear). It is not defined for other base learner types, such as tree learners (booster=gbtree).
coef_
array of shape [n_features]
or [n_classes, n_features]
Return the evaluation results.
If eval_set is passed to the fit()
function, you can call evals_result()
to get evaluation results for all passed eval_sets. When eval_metric is also passed to the fit()
function, the evals_result will contain the eval_metrics passed to the fit()
function.
The returned evaluation result is a dictionary:
{'validation_0': {'logloss': ['0.604835', '0.531479']}, 'validation_1': {'logloss': ['0.41965', '0.17686']}}
evals_result
Feature importances property, return depends on importance_type parameter. When model trained with multi-class/multi-label/multi-target dataset, the feature importance is “averaged” over all targets. The “average” is defined based on the importance type. For instance, if the importance type is “total_gain”, then the score is sum of loss change for each split from all trees.
feature_importances_ (array of shape [n_features]
except for multi-class)
linear model, which returns an array with shape (n_features, n_classes)
Names of features seen during fit()
. Defined only when X has feature names that are all strings.
Fit gradient boosting classifier.
Note that calling fit()
multiple times will cause the model object to be re-fit from scratch. To resume training from a previous checkpoint, explicitly pass xgb_model
argument.
X (Any) –
Input feature matrix. See Markers for a list of supported types.
When the tree_method
is set to hist
, internally, the QuantileDMatrix
will be used instead of the DMatrix
for conserving memory. However, this has performance implications when the device of input data is not matched with algorithm. For instance, if the input is a numpy array on CPU but cuda
is used for training, then the data is first processed on CPU then transferred to GPU.
y (Any) – Labels
sample_weight (Any | None) – instance weights
base_margin (Any | None) – Global bias for each instance. See Intercept for details.
eval_set (Sequence[Tuple[Any, Any]] | None) – A list of (X, y) tuple pairs to use as validation sets, for which metrics will be computed. Validation metrics will help us track the performance of the model.
verbose (bool | int | None) – If verbose is True and an evaluation set is used, the evaluation metric measured on the validation set is printed to stdout at each boosting stage. If verbose is an integer, the evaluation metric is printed at each verbose boosting stage. The last boosting stage / the boosting stage found by using early_stopping_rounds is also printed.
xgb_model (Booster | str | XGBModel | None) – file name of stored XGBoost model or ‘Booster’ instance XGBoost model to be loaded before training (allows training continuation).
sample_weight_eval_set (Sequence[Any] | None) – A list of the form [L_1, L_2, …, L_n], where each L_i is an array like object storing instance weights for the i-th validation set.
base_margin_eval_set (Sequence[Any] | None) – A list of the form [M_1, M_2, …, M_n], where each M_i is an array like object storing base margin for the i-th validation set.
feature_weights (Any | None) –
Deprecated since version 3.0.0.
Use feature_weights in __init__()
or set_params()
instead.
Get the underlying xgboost Booster of this model.
This will raise an exception when fit was not called
booster
a xgboost booster of underlying model
Get metadata routing of this object.
Please check User Guide on how the routing mechanism works.
routing – A MetadataRequest
encapsulating routing information.
MetadataRequest
Gets the number of xgboost boosting rounds.
Intercept (bias) property
For tree-based model, the returned value is the base_score.
intercept_
array of shape (1,)
or [n_classes]
Load the model from a file or a bytearray.
The model is saved in an XGBoost internal format which is universal among the various XGBoost interfaces. Auxiliary attributes of the Python Booster object (such as feature_names) are only saved when using JSON or UBJSON (default) format. Also, parameters that are not part of the model (like metrics, max_depth, etc) are not saved, see Model IO for more info.
model.save_model("model.json") model.load_model("model.json") # or model.save_model("model.ubj") model.load_model("model.ubj") # or buf = model.save_raw() model.load_model(buf)
Number of features seen during fit()
.
Predict with X. If the model is trained with early stopping, then best_iteration
is used automatically. The estimator uses inplace_predict by default and falls back to using DMatrix
if devices between the data and the estimator don’t match.
Note
This function is only thread safe for gbtree and dart.
X (Any) – Data to predict with. See Markers for a list of supported types.
output_margin (bool) – Whether to output the raw untransformed margin value.
validate_features (bool) – When this is True, validate that the Booster’s and data’s feature_names are identical. Otherwise, it is assumed that the feature_names are the same.
base_margin (Any | None) – Global bias for each instance. See Intercept for details.
iteration_range (Tuple[int | integer, int | integer] | None) –
Specifies which layer of trees are used in prediction. For example, if a random forest is trained with 100 rounds. Specifying iteration_range=(10, 20)
, then only the forests built during [10, 20) (half open set) rounds are used in this prediction.
Added in version 1.4.0.
prediction
Predict the probability of each X example being of a given class. If the model is trained with early stopping, then best_iteration
is used automatically. The estimator uses inplace_predict by default and falls back to using DMatrix
if devices between the data and the estimator don’t match.
Note
This function is only thread safe for gbtree and dart.
X (Any) – Feature matrix. See Markers for a list of supported types.
validate_features (bool) – When this is True, validate that the Booster’s and data’s feature_names are identical. Otherwise, it is assumed that the feature_names are the same.
base_margin (Any | None) – Global bias for each instance. See Intercept for details.
iteration_range (Tuple[int | integer, int | integer] | None) – Specifies which layer of trees are used in prediction. For example, if a random forest is trained with 100 rounds. Specifying iteration_range=(10, 20), then only the forests built during [10, 20) (half open set) rounds are used in this prediction.
a numpy array of shape array-like of shape (n_samples, n_classes) with the probability of each data example being of a given class.
prediction
Save the model to a file.
The model is saved in an XGBoost internal format which is universal among the various XGBoost interfaces. Auxiliary attributes of the Python Booster object (such as feature_names) are only saved when using JSON or UBJSON (default) format. Also, parameters that are not part of the model (like metrics, max_depth, etc) are not saved, see Model IO for more info.
model.save_model("model.json") # or model.save_model("model.ubj")
Return accuracy on provided data and labels.
In multi-label classification, this is the subset accuracy which is a harsh metric since you require for each sample that each label set be correctly predicted.
X (array-like of shape (n_samples, n_features)) – Test samples.
y (array-like of shape (n_samples,) or (n_samples, n_outputs)) – True labels for X.
sample_weight (array-like of shape (n_samples,), default=None) – Sample weights.
score – Mean accuracy of self.predict(X)
w.r.t. y.
Configure whether metadata should be requested to be passed to the fit
method.
Note that this method is only relevant when this estimator is used as a sub-estimator within a meta-estimator and metadata routing is enabled with
enable_metadata_routing=True
(seesklearn.set_config()
). Please check the User Guide on how the routing mechanism works.The options for each parameter are:
True
: metadata is requested, and passed tofit
if provided. The request is ignored if metadata is not provided.
False
: metadata is not requested and the meta-estimator will not pass it tofit
.
None
: metadata is not requested, and the meta-estimator will raise an error if the user provides it.
str
: metadata should be passed to the meta-estimator with this given alias instead of the original name.The default (
sklearn.utils.metadata_routing.UNCHANGED
) retains the existing request. This allows you to change the request for some parameters and not others.Added in version 1.3.
- base_marginstr, True, False, or None, default=sklearn.utils.metadata_routing.UNCHANGED
Metadata routing for
base_margin
parameter infit
.- base_margin_eval_setstr, True, False, or None, default=sklearn.utils.metadata_routing.UNCHANGED
Metadata routing for
base_margin_eval_set
parameter infit
.- eval_setstr, True, False, or None, default=sklearn.utils.metadata_routing.UNCHANGED
Metadata routing for
eval_set
parameter infit
.- feature_weightsstr, True, False, or None, default=sklearn.utils.metadata_routing.UNCHANGED
Metadata routing for
feature_weights
parameter infit
.- sample_weightstr, True, False, or None, default=sklearn.utils.metadata_routing.UNCHANGED
Metadata routing for
sample_weight
parameter infit
.- sample_weight_eval_setstr, True, False, or None, default=sklearn.utils.metadata_routing.UNCHANGED
Metadata routing for
sample_weight_eval_set
parameter infit
.- verbosestr, True, False, or None, default=sklearn.utils.metadata_routing.UNCHANGED
Metadata routing for
verbose
parameter infit
.- xgb_modelstr, True, False, or None, default=sklearn.utils.metadata_routing.UNCHANGED
Metadata routing for
xgb_model
parameter infit
.
- selfobject
The updated object.
Set the parameters of this estimator. Modification of the sklearn method to allow unknown kwargs. This allows using the full range of xgboost parameters that are not defined as member variables in sklearn grid search.
self
params (Any)
Configure whether metadata should be requested to be passed to the predict_proba
method.
Note that this method is only relevant when this estimator is used as a sub-estimator within a meta-estimator and metadata routing is enabled with
enable_metadata_routing=True
(seesklearn.set_config()
). Please check the User Guide on how the routing mechanism works.The options for each parameter are:
True
: metadata is requested, and passed topredict_proba
if provided. The request is ignored if metadata is not provided.
False
: metadata is not requested and the meta-estimator will not pass it topredict_proba
.
None
: metadata is not requested, and the meta-estimator will raise an error if the user provides it.
str
: metadata should be passed to the meta-estimator with this given alias instead of the original name.The default (
sklearn.utils.metadata_routing.UNCHANGED
) retains the existing request. This allows you to change the request for some parameters and not others.Added in version 1.3.
- base_marginstr, True, False, or None, default=sklearn.utils.metadata_routing.UNCHANGED
Metadata routing for
base_margin
parameter inpredict_proba
.- iteration_rangestr, True, False, or None, default=sklearn.utils.metadata_routing.UNCHANGED
Metadata routing for
iteration_range
parameter inpredict_proba
.- validate_featuresstr, True, False, or None, default=sklearn.utils.metadata_routing.UNCHANGED
Metadata routing for
validate_features
parameter inpredict_proba
.
- selfobject
The updated object.
Configure whether metadata should be requested to be passed to the predict
method.
Note that this method is only relevant when this estimator is used as a sub-estimator within a meta-estimator and metadata routing is enabled with
enable_metadata_routing=True
(seesklearn.set_config()
). Please check the User Guide on how the routing mechanism works.The options for each parameter are:
True
: metadata is requested, and passed topredict
if provided. The request is ignored if metadata is not provided.
False
: metadata is not requested and the meta-estimator will not pass it topredict
.
None
: metadata is not requested, and the meta-estimator will raise an error if the user provides it.
str
: metadata should be passed to the meta-estimator with this given alias instead of the original name.The default (
sklearn.utils.metadata_routing.UNCHANGED
) retains the existing request. This allows you to change the request for some parameters and not others.Added in version 1.3.
- base_marginstr, True, False, or None, default=sklearn.utils.metadata_routing.UNCHANGED
Metadata routing for
base_margin
parameter inpredict
.- iteration_rangestr, True, False, or None, default=sklearn.utils.metadata_routing.UNCHANGED
Metadata routing for
iteration_range
parameter inpredict
.- output_marginstr, True, False, or None, default=sklearn.utils.metadata_routing.UNCHANGED
Metadata routing for
output_margin
parameter inpredict
.- validate_featuresstr, True, False, or None, default=sklearn.utils.metadata_routing.UNCHANGED
Metadata routing for
validate_features
parameter inpredict
.
- selfobject
The updated object.
Configure whether metadata should be requested to be passed to the score
method.
Note that this method is only relevant when this estimator is used as a sub-estimator within a meta-estimator and metadata routing is enabled with
enable_metadata_routing=True
(seesklearn.set_config()
). Please check the User Guide on how the routing mechanism works.The options for each parameter are:
True
: metadata is requested, and passed toscore
if provided. The request is ignored if metadata is not provided.
False
: metadata is not requested and the meta-estimator will not pass it toscore
.
None
: metadata is not requested, and the meta-estimator will raise an error if the user provides it.
str
: metadata should be passed to the meta-estimator with this given alias instead of the original name.The default (
sklearn.utils.metadata_routing.UNCHANGED
) retains the existing request. This allows you to change the request for some parameters and not others.Added in version 1.3.
- sample_weightstr, True, False, or None, default=sklearn.utils.metadata_routing.UNCHANGED
Metadata routing for
sample_weight
parameter inscore
.
- selfobject
The updated object.
self (XGBRFClassifier)
Plotting Library.
Plot importance based on fitted trees.
booster (XGBModel | Booster | dict) – Booster or XGBModel instance, or dict taken by Booster.get_fscore()
ax (matplotlib Axes) – Target axes instance. If None, new figure and axes will be created.
grid (bool) – Turn the axes grids on or off. Default is True (On).
importance_type (str) –
How the importance is calculated: either “weight”, “gain”, or “cover”
”weight” is the number of times a feature appears in a tree
”gain” is the average gain of splits which use the feature
”cover” is the average coverage of splits which use the feature where coverage is defined as the number of samples affected by the split
max_num_features (int | None) – Maximum number of top features displayed on plot. If None, all features will be displayed.
height (float) – Bar height, passed to ax.barh()
xlim (tuple | None) – Tuple passed to axes.xlim()
ylim (tuple | None) – Tuple passed to axes.ylim()
title (str) – Axes title. To disable, pass None.
xlabel (str) – X axis title label. To disable, pass None.
ylabel (str) – Y axis title label. To disable, pass None.
show_values (bool) – Show values on plot. To disable, pass False.
values_format (str) – Format string for values. “v” will be replaced by the value of the feature importance. e.g. Pass “{v:.2f}” in order to limit the number of digits after the decimal point to two, for each value printed on the graph.
kwargs (Any) – Other keywords passed to ax.barh()
ax
matplotlib Axes
Plot specified tree.
booster (Booster | XGBModel) – Booster or XGBModel instance
fmap (str (optional)) – The name of feature map file
num_trees (int | None) –
Deprecated since version 3.0.
rankdir (str, default "TB") – Passed to graphviz via graph_attr
ax (matplotlib Axes, default None) – Target axes instance. If None, new figure and axes will be created.
with_stats (bool) –
Added in version 3.0.
See to_graphviz()
.
tree_idx (int) –
Added in version 3.0.
See to_graphviz()
.
kwargs (Any) – Other keywords passed to to_graphviz()
ax
matplotlib Axes
Convert specified tree to graphviz instance. IPython can automatically plot the returned graphviz instance. Otherwise, you should call .render() method of the returned graphviz instance.
booster (Booster | XGBModel) – Booster or XGBModel instance
num_trees (int | None) –
Deprecated since version 3.0.
Specify the ordinal number of target tree
rankdir (str | None) – Passed to graphviz via graph_attr
yes_color (str | None) – Edge color when meets the node condition.
no_color (str | None) – Edge color when doesn’t meet the node condition.
condition_node_params (dict | None) –
Condition node configuration for for graphviz. Example:
{'shape': 'box', 'style': 'filled,rounded', 'fillcolor': '#78bceb'}
leaf_node_params (dict | None) –
Leaf node configuration for graphviz. Example:
{'shape': 'box', 'style': 'filled', 'fillcolor': '#e48038'}
with_stats (bool) –
Added in version 3.0.
Controls whether the split statistics should be included.
tree_idx (int) –
Added in version 3.0.
Specify the ordinal index of target tree.
kwargs (Any) – Other keywords passed to graphviz graph_attr, e.g. graph [ {key} = {value} ]
graph
graphviz.Source
Callback library containing training routines. See Callback Functions for a quick introduction.
Interface for training callback.
Added in version 1.3.0.
Run after each iteration. Returns True when training should stop.
model (Any) – Eeither a Booster
object or a CVPack if the cv function in xgboost is being used.
epoch (int) – The current training iteration.
evals_log (Dict[str, Dict[str, List[float] | List[Tuple[float, float]]]]) –
A dictionary containing the evaluation history:
{"data_name": {"metric_name": [0.5, ...]}}
Run before each iteration. Returns True when training should stop. See after_iteration()
for details.
Bases: TrainingCallback
Print the evaluation result at each iteration.
Added in version 1.3.0.
Run after each iteration. Returns True when training should stop.
model (Any) – Eeither a Booster
object or a CVPack if the cv function in xgboost is being used.
epoch (int) – The current training iteration.
evals_log (Dict[str, Dict[str, List[float] | List[Tuple[float, float]]]]) –
A dictionary containing the evaluation history:
{"data_name": {"metric_name": [0.5, ...]}}
Bases: TrainingCallback
Callback function for early stopping
Added in version 1.3.0.
rounds (int) – Early stopping rounds.
metric_name (str | None) – Name of metric that is used for early stopping.
data_name (str | None) – Name of dataset that is used for early stopping.
maximize (bool | None) – Whether to maximize evaluation metric. None means auto (discouraged).
save_best (bool | None) – Whether training should return the best model or the last model. If set to True, it will only keep the boosting rounds up to the detected best iteration, discarding the ones that come after. This is only supported with tree methods (not gblinear). Also, the cv function doesn’t return a model, the parameter is not applicable.
min_delta (float) –
Added in version 1.5.0.
Minimum absolute change in score to be qualified as an improvement.
Examples
es = xgboost.callback.EarlyStopping( rounds=2, min_delta=1e-3, save_best=True, maximize=False, data_name="validation_0", metric_name="mlogloss", ) clf = xgboost.XGBClassifier(tree_method="hist", device="cuda", callbacks=[es]) X, y = load_digits(return_X_y=True) clf.fit(X, y, eval_set=[(X, y)])
Run after each iteration. Returns True when training should stop.
model (Any) – Eeither a Booster
object or a CVPack if the cv function in xgboost is being used.
epoch (int) – The current training iteration.
evals_log (Dict[str, Dict[str, List[float] | List[Tuple[float, float]]]]) –
A dictionary containing the evaluation history:
{"data_name": {"metric_name": [0.5, ...]}}
Bases: TrainingCallback
Callback function for scheduling learning rate.
Added in version 1.3.0.
learning_rates (Callable[[int], float] | Sequence[float]) – If it’s a callable object, then it should accept an integer parameter epoch and returns the corresponding learning rate. Otherwise it should be a sequence like list or tuple with the same size of boosting rounds.
Run after each iteration. Returns True when training should stop.
model (Any) – Eeither a Booster
object or a CVPack if the cv function in xgboost is being used.
epoch (int) – The current training iteration.
evals_log (Dict[str, Dict[str, List[float] | List[Tuple[float, float]]]]) –
A dictionary containing the evaluation history:
{"data_name": {"metric_name": [0.5, ...]}}
Bases: TrainingCallback
Checkpointing operation. Users are encouraged to create their own callbacks for checkpoint as XGBoost doesn’t handle distributed file systems. When checkpointing on distributed systems, be sure to know the rank of the worker to avoid multiple workers checkpointing to the same place.
Added in version 1.3.0.
Since XGBoost 2.1.0, the default format is changed to UBJSON.
name (str) – pattern of output model file. Models will be saved as name_0.ubj, name_1.ubj, name_2.ubj ….
as_pickle (bool) – When set to True, all training parameters will be saved in pickle format, instead of saving only the model.
interval (int) – Interval of checkpointing. Checkpointing is slow so setting a larger number can reduce performance hit.
Run after each iteration. Returns True when training should stop.
model (Any) – Eeither a Booster
object or a CVPack if the cv function in xgboost is being used.
epoch (int) – The current training iteration.
evals_log (Dict[str, Dict[str, List[float] | List[Tuple[float, float]]]]) –
A dictionary containing the evaluation history:
{"data_name": {"metric_name": [0.5, ...]}}
PySpark XGBoost integration interface
Bases: _SparkXGBEstimator
, HasProbabilityCol
, HasRawPredictionCol
SparkXGBClassifier is a PySpark ML estimator. It implements the XGBoost classification algorithm based on XGBoost python library, and it can be used in PySpark Pipeline and PySpark ML meta algorithms like - CrossValidator
/ - TrainValidationSplit
/ - OneVsRest
SparkXGBClassifier automatically supports most of the parameters in xgboost.XGBClassifier
constructor and most of the parameters used in xgboost.XGBClassifier.fit()
and xgboost.XGBClassifier.predict()
method.
To enable GPU support, set device to cuda or gpu.
SparkXGBClassifier doesn’t support setting base_margin explicitly as well, but support another param called base_margin_col. see doc below for more details.
SparkXGBClassifier doesn’t support setting output_margin, but we can get output margin from the raw prediction column. See raw_prediction_col param doc below for more details.
SparkXGBClassifier doesn’t support validate_features and output_margin param.
SparkXGBClassifier doesn’t support setting nthread xgboost param, instead, the nthread param for each xgboost worker will be set equal to spark.task.cpus config value.
features_col (str | List[str]) – When the value is string, it requires the features column name to be vector type. When the value is a list of string, it requires all the feature columns to be numeric types.
label_col (str) – Label column name. Default to “label”.
prediction_col (str) – Prediction column name. Default to “prediction”
probability_col (str) – Column name for predicted class conditional probabilities. Default to probabilityCol
raw_prediction_col (str) – The output_margin=True is implicitly supported by the rawPredictionCol output column, which is always returned with the predicted margin values.
pred_contrib_col (pyspark.ml.param.Param[str]) – Contribution prediction column name.
validation_indicator_col (str | None) – For params related to xgboost.XGBClassifier training with evaluation dataset’s supervision, set xgboost.spark.SparkXGBClassifier.validation_indicator_col
parameter instead of setting the eval_set parameter in xgboost.XGBClassifier fit method.
weight_col (str | None) – To specify the weight of the training and validation dataset, set xgboost.spark.SparkXGBClassifier.weight_col
parameter instead of setting sample_weight and sample_weight_eval_set parameter in xgboost.XGBClassifier fit method.
base_margin_col (str | None) – To specify the base margins of the training and validation dataset, set xgboost.spark.SparkXGBClassifier.base_margin_col
parameter instead of setting base_margin and base_margin_eval_set in the xgboost.XGBClassifier fit method.
num_workers (int) – How many XGBoost workers to be used to train. Each XGBoost worker corresponds to one spark task.
device (str | None) –
Added in version 2.0.0.
Device for XGBoost workers, available options are cpu, cuda, and gpu.
force_repartition (bool) – Boolean value to specify if forcing the input dataset to be repartitioned before XGBoost training.
repartition_random_shuffle (bool) – Boolean value to specify if randomly shuffling the dataset when repartitioning is required.
enable_sparse_data_optim (bool) – Boolean value to specify if enabling sparse data optimization, if True, Xgboost DMatrix object will be constructed from sparse matrix instead of dense matrix.
launch_tracker_on_driver (bool) – Boolean value to indicate whether the tracker should be launched on the driver side or the executor side.
coll_cfg (Config | None) – The collective configuration. See Config
kwargs (Any) – A dictionary of xgboost parameters, please refer to https://xgboost.readthedocs.io/en/stable/parameter.html
Note
The Parameters chart above contains parameters that need special handling. For a full list of parameters, see entries with Param(parent=… below.
This API is experimental.
Examples
>>> from xgboost.spark import SparkXGBClassifier >>> from pyspark.ml.linalg import Vectors >>> df_train = spark.createDataFrame([ ... (Vectors.dense(1.0, 2.0, 3.0), 0, False, 1.0), ... (Vectors.sparse(3, {1: 1.0, 2: 5.5}), 1, False, 2.0), ... (Vectors.dense(4.0, 5.0, 6.0), 0, True, 1.0), ... (Vectors.sparse(3, {1: 6.0, 2: 7.5}), 1, True, 2.0), ... ], ["features", "label", "isVal", "weight"]) >>> df_test = spark.createDataFrame([ ... (Vectors.dense(1.0, 2.0, 3.0), ), ... ], ["features"]) >>> xgb_classifier = SparkXGBClassifier(max_depth=5, missing=0.0, ... validation_indicator_col='isVal', weight_col='weight', ... early_stopping_rounds=1, eval_metric='logloss') >>> xgb_clf_model = xgb_classifier.fit(df_train) >>> xgb_clf_model.transform(df_test).show()
Clears a param from the param map if it has been explicitly set.
param (Param)
None
Creates a copy of this instance with the same uid and some extra params. The default implementation creates a shallow copy using copy.copy()
, and then copies the embedded and extra parameters over and returns the copy. Subclasses should override this method if the default approach is not sufficient.
extra (dict, optional) – Extra parameters to copy to the new instance
self (P)
Copy of this instance
Params
Explains a single param and returns its name, doc, and optional default value and user-supplied value in a string.
Returns the documentation of all params with their optionally default values and user-supplied values.
Extracts the embedded default param values and user-supplied values, and then merges them with extra values from input into a flat param map, where the latter value is used if there exist conflicts, i.e., with ordering: default param values < user-supplied values < extra.
Fits a model to the input dataset with optional parameters.
Added in version 1.3.0.
dataset (pyspark.sql.DataFrame
) – input dataset.
params (dict or list or tuple, optional) – an optional param map that overrides embedded params. If a list/tuple of param maps is given, this calls fit on each param map and returns a list of models.
fitted model(s)
Transformer
or a list of Transformer
Fits a model to the input dataset for each param map in paramMaps.
Added in version 2.3.0.
dataset (pyspark.sql.DataFrame
) – input dataset.
paramMaps (collections.abc.Sequence
) – A Sequence of param maps.
A thread safe iterable which contains one model for each param map. Each call to next(modelIterator) will return (index, model) where model was fit using paramMaps[index]. index values may not be sequential.
_FitMultipleIterator
Gets the value of featuresCol or its default value.
Gets the value of labelCol or its default value.
Gets the value of a param in the user-supplied param map or its default value. Raises an error if neither is set.
Gets the value of predictionCol or its default value.
Gets the value of probabilityCol or its default value.
Gets the value of rawPredictionCol or its default value.
Gets the value of validationIndicatorCol or its default value.
Gets the value of weightCol or its default value.
Checks whether a param has a default value.
Tests whether this instance contains a param with a given (string) name.
Checks whether a param is explicitly set by user or has a default value.
Checks whether a param is explicitly set by user.
Reads an ML instance from the input path, a shortcut of read().load(path).
path (str)
RL
Returns all params ordered by name. The default implementation uses dir()
to get all attributes of type Param
.
Return the reader for loading the estimator.
SparkXGBReader
Save this ML instance to the given path, a shortcut of ‘write().save(path)’.
path (str)
None
Sets a parameter in the embedded param map.
Set params for the estimator.
kwargs (Any)
None
Set collective configuration
value (Config)
_SparkXGBParams
Set device, optional value: cpu, cuda, gpu
value (str)
_SparkXGBParams
A unique id for the object.
Return the writer for saving the estimator.
SparkXGBWriter
Bases: _ClassificationModel
The model returned by xgboost.spark.SparkXGBClassifier.fit()
Note
This API is experimental.
xgb_sklearn_model (XGBModel | None)
training_summary (XGBoostTrainingSummary | None)
Clears a param from the param map if it has been explicitly set.
param (Param)
None
Creates a copy of this instance with the same uid and some extra params. The default implementation creates a shallow copy using copy.copy()
, and then copies the embedded and extra parameters over and returns the copy. Subclasses should override this method if the default approach is not sufficient.
extra (dict, optional) – Extra parameters to copy to the new instance
self (P)
Copy of this instance
Params
Explains a single param and returns its name, doc, and optional default value and user-supplied value in a string.
Returns the documentation of all params with their optionally default values and user-supplied values.
Extracts the embedded default param values and user-supplied values, and then merges them with extra values from input into a flat param map, where the latter value is used if there exist conflicts, i.e., with ordering: default param values < user-supplied values < extra.
Gets the value of featuresCol or its default value.
Gets the value of labelCol or its default value.
Gets the value of a param in the user-supplied param map or its default value. Raises an error if neither is set.
Gets the value of predictionCol or its default value.
Gets the value of probabilityCol or its default value.
Gets the value of rawPredictionCol or its default value.
Gets the value of validationIndicatorCol or its default value.
Gets the value of weightCol or its default value.
Return the xgboost.core.Booster instance.
Get feature importance of each feature. Importance type can be defined as:
‘weight’: the number of times a feature is used to split the data across all trees.
‘gain’: the average gain across all splits the feature is used in.
‘cover’: the average coverage across all splits the feature is used in.
‘total_gain’: the total gain across all splits the feature is used in.
‘total_cover’: the total coverage across all splits the feature is used in.
Checks whether a param has a default value.
Tests whether this instance contains a param with a given (string) name.
Checks whether a param is explicitly set by user or has a default value.
Checks whether a param is explicitly set by user.
Reads an ML instance from the input path, a shortcut of read().load(path).
path (str)
RL
Returns all params ordered by name. The default implementation uses dir()
to get all attributes of type Param
.
Return the reader for loading the model.
SparkXGBModelReader
Save this ML instance to the given path, a shortcut of ‘write().save(path)’.
path (str)
None
Sets a parameter in the embedded param map.
Set collective configuration
value (Config)
_SparkXGBParams
Set device, optional value: cpu, cuda, gpu
value (str)
_SparkXGBParams
Transforms the input dataset with optional parameters.
Added in version 1.3.0.
dataset (pyspark.sql.DataFrame
) – input dataset
params (dict, optional) – an optional param map that overrides embedded params.
transformed dataset
A unique id for the object.
Return the writer for saving the model.
SparkXGBModelWriter
Bases: _SparkXGBEstimator
SparkXGBRegressor is a PySpark ML estimator. It implements the XGBoost regression algorithm based on XGBoost python library, and it can be used in PySpark Pipeline and PySpark ML meta algorithms like - CrossValidator
/ - TrainValidationSplit
/ - OneVsRest
SparkXGBRegressor automatically supports most of the parameters in xgboost.XGBRegressor
constructor and most of the parameters used in xgboost.XGBRegressor.fit()
and xgboost.XGBRegressor.predict()
method.
To enable GPU support, set device to cuda or gpu.
SparkXGBRegressor doesn’t support setting base_margin explicitly as well, but support another param called base_margin_col. see doc below for more details.
SparkXGBRegressor doesn’t support validate_features and output_margin param.
SparkXGBRegressor doesn’t support setting nthread xgboost param, instead, the nthread param for each xgboost worker will be set equal to spark.task.cpus config value.
features_col (str | List[str]) – When the value is string, it requires the features column name to be vector type. When the value is a list of string, it requires all the feature columns to be numeric types.
label_col (str) – Label column name. Default to “label”.
prediction_col (str) – Prediction column name. Default to “prediction”
pred_contrib_col (pyspark.ml.param.Param[str]) – Contribution prediction column name.
validation_indicator_col (str | None) – For params related to xgboost.XGBRegressor training with evaluation dataset’s supervision, set xgboost.spark.SparkXGBRegressor.validation_indicator_col
parameter instead of setting the eval_set parameter in xgboost.XGBRegressor fit method.
weight_col (str | None) – To specify the weight of the training and validation dataset, set xgboost.spark.SparkXGBRegressor.weight_col
parameter instead of setting sample_weight and sample_weight_eval_set parameter in xgboost.XGBRegressor fit method.
base_margin_col (str | None) – To specify the base margins of the training and validation dataset, set xgboost.spark.SparkXGBRegressor.base_margin_col
parameter instead of setting base_margin and base_margin_eval_set in the xgboost.XGBRegressor fit method.
num_workers (int) – How many XGBoost workers to be used to train. Each XGBoost worker corresponds to one spark task.
device (str | None) –
Added in version 2.0.0.
Device for XGBoost workers, available options are cpu, cuda, and gpu.
force_repartition (bool) – Boolean value to specify if forcing the input dataset to be repartitioned before XGBoost training.
repartition_random_shuffle (bool) – Boolean value to specify if randomly shuffling the dataset when repartitioning is required.
enable_sparse_data_optim (bool) – Boolean value to specify if enabling sparse data optimization, if True, Xgboost DMatrix object will be constructed from sparse matrix instead of dense matrix.
launch_tracker_on_driver (bool) – Boolean value to indicate whether the tracker should be launched on the driver side or the executor side.
coll_cfg (Config | None) – The collective configuration. See Config
kwargs (Any) – A dictionary of xgboost parameters, please refer to https://xgboost.readthedocs.io/en/stable/parameter.html
Note
The Parameters chart above contains parameters that need special handling. For a full list of parameters, see entries with Param(parent=… below.
This API is experimental.
Examples
>>> from xgboost.spark import SparkXGBRegressor >>> from pyspark.ml.linalg import Vectors >>> df_train = spark.createDataFrame([ ... (Vectors.dense(1.0, 2.0, 3.0), 0, False, 1.0), ... (Vectors.sparse(3, {1: 1.0, 2: 5.5}), 1, False, 2.0), ... (Vectors.dense(4.0, 5.0, 6.0), 2, True, 1.0), ... (Vectors.sparse(3, {1: 6.0, 2: 7.5}), 3, True, 2.0), ... ], ["features", "label", "isVal", "weight"]) >>> df_test = spark.createDataFrame([ ... (Vectors.dense(1.0, 2.0, 3.0), ), ... (Vectors.sparse(3, {1: 1.0, 2: 5.5}), ) ... ], ["features"]) >>> xgb_regressor = SparkXGBRegressor(max_depth=5, missing=0.0, ... validation_indicator_col='isVal', weight_col='weight', ... early_stopping_rounds=1, eval_metric='rmse') >>> xgb_reg_model = xgb_regressor.fit(df_train) >>> xgb_reg_model.transform(df_test)
Clears a param from the param map if it has been explicitly set.
param (Param)
None
Creates a copy of this instance with the same uid and some extra params. The default implementation creates a shallow copy using copy.copy()
, and then copies the embedded and extra parameters over and returns the copy. Subclasses should override this method if the default approach is not sufficient.
extra (dict, optional) – Extra parameters to copy to the new instance
self (P)
Copy of this instance
Params
Explains a single param and returns its name, doc, and optional default value and user-supplied value in a string.
Returns the documentation of all params with their optionally default values and user-supplied values.
Extracts the embedded default param values and user-supplied values, and then merges them with extra values from input into a flat param map, where the latter value is used if there exist conflicts, i.e., with ordering: default param values < user-supplied values < extra.
Fits a model to the input dataset with optional parameters.
Added in version 1.3.0.
dataset (pyspark.sql.DataFrame
) – input dataset.
params (dict or list or tuple, optional) – an optional param map that overrides embedded params. If a list/tuple of param maps is given, this calls fit on each param map and returns a list of models.
fitted model(s)
Transformer
or a list of Transformer
Fits a model to the input dataset for each param map in paramMaps.
Added in version 2.3.0.
dataset (pyspark.sql.DataFrame
) – input dataset.
paramMaps (collections.abc.Sequence
) – A Sequence of param maps.
A thread safe iterable which contains one model for each param map. Each call to next(modelIterator) will return (index, model) where model was fit using paramMaps[index]. index values may not be sequential.
_FitMultipleIterator
Gets the value of featuresCol or its default value.
Gets the value of labelCol or its default value.
Gets the value of a param in the user-supplied param map or its default value. Raises an error if neither is set.
Gets the value of predictionCol or its default value.
Gets the value of validationIndicatorCol or its default value.
Gets the value of weightCol or its default value.
Checks whether a param has a default value.
Tests whether this instance contains a param with a given (string) name.
Checks whether a param is explicitly set by user or has a default value.
Checks whether a param is explicitly set by user.
Reads an ML instance from the input path, a shortcut of read().load(path).
path (str)
RL
Returns all params ordered by name. The default implementation uses dir()
to get all attributes of type Param
.
Return the reader for loading the estimator.
SparkXGBReader
Save this ML instance to the given path, a shortcut of ‘write().save(path)’.
path (str)
None
Sets a parameter in the embedded param map.
Set params for the estimator.
kwargs (Any)
None
Set collective configuration
value (Config)
_SparkXGBParams
Set device, optional value: cpu, cuda, gpu
value (str)
_SparkXGBParams
A unique id for the object.
Return the writer for saving the estimator.
SparkXGBWriter
Bases: _SparkXGBModel
The model returned by xgboost.spark.SparkXGBRegressor.fit()
Note
This API is experimental.
xgb_sklearn_model (XGBModel | None)
training_summary (XGBoostTrainingSummary | None)
Clears a param from the param map if it has been explicitly set.
param (Param)
None
Creates a copy of this instance with the same uid and some extra params. The default implementation creates a shallow copy using copy.copy()
, and then copies the embedded and extra parameters over and returns the copy. Subclasses should override this method if the default approach is not sufficient.
extra (dict, optional) – Extra parameters to copy to the new instance
self (P)
Copy of this instance
Params
Explains a single param and returns its name, doc, and optional default value and user-supplied value in a string.
Returns the documentation of all params with their optionally default values and user-supplied values.
Extracts the embedded default param values and user-supplied values, and then merges them with extra values from input into a flat param map, where the latter value is used if there exist conflicts, i.e., with ordering: default param values < user-supplied values < extra.
Gets the value of featuresCol or its default value.
Gets the value of labelCol or its default value.
Gets the value of a param in the user-supplied param map or its default value. Raises an error if neither is set.
Gets the value of predictionCol or its default value.
Gets the value of validationIndicatorCol or its default value.
Gets the value of weightCol or its default value.
Return the xgboost.core.Booster instance.
Get feature importance of each feature. Importance type can be defined as:
‘weight’: the number of times a feature is used to split the data across all trees.
‘gain’: the average gain across all splits the feature is used in.
‘cover’: the average coverage across all splits the feature is used in.
‘total_gain’: the total gain across all splits the feature is used in.
‘total_cover’: the total coverage across all splits the feature is used in.
Checks whether a param has a default value.
Tests whether this instance contains a param with a given (string) name.
Checks whether a param is explicitly set by user or has a default value.
Checks whether a param is explicitly set by user.
Reads an ML instance from the input path, a shortcut of read().load(path).
path (str)
RL
Returns all params ordered by name. The default implementation uses dir()
to get all attributes of type Param
.
Return the reader for loading the model.
SparkXGBModelReader
Save this ML instance to the given path, a shortcut of ‘write().save(path)’.
path (str)
None
Sets a parameter in the embedded param map.
Set collective configuration
value (Config)
_SparkXGBParams
Set device, optional value: cpu, cuda, gpu
value (str)
_SparkXGBParams
Transforms the input dataset with optional parameters.
Added in version 1.3.0.
dataset (pyspark.sql.DataFrame
) – input dataset
params (dict, optional) – an optional param map that overrides embedded params.
transformed dataset
A unique id for the object.
Return the writer for saving the model.
SparkXGBModelWriter
Bases: _SparkXGBEstimator
SparkXGBRanker is a PySpark ML estimator. It implements the XGBoost ranking algorithm based on XGBoost python library, and it can be used in PySpark Pipeline and PySpark ML meta algorithms like CrossValidator
/ TrainValidationSplit
/ OneVsRest
SparkXGBRanker automatically supports most of the parameters in xgboost.XGBRanker
constructor and most of the parameters used in xgboost.XGBRanker.fit()
and xgboost.XGBRanker.predict()
method.
To enable GPU support, set device to cuda or gpu.
SparkXGBRanker doesn’t support setting base_margin explicitly as well, but support another param called base_margin_col. see doc below for more details.
SparkXGBRanker doesn’t support setting output_margin, but we can get output margin from the raw prediction column. See raw_prediction_col param doc below for more details.
SparkXGBRanker doesn’t support validate_features and output_margin param.
SparkXGBRanker doesn’t support setting nthread xgboost param, instead, the nthread param for each xgboost worker will be set equal to spark.task.cpus config value.
features_col (str | List[str]) – When the value is string, it requires the features column name to be vector type. When the value is a list of string, it requires all the feature columns to be numeric types.
label_col (str) – Label column name. Default to “label”.
prediction_col (str) – Prediction column name. Default to “prediction”
pred_contrib_col (pyspark.ml.param.Param[str]) – Contribution prediction column name.
validation_indicator_col (str | None) – For params related to xgboost.XGBRanker training with evaluation dataset’s supervision, set xgboost.spark.SparkXGBRanker.validation_indicator_col
parameter instead of setting the eval_set parameter in xgboost.XGBRanker
fit method.
weight_col (str | None) – To specify the weight of the training and validation dataset, set xgboost.spark.SparkXGBRanker.weight_col
parameter instead of setting sample_weight and sample_weight_eval_set parameter in xgboost.XGBRanker
fit method.
base_margin_col (str | None) – To specify the base margins of the training and validation dataset, set xgboost.spark.SparkXGBRanker.base_margin_col
parameter instead of setting base_margin and base_margin_eval_set in the xgboost.XGBRanker
fit method.
qid_col (str | None) – Query id column name.
num_workers (int) – How many XGBoost workers to be used to train. Each XGBoost worker corresponds to one spark task.
device (str | None) –
Added in version 2.0.0.
Device for XGBoost workers, available options are cpu, cuda, and gpu.
force_repartition (bool) – Boolean value to specify if forcing the input dataset to be repartitioned before XGBoost training.
repartition_random_shuffle (bool) – Boolean value to specify if randomly shuffling the dataset when repartitioning is required.
enable_sparse_data_optim (bool) – Boolean value to specify if enabling sparse data optimization, if True, Xgboost DMatrix object will be constructed from sparse matrix instead of dense matrix.
launch_tracker_on_driver (bool) – Boolean value to indicate whether the tracker should be launched on the driver side or the executor side.
coll_cfg (Config | None) – The collective configuration. See Config
kwargs (Any) – A dictionary of xgboost parameters, please refer to https://xgboost.readthedocs.io/en/stable/parameter.html
Note: (..) – The Parameters chart above contains parameters that need special handling.: For a full list of parameters, see entries with Param(parent=… below.
Note: – This API is experimental.:
Examples
>>> from xgboost.spark import SparkXGBRanker >>> from pyspark.ml.linalg import Vectors >>> ranker = SparkXGBRanker(qid_col="qid") >>> df_train = spark.createDataFrame( ... [ ... (Vectors.dense(1.0, 2.0, 3.0), 0, 0), ... (Vectors.dense(4.0, 5.0, 6.0), 1, 0), ... (Vectors.dense(9.0, 4.0, 8.0), 2, 0), ... (Vectors.sparse(3, {1: 1.0, 2: 5.5}), 0, 1), ... (Vectors.sparse(3, {1: 6.0, 2: 7.5}), 1, 1), ... (Vectors.sparse(3, {1: 8.0, 2: 9.5}), 2, 1), ... ], ... ["features", "label", "qid"], ... ) >>> df_test = spark.createDataFrame( ... [ ... (Vectors.dense(1.5, 2.0, 3.0), 0), ... (Vectors.dense(4.5, 5.0, 6.0), 0), ... (Vectors.dense(9.0, 4.5, 8.0), 0), ... (Vectors.sparse(3, {1: 1.0, 2: 6.0}), 1), ... (Vectors.sparse(3, {1: 6.0, 2: 7.0}), 1), ... (Vectors.sparse(3, {1: 8.0, 2: 10.5}), 1), ... ], ... ["features", "qid"], ... ) >>> model = ranker.fit(df_train) >>> model.transform(df_test).show()
Clears a param from the param map if it has been explicitly set.
param (Param)
None
Creates a copy of this instance with the same uid and some extra params. The default implementation creates a shallow copy using copy.copy()
, and then copies the embedded and extra parameters over and returns the copy. Subclasses should override this method if the default approach is not sufficient.
extra (dict, optional) – Extra parameters to copy to the new instance
self (P)
Copy of this instance
Params
Explains a single param and returns its name, doc, and optional default value and user-supplied value in a string.
Returns the documentation of all params with their optionally default values and user-supplied values.
Extracts the embedded default param values and user-supplied values, and then merges them with extra values from input into a flat param map, where the latter value is used if there exist conflicts, i.e., with ordering: default param values < user-supplied values < extra.
Fits a model to the input dataset with optional parameters.
Added in version 1.3.0.
dataset (pyspark.sql.DataFrame
) – input dataset.
params (dict or list or tuple, optional) – an optional param map that overrides embedded params. If a list/tuple of param maps is given, this calls fit on each param map and returns a list of models.
fitted model(s)
Transformer
or a list of Transformer
Fits a model to the input dataset for each param map in paramMaps.
Added in version 2.3.0.
dataset (pyspark.sql.DataFrame
) – input dataset.
paramMaps (collections.abc.Sequence
) – A Sequence of param maps.
A thread safe iterable which contains one model for each param map. Each call to next(modelIterator) will return (index, model) where model was fit using paramMaps[index]. index values may not be sequential.
_FitMultipleIterator
Gets the value of featuresCol or its default value.
Gets the value of labelCol or its default value.
Gets the value of a param in the user-supplied param map or its default value. Raises an error if neither is set.
Gets the value of predictionCol or its default value.
Gets the value of validationIndicatorCol or its default value.
Gets the value of weightCol or its default value.
Checks whether a param has a default value.
Tests whether this instance contains a param with a given (string) name.
Checks whether a param is explicitly set by user or has a default value.
Checks whether a param is explicitly set by user.
Reads an ML instance from the input path, a shortcut of read().load(path).
path (str)
RL
Returns all params ordered by name. The default implementation uses dir()
to get all attributes of type Param
.
Return the reader for loading the estimator.
SparkXGBReader
Save this ML instance to the given path, a shortcut of ‘write().save(path)’.
path (str)
None
Sets a parameter in the embedded param map.
Set params for the estimator.
kwargs (Any)
None
Set collective configuration
value (Config)
_SparkXGBParams
Set device, optional value: cpu, cuda, gpu
value (str)
_SparkXGBParams
A unique id for the object.
Return the writer for saving the estimator.
SparkXGBWriter
Bases: _SparkXGBModel
The model returned by xgboost.spark.SparkXGBRanker.fit()
Note
This API is experimental.
xgb_sklearn_model (XGBModel | None)
training_summary (XGBoostTrainingSummary | None)
Clears a param from the param map if it has been explicitly set.
param (Param)
None
Creates a copy of this instance with the same uid and some extra params. The default implementation creates a shallow copy using copy.copy()
, and then copies the embedded and extra parameters over and returns the copy. Subclasses should override this method if the default approach is not sufficient.
extra (dict, optional) – Extra parameters to copy to the new instance
self (P)
Copy of this instance
Params
Explains a single param and returns its name, doc, and optional default value and user-supplied value in a string.
Returns the documentation of all params with their optionally default values and user-supplied values.
Extracts the embedded default param values and user-supplied values, and then merges them with extra values from input into a flat param map, where the latter value is used if there exist conflicts, i.e., with ordering: default param values < user-supplied values < extra.
Gets the value of featuresCol or its default value.
Gets the value of labelCol or its default value.
Gets the value of a param in the user-supplied param map or its default value. Raises an error if neither is set.
Gets the value of predictionCol or its default value.
Gets the value of validationIndicatorCol or its default value.
Gets the value of weightCol or its default value.
Return the xgboost.core.Booster instance.
Get feature importance of each feature. Importance type can be defined as:
‘weight’: the number of times a feature is used to split the data across all trees.
‘gain’: the average gain across all splits the feature is used in.
‘cover’: the average coverage across all splits the feature is used in.
‘total_gain’: the total gain across all splits the feature is used in.
‘total_cover’: the total coverage across all splits the feature is used in.
Checks whether a param has a default value.
Tests whether this instance contains a param with a given (string) name.
Checks whether a param is explicitly set by user or has a default value.
Checks whether a param is explicitly set by user.
Reads an ML instance from the input path, a shortcut of read().load(path).
path (str)
RL
Returns all params ordered by name. The default implementation uses dir()
to get all attributes of type Param
.
Return the reader for loading the model.
SparkXGBModelReader
Save this ML instance to the given path, a shortcut of ‘write().save(path)’.
path (str)
None
Sets a parameter in the embedded param map.
Set collective configuration
value (Config)
_SparkXGBParams
Set device, optional value: cpu, cuda, gpu
value (str)
_SparkXGBParams
Transforms the input dataset with optional parameters.
Added in version 1.3.0.
dataset (pyspark.sql.DataFrame
) – input dataset
params (dict, optional) – an optional param map that overrides embedded params.
transformed dataset
A unique id for the object.
Return the writer for saving the model.
SparkXGBModelWriter
XGBoost collective communication related API.
User configuration for the communicator context. This is used for easier integration with distributed frameworks. Users of the collective module can pass the parameters directly into tracker and the communicator.
Added in version 3.0.
See dmlc_retry in init()
.
See dmlc_timeout in init()
. This is only used for communicators, not the tracker. They are different parameters since the timeout for tracker limits only the time for starting and finalizing the communication group, whereas the timeout for communicators limits the time used for collective operations, like allreduce()
.
int | None
See RabitTracker
.
See RabitTracker
.
See RabitTracker
.
Initialize the collective library with arguments.
Keyword arguments representing the parameters and their values.
dmlc_communicator: The type of the communicator. * rabit: Use Rabit. This is the default if the type is unspecified. * federated: Use the gRPC interface for Federated Learning.
dmlc_tracker_uri: Hostname of the tracker.
dmlc_tracker_port: Port number of the tracker.
dmlc_task_id: ID of the current task, can be used to obtain deterministic
dmlc_retry: The number of retry when handling network errors.
dmlc_timeout: Timeout in seconds.
dmlc_nccl_path: Path to load (dlopen) nccl for GPU-based communication.
federated_server_address: Address of the federated server.
federated_world_size: Number of federated workers.
federated_rank: Rank of the current worker.
federated_server_cert: Server certificate file path. Only needed for the SSL mode.
federated_client_key: Client key file path. Only needed for the SSL mode.
federated_client_cert: Client certificate file path. Only needed for the SSL mode.
Use upper case for environment variables, use lower case for runtime configuration.
None
Finalize the communicator.
None
Get rank of current process.
rank – Rank of current process.
Get total number workers.
Total number of process.
n
A context controlling collective communicator initialization and finalization.
Tracker for XGBoost collective.
Tracker for the collective used in XGBoost, acting as a coordinator between workers.
n_workers (int) – The total number of workers in the communication group.
host_ip (str | None) – The IP address of the tracker node. XGBoost can try to guess one by probing with sockets. But it’s best to explicitly pass an address.
port (int) – The port this tracker should listen to. XGBoost can query an available port from the OS, this configuration is useful for restricted network environments.
sortby (str) –
How to sort the workers for rank assignment. The default is host, but users can set the DMLC_TASK_ID via arguments of init()
and obtain deterministic rank assignment through sorting by task name. Available options are:
timeout (int) –
Timeout for constructing (bootstrap) and shutting down the communication group, doesn’t apply to communication when the group is up and running.
The timeout value should take the time of data loading and pre-processing into account, due to potential lazy execution. By default the Tracker doesn’t have any timeout to avoid pre-mature aborting.
The wait_for()
method has a different timeout parameter that can stop the tracker even if the tracker is still being used. A value error is raised when timeout is reached.
Examples
from xgboost.tracker import RabitTracker from xgboost import collective as coll tracker = RabitTracker(host_ip="127.0.0.1", n_workers=2) tracker.start() with coll.CommunicatorContext(**tracker.worker_args()): ret = coll.broadcast("msg", 0) assert str(ret) == "msg"
RetroSearch is an open source project built by @garambo | Open a GitHub Issue
Search and Browse the WWW like it's 1997 | Search results from DuckDuckGo
HTML:
3.2
| Encoding:
UTF-8
| Version:
0.7.4