Bases: object
Dataset in LightGBM.
LightGBM does not train on raw data. It discretizes continuous features into histogram bins, tries to combine categorical features, and automatically handles missing and infinite values.
This class handles that preprocessing, and holds that alternative representation of the input data.
Initialize Dataset.
data (str, pathlib.Path, numpy array, pandas DataFrame, scipy.sparse, Sequence, list of Sequence, list of numpy array or pyarrow Table) – Data source of Dataset. If str or pathlib.Path, it represents the path to a text file (CSV, TSV, or LibSVM) or a LightGBM Dataset binary file.
label (list, numpy 1-D array, pandas Series / one-column DataFrame, pyarrow Array, pyarrow ChunkedArray or None, optional (default=None)) – Label of the data.
reference (Dataset or None, optional (default=None)) – If this is Dataset for validation, training data should be used as reference.
weight (list, numpy 1-D array, pandas Series, pyarrow Array, pyarrow ChunkedArray or None, optional (default=None)) – Weight for each instance. Weights should be non-negative.
group (list, numpy 1-D array, pandas Series, pyarrow Array, pyarrow ChunkedArray or None, optional (default=None)) – Group/query data. Only used in the learning-to-rank task. sum(group) = n_samples. For example, if you have a 100-document dataset with group = [10, 20, 40, 10, 10, 10]
, that means that you have 6 groups, where the first 10 records are in the first group, records 11-30 are in the second group, records 31-70 are in the third group, etc.
init_score (list, list of lists (for multi-class task), numpy array, pandas Series, pandas DataFrame (for multi-class task), pyarrow Array, pyarrow ChunkedArray, pyarrow Table (for multi-class task) or None, optional (default=None)) – Init score for Dataset.
feature_name (list of str, or 'auto', optional (default="auto")) – Feature names. If ‘auto’ and data is pandas DataFrame or pyarrow Table, data columns names are used.
categorical_feature (list of str or int, or 'auto', optional (default="auto")) – Categorical features. If list of int, interpreted as indices. If list of str, interpreted as feature names (need to specify feature_name
as well). If ‘auto’ and data is pandas DataFrame, pandas unordered categorical columns are used. All values in categorical features will be cast to int32 and thus should be less than int32 max value (2147483647). Large values could be memory consuming. Consider using consecutive integers starting from zero. All negative values in categorical features will be treated as missing values. The output cannot be monotonically constrained with respect to a categorical feature. Floating point numbers in categorical features will be rounded towards 0.
params (dict or None, optional (default=None)) – Other parameters for Dataset.
free_raw_data (bool, optional (default=True)) – If True, raw data is freed after constructing inner Dataset.
position (numpy 1-D array, pandas Series or None, optional (default=None)) – Position of items used in unbiased learning-to-rank task.
Methods
__init__
(data[, label, reference, weight, ...])
Initialize Dataset.
add_features_from
(other)
Add features from other Dataset to the current Dataset.
Lazy init.
create_valid
(data[, label, weight, group, ...])
Create validation data align with current Dataset.
feature_num_bin
(feature)
Get the number of bins for a feature.
get_data
()
Get the raw data of the Dataset.
Get the names of columns (features) in the Dataset.
get_field
(field_name)
Get property from the Dataset.
Get the group of the Dataset.
Get the initial score of the Dataset.
Get the label of the Dataset.
Get the used parameters in the Dataset.
Get the position of the Dataset.
get_ref_chain
([ref_limit])
Get a chain of Dataset objects.
Get the weight of the Dataset.
num_data
()
Get the number of rows in the Dataset.
Get the number of columns (features) in the Dataset.
save_binary
(filename)
Save Dataset to a binary file.
set_categorical_feature
(categorical_feature)
Set categorical features.
set_feature_name
(feature_name)
Set feature name.
set_field
(field_name, data)
Set property into the Dataset.
set_group
(group)
Set group size of Dataset (used for ranking).
set_init_score
(init_score)
Set init score of Booster to start from.
set_label
(label)
Set label of Dataset.
set_position
(position)
Set position of Dataset (used for ranking).
set_reference
(reference)
Set reference Dataset.
set_weight
(weight)
Set weight of each instance.
subset
(used_indices[, params])
Get subset of current Dataset.
Add features from other Dataset to the current Dataset.
Both Datasets must be constructed before calling this method.
Lazy init.
self – Constructed Dataset object.
Create validation data align with current Dataset.
data (str, pathlib.Path, numpy array, pandas DataFrame, scipy.sparse, Sequence, list of Sequence, list of numpy array or pyarrow Table) – Data source of Dataset. If str or pathlib.Path, it represents the path to a text file (CSV, TSV, or LibSVM) or a LightGBM Dataset binary file.
label (list, numpy 1-D array, pandas Series / one-column DataFrame, pyarrow Array, pyarrow ChunkedArray or None, optional (default=None)) – Label of the data.
weight (list, numpy 1-D array, pandas Series, pyarrow Array, pyarrow ChunkedArray or None, optional (default=None)) – Weight for each instance. Weights should be non-negative.
group (list, numpy 1-D array, pandas Series, pyarrow Array, pyarrow ChunkedArray or None, optional (default=None)) – Group/query data. Only used in the learning-to-rank task. sum(group) = n_samples. For example, if you have a 100-document dataset with group = [10, 20, 40, 10, 10, 10]
, that means that you have 6 groups, where the first 10 records are in the first group, records 11-30 are in the second group, records 31-70 are in the third group, etc.
init_score (list, list of lists (for multi-class task), numpy array, pandas Series, pandas DataFrame (for multi-class task), pyarrow Array, pyarrow ChunkedArray, pyarrow Table (for multi-class task) or None, optional (default=None)) – Init score for Dataset.
params (dict or None, optional (default=None)) – Other parameters for validation Dataset.
position (numpy 1-D array, pandas Series or None, optional (default=None)) – Position of items used in unbiased learning-to-rank task.
valid – Validation Dataset with reference to self.
Get the number of bins for a feature.
Added in version 4.0.0.
feature (int or str) – Index or name of the feature.
number_of_bins – The number of constructed bins for the feature in the Dataset.
int
Get the raw data of the Dataset.
Get the names of columns (features) in the Dataset.
feature_names – The names of columns (features) in the Dataset.
list of str
Get property from the Dataset.
Can only be run on a constructed Dataset.
Unlike get_group()
, get_init_score()
, get_label()
, get_position()
, and get_weight()
, this method ignores any raw data passed into lgb.Dataset()
on the Python side, and will only read data from the constructed C++ Dataset
object.
field_name (str) – The field name of the information.
info – A numpy array with information from the Dataset.
numpy array or None
Get the group of the Dataset.
group – Group/query data. Only used in the learning-to-rank task. sum(group) = n_samples. For example, if you have a 100-document dataset with group = [10, 20, 40, 10, 10, 10]
, that means that you have 6 groups, where the first 10 records are in the first group, records 11-30 are in the second group, records 31-70 are in the third group, etc. For a constructed Dataset
, this will only return None
or a numpy array.
list, numpy 1-D array, pandas Series, pyarrow Array, pyarrow ChunkedArray or None
Get the initial score of the Dataset.
init_score – Init score of Booster. For a constructed Dataset
, this will only return None
or a numpy array.
list, list of lists (for multi-class task), numpy array, pandas Series, pandas DataFrame (for multi-class task), pyarrow Array, pyarrow ChunkedArray, pyarrow Table (for multi-class task) or None
Get the label of the Dataset.
label – The label information from the Dataset. For a constructed Dataset
, this will only return a numpy array.
list, numpy 1-D array, pandas Series / one-column DataFrame, pyarrow Array, pyarrow ChunkedArray or None
Get the used parameters in the Dataset.
params – The used parameters in this Dataset object.
dict
Get the position of the Dataset.
position – Position of items used in unbiased learning-to-rank task. For a constructed Dataset
, this will only return None
or a numpy array.
numpy 1-D array, pandas Series or None
Get a chain of Dataset objects.
Starts with r, then goes to r.reference (if exists), then to r.reference.reference, etc. until we hit ref_limit
or a reference loop.
ref_limit (int, optional (default=100)) – The limit number of references.
ref_chain – Chain of references of the Datasets.
set of Dataset
Get the weight of the Dataset.
weight – Weight for each data point from the Dataset. Weights should be non-negative. For a constructed Dataset
, this will only return None
or a numpy array.
list, numpy 1-D array, pandas Series, pyarrow Array, pyarrow ChunkedArray or None
Get the number of rows in the Dataset.
number_of_rows – The number of rows in the Dataset.
int
Get the number of columns (features) in the Dataset.
number_of_columns – The number of columns (features) in the Dataset.
int
Save Dataset to a binary file.
Note
Please note that init_score is not saved in binary file. If you need it, please set it again after loading Dataset.
filename (str or pathlib.Path) – Name of the output file.
self – Returns self.
Set categorical features.
categorical_feature (list of str or int, or 'auto') – Names or indices of categorical features.
self – Dataset with set categorical features.
Set feature name.
feature_name (list of str) – Feature names.
self – Dataset with set feature name.
Set property into the Dataset.
field_name (str) – The field name of the information.
data (list, list of lists (for multi-class task), numpy array, pandas Series, pandas DataFrame (for multi-class task), pyarrow Array, pyarrow ChunkedArray, pyarrow Table (for multi-class task) or None) – The data to be set.
self – Dataset with set property.
Set group size of Dataset (used for ranking).
group (list, numpy 1-D array, pandas Series, pyarrow Array, pyarrow ChunkedArray or None) – Group/query data. Only used in the learning-to-rank task. sum(group) = n_samples. For example, if you have a 100-document dataset with group = [10, 20, 40, 10, 10, 10]
, that means that you have 6 groups, where the first 10 records are in the first group, records 11-30 are in the second group, records 31-70 are in the third group, etc.
self – Dataset with set group.
Set init score of Booster to start from.
init_score (list, list of lists (for multi-class task), numpy array, pandas Series, pandas DataFrame (for multi-class task), pyarrow Array, pyarrow ChunkedArray, pyarrow Table (for multi-class task) or None) – Init score for Booster.
self – Dataset with set init score.
Set label of Dataset.
label (list, numpy 1-D array, pandas Series / one-column DataFrame, pyarrow Array, pyarrow ChunkedArray or None) – The label information to be set into Dataset.
self – Dataset with set label.
Set position of Dataset (used for ranking).
position (numpy 1-D array, pandas Series or None, optional (default=None)) – Position of items used in unbiased learning-to-rank task.
self – Dataset with set position.
Set reference Dataset.
Set weight of each instance.
weight (list, numpy 1-D array, pandas Series, pyarrow Array, pyarrow ChunkedArray or None) – Weight to be set for each data point. Weights should be non-negative.
self – Dataset with set weight.
Get subset of current Dataset.
used_indices (list of int) – Indices used to create the subset.
params (dict or None, optional (default=None)) – These parameters will be passed to Dataset constructor.
subset – Subset of the current Dataset.
RetroSearch is an open source project built by @garambo | Open a GitHub Issue
Search and Browse the WWW like it's 1997 | Search results from DuckDuckGo
HTML:
3.2
| Encoding:
UTF-8
| Version:
0.7.4