Serializable
, org.apache.spark.internal.Logging
, QuantileDiscretizerBase
, Params
, HasHandleInvalid
, HasInputCol
, HasInputCols
, HasOutputCol
, HasOutputCols
, HasRelativeError
, DefaultParamsWritable
, Identifiable
, MLWritable
QuantileDiscretizer
takes a column with continuous features and outputs a column with binned categorical features. The number of bins can be set using the
numBuckets
parameter. It is possible that the number of buckets used will be smaller than this value, for example, if there are too few distinct values of the input to create enough distinct quantiles. Since 2.3.0,
QuantileDiscretizer
can map multiple columns at once by setting the
inputCols
parameter. If both of the
inputCol
and
inputCols
parameters are set, an Exception will be thrown. To specify the number of buckets for each column, the
numBucketsArray
parameter can be set, or if the number of buckets should be the same across columns,
numBuckets
can be set as a convenience. Note that in multiple columns case, relative error is applied to all columns.
NaN handling: null and NaN values will be ignored from the column during QuantileDiscretizer
fitting. This will produce a Bucketizer
model for making predictions. During the transformation, Bucketizer
will raise an error when it finds NaN values in the dataset, but the user can also choose to either keep or remove NaN values within the dataset by setting handleInvalid
. If the user chooses to keep NaN values, they will be handled specially and placed into their own bucket, for example, if 4 buckets are used, then non-NaN data will be put into buckets[0-3], but NaNs will be counted in a special bucket[4].
Algorithm: The bin ranges are chosen using an approximate algorithm (see the documentation for org.apache.spark.sql.DataFrameStatFunctions.approxQuantile
for a detailed description). The precision of the approximation can be controlled with the relativeError
parameter. The lower and upper bin bounds will be -Infinity
and +Infinity
, covering all real values.
org.apache.spark.internal.Logging.LogStringContext, org.apache.spark.internal.Logging.SparkShellLoggingFilter
Constructors
Creates a copy of this instance with the same UID and some extra params.
Fits a model to the input data.
Param for how to handle invalid entries.
Param for input column name.
Param for input column names.
static org.apache.spark.internal.Logging.LogStringContext
Number of buckets (quantiles, or categories) into which data points are grouped.
Array of number of buckets (quantiles, or categories) into which data points are grouped.
static org.slf4j.Logger
static void
Param for output column name.
Param for output column names.
Param for the relative target precision for the approximate quantile algorithm.
Check transform validity and derive the output schema from the input schema.
An immutable unique ID for the object and its derivatives.
Methods inherited from interface org.apache.spark.internal.LogginginitializeForcefully, initializeLogIfNecessary, initializeLogIfNecessary, initializeLogIfNecessary$default$2, isTraceEnabled, log, logDebug, logDebug, logDebug, logDebug, logError, logError, logError, logError, logInfo, logInfo, logInfo, logInfo, logName, LogStringContext, logTrace, logTrace, logTrace, logTrace, logWarning, logWarning, logWarning, logWarning, org$apache$spark$internal$Logging$$log_, org$apache$spark$internal$Logging$$log__$eq, withLogContext
Methods inherited from interface org.apache.spark.ml.util.MLWritablesave
Methods inherited from interface org.apache.spark.ml.param.Paramsclear, copyValues, defaultCopy, defaultParamMap, explainParam, explainParams, extractParamMap, extractParamMap, get, getDefault, getOrDefault, getParam, hasDefault, hasParam, isDefined, isSet, onParamChange, paramMap, params, set, set, set, setDefault, setDefault, shouldOwn
public QuantileDiscretizer()
public static org.slf4j.Logger org$apache$spark$internal$Logging$$log_()
public static void org$apache$spark$internal$Logging$$log__$eq(org.slf4j.Logger x$1)
public static org.apache.spark.internal.Logging.LogStringContext LogStringContext(scala.StringContext sc)
Number of buckets (quantiles, or categories) into which data points are grouped. Must be greater than or equal to 2.
See also QuantileDiscretizerBase.handleInvalid()
, which can optionally create an additional bucket for NaN values.
default: 2
numBuckets
in interface QuantileDiscretizerBase
Array of number of buckets (quantiles, or categories) into which data points are grouped. Each value must be greater than or equal to 2
See also QuantileDiscretizerBase.handleInvalid()
, which can optionally create an additional bucket for NaN values.
numBucketsArray
in interface QuantileDiscretizerBase
Param for how to handle invalid entries. Options are 'skip' (filter out rows with invalid values), 'error' (throw an error), or 'keep' (keep invalid values in a special additional bucket). Note that in the multiple columns case, the invalid handling is applied to all columns. That said for 'error' it will throw an error if any invalids are found in any column, for 'skip' it will skip rows with any invalids in any columns, etc. Default: "error"
handleInvalid
in interface HasHandleInvalid
handleInvalid
in interface QuantileDiscretizerBase
Param for the relative target precision for the approximate quantile algorithm. Must be in the range [0, 1].
relativeError
in interface HasRelativeError
Param for output column names.
outputCols
in interface HasOutputCols
Param for input column names.
inputCols
in interface HasInputCols
Param for output column name.
outputCol
in interface HasOutputCol
Param for input column name.
inputCol
in interface HasInputCol
An immutable unique ID for the object and its derivatives.
uid
in interface Identifiable
Check transform validity and derive the output schema from the input schema.
We check validity for interactions between parameters during transformSchema
and raise an exception if any parameter value is invalid. Parameter value checks which do not depend on other parameters are handled by Param.validate()
.
Typical implementation should first conduct verification on schema change and parameter validity, including complex parameter interaction checks.
transformSchema
in class PipelineStage
schema
- (undocumented)
Fits a model to the input data.
fit
in class Estimator<Bucketizer>
dataset
- (undocumented)
Params
Creates a copy of this instance with the same UID and some extra params. Subclasses should implement this method and set the return type properly. See defaultCopy()
.
copy
in interface Params
copy
in class Estimator<Bucketizer>
extra
- (undocumented)
RetroSearch is an open source project built by @garambo | Open a GitHub Issue
Search and Browse the WWW like it's 1997 | Search results from DuckDuckGo
HTML:
3.2
| Encoding:
UTF-8
| Version:
0.7.4