Serializable
, org.apache.spark.internal.Logging
, CountVectorizerParams
, Params
, HasInputCol
, HasOutputCol
, DefaultParamsWritable
, Identifiable
, MLWritable
org.apache.spark.internal.Logging.LogStringContext, org.apache.spark.internal.Logging.SparkShellLoggingFilter
Constructors
Binary toggle to control the output vector values.
Creates a copy of this instance with the same UID and some extra params.
Fits a model to the input data.
Param for input column name.
Specifies the maximum number of different documents a term could appear in to be included in the vocabulary.
Specifies the minimum number of different documents a term must appear in to be included in the vocabulary.
Filter to ignore rare words in a document.
Param for output column name.
Check transform validity and derive the output schema from the input schema.
An immutable unique ID for the object and its derivatives.
Max size of the vocabulary.
Methods inherited from interface org.apache.spark.internal.LogginginitializeForcefully, initializeLogIfNecessary, initializeLogIfNecessary, initializeLogIfNecessary$default$2, isTraceEnabled, log, logDebug, logDebug, logDebug, logDebug, logError, logError, logError, logError, logInfo, logInfo, logInfo, logInfo, logName, LogStringContext, logTrace, logTrace, logTrace, logTrace, logWarning, logWarning, logWarning, logWarning, org$apache$spark$internal$Logging$$log_, org$apache$spark$internal$Logging$$log__$eq, withLogContext
Methods inherited from interface org.apache.spark.ml.util.MLWritablesave
Methods inherited from interface org.apache.spark.ml.param.Paramsclear, copyValues, defaultCopy, defaultParamMap, explainParam, explainParams, extractParamMap, extractParamMap, get, getDefault, getOrDefault, getParam, hasDefault, hasParam, isDefined, isSet, onParamChange, paramMap, params, set, set, set, setDefault, setDefault, shouldOwn
public CountVectorizer()
Default: 2^18^
vocabSize
in interface CountVectorizerParams
Default: 1.0
minDF
in interface CountVectorizerParams
Default: (2^63^) - 1
maxDF
in interface CountVectorizerParams
Filter to ignore rare words in a document. For each document, terms with frequency/count less than the given threshold are ignored. If this is an integer greater than or equal to 1, then this specifies a count (of times the term must appear in the document); if this is a double in [0,1), then this specifies a fraction (out of the document's token count).
Note that the parameter is only used in transform of CountVectorizerModel
and does not affect fitting.
Default: 1.0
minTF
in interface CountVectorizerParams
Binary toggle to control the output vector values. If True, all nonzero counts (after minTF filter applied) are set to 1. This is useful for discrete probabilistic models that model binary events rather than integer counts. Default: false
binary
in interface CountVectorizerParams
Param for output column name.
outputCol
in interface HasOutputCol
Param for input column name.
inputCol
in interface HasInputCol
An immutable unique ID for the object and its derivatives.
uid
in interface Identifiable
Fits a model to the input data.
fit
in class Estimator<CountVectorizerModel>
dataset
- (undocumented)
Check transform validity and derive the output schema from the input schema.
We check validity for interactions between parameters during transformSchema
and raise an exception if any parameter value is invalid. Parameter value checks which do not depend on other parameters are handled by Param.validate()
.
Typical implementation should first conduct verification on schema change and parameter validity, including complex parameter interaction checks.
transformSchema
in class PipelineStage
schema
- (undocumented)
Params
Creates a copy of this instance with the same UID and some extra params. Subclasses should implement this method and set the return type properly. See defaultCopy()
.
copy
in interface Params
copy
in class Estimator<CountVectorizerModel>
extra
- (undocumented)
RetroSearch is an open source project built by @garambo | Open a GitHub Issue
Search and Browse the WWW like it's 1997 | Search results from DuckDuckGo
HTML:
3.2
| Encoding:
UTF-8
| Version:
0.7.4