A RetroSearch Logo

Home - News ( United States | United Kingdom | Italy | Germany ) - Football scores

Search Query:

Showing content from https://api-docs.databricks.com/python/pyspark/latest/api/pyspark.ml.stat.ChiSquareTest.html below:

ChiSquareTest — PySpark master documentation

ChiSquareTest¶
class pyspark.ml.stat.ChiSquareTest¶

Conduct Pearson’s independence test for every feature against the label. For each feature, the (feature, label) pairs are converted into a contingency matrix for which the Chi-squared statistic is computed. All label and feature values must be categorical.

The null hypothesis is that the occurrence of the outcomes is statistically independent.

Methods

test(dataset, featuresCol, labelCol[, flatten])

Perform a Pearson’s independence test using dataset.

Methods Documentation

static test(dataset: pyspark.sql.dataframe.DataFrame, featuresCol: str, labelCol: str, flatten: bool = False) → pyspark.sql.dataframe.DataFrame¶

Perform a Pearson’s independence test using dataset.

Added optional flatten argument.

Parameters
datasetpyspark.sql.DataFrame

DataFrame of categorical labels and categorical features. Real-valued features will be treated as categorical for each distinct value.

featuresColstr

Name of features column in dataset, of type Vector (VectorUDT).

labelColstr

Name of label column in dataset, of any numerical type.

flattenbool, optional

if True, flattens the returned dataframe.

Returns
pyspark.sql.DataFrame

DataFrame containing the test result for every feature against the label. If flatten is True, this DataFrame will contain one row per feature with the following fields:

  • featureIndex: int

  • pValue: float

  • degreesOfFreedom: int

  • statistic: float

If flatten is False, this DataFrame will contain a single Row with the following fields:

  • pValues: Vector

  • degreesOfFreedom: Array[int]

  • statistics: Vector

Each of these fields has one value per feature.

Examples

>>> from pyspark.ml.linalg import Vectors
>>> from pyspark.ml.stat import ChiSquareTest
>>> dataset = [[0, Vectors.dense([0, 0, 1])],
...            [0, Vectors.dense([1, 0, 1])],
...            [1, Vectors.dense([2, 1, 1])],
...            [1, Vectors.dense([3, 1, 1])]]
>>> dataset = spark.createDataFrame(dataset, ["label", "features"])
>>> chiSqResult = ChiSquareTest.test(dataset, 'features', 'label')
>>> chiSqResult.select("degreesOfFreedom").collect()[0]
Row(degreesOfFreedom=[3, 1, 0])
>>> chiSqResult = ChiSquareTest.test(dataset, 'features', 'label', True)
>>> row = chiSqResult.orderBy("featureIndex").collect()
>>> row[0].statistic
4.0

RetroSearch is an open source project built by @garambo | Open a GitHub Issue

Search and Browse the WWW like it's 1997 | Search results from DuckDuckGo

HTML: 3.2 | Encoding: UTF-8 | Version: 0.7.4