Last Updated : 26 Jul, 2025
Chi-Square test helps us determine if there is a significant relationship between two categorical variables and the target variable. It is a non-parametric statistical test meaning it doesn’t follow normal distribution.
Example of Chi-square testThe Chi-square test compares the observed frequencies (actual data) to the expected frequencies (what we would expect if there was no relationship). This helps identify which features are important for predicting the target variable in machine learning models.
Formula for Chi-square testChi-square statistic is calculated as:
\chi^2_c = \sum \frac{(O_{i} - E_{i})^2}{E_{i}} ...eq(1)
where,
Often used with non-normally distributed data. Before we jump into calculations. let's understand some important terms:
The two main types are the chi-square test for independence and the chi-square goodness-of-fit test.
Types of chi-square tests1. Chi-Square Test for Independence: This test is used whether there is a significant relationship between two categorical variables.
2. Chi-Square Goodness-of-Fit Test: The Chi-Square Goodness-of-Fit test is used to check if a variable follows a specific expected pattern or distribution.
Step 1: Define Your Hypotheses
Step 2: Create a Contingency Table: This is simply a table that displays the frequency distribution of the two categorical variables.
Step 3: Calculate Expected Values: To find the expected value for each cell use this formula:
E_{i} = \frac{(Row\ Total \times Column\ Total)}{Grand\ Total}
Step 4: Compute the Chi-Square Statistic: Now use the Chi-Square formula:
\chi^2 = \sum \frac{(O_{i} - E_{i})^2}{E_{i}}
where:
If the observed and expected values are very different the Chi-Square value will be high which indicate a strong relationship.
Step 5: Compare with the Critical Value:
The Chi-Square Test helps us find relationships or differences between categories. Its main uses are:
Let us examine a dataset with features including "income level" (low, medium, high) and "subscription status" (subscribed, not subscribed) indicate whether a customer subscribed to a service. The goal is to determine if this feature is relevant for predicting subscription status.
Step 1: Make Hypothesis
Step 2: Contingency table
Income Level
Subscribed
Not subscribed
Row Total
Low
20
30
50
Medium
40
25
65
High
10
15
25
Column Total
70
70
140
Step 3: Now calculate the expected frequencies: For example the expected frequency for "Low Income" and "Subscribed" would be:
Similarly we can find expected frequencies for other aspects as well:
Subscribed
Not Subscribed
Low Income
25
25
Medium Income
35
30
High Income
10
15
Step 4: Calculate the Chi-Square Statistic: Let's summarize the observed and expected values into a table and calculate the Chi-Square value:
Subscribed (O)
Not Subscribed (O)
Subscribed (E)
Not Subscribed (E)
Low Income
20
30
25
25
Medium Income
40
25
35
30
High Income
10
15
10
15
Now using the formula specified in equation 1 we can get our chi-square statistic values in the following manner:
\chi^2= \frac{(20 - 25)^2}{25} + \frac{(30 - 25)^2}{25}++ \frac{(40 - 35)^2}{35} + \frac{(25 - 30)^2}{30}+ \frac{(10 - 10)^2}{10} + \frac{(15 - 15)^2}{15}
= 1 + 1.2 + 0.714 + 0.833 + 0 + 0\\=3.747
Step 5: Degrees of Freedom
\text{Degrees of Freedom (df)} = (3 - 1) \times (2 - 1) = 2
Step 6: Interpretations
Now compare the calculated \chi^2 value (3.747) with the critical value for 2 degrees of freedom. If \chi^2 is greater than the critical value, reject the null hypothesis. This means "income level" is significantly related to "subscription status" and is an important feature. Before its implementation we should have some basic knowledge about numpy, matplotlib and scipy.
Python
import numpy as np
import matplotlib.pyplot as plt
import scipy.stats as stats
df = 2
alpha = 0.05
critical_value = stats.chi2.ppf(1 - alpha, df)
critical_value
Output:
5.991464547107979
For df = 2 and significance level \alpha = 0.05 , the critical value is 5.991.
import numpy as np
import matplotlib.pyplot as plt
import scipy.stats as stats
df = 2
alpha = 0.05
c_val = stats.chi2.ppf(1 - alpha, df)
cal_chi_s = 3.747
x = np.linspace(0, 10, 1000)
y = stats.chi2.pdf(x, df)
plt.plot(x, y, label='Chi-Square Distribution (df=2)')
plt.fill_between(x, y, where=(x > c_val), color='red', alpha=0.5, label='Critical Region')
plt.axvline(cal_chi_s, color='blue', linestyle='dashed', label='Calculated Chi-Square')
plt.axvline(c_val, color='green', linestyle='dashed', label='Critical Value')
plt.title('Chi-Square Distribution and Critical Region')
plt.xlabel('Chi-Square Value')
plt.ylabel('Probability Density Function')
plt.legend()
plt.show()
Output:
Chi-square Distribution
In this example The green dashed line represents the critical value the threshold beyond which you would reject the null hypothesis.
If the calculated Chi-Square statistic falls within this shaded area then you would reject the null hypothesis. The calculated chi-square value does not fall within the critical region therefore accepting the null hypothesis. Hence there is no significant association between two variables.
RetroSearch is an open source project built by @garambo | Open a GitHub Issue
Search and Browse the WWW like it's 1997 | Search results from DuckDuckGo
HTML:
3.2
| Encoding:
UTF-8
| Version:
0.7.4