How to select features using chi-squared in Python

Hits: 51

How to select features using chi-squared in Python

The Chi-Squared test is a statistical test that can be used to select features for a machine learning model. It tests the independence of two categorical variables by comparing the observed frequencies of the variables to the expected frequencies if they were independent.

In Python, the Chi-Squared test can be performed using the SelectKBest function from the sklearn.feature_selection library. Here are the steps to select features using Chi-Squared in Python:

  1. Import the necessary libraries. You will need to have scikit-learn and numpy installed.
from sklearn.feature_selection
import SelectKBest, chi2
  1. Prepare your data. You will need to have your feature data in one array/matrix and target data in another array/vector.
X = your_feature_matrix
y = your_target_vector
  1. Create a SelectKBest object. You can specify the number of features you want to select.
selector = SelectKBest(chi2, k=10)


  1. Fit the SelectKBest model to your data. This will calculate the Chi-Squared test statistic for each feature., y)


  1. Get the Chi-Squared test statistic and p-values for each feature using the scores_ and pvalues_ attributes of the selector object.
chi2_scores = selector.scores_
p_values = selector.pvalues_
  1. choose the threshold of p-values as per your requirement (e.g : 0.05) and then select the features that have p-values less than the threshold.
mask = p_values < 0.05
top_k_features = X[:, mask]
  1. After following these steps you will have the top k features based on the Chi-Squared test that you provided in step 3.


The Chi-squared test is often used to test the relationship between categorical variables, but it can also be used to test the relationship between categorical and continuous variables, where the continuous variables are discretized first. It is easy to apply and its results are easy to interpret. However, it should be noted that Chi-Squared test assumes that the samples are independent and the data should follow a chi-squared distribution. It may not be suitable for all cases and its results might be less reliable with small sample sizes.

In this Learn through Codes example, you will learn: How to select features using chi-squared in Python.


How to select features using chi-squared in Python


There are 2000+ End-to-End Python & R Notebooks are available to build Professional Portfolio as a Data Scientist and/or Machine Learning Specialist. All Notebooks are only $19.95. We would like to request you to have a look at the website for FREE the end-to-end notebooks, and then decide whether you would like to purchase or not.

Please do not waste your valuable time by watching videos, rather use end-to-end (Python and R) recipes from Professional Data Scientists to practice coding, and land the most demandable jobs in the fields of Predictive analytics & AI (Machine Learning and Data Science).

The objective is to guide the developers & analysts to “Learn how to Code” for Applied AI using end-to-end coding solutions, and unlock the world of opportunities!

Statistics for Beginners in Excel – Chi-square Distribution


Machine Learning for Beginners in Python: How to Select Important Features In Random Forest