Hits: 51
How to select features using chi-squared in Python
The Chi-Squared test is a statistical test that can be used to select features for a machine learning model. It tests the independence of two categorical variables by comparing the observed frequencies of the variables to the expected frequencies if they were independent.
In Python, the Chi-Squared test can be performed using the SelectKBest function from the sklearn.feature_selection library. Here are the steps to select features using Chi-Squared in Python:
- Import the necessary libraries. You will need to have scikit-learn and numpy installed.
from sklearn.feature_selection
import SelectKBest, chi2
- Prepare your data. You will need to have your feature data in one array/matrix and target data in another array/vector.
X = your_feature_matrix
y = your_target_vector
- Create a SelectKBest object. You can specify the number of features you want to select.
selector = SelectKBest(chi2, k=10)
- Fit the SelectKBest model to your data. This will calculate the Chi-Squared test statistic for each feature.
selector.fit(X, y)
- Get the Chi-Squared test statistic and p-values for each feature using the
scores_
andpvalues_
attributes of the selector object.
chi2_scores = selector.scores_
p_values = selector.pvalues_
- choose the threshold of p-values as per your requirement (e.g : 0.05) and then select the features that have p-values less than the threshold.
mask = p_values < 0.05
top_k_features = X[:, mask]
- After following these steps you will have the top k features based on the Chi-Squared test that you provided in step 3.
The Chi-squared test is often used to test the relationship between categorical variables, but it can also be used to test the relationship between categorical and continuous variables, where the continuous variables are discretized first. It is easy to apply and its results are easy to interpret. However, it should be noted that Chi-Squared test assumes that the samples are independent and the data should follow a chi-squared distribution. It may not be suitable for all cases and its results might be less reliable with small sample sizes.
In this Learn through Codes example, you will learn: How to select features using chi-squared in Python.
How to select features using chi-squared in Python
Free Machine Learning & Data Science Coding Tutorials in Python & R for Beginners. Subscribe @ Western Australian Center for Applied Machine Learning & Data Science.
Introduction to Applied Machine Learning & Data Science for Beginners, Business Analysts, Students, Researchers and Freelancers with Python & R Codes @ Western Australian Center for Applied Machine Learning & Data Science (WACAMLDS) !!!
Latest end-to-end Learn by Coding Projects (Jupyter Notebooks) in Python and R:
Applied Statistics with R for Beginners and Business Professionals
Data Science and Machine Learning Projects in Python: Tabular Data Analytics
Data Science and Machine Learning Projects in R: Tabular Data Analytics
Python Machine Learning & Data Science Recipes: Learn by Coding