Feature Selection Using Random Forest

Often in data science we have hundreds or even millions of features and we want a way to create a model that only includes the most important features. This has three benefits. First, we make our model more simple to interpret. Second, we can reduce the variance of the model, and therefore overfitting. Finally, we can reduce the computational cost (and time) of training a model. The process of identifying only the most relevant features is called “feature selection.”

Random Forests are often used for feature selection in a data science workflow. The reason is because the tree-based strategies used by random forests naturally ranks by how well they improve the purity of the node. This mean decrease in impurity over all trees (called gini impurity). Nodes with the greatest decrease in impurity happen at the start of the trees, while notes with the least decrease in impurity occur at the end of trees. Thus, by pruning trees below a particular node, we can create a subset of the most important features.

In this tutorial we will:

Prepare the dataset
Train a random forest classifier
Identify the most important features
Create a new ‘limited featured’ dataset containing only those features
Train a second classifier on this new dataset
Compare the accuracy of the ‘full featured’ classifier to the accuracy of the ‘limited featured’ classifier

Note: There are other definitions of importance, however in this tutorial we limit our discussion to gini importance.

Preliminaries


import numpy as np
from sklearn.ensemble import RandomForestClassifier
from sklearn import datasets
from sklearn.model_selection import train_test_split
from sklearn.feature_selection import SelectFromModel
from sklearn.metrics import accuracy_score

Create The Data

The dataset used in this tutorial is the famous iris dataset. The Iris target data contains 50 samples from three species of Iris, y and four feature variables, X.


/* Load the iris dataset */
iris = datasets.load_iris()

/* Create a list of feature names */
feat_labels = ['Sepal Length','Sepal Width','Petal Length','Petal Width']

/* Create X from the features */
X = iris.data

/* Create y from output */
y = iris.target

View The Data


/* View the features */
X[0:5]

array([[ 5.1,  3.5,  1.4,  0.2],
       [ 4.9,  3. ,  1.4,  0.2],
       [ 4.7,  3.2,  1.3,  0.2],
       [ 4.6,  3.1,  1.5,  0.2],
       [ 5. ,  3.6,  1.4,  0.2]])


/* View the target data */
y

array([0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
       0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
       0, 0, 0, 0, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,
       1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,
       1, 1, 1, 1, 1, 1, 1, 1, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2,
       2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2,
       2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2])

Split The Data Into Training And Test Sets


/* Split the data into 40% test and 60% training */
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.4, random_state=0)

Train A Random Forest Classifier


/* Create a random forest classifier */
clf = RandomForestClassifier(n_estimators=10000, random_state=0, n_jobs=-1)

/* Train the classifier */
clf.fit(X_train, y_train)

/* Print the name and gini importance of each feature */

for feature in zip(feat_labels, clf.feature_importances_):
    print(feature)


('Sepal Length', 0.11024282328064565)
('Sepal Width', 0.016255033655398394)
('Petal Length', 0.45028123999239533)
('Petal Width', 0.42322090307156124)

The scores above are the importance scores for each variable. There are two things to note. First, all the importance scores add up to 100%. Second, Petal Length and Petal Width are far more important than the other two features. Combined, Petal Length and Petal Width have an importance of ~0.86! Clearly these are the most importance features.

Identify And Select Most Important Features


/* Create a selector object that will use the random forest classifier to identify
   features that have an importance of more than 0.15 */
sfm = SelectFromModel(clf, threshold=0.15)

/* Train the selector */
sfm.fit(X_train, y_train)


SelectFromModel(estimator=RandomForestClassifier(bootstrap=True, class_weight=None, criterion='gini',
            max_depth=None, max_features='auto', max_leaf_nodes=None,
            min_impurity_split=1e-07, min_samples_leaf=1,
            min_samples_split=2, min_weight_fraction_leaf=0.0,
            n_estimators=10000, n_jobs=-1, oob_score=False, random_state=0,
            verbose=0, warm_start=False),
        prefit=False, threshold=0.15)


/* Print the names of the most important features */
for feature_list_index in sfm.get_support(indices=True):
    print(feat_labels[feature_list_index])


Petal Length
Petal Width

Create A Data Subset With Only The Most Important Features


/* Transform the data to create a new dataset containing only the most important features
   Note: We have to apply the transform to both the training X and test X data. */
X_important_train = sfm.transform(X_train)
X_important_test = sfm.transform(X_test)

Train A New Random Forest Classifier Using Only Most Important Features


/* Create a new random forest classifier for the most important features */
clf_important = RandomForestClassifier(n_estimators=10000, random_state=0, n_jobs=-1)

/* Train the new classifier on the new dataset containing the most important features */
clf_important.fit(X_important_train, y_train)

RandomForestClassifier(bootstrap=True, class_weight=None, criterion='gini',
            max_depth=None, max_features='auto', max_leaf_nodes=None,
            min_impurity_split=1e-07, min_samples_leaf=1,
            min_samples_split=2, min_weight_fraction_leaf=0.0,
            n_estimators=10000, n_jobs=-1, oob_score=False, random_state=0,
            verbose=0, warm_start=False)

Compare The Accuracy Of Our Full Feature Classifier To Our Limited Feature Classifier


/* Apply The Full Featured Classifier To The Test Data */
y_pred = clf.predict(X_test)

/* View The Accuracy Of Our Full Feature (4 Features) Model */
accuracy_score(y_test, y_pred)

0.93333333333333335


/* Apply The Full Featured Classifier To The Test Data */
y_important_pred = clf_important.predict(X_important_test)

/* View The Accuracy Of Our Limited Feature (2 Features) Model */
accuracy_score(y_test, y_important_pred)

0.8833333333333333

As can be seen by the accuracy scores, our original model which contained all four features is 93.3% accurate while the our ‘limited’ model which contained only two features is 88.3% accurate. Thus, for a small cost in accuracy we halved the number of features in the model.

Python Example for Beginners

Two Machine Learning Fields

There are two sides to machine learning:

Practical Machine Learning:This is about querying databases, cleaning data, writing scripts to transform data and gluing algorithm and libraries together and writing custom code to squeeze reliable answers from data to satisfy difficult and ill defined questions. It’s the mess of reality.
Theoretical Machine Learning: This is about math and abstraction and idealized scenarios and limits and beauty and informing what is possible. It is a whole lot neater and cleaner and removed from the mess of reality.

Data Science Resources: Data Science Recipes and Applied Machine Learning Recipes

Introduction to Applied Machine Learning & Data Science for Beginners, Business Analysts, Students, Researchers and Freelancers with Python & R Codes @ Western Australian Center for Applied Machine Learning & Data Science (WACAMLDS) !!!

Latest end-to-end Learn by Coding Recipes in Project-Based Learning:

Applied Statistics with R for Beginners and Business Professionals

Data Science and Machine Learning Projects in Python: Tabular Data Analytics

Data Science and Machine Learning Projects in R: Tabular Data Analytics

Python Machine Learning & Data Science Recipes: Learn by Coding

R Machine Learning & Data Science Recipes: Learn by Coding

Comparing Different Machine Learning Algorithms in Python for Classification (FREE)

Disclaimer: The information and code presented within this recipe/tutorial is only for educational and coaching purposes for beginners and developers. Anyone can practice and apply the recipe/tutorial presented here, but the reader is taking full responsibility for his/her actions. The author (content curator) of this recipe (code / program) has made every effort to ensure the accuracy of the information was correct at time of publication. The author (content curator) does not assume and hereby disclaims any liability to any party for any loss, damage, or disruption caused by errors or omissions, whether such errors or omissions result from accident, negligence, or any other cause. The information presented here could also be found in public knowledge domains.

Google –> SETScholars

(Download FREE Learn by Coding Examples)

A list of Python, R and SQL Codes for Applied Machine Learning and Data Science at https://setscholars.net/Learn by Coding Categories:

Classification: https://setscholars.net/category/classification/
Data Analytics: https://setscholars.net/category/data-analytics/
Data Science: https://setscholars.net/category/data-science/
Data Visualisation: https://setscholars.net/category/data-visualisation/
Machine Learning Recipe: https://setscholars.net/category/machine-learning-recipe/
Pandas: https://setscholars.net/category/pandas/
Python: https://setscholars.net/category/python/
SKLEARN: https://setscholars.net/category/sklearn/
Supervised Learning: https://setscholars.net/category/supervised-learning/
Tabular Data Analytics: https://setscholars.net/category/tabular-data-analytics/
End-to-End Data Science Recipes: https://setscholars.net/category/a-star-data-science-recipe/
Applied Statistics: https://setscholars.net/category/applied-statistics/
Bagging Ensemble: https://setscholars.net/category/bagging-ensemble/
Boosting Ensemble: https://setscholars.net/category/boosting-ensemble/
CatBoost: https://setscholars.net/category/catboost/
Clustering: https://setscholars.net/category/clustering/
Data Analytics: https://setscholars.net/category/data-analytics/
Data Science: https://setscholars.net/category/data-science/
Data Visualisation: https://setscholars.net/category/data-visualisation/
Decision Tree: https://setscholars.net/category/decision-tree/
LightGBM: https://setscholars.net/category/lightgbm/
Machine Learning Recipe: https://setscholars.net/category/machine-learning-recipe/
Multi-Class Classification: https://setscholars.net/category/multi-class-classification/
Neural Networks: https://setscholars.net/category/neural-networks/
Python Machine Learning: https://setscholars.net/category/python-machine-learning/
Python Machine Learning Crash Course: https://setscholars.net/category/python-machine-learning-crash-course/
R Classification: https://setscholars.net/category/r-classification/
R for Beginners: https://setscholars.net/category/r-for-beginners/
R for Business Analytics: https://setscholars.net/category/r-for-business-analytics/
R for Data Science: https://setscholars.net/category/r-for-data-science/
R for Data Visualisation: https://setscholars.net/category/r-for-data-visualisation/
R for Excel Users: https://setscholars.net/category/r-for-excel-users/
R Machine Learning: https://setscholars.net/category/r-machine-learning/
R Machine Learning Crash Course: https://setscholars.net/category/r-machine-learning-crash-course/
R Regression: https://setscholars.net/category/r-regression/
Regression: https://setscholars.net/category/regression/
XGBOOST: https://setscholars.net/category/xgboost/
Excel examples for beginners: https://setscholars.net/category/excel-examples-for-beginners/
C Programming tutorials & examples: https://setscholars.net/category/c-programming-tutorials/
Javascript tutorials & examples: https://setscholars.net/category/javascript-tutorials-and-examples/
Python tutorials & examples: https://setscholars.net/category/python-tutorials/
R tutorials & examples: https://setscholars.net/category/r-for-beginners/
SQL tutorials & examples: https://setscholars.net/category/sql-tutorials-for-business-analyst/

( FREE downloadable Mathematics Worksheet for Kids )

Year 1 Mathematics Worksheet: https://setscholars.net/category/year-1-mathematics-worksheet/

Towards Advanced Analytics Specialist & Analytics Engineer

Machine Learning for Beginners in Python: Feature Selection Using Random Forest

Feature Selection Using Random Forest

Preliminaries

Create The Data

View The Data

Split The Data Into Training And Test Sets

Train A Random Forest Classifier

Identify And Select Most Important Features

Create A Data Subset With Only The Most Important Features

Train A New Random Forest Classifier Using Only Most Important Features

Compare The Accuracy Of Our Full Feature Classifier To Our Limited Feature Classifier

Python Example for Beginners

Special 95% discount

2000+ Applied Machine Learning & Data Science Recipes

Portfolio Projects for Aspiring Data Scientists: Tabular Text & Image Data Analytics as well as Time Series Forecasting in Python & R

Two Machine Learning Fields

Google –> SETScholars

Related Posts

ETL vs. ELT: Navigating Data Integration Techniques in Data Warehousing

Unlocking the Power of Mixed Models in Statistical Analysis

Mastering Analysis of Covariance (ANCOVA): A Comprehensive Statistical Guide with Python and R Examples