Machine Learning for Beginners in Python: Random Forest Classifier Example

Random Forest Classifier Example

This tutorial is based on Yhat’s 2013 tutorial on Random Forests in Python. If you want a good summary of the theory and uses of random forests, I suggest you check out their guide. In the tutorial below, I annotate, correct, and expand on a short code example of random forests they present at the end of the article. Specifically, I 1) update the code so it runs in the latest version of pandas and Python, 2) write detailed comments explaining what is happening in each step, and 3) expand the code in a number of ways.

Let’s get started!

A Note About The Data

The data for this tutorial is famous. Called, the iris dataset, it contains four variables measuring various parts of iris flowers of three related species, and then a fourth variable with the species name. The reason it is so famous in machine learning and statistics communities is because the data requires very little preprocessing (i.e. no missing values, all features are floating numbers, etc.).

Preliminaries


/* Load the library with the iris dataset */
from sklearn.datasets import load_iris

/* Load scikit's random forest classifier library */
from sklearn.ensemble import RandomForestClassifier

/* Load pandas */
import pandas as pd

/* Load numpy */
import numpy as np

/* Set random seed */
np.random.seed(0)

Load Data


/* Create an object called iris with the iris data */
iris = load_iris()

/* Create a dataframe with the four feature variables */
df = pd.DataFrame(iris.data, columns=iris.feature_names)

/* View the top 5 rows */
df.head()
sepal length (cm) sepal width (cm) petal length (cm) petal width (cm)
0 5.1 3.5 1.4 0.2
1 4.9 3.0 1.4 0.2
2 4.7 3.2 1.3 0.2
3 4.6 3.1 1.5 0.2
4 5.0 3.6 1.4 0.2

/* Add a new column with the species names, this is what we are going to try to predict */
df['species'] = pd.Categorical.from_codes(iris.target, iris.target_names)

/* View the top 5 rows */
df.head()
sepal length (cm) sepal width (cm) petal length (cm) petal width (cm) species
0 5.1 3.5 1.4 0.2 setosa
1 4.9 3.0 1.4 0.2 setosa
2 4.7 3.2 1.3 0.2 setosa
3 4.6 3.1 1.5 0.2 setosa
4 5.0 3.6 1.4 0.2 setosa

Create Training And Test Data


/* Create a new column that for each row, generates a random number between 0 and 1, and
   if that value is less than or equal to .75, then sets the value of that cell as True
   and false otherwise. This is a quick and dirty way of randomly assigning some rows to
   be used as the training data and some as the test data. */
df['is_train'] = np.random.uniform(0, 1, len(df)) <= .75

/* View the top 5 rows */
df.head()
sepal length (cm) sepal width (cm) petal length (cm) petal width (cm) species is_train
0 5.1 3.5 1.4 0.2 setosa True
1 4.9 3.0 1.4 0.2 setosa True
2 4.7 3.2 1.3 0.2 setosa True
3 4.6 3.1 1.5 0.2 setosa True
4 5.0 3.6 1.4 0.2 setosa True

/* Create two new dataframes, one with the training rows, one with the test rows */
train, test = df[df['is_train']==True], df[df['is_train']==False]

/* Show the number of observations for the test and training dataframes */
print('Number of observations in the training data:', len(train))
print('Number of observations in the test data:',len(test))

Number of observations in the training data: 118
Number of observations in the test data: 32

Preprocess Data


/* Create a list of the feature column's names */
features = df.columns[:4]

/* View features */
features
Index(['sepal length (cm)', 'sepal width (cm)', 'petal length (cm)',
       'petal width (cm)'],
      dtype='object')

/* train['species'] contains the actual species names. Before we can use it,
   we need to convert each species name into a digit. So, in this case there
   are three species, which have been coded as 0, 1, or 2. */
y = pd.factorize(train['species'])[0]

/* View target */
y
array([0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
       0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 1, 1, 1, 1, 1, 1, 1, 1,
       1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,
       1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2,
       2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2,
       2, 2, 2])

Train The Random Forest Classifier


/* Create a random forest Classifier. By convention, clf means 'Classifier' */
clf = RandomForestClassifier(n_jobs=2, random_state=0)

/* Train the Classifier to take the training features and learn how they relate
   to the training y (the species) */
clf.fit(train[features], y)
RandomForestClassifier(bootstrap=True, class_weight=None, criterion='gini',
            max_depth=None, max_features='auto', max_leaf_nodes=None,
            min_impurity_split=1e-07, min_samples_leaf=1,
            min_samples_split=2, min_weight_fraction_leaf=0.0,
            n_estimators=10, n_jobs=2, oob_score=False, random_state=0,
            verbose=0, warm_start=False)

Huzzah! We have done it! We have officially trained our random forest Classifier! Now let’s play with it. The Classifier model itself is stored in the clf variable.

Apply Classifier To Test Data

If you have been following along, you will know we only trained our classifier on part of the data, leaving the rest out. This is, in my humble opinion, the most important part of machine learning. Why? Because by leaving out a portion of the data, we have a set of data to test the accuracy of our model!

Let’s do that now.


/* Apply the Classifier we trained to the test data (which, remember, it has never seen before) */
clf.predict(test[features])
array([0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 1, 1, 2, 2, 1, 1, 2, 2, 2,
       2, 2, 2, 2, 2, 2, 2, 2, 2])

What are you looking at above? Remember that we coded each of the three species of plant as 0, 1, or 2. What the list of numbers above is showing you is what species our model predicts each plant is based on the the sepal length, sepal width, petal length, and petal width. How confident is the classifier about each plant? We can see that too.


/* View the predicted probabilities of the first 10 observations */
clf.predict_proba(test[features])[0:10]
array([[ 1. ,  0. ,  0. ],
       [ 1. ,  0. ,  0. ],
       [ 1. ,  0. ,  0. ],
       [ 1. ,  0. ,  0. ],
       [ 1. ,  0. ,  0. ],
       [ 1. ,  0. ,  0. ],
       [ 1. ,  0. ,  0. ],
       [ 0.9,  0.1,  0. ],
       [ 1. ,  0. ,  0. ],
       [ 1. ,  0. ,  0. ]])

There are three species of plant, thus [ 1. , 0. , 0. ] tells us that the classifier is certain that the plant is the first class. Taking another example, [ 0.9, 0.1, 0. ] tells us that the classifier gives a 90% probability the plant belongs to the first class and a 10% probability the plant belongs to the second class. Because 90 is greater than 10, the classifier predicts the plant is the first class.

Evaluate Classifier

Now that we have predicted the species of all plants in the test data, we can compare our predicted species with the that plant’s actual species.


/* Create actual english names for the plants for each predicted plant class */
preds = iris.target_names[clf.predict(test[features])]

/* View the PREDICTED species for the first five observations */
preds[0:5]
array(['setosa', 'setosa', 'setosa', 'setosa', 'setosa'],
      dtype='<U10')

/* View the ACTUAL species for the first five observations */
test['species'].head()

7     setosa
8     setosa
10    setosa
13    setosa
17    setosa
Name: species, dtype: category
Categories (3, object): [setosa, versicolor, virginica]

That looks pretty good! At least for the first five observations. Now let’s use look at all the data.

Create a confusion matrix

confusion matrix can be, no pun intended, a little confusing to interpret at first, but it is actually very straightforward. The columns are the species we predicted for the test data and the rows are the actual species for the test data. So, if we take the top row, we can wee that we predicted all 13 setosa plants in the test data perfectly. However, in the next row, we predicted 5 of the versicolor plants correctly, but mis-predicted two of the versicolor plants as virginica.

The short explanation of how to interpret a confusion matrix is: anything on the diagonal was classified correctly and anything off the diagonal was classified incorrectly.


/* Create confusion matrix */
pd.crosstab(test['species'], preds, rownames=['Actual Species'], colnames=['Predicted Species'])

Predicted Species setosa versicolor virginica
Actual Species
setosa 13 0 0
versicolor 0 5 2
virginica 0 0 12

View Feature Importance

While we don’t get regression coefficients like with OLS, we do get a score telling us how important each feature was in classifying. This is one of the most powerful parts of random forests, because we can clearly see that petal width was more important in classification than sepal width.


/* View a list of the features and their importance scores */
list(zip(train[features], clf.feature_importances_))
[('sepal length (cm)', 0.11185992930506346),
 ('sepal width (cm)', 0.016341813006098178),
 ('petal length (cm)', 0.36439533040889194),
 ('petal width (cm)', 0.5074029272799464)]

 

Python Example for Beginners

Two Machine Learning Fields

There are two sides to machine learning:

  • Practical Machine Learning:This is about querying databases, cleaning data, writing scripts to transform data and gluing algorithm and libraries together and writing custom code to squeeze reliable answers from data to satisfy difficult and ill defined questions. It’s the mess of reality.
  • Theoretical Machine Learning: This is about math and abstraction and idealized scenarios and limits and beauty and informing what is possible. It is a whole lot neater and cleaner and removed from the mess of reality.

Data Science Resources: Data Science Recipes and Applied Machine Learning Recipes

Introduction to Applied Machine Learning & Data Science for Beginners, Business Analysts, Students, Researchers and Freelancers with Python & R Codes @ Western Australian Center for Applied Machine Learning & Data Science (WACAMLDS) !!!

Latest end-to-end Learn by Coding Recipes in Project-Based Learning:

Applied Statistics with R for Beginners and Business Professionals

Data Science and Machine Learning Projects in Python: Tabular Data Analytics

Data Science and Machine Learning Projects in R: Tabular Data Analytics

Python Machine Learning & Data Science Recipes: Learn by Coding

R Machine Learning & Data Science Recipes: Learn by Coding

Comparing Different Machine Learning Algorithms in Python for Classification (FREE)

Disclaimer: The information and code presented within this recipe/tutorial is only for educational and coaching purposes for beginners and developers. Anyone can practice and apply the recipe/tutorial presented here, but the reader is taking full responsibility for his/her actions. The author (content curator) of this recipe (code / program) has made every effort to ensure the accuracy of the information was correct at time of publication. The author (content curator) does not assume and hereby disclaims any liability to any party for any loss, damage, or disruption caused by errors or omissions, whether such errors or omissions result from accident, negligence, or any other cause. The information presented here could also be found in public knowledge domains.  

Google –> SETScholars

A list of Python, R and SQL Codes for Applied Machine Learning and Data  Science at https://setscholars.net/Learn by Coding Categories:

  1. Classification: https://setscholars.net/category/classification/
  2. Data Analytics: https://setscholars.net/category/data-analytics/
  3. Data Science: https://setscholars.net/category/data-science/
  4. Data Visualisation: https://setscholars.net/category/data-visualisation/
  5. Machine Learning Recipe: https://setscholars.net/category/machine-learning-recipe/
  6. Pandas: https://setscholars.net/category/pandas/
  7. Python: https://setscholars.net/category/python/
  8. SKLEARN: https://setscholars.net/category/sklearn/
  9. Supervised Learning: https://setscholars.net/category/supervised-learning/
  10. Tabular Data Analytics: https://setscholars.net/category/tabular-data-analytics/
  11. End-to-End Data Science Recipes: https://setscholars.net/category/a-star-data-science-recipe/
  12. Applied Statistics: https://setscholars.net/category/applied-statistics/
  13. Bagging Ensemble: https://setscholars.net/category/bagging-ensemble/
  14. Boosting Ensemble: https://setscholars.net/category/boosting-ensemble/
  15. CatBoost: https://setscholars.net/category/catboost/
  16. Clustering: https://setscholars.net/category/clustering/
  17. Data Analytics: https://setscholars.net/category/data-analytics/
  18. Data Science: https://setscholars.net/category/data-science/
  19. Data Visualisation: https://setscholars.net/category/data-visualisation/
  20. Decision Tree: https://setscholars.net/category/decision-tree/
  21. LightGBM: https://setscholars.net/category/lightgbm/
  22. Machine Learning Recipe: https://setscholars.net/category/machine-learning-recipe/
  23. Multi-Class Classification: https://setscholars.net/category/multi-class-classification/
  24. Neural Networks: https://setscholars.net/category/neural-networks/
  25. Python Machine Learning: https://setscholars.net/category/python-machine-learning/
  26. Python Machine Learning Crash Course: https://setscholars.net/category/python-machine-learning-crash-course/
  27. R Classification: https://setscholars.net/category/r-classification/
  28. R for Beginners: https://setscholars.net/category/r-for-beginners/
  29. R for Business Analytics: https://setscholars.net/category/r-for-business-analytics/
  30. R for Data Science: https://setscholars.net/category/r-for-data-science/
  31. R for Data Visualisation: https://setscholars.net/category/r-for-data-visualisation/
  32. R for Excel Users: https://setscholars.net/category/r-for-excel-users/
  33. R Machine Learning: https://setscholars.net/category/r-machine-learning/
  34. R Machine Learning Crash Course: https://setscholars.net/category/r-machine-learning-crash-course/
  35. R Regression: https://setscholars.net/category/r-regression/
  36. Regression: https://setscholars.net/category/regression/
  37. XGBOOST: https://setscholars.net/category/xgboost/
  38. Excel examples for beginners: https://setscholars.net/category/excel-examples-for-beginners/
  39. C Programming tutorials & examples: https://setscholars.net/category/c-programming-tutorials/
  40. Javascript tutorials & examples: https://setscholars.net/category/javascript-tutorials-and-examples/
  41. Python tutorials & examples: https://setscholars.net/category/python-tutorials/
  42. R tutorials & examples: https://setscholars.net/category/r-for-beginners/
  43. SQL tutorials & examples: https://setscholars.net/category/sql-tutorials-for-business-analyst/

 

( FREE downloadable Mathematics Worksheet for Kids )