Machine Learning for Beginners in Python: How to do Cross Validation Pipeline

Hits: 31

Cross Validation Pipeline

The code below does a lot in only a few lines. To help explain things, here are the steps that code is doing:

  1. Split the raw data into three folds. Select one for testing and two for training.
  2. Preprocess the data by scaling the training features.
  3. Train a support vector classifier on the training data.
  4. Apply the classifier to the test data.
  5. Record the accuracy score.
  6. Repeat steps 1-5 two more times, once for each fold.
  7. Calculate the mean score for all the folds.


/* Load Libraries */
from sklearn.datasets import load_iris
from sklearn.pipeline import make_pipeline
from sklearn import preprocessing
from sklearn import cross_validation
from sklearn import svm

Load Data

For this tutorial we will use the famous iris dataset. The iris data contains four measurements of 150 iris flowers and their species. We will use a support vector classifier to predict the species of the iris flowers.

/* Load the iris test data */
iris = load_iris()

/* View the iris data features for the first three rows */[0:3]
array([[ 5.1,  3.5,  1.4,  0.2],
       [ 4.9,  3. ,  1.4,  0.2],
       [ 4.7,  3.2,  1.3,  0.2]])

/* View the iris data target for first three rows. '0' means it flower is of the setosa species. */[0:3]
array([0, 0, 0])

Create Classifier Pipeline

Now we create a pipeline for the data. First, the pipeline preprocesses the data by scaling the feature variable’s values to mean zero and unit variance. Second, the pipeline trains a support classifier on the data with C=1C is the cost function for the margins. The higher the C, the less tolerant the model is for observations being on the wrong side of the hyperplane.

/* Create a pipeline that scales the data then trains a support vector classifier */
classifier_pipeline = make_pipeline(preprocessing.StandardScaler(), svm.SVC(C=1))

Cross Validation

Scikit provides a great helper function to make it easy to do cross validation. Specifically, the code below splits the data into three folds, then executes the classifier pipeline on the iris data.

Important note from the scikit docsFor integer/None inputs, if y is binary or multiclass, StratifiedKFold used. If the estimator is a classifier or if y is neither binary nor multiclass, KFold is used.


/* KFold/StratifiedKFold cross validation with 3 folds (the default)
   applying the classifier pipeline to the feature and target data */
scores = cross_validation.cross_val_score(classifier_pipeline,,, cv=3)

Evaluate Model

Here is the output of our 3 KFold cross validation. Each value is the accuracy score of the support vector classifier when leaving out a different fold. There are three values because there are three folds. A higher accuracy score, the better.

array([ 0.98039216,  0.90196078,  0.97916667])

To get an good measure of the model’s accuracy, we calculate the mean of the three scores. This is our measure of model accuracy.



Python Example for Beginners

Two Machine Learning Fields

There are two sides to machine learning:

  • Practical Machine Learning:This is about querying databases, cleaning data, writing scripts to transform data and gluing algorithm and libraries together and writing custom code to squeeze reliable answers from data to satisfy difficult and ill defined questions. It’s the mess of reality.
  • Theoretical Machine Learning: This is about math and abstraction and idealized scenarios and limits and beauty and informing what is possible. It is a whole lot neater and cleaner and removed from the mess of reality.

Data Science Resources: Data Science Recipes and Applied Machine Learning Recipes

Introduction to Applied Machine Learning & Data Science for Beginners, Business Analysts, Students, Researchers and Freelancers with Python & R Codes @ Western Australian Center for Applied Machine Learning & Data Science (WACAMLDS) !!!

Latest end-to-end Learn by Coding Recipes in Project-Based Learning:

Applied Statistics with R for Beginners and Business Professionals

Data Science and Machine Learning Projects in Python: Tabular Data Analytics

Data Science and Machine Learning Projects in R: Tabular Data Analytics

Python Machine Learning & Data Science Recipes: Learn by Coding

R Machine Learning & Data Science Recipes: Learn by Coding

Comparing Different Machine Learning Algorithms in Python for Classification (FREE)

Disclaimer: The information and code presented within this recipe/tutorial is only for educational and coaching purposes for beginners and developers. Anyone can practice and apply the recipe/tutorial presented here, but the reader is taking full responsibility for his/her actions. The author (content curator) of this recipe (code / program) has made every effort to ensure the accuracy of the information was correct at time of publication. The author (content curator) does not assume and hereby disclaims any liability to any party for any loss, damage, or disruption caused by errors or omissions, whether such errors or omissions result from accident, negligence, or any other cause. The information presented here could also be found in public knowledge domains.  

Google –> SETScholars

A list of Python, R and SQL Codes for Applied Machine Learning and Data  Science at by Coding Categories:

  1. Classification:
  2. Data Analytics:
  3. Data Science:
  4. Data Visualisation:
  5. Machine Learning Recipe:
  6. Pandas:
  7. Python:
  9. Supervised Learning:
  10. Tabular Data Analytics:
  11. End-to-End Data Science Recipes:
  12. Applied Statistics:
  13. Bagging Ensemble:
  14. Boosting Ensemble:
  15. CatBoost:
  16. Clustering:
  17. Data Analytics:
  18. Data Science:
  19. Data Visualisation:
  20. Decision Tree:
  21. LightGBM:
  22. Machine Learning Recipe:
  23. Multi-Class Classification:
  24. Neural Networks:
  25. Python Machine Learning:
  26. Python Machine Learning Crash Course:
  27. R Classification:
  28. R for Beginners:
  29. R for Business Analytics:
  30. R for Data Science:
  31. R for Data Visualisation:
  32. R for Excel Users:
  33. R Machine Learning:
  34. R Machine Learning Crash Course:
  35. R Regression:
  36. Regression:
  37. XGBOOST:
  38. Excel examples for beginners:
  39. C Programming tutorials & examples:
  40. Javascript tutorials & examples:
  41. Python tutorials & examples:
  42. R tutorials & examples:
  43. SQL tutorials & examples:


( FREE downloadable Mathematics Worksheet for Kids )