Machine Learning for Beginners in Python: How to do Cross Validation With Parameter Tuning Using Grid Search

Cross Validation With Parameter Tuning Using Grid Search

In machine learning, two tasks are commonly done at the same time in data pipelines: cross validation and (hyper)parameter tuning. Cross validation is the process of training learners using one set of data and testing it using a different set. Parameter tuning is the process to selecting the values for a model’s parameters that maximize the accuracy of the model.

In this tutorial we work through an example which combines cross validation and parameter tuning using scikit-learn.


import numpy as np
from sklearn.grid_search import GridSearchCV
from sklearn import datasets, svm
import matplotlib.pyplot as plt

Create Two Datasets

In the code below, we load the digits dataset, which contains 64 feature variables. Each feature denotes the darkness of a pixel in an 8 by 8 image of a handwritten digit. We can see these features for the first observation:

/* Load the digit data */
digits = datasets.load_digits()
/* View the features of the first observation */[0:1]
array([[  0.,   0.,   5.,  13.,   9.,   1.,   0.,   0.,   0.,   0.,  13.,
         15.,  10.,  15.,   5.,   0.,   0.,   3.,  15.,   2.,   0.,  11.,
          8.,   0.,   0.,   4.,  12.,   0.,   0.,   8.,   8.,   0.,   0.,
          5.,   8.,   0.,   0.,   9.,   8.,   0.,   0.,   4.,  11.,   0.,
          1.,  12.,   7.,   0.,   0.,   2.,  14.,   5.,  10.,  12.,   0.,
          0.,   0.,   0.,   6.,  13.,  10.,   0.,   0.,   0.]])

The target data is a vector containing the image’s true digit. For example, the first observation is a handwritten digit for ‘0’.

/* View the target of the first observation */[0:1]

To demonstrate cross validation and parameter tuning, first we are going to divide the digit data into two datasets called data1 and data2data1 contains the first 1000 rows of the digits data, while data2 contains the remaining ~800 rows. Note that this split is separate to the cross validation we will conduct and is done purely to demonstrate something at the end of the tutorial. In other words, don’t worry about data2 for now, we will come back to it.

/* Create dataset 1 */
data1_features =[:1000]
data1_target =[:1000]

/* Create dataset 2 */
data2_features =[1000:]
data2_target =[1000:]

Create Parameter Candidates

Before looking for which combination of parameter values produces the most accurate model, we must specify the different candidate values we want to try. In the code below we have a number of candidate parameter values, including four different values for C (1, 10, 100, 1000), two values for gamma (0.001, 0.0001), and two kernels (linear, rbf). The grid search will try all combinations of parameter values and select the set of parameters which provides the most accurate model.

parameter_candidates = [
  {'C': [1, 10, 100, 1000], 'kernel': ['linear']},
  {'C': [1, 10, 100, 1000], 'gamma': [0.001, 0.0001], 'kernel': ['rbf']},

Conduct Grid Search To Find Parameters Producing Highest Score

Now we are ready to conduct the grid search using scikit-learn’s GridSearchCV which stands for grid search cross validation. By default, the GridSearchCV’s cross validation uses 3-fold KFold or StratifiedKFold depending on the situation.

/* Create a classifier object with the classifier and parameter candidates */
clf = GridSearchCV(estimator=svm.SVC(), param_grid=parameter_candidates, n_jobs=-1)

/* Train the classifier on data1's feature and target data */, data1_target)

GridSearchCV(cv=None, error_score='raise',
       estimator=SVC(C=1.0, cache_size=200, class_weight=None, coef0=0.0,
  decision_function_shape=None, degree=3, gamma='auto', kernel='rbf',
  max_iter=-1, probability=False, random_state=None, shrinking=True,
  tol=0.001, verbose=False),
       fit_params={}, iid=True, n_jobs=-1,
       param_grid=[{'kernel': ['linear'], 'C': [1, 10, 100, 1000]}, {'kernel': ['rbf'], 'gamma': [0.001, 0.0001], 'C': [1, 10, 100, 1000]}],
       pre_dispatch='2*n_jobs', refit=True, scoring=None, verbose=0)

Success! We have our results! First, let’s look at the accuracy score when we apply the model to the data1’s test data.

/* View the accuracy score */
print('Best score for data1:', clf.best_score_)
Best score for data1: 0.942

Which parameters are the best? We can tell scikit-learn to display them:

/* View the best parameters for the model found using grid search */
print('Best C:',clf.best_estimator_.C) 
print('Best Kernel:',clf.best_estimator_.kernel)
print('Best Gamma:',clf.best_estimator_.gamma)

Best C: 10
Best Kernel: rbf
Best Gamma: 0.001

This tells us that the most accurate model uses C=10, the rbf kernel, and gamma=0.001.

Sanity Check Using Second Dataset

Remember the second dataset we created? Now we will use it to prove that those parameters are actually used by the model. First, we apply the classifier we just trained to the second dataset. Then we will train a new support vector classifier from scratch using the parameters found using the grid search. We should get the same results for both models.

/* Apply the classifier trained using data1 to data2, and view the accuracy score
clf.score(data2_features, data2_target)

/* Train a new classifier using the best parameters found by the grid search */
svm.SVC(C=10, kernel='rbf', gamma=0.001).fit(data1_features, data1_target).score(data2_features, data2_target)


Python Example for Beginners

Two Machine Learning Fields

There are two sides to machine learning:

  • Practical Machine Learning:This is about querying databases, cleaning data, writing scripts to transform data and gluing algorithm and libraries together and writing custom code to squeeze reliable answers from data to satisfy difficult and ill defined questions. It’s the mess of reality.
  • Theoretical Machine Learning: This is about math and abstraction and idealized scenarios and limits and beauty and informing what is possible. It is a whole lot neater and cleaner and removed from the mess of reality.

Data Science Resources: Data Science Recipes and Applied Machine Learning Recipes

Introduction to Applied Machine Learning & Data Science for Beginners, Business Analysts, Students, Researchers and Freelancers with Python & R Codes @ Western Australian Center for Applied Machine Learning & Data Science (WACAMLDS) !!!

Latest end-to-end Learn by Coding Recipes in Project-Based Learning:

Applied Statistics with R for Beginners and Business Professionals

Data Science and Machine Learning Projects in Python: Tabular Data Analytics

Data Science and Machine Learning Projects in R: Tabular Data Analytics

Python Machine Learning & Data Science Recipes: Learn by Coding

R Machine Learning & Data Science Recipes: Learn by Coding

Comparing Different Machine Learning Algorithms in Python for Classification (FREE)

Disclaimer: The information and code presented within this recipe/tutorial is only for educational and coaching purposes for beginners and developers. Anyone can practice and apply the recipe/tutorial presented here, but the reader is taking full responsibility for his/her actions. The author (content curator) of this recipe (code / program) has made every effort to ensure the accuracy of the information was correct at time of publication. The author (content curator) does not assume and hereby disclaims any liability to any party for any loss, damage, or disruption caused by errors or omissions, whether such errors or omissions result from accident, negligence, or any other cause. The information presented here could also be found in public knowledge domains.  

Google –> SETScholars

A list of Python, R and SQL Codes for Applied Machine Learning and Data  Science at by Coding Categories:

  1. Classification:
  2. Data Analytics:
  3. Data Science:
  4. Data Visualisation:
  5. Machine Learning Recipe:
  6. Pandas:
  7. Python:
  9. Supervised Learning:
  10. Tabular Data Analytics:
  11. End-to-End Data Science Recipes:
  12. Applied Statistics:
  13. Bagging Ensemble:
  14. Boosting Ensemble:
  15. CatBoost:
  16. Clustering:
  17. Data Analytics:
  18. Data Science:
  19. Data Visualisation:
  20. Decision Tree:
  21. LightGBM:
  22. Machine Learning Recipe:
  23. Multi-Class Classification:
  24. Neural Networks:
  25. Python Machine Learning:
  26. Python Machine Learning Crash Course:
  27. R Classification:
  28. R for Beginners:
  29. R for Business Analytics:
  30. R for Data Science:
  31. R for Data Visualisation:
  32. R for Excel Users:
  33. R Machine Learning:
  34. R Machine Learning Crash Course:
  35. R Regression:
  36. Regression:
  37. XGBOOST:
  38. Excel examples for beginners:
  39. C Programming tutorials & examples:
  40. Javascript tutorials & examples:
  41. Python tutorials & examples:
  42. R tutorials & examples:
  43. SQL tutorials & examples:


( FREE downloadable Mathematics Worksheet for Kids )