Decoding the Art of Evaluating Machine Learning Algorithms: A Comprehensive Guide to Choosing the Right Test Options
In the world of machine learning, the evaluation of algorithms is a critical step that can significantly influence the effectiveness of your models. The test options you choose when evaluating machine learning algorithms can mean the difference between over-learning, a mediocre result, and a usable state-of-the-art result that you can confidently shout from the rooftops. This comprehensive guide aims to demystify the process of choosing the right test options when evaluating machine learning algorithms.
The Role of Randomness in Machine Learning
The root of the difficulty in choosing the right test options lies in randomness. Most machine learning algorithms use randomness in some way. The randomness may be explicit in the algorithm or may be in the sample of the data selected to train the algorithm. This does not mean that the algorithms produce random results, but rather that they produce results with some noise or variance. We call this type of limited variance, stochastic, and the algorithms that exploit it, stochastic algorithms.
Training and Testing on the Same Data
One might be tempted to train a model on a dataset and then report the results of the model on that same dataset. While this approach will indeed provide an indication of the performance of the algorithm on the dataset, it does not provide any indication of how the algorithm will perform on data that the model was not trained on, also known as unseen data. This is a crucial consideration if you plan to use the model to make predictions on unseen data.
A simple way to use one dataset to both train and estimate the performance of the algorithm on unseen data is to split the dataset. For example, you could randomly select 66% of the instances for training and use the remaining 34% as a test dataset. The algorithm is run on the training dataset, and a model is created and assessed on the test dataset, providing a performance accuracy.
Split tests are fast and great when you have a lot of data or when training a model is expensive in terms of IT resources or time. A split test on a very large dataset can produce an accurate estimate of the actual performance of the algorithm. However, a problem with split tests is that if we split the training dataset again into a different 66%/34% split, we would get a different result from our algorithm. This is known as model variance.
Multiple Split Tests
A solution to the problem of model variance in split tests is to reduce the variance of the random process and do it many times. We can collect the results from a fair number of runs (say 10) and take the average. This approach, however, may still result in some data instances never being included for training or testing, whereas others may be selected multiple times. This may skew results and may not give a meaningful idea of the accuracy of the algorithm.
Cross validation, specifically k-fold cross validation, is a solution to the problem of ensuring each instance is used for training and testing an equal number of times while reducing the variance of an accuracy score. For example, let’s choose a value of k=10. This will split the dataset into 10 parts (10 folds), and the algorithm will be run 10 times. Each time the algorithm is run, it will be trained on 90% of the data and tested on 10%, and each run of the algorithm will change which 10% of the data the algorithm is tested on.
The k-fold cross-validation method is the go-to method for evaluating the performance of an algorithm on a dataset. However, cross-validation does not account for variance in the algorithm’s predictions. Another point of concern is that cross-validation itself uses randomness to decide how to split the dataset into k folds. Cross-validation does not estimate how the algorithm performs with different sets of folds.
Multiple Cross Validation
To account for the variance in the algorithm itself, one can run cross-validation multiple times and take the mean and the standard deviation of the algorithm accuracy from each run. This will give an estimate of the performance of the algorithm on the dataset and an estimation of how robust the performance is.
When comparing algorithm performance measures when using multiple runs of k-fold cross validation, one can use statistical significance tests, like the Student’s t-test. Statistical significance tests can give meaning to the differences (or lack thereof) between algorithm results when using multiple runs. This can be useful when one wants to make accurate claims about results (for example, algorithm A was better than algorithm B, and the difference was statistically significant).
In this post, you have discovered the difference between the main test options available to you when designing a test harness to evaluate machine learning algorithms. Specifically, you learned the utility and problems with:
1. Training and testing on the same dataset
2. Split tests
3. Multiple split tests
4. Cross validation
5. Multiple cross validation
6. Statistical significance
When in doubt, use k-fold cross validation (k=10) and use multiple runs of k-fold cross validation with statistical significance tests when you want to meaningfully compare algorithms on your dataset.
Relevant Prompts for Evaluating Machine Learning Algorithms
Here are some prompts that you can use when evaluating machine learning algorithms:
1. “What is the performance of the algorithm when trained and tested on the same dataset?”
2. “How does the performance of the algorithm change with different splits in a split test?”
3. “What is the average performance of the algorithm over multiple split tests?”
4. “How does the performance of the algorithm change with different folds in k-fold cross validation?”
5. “What is the average performance of the algorithm over multiple runs of k-fold cross validation?”
6. “How does the performance of the algorithm change with different random number seeds in multiple runs of k-fold cross validation?”
7. “Is the difference in performance between algorithm A and algorithm B statistically significant?”
8. “How does the performance of the algorithm change with different p-values in statistical significance tests?”
9. “What is the impact of model variance on the performance of the algorithm?”
10. “How does the performance of the algorithm change with different test options?”
In conclusion, choosing the right test options when evaluating machine learning algorithms is a critical step that can significantly influence the effectiveness of your models. By understanding the various test options and their implications, you can make more informed decisions and ultimately build more effective machine learning models.