Variance of the Generalization Error
After a good day's work, you want to judge what you created. You forged a machine learning model, and the generalization error will tell you how well your model performs. The generalization error estimate such as the mean squared error (MSE) tells you how well the model will work on fresh data (which, nonetheless, has to come from the same distribution as the training data).
This is quite simply! Leave a part of your data out of the training and tuning process and compute the MSE or another generalization error measure on this untouched data set. To get more stable results, you can repeat the split into training data and evaluation data mutliple times with resampling schemes such as cross-validation and bootstrapping.
If you repeat splitting+training+evaluation 10-times, you get 10 estimates of the generalization error per model (). And these estimates are simply averaged per model .
Let's assume you compare a random forest, a support vector machine and a linear model. The MSE results (that I totally made up) are:
Which one is the best model? The random forest would be the obvious first choice. But this table does not allow you to draw this conclusion.
A Coin Toss
A simple example why this conclusion might be wrong. Let's toss some coins. Three fair coins which I call A, B and C. We repeat it 10 times and count how often each coin shows "tails". You might already know where this is going. Here are the results:
Coin A is the clear winner here with 5/10 tails and the others only have 3/10 tails.
Objection! A coin is random, so this is purely noise that we are interpreting! Yes, exactly. The difference between the coin and the ML models is that for the coin we know that it's purely noise. For the ML benchmark it will be a mixture of noise and real differences. The solution is to quantify the uncertainty caused by the noise.
Statistics to the Rescue
Statistics offers hypothesis tests and confidence intervals to quantify uncertainty. To compute a confidence interval for the 10 generalization error estimates, we could – in a first naive approach – compute the variance across the 10 estimates per model. We know that the generalization error is a mean estimate with unknown variance (which we have to estimate), therefore we can use the Student t-distribution with 9 degrees of freedom (10 - 1 = 9) to construct tests and confidence intervals.
More Complicated Than Anticipated
BUT, there is a big problem with this variance + t-distribution approach. When we simply compute the variance, we silently assume that the MSE estimates are independent of each other. But they are certainly not. The estimate are correlated because the training data for the models overlap and also the test-data can overlap when using boostrap or subsampling. Also the training and test data are not drawn independently. Once you have drawn a sample as the training data, it is already determined which data points will be your test data.
So, when we ignore the correlation, we will underestimate the true variance of our MSE estimates. It turns out that there is no simple, exact solution to the problem. Of course, solutions have been proposed to tackle this issue. I found the approach by Nadeau and Bengio [1,2] the most convincing. They offer a simple, but approximate solution.
This One Simple Trick: Adjusting the Variance
They propose to estimate the variance as you normally would for uncorrelated data: . In the case of no correlation, this would mean that the variance of the mean estimate is: .
But instead of multiplying the variance estimate with , Nadeau and Bengio [1,2] propose to multiply the variance with the correction term , where are the number of data points in the training data and are the number of data points in the test data.
This correction increases the variance and should approximately acount for the correlation. Keep in mind that this is just a gross approximation and might still deliver too conservative or too liberal variance estimates. To arrive at such a simple correction term, the authors made the assumption that the correlation between the models depends only on the number of observations that they share, and not on the specific observations that they share.
How often should you refit the model and get a new estimate? Nadeau and Bengio propose to use 15 estimates to compute the adjusted variance, based on simulation results.
Other papers propose specific resampling schemes. Dietterich  proposed to use 2-fold cross-validation repeated 5 times with specific estimators. Hothorn et. al  built their testing framework on bootstrap theory.
A nice paper to understand the source of uncertainty in benchmarking is .b
 Nadeau, C., & Bengio, Y. (2003). Inference for the generalization error. Machine learning, 52(3), 239-281. PDF
 Technical Report of  PDF
 Dietterich, T. G. (1998). Approximate statistical tests for comparing supervised classification learning algorithms. Neural computation, 10(7), 1895-1923. PDF
 Hothorn, T., Leisch, F., Zeileis, A., & Hornik, K. (2005). The design and analysis of benchmark experiments. Journal of Computational and Graphical Statistics, 14(3), 675-699. PDF
 Bouthillier, X., Delaunay, P., Bronzi, M., Trofimov, A., Nichyporuk, B., Szeto, J., ... & Vincent, P. (2021). Accounting for Variance in Machine Learning Benchmarks. Proceedings of Machine Learning and Systems, 3. PDF