Generalization Mindsets: What's a good model?

During a seminar in my Master, I presented the famous paper Statistical modeling: The two cultures by Leo Breiman [1]. The professor, a statistician through and through, rejected the idea of machine learning in general. He said something like: you are doomed to simply learn all the "errors" in the data. I objected to that, pointing out that the usual practice in machine learning is to regularize the model using validation data and assess the performance on additional tests data, hence reducing overfitting effects. Well, we didn't agree in the end, and to this day I (mostly) object to his statement. At the time, I believed that he was referring to overfitting. But today I think It might be have been a deeper, more philosophical objection to machine learning. I have been trained as statistician for many years, also worked as statistician in many projects and published papers. I have also aquired machine learning skills, participated (with moderate success) in machine learning competitions, deployed machine learning models in products and published papers on machine learning model interpretation. It took me years working with different modeling cultues to formulate this distinctions in, which I would call, Generalization Mindsets.

The disagreement between statisticians and machine learner is NOT which models are to be used (like linear models vs. deep learning). Statisticians and machine learner disagree on the question: What does it mean for a model to generalize? The following discussion is not erect artificial borders between disciplines, but to show that the data modeling world needs to adapt a blend of mindsets.

Generalization Mindset in Statistics

At the heart of classical statistical modeling is thinking and making assumptions about the data-generating process and encoding this knowledge into the model.

Let’s walk through a very simplified version of classical statistical modeling: You might start with a linear model. The outcome is some count outcome (like number of fish caught on a day). Let's assume a Poisson distribution, so we have to use a transformation function, which puts the model into the Generalized Linear Model context (GLM). The effect of temperature might not be linear, but better be captured with a smooth effect estimator, so we have to use a generalized additive model (GAM). Maybe there is an interaction between bait used and time of the day? Let's add an interaction term. These modifications continue until the modeler thinks the data-generating process is reflected well enough, which is judged both by domain expertise and data-driven evaluations. How is this model evaluated? Usually by looking at the model deviance or the AIC, which kind of measure how the model fares against the background of a distributional assumptions. To compare two models, for example one where we included the "bait"-feature and one where we didn't, we compare the likelihood-ratios. All these evaluations are in-sample, using the same data that were used to estimate the parameters (something that would cause nausea for any machine learner).

What does that process tell us about the generalization mindset of statisticians? The entire modeling process tightly integrates knowledge about the data-generating process. A good statistician closely works together with a person who knows the data inside out. The way the models are built allow for comparing model parameters to domain knowledge, which allows errors or inadequate modeling choices to be spotted. In the entire process, the flexibility of the model is tightly controlled: Only the minimum of flexbility is given in the fitting process. For example, interactions between two features are only allowed when the statistician chooses to. Beside the modeling process itself, the statistician judges whether we can assume that the data are a representative sample from whatever we want to study. To the statistician, the degree of representativeness is important to judge how generalizable the results are to the data-generating process.

For the statistician, generalization requires: representative sample + model structure based on the data-generating process + distributional match + interpretable results that allow feedback

For the statistician, the question is not "How well will my model generalize to new data?", but rather "Which conclusions can be drawn from the model (and data sample) about aspects data-generating process?". This aspect can be the treatment effect of a medication, or whether customers are more likely to buy a product when they are younger.

The structural assumptions (like linearity) allow us to map coefficients to individual features, and the additional distributional assumptions allow us to calculate confidence intervals to extend the interpretation to the data-generating process.

The disadvantage of an approach that is focused on in-sample evaluation is that the model might not perform well at all for a new sample of the data. After years of being trained as a statistician, it took me some time to accept out-of-sample evaluation as a way to choose a model. Let's have a closer look at this particular generalization mindest:

Generalization Mindset in Machine Learning

For the machine learner, generalization requires good performance on test data.

In machine learning, it usually does not matter that much which model was used, as long as you can show that it performs well on unseen test data. Simple, yet elegant. Isn't that exactly what you want? A model that generalizes to other data from the same distribution? When I learned about this generalization mindset – measuring model generalization as performance on unseen data – my mind was blown. And after I had adopted this mindset, I found it difficult to accept a different mindset of generalization for some time.

When we compare the generalization mindset of machine learners to statisticians, we see that the entire emphasis on the data-generating process has gone out the window (exaggerating here a bit, e.g., see Inductive Bias). But not all is well in "good-performance-on-test-data"-country.

While I stil believe that test performance is a good tool to quantify generalization, it can fail in the dumbest ways. There are many examples, like using an asthma diagnosis as a predictor for lower risk of pneumonia, classifying images because of watermarks [2], wrongly classifying dogs as wolfs because of snow in the background [3], or predicting a better pneumonia outcome for asthma patients [4]. And, of course, the big issue of adversarial attacks, where a change in even one pixel can derail an image classifier. It's convenient to only look at the test performance. But this can come at the prize of ignoring the data-generating process, not talking enough with people that understand the data, ensuring that the model does not behave in a stupid way.

I once worked part-time for a machine learning startup that built a document classifier. With my background in statistics, I asked many questions, like how good our training documents will reflect the later, actual distribution of documents. My supervisor at the time sometimes told me that he values that I ask these question. The questions were mostly about the data-generating process and I didn‘t understand why they would be valuable, because I thought it very natural to think abou these questions. But this type of thinking had to do with thinking like a statistician, while many team members had an engineering background. Today, I often see papers or products using machine learning, where the creators are seemingly blinded by the test performance and would have profited immensely from a generalization mindset of a statistician.

Generalization Mindset in Causality

Causal inference aims to identify the causal effect a component has in a system (very broadly speaking). Causality is a topic avoided by many statisticians, at least where I did my Bachelor‘s and Master‘s degree. Causality was only mentioned in two occasions: “correlation is not causation“ and we shall include all the confounders in our regression models. Typical introductions to machine learning do not mention causality. The prevalent mentality is to use any information that can be used for a prediction, because, remember, generalization is performance on unseen data for the machine learner.

The generalization mindset in causality builds upon assumptions about causal structures. Like the statistician, the causality-person (Is there a better word?) is focused on the data-generating process. But where the statistician thinks more associative, like “What does the distribution of my target look like given all these features“, practitioners of causal inference make causal assumptions and restrict their models accordingly. This focus on causality also leads to a natural generalization mindset: If you modeled the correct causal relationships, your model can be generalized to other situations.

To summarize, generalization in causality requires assumption of causal structures + model restrictions based on causal structures.

Choose Your Flavor

In some way these mindsets are archetypes, and individual practitioner have a blend of these mindset. But on the aggregate level, say, a scientific journal, a platform for machine learning competitions, or books, you can find these mindsets in its pure form. Especially generalization mindset of good test performance acts as a repellant for other mindsets. The measurability of generalization performance with a single metric transformed modeling to a mindless game of high scores. Popular machine learning platforms with leaderboards are the most vivid proof of that. Another offspring of modeling as a game is how we measure progress in machine learning research: Scientific progress has been made when a model B outperforms model A on task XYZ on test data. KPIs, popularized by risk-avoiding, cover-my-own ass, dull-minded middle managers, have the unfortunate property of introducing an absolute focus on them and overruling common sense. KPIs have creeped into how we evaluate scientists and their impact (journal impact, h-index, ...) but also how we evaluate the merit of new methods in machine learning. It is simple to publish a new machine learning model that somehow beats the status quo. But it is more difficult to argue that you method is better for many other reasons. Machine learning platforms take the KPI idea to the final conclusion and only pays out the ones with the best performing model, best on a single metric. Regardless whether the model makes any sense based on causality reasoning, or how interpretable it is, or how long it takes to train and so on. Other fields are also quite restrictive. Some medical journals will give you a funny look when you start with machine learning and but don't provide any p-values.

Blending It All Together

Each mindset has its strengths and blindspots. None of the mindset is the ONE correct generalization mindset. And if you uncritically follow just one mindset, you loose out. The best generalization mindset is the one that blends many together.

Did you fit a generalized regression model? See how well it fares on unseen data, compare it with something like a random forest. Recognize the prize to pay (in predictive performance) for structural assumptions of the generalized linear regression model. Try to draw the causal structures using a directed acyclic graph, best together with the domain expert. Did you include all necessary variable to model the causal relationship?

Are you working on a machine learning model that is going to be deployed in production? Take on the statisticians hat, think about the data-generating process. How would you expect the data to be distributed? Would expect some features to have a linear effect on the outcome? How much does the performance drop when you impose some restrictions in the model, and how much more robuster does the model become in return? Do features used causally influence the target in reality? If not, this might make the model vulnerable.

[1] Breiman, Leo. "Statistical modeling: The two cultures (with comments and a rejoinder by the author)." Statistical science 16.3 (2001): 199-231.

[2] Lapuschkin, Sebastian, et al. "Unmasking clever hans predictors and assessing what machines really learn." Nature communications 10.1 (2019): 1-8.

[3] Ribeiro, Marco Tulio, Sameer Singh, and Carlos Guestrin. "" Why should i trust you?" Explaining the predictions of any classifier." Proceedings of the 22nd ACM SIGKDD international conference on knowledge discovery and data mining. 2016.

[4] Caruana, Rich, et al. "Intelligible models for healthcare: Predicting pneumonia risk and hospital 30-day readmission." Proceedings of the 21th ACM SIGKDD international conference on knowledge discovery and data mining. 2015.