A review of “Underspecification Presents Challenges for Credibility in Modern Machine Learning” by D’Amour et al.

https://arxiv.org/abs/2011.03395  0/
Main claim: ML pipelines exhibit “underspecification”. i.e., they can produce a range of fitted models that all give similar iid test set performance but that give very different performance on “stress tests”. 1/
These stress tests measure performance on subpopulations and performance after data shift. The body of the paper catalogues a long list of systems that exhibit these underspecification phenomena. They describe interesting stress tests for all of these systems. 2/
However, the fact that from a finite data set and an over-parameterized or non-identifiable model class one can obtain a diversity of fitted models is no surprise. It is at the heart of Bayesian and ensemble methods. 3/
ML has standard techniques for eliminating this diversity including (a) computing the ensemble average, (b) regularization/shrinkage to a prior, (c) flat minima, (d) imposing invariants. These all improve iid generalization, and they would eliminate "underspecification" 4/
The authors cite but don’t discuss these standard methods. Instead, their real concern is not to eliminate variability but to attain good performance after data shift. They view this as a problem of selecting a good model from the ensemble of fitted models. 5/
They suggest that ML engineers should think carefully about potential data shifts and design stress tests that exercise similar shifts. These can then be applied to select good models from the fitted model ensemble. 6/
I agree that good tests is important, but they are not sufficient, because they don’t provide an efficient way to ensure that the fitted models will pass the tests. 7/
I think it would be better to identify the invariants that are implicit in those tests and enforce those during learning. We have some ways of doing this, but much more research is needed. 8/
I was surprised that the article didn’t mention another drawback of underspecification. Elsewhere, Google folks have described their MLOps tools for automatic retraining, deployment, and monitoring (with automatic rollback if new models are found to be defective). 9/
Clearly underspecification makes automatic retraining risky, and we need ways to constrain the retraining process to limit defects. Stress tests would also help detect retraining failures. 10/
Summary: Underspecification is easy to fix, but it isn't the real problem. Rather, the issue is poor performance after data shift. This is indeed a central challenge for machine learning, and this paper presents a great catalogue of examples of the problem! end/
You can follow @tdietterich.
Tip: mention @twtextapp on a Twitter thread with the keyword “unroll” to get a link to it.

Latest Threads Unrolled:

By continuing to use the site, you are consenting to the use of cookies as explained in our Cookie Policy to improve your experience.