Hot off the press. The 2022 Machine Learning for Health conference was held in Las Vegas last week. This post is about a paper by Google and Standford researchers presented there.
Model Stability. Many machine learning models depend on random parameters. Models that perform similarly under different values of the random parameters are called "stable". Stability is important: if slight differences in parameters lead to wild variations in performance, then the model is pretty useless.
In healthcare applications, model stability has not been well studied. This paper focuses on stability in the context of risk stratification and selecting high-risk patients for clinical intervention.
Stability metrics. To evaluate stability, we need to define "similar performance". Performance can be evaluated based on statistical metrics like AUC (“area under the curve”). But, these aren't always the right metrics for real-world applications.
For risk stratification, a more relevant metric for comparing two models is the overlap between the high-risk populations they identify. The researchers define metrics along these lines and show that random variation can lead to models with very similar AUCs, but very different risk-stratification performance.
The bottom line. Choosing the right metrics for optimization is a recurring theme of this blog. This paper adds a dimension to that. It's important tie not only performance metrics but also stability metrics to the right clinical context.