Stability metrics for risk scoring

Hot off the press. The 2022 Machine Learning for Health conference was held in Las Vegas last week. This post is about a paper by Google and Standford researchers presented there.

Model Stability. Many machine learning models depend on random parameters. Models that perform similarly under different values of the random parameters are called "stable". Stability is important: if slight differences in parameters lead to wild variations in performance, then the model is pretty useless.

In healthcare applications, model stability has not been well studied. This paper focuses on stability in the context of risk stratification and selecting high-risk patients for clinical intervention.

Stability metrics. To evaluate stability, we need to define "similar performance". Performance can be evaluated based on statistical metrics like AUC (“area under the curve”). But, these  aren't always the right metrics for real-world applications.

For risk stratification, a more relevant metric for comparing two models is the overlap between the high-risk populations they identify. The researchers define metrics along these lines and show that random variation can lead to models with very similar AUCs, but very different risk-stratification performance.

The bottom line. Choosing the right metrics for optimization is a recurring theme of this blog. This paper adds a dimension to that. It's important tie not only performance metrics but also stability metrics to the right clinical context.

Using AI to understand physician decision-making

Most health AI papers are about how to new ways to build better predictive models. But this post highlights a different kind of work: it's about new ways to use them. In this study, AI models are a tool to probe and understand human decision-making. (A summary also appeared in JAMA.) 

Testing for a heart attack is complex. The researchers focus on testing patterns for acute coronary syndrome (heart attack) in the ED. The decision to test is complex. Many heart attacks don't involve stereotypical symptoms like chest pain. And definitive diagnosis is invasive and expensive, because it often requires catheterization.

Using AI to probe human decision-making. The researchers build an AI model to predict the risk of heart attack from EHR data. The technical methodology is standard, but that is not the main point. Rather, the idea is to compare model predictions to testing patterns and use them to detect both under- and over-testing.

Inefficiency is not just overtesting. It turns out that undertesting is also a big problem. For example, the authors show that patients with chest pain tend to be overtested, while patients without it are undertested. The way to improve is the system is, of course, to test the right patients.

A framework for insight generation. The paper takes an economic rather than clinical outlook. But, it offers a useful and approachable framework to gain clinical, operational, and financial insight. It is relevant for any healthcare organization trying to identify areas for improvement, especially in value-based settings.

Clinical phenotype beyond coding

Today: a new machine learning approach called "Guided MixEHR" for extracting clinical characteristics that are not properly captured by a simple code, including many common chronic conditions. (h/t to my friend Jacob Oppenheim from EQRx for pointing me to this research.)


Codes are not enough. How do you determine clinical characteristics from healthcare data (claims or EHR)? Many conditions, like frailty, are not captured by a single code. Even common chronic conditions, like COPD, require complex rules for accurate determination, for example to avoid counting rule-out diagnosis codes. Extracting this information accurately and at scale can help analytics teams in many use cases, from basic reporting to generating inputs into machine learning models.  


Oldie but goodie. This paper uses classical methods from topic modeling, thinking of clinical characteristics as “topics” pertaining to each patient (“classical” roughly means more than 10 years old, in this case about 20). It’s pretty mathy, but the code is available online.
 

What about deep learning? Most deep learning (aka neural networks) methods don’t work well on structured data, and there are no high-quality off-the-shelf deep learning tools that perform well in this case. And, while a theme of this newsletter is that deep learning is now within reach for many data science teams, as a rule you should not be building your own deep learning models, just fine-tune existing ones. It takes too much data, compute power, and specific expertise. 


A good middle ground. This paper is significantly more sophisticated than standard models that are easy to build in-house, but it requires much less data and compute power than deep learning. And it performs well, though it was only tested on public datasets from two locations. So, as is usually the case, expect most of the work to be on data processing and validation.    

Revisiting longitudinal acute kidney injury prediction

A blast from the past. Today, I am revisiting a 2019 paper about using deep learning to predict acute kidney injury (AKI) among hospitalized patients from structured EHR data (I summarized it when it was published). At the time, the model significantly outperformed any other results in the field, and I was curious how things have changed since then.
 

A bevy of models. Predicting AKI is an important use case, and a new review assesses the performance of 46 recent models. While they use data from different hospitals and are not directly comparable, the 2019 study remains a top performer. (practical point: reviews tend to be accessible and are useful to get a sense of the difficulty of a problem is and the expected performance.)
 

Ask “when”, not just “what”. The main innovation of the 2019 paper was designing a method that used data longitudinally, taking into account not just which diagnoses, procedures, and lab values are recorded, but also when. The importance of longitudinality has been demonstrated in many healthcare use cases in recent years, but most standard off-the-shelf models don’t handle longitudinal data well.
 

More good news. The authors expanded their method to a general protocol for longitudinal analysis of EHR data and made it available as open source. This is one of the themes of this newsletter: more and more cutting-edge deep learning models are now accessible to practicing data science teams. 


What I’m curious about. The 2019 model uses a type of model called “recurrent neural network”, which are becoming less popular. A new type of model called “transformer” has taken over for many longitudinal tasks, but there aren’t many applications to healthcare yet – this is an area I’ll be following.

Using SDOH to improve accuracy and equitability of predictive models

In the last post, I shared a paper about extracting social determinants of health (SDoH) from EMR. Continuing with the SDoH theme, this week’s paper is about using them to improve accuracy and equitability of predictive models.

TL;DR: The researchers develop models to predict in-hospital mortality among heart failure patients. First, they show that simple, off-the-shelf ML tools are significantly more accurate than standard risk scores. Second, taking race and area-level SDoH like median household income improved performance further, particularly for Black patients.

Why should you care? When considering machine learning applications, healthcare organizations increasingly need to evaluate equitability and not just performance, and ensure that models don’t persist bias due to historical patterns. Incorporating SDoH can help improve both the accuracy and equitability of a model.

Worth your attention: Mortality is a rare outcome (low single digits) even among hospital patients. The number of observed deaths is small, which makes comparing models more challenging. This paper uses metrics targeted specifically for rare outcomes. This is good practice. Always ask which metrics are being used to assess model performance, especially if only standard-fare metrics like accuracy and AUC are being reported.

One caveat: SDoH typically come from a different data source than clinical data. Whenever two data sources are combined, special care is needed. Perhaps SDoH are more likely to be available for a certain patient population? This could affect both the accuracy and equitability of a model. In this particular work, more information on this topic would be helpful.

Using deep learning to classify SDOH

After a two-year hiatus, I am excited to relaunch my newsletter, intended for non-technical (but data curious) healthcare professionals. Each post will highlight a recent AI research article, why it’s interesting, and how it may be applicable in practice to healthcare companies, with their actual systems, teams, and needs.

This week's paper is about using deep learning to classify social determinants of health (SDoH) from electronic health records (paywall; email me for a copy).
 

TL;DR: The researchers extract SDoH from clinical notes using various machine learning tools, including off-the-shelf deep learning models. The latter outperformed other methods and achieved good accuracy.   
 

Why is this important? SDoH are increasingly recognized as critical features for applications like risk stratification, predicting health outcomes, and evaluating programs for equity and bias. But SDoH are generally not coded in a structured way, so automatic extraction from EMR would be useful. 
 

Also interesting: The performance of off-the-shelf models on ad-hoc clinical annotation tasks is promising. This was generally not the case before 2020, with the launch and broad adoption of a model called “BERT” developed by Google.
 

Who is this relevant for? Healthcare organizations with access to EHR data and interest in incorporating SDoH to their work. Good results were achieved with off-the-shelf models, which are accessible to data science teams without specialized research capabilities.
 

Caveats: The results were tested data from a single hospital and based on a set of SDoH defined by the researchers. Real-world applications will require careful definition of SDoH of interest, an investment in annotation, and a rigorous evaluation of the results.
 

The bottom line: With standardized SDoH definitions and studies across diverse datasets, it may be possible to extract SDoH from EHR at scale and with high accuracy. This is of particular importance for evaluating equity and bias of clinical programs. Meanwhile, off-the-shelf deep learning models show promising performance on ad-hoc clinical annotation tasks, and are within reach for most data science teams.