Clinical phenotype beyond coding

Today: a new machine learning approach called "Guided MixEHR" for extracting clinical characteristics that are not properly captured by a simple code, including many common chronic conditions. (h/t to my friend Jacob Oppenheim from EQRx for pointing me to this research.)

Codes are not enough. How do you determine clinical characteristics from healthcare data (claims or EHR)? Many conditions, like frailty, are not captured by a single code. Even common chronic conditions, like COPD, require complex rules for accurate determination, for example to avoid counting rule-out diagnosis codes. Extracting this information accurately and at scale can help analytics teams in many use cases, from basic reporting to generating inputs into machine learning models.

Oldie but goodie. This paper uses classical methods from topic modeling, thinking of clinical characteristics as “topics” pertaining to each patient (“classical” roughly means more than 10 years old, in this case about 20). It’s pretty mathy, but the code is available online.

What about deep learning? Most deep learning (aka neural networks) methods don’t work well on structured data, and there are no high-quality off-the-shelf deep learning tools that perform well in this case. And, while a theme of this newsletter is that deep learning is now within reach for many data science teams, as a rule you should not be building your own deep learning models, just fine-tune existing ones. It takes too much data, compute power, and specific expertise.

A good middle ground. This paper is significantly more sophisticated than standard models that are easy to build in-house, but it requires much less data and compute power than deep learning. And it performs well, though it was only tested on public datasets from two locations. So, as is usually the case, expect most of the work to be on data processing and validation.