Natural language processing for scalable feature engineering and ultra-high-dimensional confounding adjustment in healthcare database studies

Healthcare data generated from clinical practice, including electronic health records (EHR) and insurance claims, can complement randomized controlled trials to provide evidence on the effects of medical products to support clinical decisions [1]. However, estimating causal effects from these data sources, so called real-world evidence (RWE), can be challenging due to confounding caused by non-randomized treatment allocation and poorly measured information on comorbidities [2,3]. Approaches to mitigate confounding bias would ideally be based on causal diagrams and expert knowledge for variable selection [4]. Covariate adjustment based on expert knowledge alone, however, is not always adequate because some confounders may not be considered by researchers or not be directly measurable in such secondary healthcare data.

To improve confounding control in RWE studies, data-driven algorithms can be used to empirically identify and adjust for large numbers of pre-exposure variables that indirectly capture information on unmeasured or unspecified confounding factors (‘proxy’ confounders) [[5], [6], [7], [8]]. A growing literature has shown that supplementing investigator-specified variables with large numbers of empirically identified features can often improve confounding control compared to adjustment based on investigator-specified variables alone [[5], [6], [7], [8], [9], [10], [11], [12], [13]]. Current approaches for high-dimensional proxy adjustment (HDPA), however, require data to be in a structured format (e.g., claims and structured EHR data), leaving unstructured EHR text information underutilized for confounding control. Leveraging this information can be challenging since patient-reported records are often recorded in free-text documents that are not readily analyzable at a large scale.

Recent work has demonstrated that unsupervised natural language processing (NLP) technology can scale to generate large numbers of structured features from unstructured clinical documents [[14], [15], [16]]. However, the added value of supplementing administrative claims data with large numbers of NLP-generated EHR features to improve high-dimensional proxy confounding control in healthcare database studies remains unclear. Here, we use three empirical studies to investigate the added value of supplementing administrative claims with high-dimensional sets of NLP-generated features from time-indexed EHR notes for causal analyses. We consider several NLP methods for generating structured features from pre-exposure free-text clinical notes and observe how adjustment for different covariate sets impacts covariate balance and effect estimates after propensity score (PS) weighting. Our objective is to assess if unsupervised NLP tools can leverage information in EHR free-text notes to supplement claims data for improved high-dimensional proxy adjustment in healthcare database studies.

Comments (0)

No login
gif