399 Using Machine Learning to Inform Extraction of Clinical Data from Sleep Study Reports
Abstract Introduction In-laboratory and home sleep studies are important tools for diagnosing sleep disorders. However, a limited amount of measurements is used to inform disease severity and only specific measures, if any, are stored as structured fields into electronic health records (EHR). We propose a sleep study data extraction approach based on supervised machine learning to facilitate the development of specialized format-specific parsers for large-scale automated sleep data extraction. Methods Using retrospective data from the Penn Medicine Sleep Center, we identified 64,100 sleep study reports stored in Microsoft Word documents of varying formats, recorded from 2001–2018. A random sample of 200 reports was selected for manual annotation of formats (e.g., layout) and type (e.g. baseline, split-night, home sleep apnea tests). Using text mining tools, we extracted 71 document property features (e.g., section dimensions, paragraph and table elements, regular expression matches). We identified 14 different formats and 7 study types. We used these manual annotations as multiclass outcomes in a random forest classifier to evaluate prediction of sleep study format and type using document property features. Out-of-bag (OOB) error rates and multiclass area under the receiver operating curve (mAUC) were estimated to evaluate training and testing performance of each model. Results We successfully predicted sleep study format and type using random forest classifiers. Training OOB error rate was 5.6% for study format and 8.1% for study type. When evaluating these models in independent testing data, the mAUC for classification of study format was 0.85 and for study type was 1.00. When applied to the large universe of diagnostic sleep study reports, we successfully extracted hundreds of discrete fields in 38,252 reports representing 33,696 unique patients. Conclusion We accurately classified a sample of sleep study reports according to their format and type, using a random forest multiclass classification method. This informed the development and successful deployment of custom data extraction tools for sleep study reports. The ability to leverage these data can improve understanding of sleep disorders in the clinical setting and facilitate implementation of large-scale research studies within the EHR. Support (if any) American Heart Association (20CDA35310360).