Data integration of structured and unstructured sources for assigning clinical codes to patient stays

Scheurwegs, Alyne; Luyckx, Kim; Luyten, Leon; Daelemans, Walter; van den Bulcke, Tim

doi:10.1093/JAMIA/OCV115

Title

Data integration of structured and unstructured sources for assigning clinical codes to patient stays

Author

Scheurwegs, Alyne

Luyckx, Kim

Luyten, Leon

Daelemans, Walter

van den Bulcke, Tim

Abstract

OBJECTIVE: Enormous amounts of healthcare data are becoming increasingly accessible through the large-scale adoption of electronic health records. In this work, structured and unstructured (textual) data are combined to assign clinical diagnostic and procedural codes (specifically ICD-9-CM) to patient stays. We investigate whether integrating these heterogeneous data types improves prediction strength compared to using the data types in isolation. METHODS: Two separate data integration approaches were evaluated. Early data integration combines features of several sources within a single model, and late data integration learns a separate model per data source and combines these predictions with a meta-learner. This is evaluated on data sources and clinical codes from a broad set of medical specialties. RESULTS: When compared with the best individual prediction source, late data integration leads to improvements in predictive power (eg, overall F-measure increased from 30.6% to 38.3% for International Classification of Diseases, Ninth Revision, Clinical Modification (ICD-9-CM) diagnostic codes), while early data integration is less consistent. The predictive strength strongly differs between medical specialties, both for ICD-9-CM diagnostic and procedural codes. DISCUSSION: Structured data provides complementary information to unstructured data (and vice versa) for predicting ICD-9-CM codes. This can be captured most effectively by the proposed late data integration approach. CONCLUSIONS: We demonstrated that models using multiple electronic health record data sources systematically outperform models using data sources in isolation in the task of predicting ICD-9-CM codes over a broad range of medical specialties.

Language

English

Source (journal)

Journal of the American Medical Informatics Association. - Philadelphia, Pa

Publication

Philadelphia, Pa : 2016

ISSN

1067-5027

DOI

10.1093/JAMIA/OCV115

Volume/pages

23 (2016) , p. 11-19

ISI

000375292600003

Full text (Publisher's DOI)

https://doi.org/10.1093/JAMIA/OCV115

Full text (open access)

https://repository.uantwerpen.be/docman/irua/8ee6cc/129923_2017_01_01.pdf

Faculty/Department				Faculty of Arts. Linguistics Faculty of Sciences. Mathematics and Computer Science Faculty of Medicine and Health Sciences

Research group				Translational Neurosciences (TNW) ADReM Data Lab (ADReM) Centre for Computational Linguistics, Psycholinguistics and Sociolinguistics (CLiPS)
Project info				Data fusion and structured input and output Machine Learning techniques for automated clinical coding.
Publication type				A1 Journal article

Subject				Computer. Automation Linguistics

Affiliation				Publications with a UAntwerp address

Web of Science

View record in Web of Science®

View citing articles in Web of Science®

Identifier

Creation

05.01.2016

Last edited

04.03.2024

To cite this reference

https://hdl.handle.net/10067/1299230151162165141