Data Mining Approach to Identify Disease Cohorts from Primary Care Electronic Medical Records: A Case of Diabetes Mellitus

Owusu Adjah, Ebenezer S.; Montvida, Olga; Agbeve, Julius; Paul, Sanjoy K.

RESEARCH ARTICLE

Data Mining Approach to Identify Disease Cohorts from Primary Care Electronic Medical Records: A Case of Diabetes Mellitus

Ebenezer S. Owusu Adjah¹^{, 2} Olga Montvida¹^{, 3} Julius Agbeve¹ Sanjoy K. Paul⁴^{, *} Authors Info & Affiliations

The Open Bioinformatics Journal • 12 December 2017 • RESEARCH ARTICLE • DOI: 10.2174/1875036201710010016

Background:

Identification of diseased patients from primary care based electronic medical records (EMRs) has methodological challenges that may impact epidemiologic inferences.

Objective:

To compare deterministic clinically guided selection algorithms with probabilistic machine learning (ML) methodologies for their ability to identify patients with type 2 diabetes mellitus (T2DM) from large population based EMRs from nationally representative primary care database.

Methods:

Four cohorts of patients with T2DM were defined by deterministic approach based on disease codes. The database was mined for a set of best predictors of T2DM and the performance of six ML algorithms were compared based on cross-validated true positive rate, true negative rate, and area under receiver operating characteristic curve.

Results:

In the database of 11,018,025 research suitable individuals, 379 657 (3.4%) were coded to have T2DM. Logistic Regression classifier was selected as best ML algorithm and resulted in a cohort of 383,330 patients with potential T2DM. Eighty-three percent (83%) of this cohort had a T2DM code, and 16% of the patients with T2DM code were not included in this ML cohort. Of those in the ML cohort without disease code, 52% had at least one measure of elevated glucose level and 22% had received at least one prescription for antidiabetic medication.

Conclusion:

Deterministic cohort selection based on disease coding potentially introduces significant mis-classification problem. ML techniques allow testing for potential disease predictors, and under meaningful data input, are able to identify diseased cohorts in a holistic way.

Keywords: Electronic Medical Records, Primary Care Database, Machine Learning Algorithm, Diabetes, Type 2 Diabetes, Cohort Identification.

Fulltext HTML PDF ePub

Data Mining Approach to Identify Disease Cohorts from Primary Care Electronic Medical Records: A Case of Diabetes Mellitus

Abstract

Background:

Objective:

Methods:

Results:

Conclusion:

Follow Us

Authors & Information

Authors

Affiliations

Information

Published In

Article Information

Cite As

Article History

Copyright

ACKNOWLEDGEMENTS

Download

Download1

Download

Citations & Metrics

Citations

Cite As

Export Citation

Metrics

Article Usage (Last 30 Days)

Article Usage (Demographic)

Copyright & License

Copyright & License

© 2017 Owusu Adjah et al.

Media

Figures

Tables

Abstract

Background:

Objective:

Methods:

Results:

Conclusion:

Authors

Affiliations

Information

Published In

Article Information

Cite As

Article History

Copyright

ACKNOWLEDGEMENTS

Download1

Download

Citations

Cite As

Export Citation

Metrics

Article Usage (Last 30 Days)

Article Usage (Demographic)

Copyright & License

© 2017 Owusu Adjah et al.

Figures

Share

Share article link

Share on social media