Data Mining Approach to Identify Disease Cohorts from Primary Care Electronic Medical Records: A Case of Diabetes Mellitus

Owusu Adjah, Ebenezer S.; Montvida, Olga; Agbeve, Julius; Paul, Sanjoy K.

RESEARCH ARTICLE

Data Mining Approach to Identify Disease Cohorts from Primary Care Electronic Medical Records: A Case of Diabetes Mellitus

Ebenezer S. Owusu Adjah¹^{, 2} Olga Montvida¹^{, 3} Julius Agbeve¹ Sanjoy K. Paul⁴^{, *} Authors Info & Affiliations

The Open Bioinformatics Journal • 12 December 2017 • RESEARCH ARTICLE • DOI: 10.2174/1875036201710010016

Background:

Identification of diseased patients from primary care based electronic medical records (EMRs) has methodological challenges that may impact epidemiologic inferences.

Objective:

To compare deterministic clinically guided selection algorithms with probabilistic machine learning (ML) methodologies for their ability to identify patients with type 2 diabetes mellitus (T2DM) from large population based EMRs from nationally representative primary care database.

Methods:

Four cohorts of patients with T2DM were defined by deterministic approach based on disease codes. The database was mined for a set of best predictors of T2DM and the performance of six ML algorithms were compared based on cross-validated true positive rate, true negative rate, and area under receiver operating characteristic curve.

Results:

In the database of 11,018,025 research suitable individuals, 379 657 (3.4%) were coded to have T2DM. Logistic Regression classifier was selected as best ML algorithm and resulted in a cohort of 383,330 patients with potential T2DM. Eighty-three percent (83%) of this cohort had a T2DM code, and 16% of the patients with T2DM code were not included in this ML cohort. Of those in the ML cohort without disease code, 52% had at least one measure of elevated glucose level and 22% had received at least one prescription for antidiabetic medication.

Conclusion:

Deterministic cohort selection based on disease coding potentially introduces significant mis-classification problem. ML techniques allow testing for potential disease predictors, and under meaningful data input, are able to identify diseased cohorts in a holistic way.

Keywords: Electronic Medical Records, Primary Care Database, Machine Learning Algorithm, Diabetes, Type 2 Diabetes, Cohort Identification.

Fulltext HTML PDF ePub

Data Mining Approach to Identify Disease Cohorts from Primary Care Electronic Medical Records: A Case of Diabetes Mellitus

Abstract

Background:

Objective:

Methods:

Results:

Conclusion:

Bentham Is Proud To Announce Collaboration With Elsevier

Three Journals Receive Impact Factors

The Nursing Journal Directory Indexes Bentham Journal, The Open Public Health Journal

Follow Us

Authors & Information

Authors

Affiliations

Information

Published In

Article Information

Cite As

Article History

Copyright

ACKNOWLEDGEMENTS

Download

Download1

Download

Citations & Metrics

Citations

Cite As

Export Citation

Metrics

Article Usage (Last 30 Days)

Article Usage (Demographic)

Copyright & License

Copyright & License

© 2017 Owusu Adjah et al.

Media

Figures

Tables

Abstract

Background:

Objective:

Methods:

Results:

Conclusion:

Bentham Is Proud To Announce Collaboration With Elsevier

Three Journals Receive Impact Factors

The Nursing Journal Directory Indexes Bentham Journal, The Open Public Health Journal

Authors

Affiliations

Information

Published In

Article Information

Cite As

Article History

Copyright

ACKNOWLEDGEMENTS

Download1

Download

Citations

Cite As

Export Citation

Metrics

Article Usage (Last 30 Days)

Article Usage (Demographic)

Copyright & License

© 2017 Owusu Adjah et al.

Figures

Share

Share article link

Share on social media