RESEARCH ARTICLE


Data Mining Approach to Identify Disease Cohorts from Primary Care Electronic Medical Records: A Case of Diabetes Mellitus



Ebenezer S. Owusu Adjah1, 2, Olga Montvida1, 3, Julius Agbeve1, Sanjoy K. Paul4, *
1 QIMR Berghofer Medical Research Institute, Brisbane, Australia
2 Faculty of Medicine, The University of Queensland, Brisbane, Australia
3 School of Biomedical Sciences, Institute of Health and Biomedical Innovation, Faculty of Health, Queensland University of Technology, Brisbane, Australia
4 Melbourne EpiCentre, University of Melbourne and Melbourne Health, Melbourne, Australia


Article Metrics

CrossRef Citations:
0
Total Statistics:

Full-Text HTML Views: 1982
Abstract HTML Views: 1109
PDF Downloads: 771
ePub Downloads: 647
Total Views/Downloads: 4509
Unique Statistics:

Full-Text HTML Views: 1035
Abstract HTML Views: 637
PDF Downloads: 301
ePub Downloads: 180
Total Views/Downloads: 2153



© 2017 Owusu Adjah et al.

open-access license: This is an open access article distributed under the terms of the Creative Commons Attribution 4.0 International Public License (CC-BY 4.0), a copy of which is available at: (https://creativecommons.org/licenses/by/4.0/legalcode). This license permits unrestricted use, distribution, and reproduction in any medium, provided the original author and source are credited.

* Address correspondence to this author at the Melbourne EpiCentre, University of Melbourne and Melbourne Health, Melbourne, Australia; Tel: +61-3-93428433; E-mail: Sanjoy.Paul@unimelb.edu.au


Abstract

Background:

Identification of diseased patients from primary care based electronic medical records (EMRs) has methodological challenges that may impact epidemiologic inferences.

Objective:

To compare deterministic clinically guided selection algorithms with probabilistic machine learning (ML) methodologies for their ability to identify patients with type 2 diabetes mellitus (T2DM) from large population based EMRs from nationally representative primary care database.

Methods:

Four cohorts of patients with T2DM were defined by deterministic approach based on disease codes. The database was mined for a set of best predictors of T2DM and the performance of six ML algorithms were compared based on cross-validated true positive rate, true negative rate, and area under receiver operating characteristic curve.

Results:

In the database of 11,018,025 research suitable individuals, 379 657 (3.4%) were coded to have T2DM. Logistic Regression classifier was selected as best ML algorithm and resulted in a cohort of 383,330 patients with potential T2DM. Eighty-three percent (83%) of this cohort had a T2DM code, and 16% of the patients with T2DM code were not included in this ML cohort. Of those in the ML cohort without disease code, 52% had at least one measure of elevated glucose level and 22% had received at least one prescription for antidiabetic medication.

Conclusion:

Deterministic cohort selection based on disease coding potentially introduces significant mis-classification problem. ML techniques allow testing for potential disease predictors, and under meaningful data input, are able to identify diseased cohorts in a holistic way.

Keywords: Electronic Medical Records, Primary Care Database, Machine Learning Algorithm, Diabetes, Type 2 Diabetes, Cohort Identification.