Data Mining Approach to Estimate the Duration of Drug Therapy from Longitudinal Electronic Medical Records

RESEARCH ARTICLE Data Mining Approach to Estimate the Duration of Drug Therapy from Longitudinal Electronic Medical Records Olga Montvida, Ognjen Arandjelović, Edward Reiner and Sanjoy K. Paul Clinical Trials and Biostatistics Unit, QIMR Berghofer Medical Research Institute, Brisbane, Australia School of Biomedical Sciences, Institute of Health and Biomedical Innovation, Faculty of Health, Queensland University of Technology, Brisbane, Australia School of Computer Science, University of St. Andrews, St. Andrews, United Kingdom Smart Analyst Inc., New York, Unites States of America Melbourne EpiCentre, University of Melbourne and Melbourne Health, Melbourne, Australia


INTRODUCTION
The electronic medical records (EMRs) and the administrative data from the primary/ambulatory care systems are increasingly being used in epidemiological [1 -3], pharmaco-epidemiological [4 -6], pharmaco-vigilance [7 -9], clinical outcome [5, 10 -12], health economic [13,14] and public health related studies [15 -18].Analyses of large primary care based EMRs from various countries, most notably from UK, USA and Sweden, have provided significant insight into the effectiveness of changes in health care practices/polices on overall disease and health management costs [3,15,19,20], in addition to population level evidences on the safety and effectiveness of various therapies and the association of disease-related risk factors on long-term outcomes [5, 6, 18, 21 -23].Increasing use of such large realworld patient-level data is illustrated well by the sixfold increase in EMR based published studies since 2000 [10,24].
In structured EMRs, especially from the primary/ambulatory care systems, comprehensive patient level data are captured on different domains simultaneously and stored in the form of relational database [25,26].Representative examples include the UK Clinical Practice Research Database and Centricity TM EMR (CEMR) database of USA [27,28].The extraction, quality control and management of such voluminous longitudinal data under individual study protocols is highly methodologically and computationally involved, and challenging from data mining and analytical viewpoints [22,29].Data science generally considers that data preparation tasks consume about 80% of total project timeline leaving only 20% for ultimate analysis itself [30,31].Data completeness, systematic biases, reproducibility and quality are some of the notable limitations in such databases [18,29,32].
Most EMR databases capture large amounts of detailed information on medications provided to individuals over time, while specific form in which this information is stored varies from database to database [26].It is usually possible to obtain the drug class, specific brand name within the corresponding class, prescription dates, dosage, and number of refills [32].However, a significant number of entries for an individual prescription may be missing or contain errors.The problem with information completeness can also arise when the medication nomenclature is not correctly matched [29].Clinical and pharmaco-epidemiological studies, which rely on the data from EMRs, are often interested in the effectiveness of specific therapies, therapeutical dynamics, treatments with concomitant medications, and durations thereof in specific disease areas.Such real-world analysis provides an extremely valuable means for the understanding of drug utilization patterns, treatment initiation periods following the diagnosis of a disease, the effectiveness of specific therapies on disease-related risk factors, and possible associations of therapies with long-term outcomes [1,6].These studies warrant appropriate extraction of longitudinal information on prescriptions or medications at individual patient level, inappropriate extraction of the data may result in misleading inferences reported [33 -35].Generally, pharmacoepidemiological studies do not estimate treatment duration, but only account for the fact of one or more prescriptions for a particular drug(s) [36,37].Some studies calculated medication duration by extracting first prescription date from the last prescription date [38,39], and only few studies additionally considered a drug being discontinued if the subsequent prescription was not refilled within the expected time of drug coverage [40,41].While some studies have discussed the challenges in the analysis of medication data from EMRs [18,42], to the best of our knowledge no existing study has analysed the quality, consistency, and completeness of EMR prescription information, nor proposed a practical algorithm able to extract salient medication information from large and complex longitudinal data sets [43].
The aims of this explanatory and methodological study are (1) to discuss and analyse the most pressing challenges encountered by computer based methods in the process of extracting and aggregating longitudinal medication data from EMRs, (2) to describe two algorithms to extract prescription information of individual therapies and to estimate the corresponding duration of treatment, and (3) to discuss how estimates of individual medication duration are affected by the choice of the study design.The effectiveness of algorithms is compared is on a cohort of patients with a clinical diagnosis of type 2 diabetes (T2DM) using a real-world EMR database collected across the USA.

Centricity Electronic Medical Records
The CEMR database contains more than 40 million patients' clinical/treatment records from 1995.CEMR represents 49 US states and a variety of ambulatory medical practices, including solo practitioners, community clinics, academic medical centres, and large integrated delivery networks.The database has been extensively used for academic research worldwide [3, 37, 44 -47].The CEMR database consists of over 30,000 health care providers, of whom approximately 70% are primary care providers.For both insured and uninsured patients, this database contains comprehensive patient-level information on many aspects including demographic information, laboratory results, history of diseases, clinical diagnosis of symptoms/ diseases, vital signs, history of medications and detailed information on the ongoing medications.For this study we used longitudinal information from January 1995 to October 2014.

Medication Data in Centricity EMR database
The medications taken by an individual (medication domain) and the prescriptions for drugs provided to the individuals by the service provider registered within the EMR system (prescription domain) are extensively documented in the database by means of three tables: medication dimension (MD), medication fact (MF) and prescription fact (PF).The MF and PF belong to the medication and prescription domains respectively.The MF may include a broader list of all medications that a patient is taking including over the counter medications, herbal remedies and medications prescribed by a provider that may be out of the EMR network.MD is linked to both MF and PF.Each record in the MD contains information on individual drug, which includes the National Drug Code (NDC) and Generic Product Identifier (GPI), as well as the four ordered attributes derived from the GPI such as generic drug names.The MD also includes the medication doses corresponding to different brands' products, identified by a unique medication key value assigned to each record.
The entries in MF capture individual patient's medication prescription history and active prescriptions from all practitioners including the service provider registered within EMR system.It contains several special fields to track longitudinal patterns, such as active medication flag, which indicates if a patient was taking the drug at the database extraction moment.Active medication list is identified by records with value "Y" of active flag.The chain identification (ID) values facilitate tracking of treatment alterations (including the addition of new medications) over time, with the related chain sequence values which track medication adjustments within the same chain ID.The initiation ('start') and cessation ('stop') dates associated with different treatments are also stored in the MF.However we found that the corresponding values are missing with alarming frequencies: 67% of the cases for the former and 11% for the latter.Also, some of the start and stop date entries could be erroneous, such as stop date preceding start date.An excerpt from the MF for an individual patient is shown in Table 1.The entries in the PF capture the prescription date and the associated number of refills only for medications that have been prescribed by the responsible provider within the EMR network.The MF dataset contains a broader set of entry sources, moreover the form of recording potentially comprises more details than corresponding data in the PF.Nevertheless it was determined that PF may contain unique entries that are not stored in MF.Therefore, the MF was considered as the primary source of medication information and the PF as a complimentary one.

METHODS
In this section, we introduce a novel algorithm for mining large-scale longitudinal EMRs with the ultimate goal of estimating the duration of treatment of a particular individual with a drug(s) of interest.The first method we introduce ("chaining") relies on chain ID and chain sequence values recorded in the MF.This feature of the approach allows to account for treatments which include alternative drug use.To assess the importance and power of longitudinal chain information, we also describe a modification of the "chaining" method ("continuous") which disregards chain ID and chain sequence values, and instead relies only on the chronology of patient's records of particular drug(s).In the current literature, the latter approach is used more frequently.

Data Pre-processing: Auxiliary Fields
Although erroneous entries generally cannot be identified, various types of global consistency rules may be applied to reduce the error.Chronology of the events may be corrected by incorporating two additional fields: patient's last available follow-up date and patient's date of birth (DOB).
CEMR database stores last available follow-up date for each patient.As initial data pre-processing step, erroneous follow-up date entries were identified and corrected by the latest record creation dates of all activities within the database for corresponding patients.
Similar to many anonymized EMRs, the exact DOB was not available within CEMR.Simple procedure was applied to approximate DOB: Obtain multiple DOB estimates per patient by subtracting reported 'valid' age from the record creation date for 1.
all activities within the database.CEMR groups patients older than 80 years under a single age key.The nonmissing age data and the non 80+ age keys were considered as 'valid' age entries.Approximate DOB as minimum of all estimates from Step 1.

2.
For patients without reported activities estimate DOB from the dataset containing demographic information by 3.
subtracting reported 'valid' age from the database extraction date.
The parameters for the mathematical formulations are identified in the Table 2 below.The scalars sd and mx may be defined on the basis of the standard prescription protocol for individual drugs.The default values of sd =1 and mx =24 were considered in our analyses.
MS may be identified by text-mining the MD dataset.For example, glucagon-like peptide-1 receptor agonist (GLP-1RA) may be identified by searching for "GLP-1 RECEPTOR AGONIST" in the second order GPI attributed field.

"Chaining" Method
The algorithm for the first approach to extract and aggregate data for the estimation of duration of treatment is elaborated below.

Replace erroneous values of start dates (b
) with missing values 3. Sort by patient key ascending, chain ID ascending within the same patient, chain sequence descending within the same chain ID.  9.If number of refills is greater than pre-defined maximal number of possible refills or negative or missing, replace it with zero.
10. Calculate end dates E = (e 1 , e 2 ,...,e k ) T by the following rules.10.1) if prescription date is not missing, then end date is equals to standard duration multiplied by the number of refills plus one and added to prescription date.10.2) if prescription date is missing, then end date is equals to standard duration multiplied by the number of refills plus one and added to record creation date.
11. Update end dates as described in Step 5 .
12. Reduce PF 1 to the set of patients from the cohort of interest, to the set of patients not in MF 2 , and to the set of keys of selected drug(s).
13. Append both datasets by the following values: patient key, record creation date, start / prescription date and end date, assume that the new dataset MP contain n' records.
where  {⋅} is an indicator function: 4.4) else end date equals to start date of previous record.
6. Reduce PF 1 to the set of patients from the cohort of interest, to the set of patients not in MF 3 , and to the set of keys selected drug(s).

Treat missing values in PF
2 as described in step 14 of "chaining" method.
8. Append both datasets by the following values: patient key, record creation date, start/prescription date and end date, assume that the new dataset MP contain records.9. Perform steps 15-17 from "chaining" method.

REMARKS
Identified erroneous entries are declared as missing in Steps 2, 8, and 9 of "chaining" method.In the Step 14, the algorithm counts the number of unique creation dates for selected drug(s) at patient level.If obtained number is greater than one, then missing start dates are replaced with record creation dates.In such a way, a patient is considered to take a particular drug if the medication records were entered in a systematic manner, otherwise the records with missing start dates are disregarded.
As an example, the prescription scenario for anti-diabetes drugs for a patient with type 2 diabetes is presented in Table 1.The treatment was initiated with metformin (METFORMIN HCL) on the 6 th of May 2009 and continued until the 25 th of April 2011, when a switch to GLP-1RA (LIRAGLUTIDE) was made.With a stop date for GLP-1RA recorded on 14 th of December 2011, data show a gap in the treatment till 26 th of September 2012, when insulin therapy begun.However, a patient with diabetes using GLP-1RA is unlikely to have had a nine month long gap in the treatment.Indeed, careful data examination leads to the conclusion that insulin treatment started on 27 th of February 2012, as would be estimated by the algorithm.
As it was mentioned earlier, MF was considered as primary data source, thus if at least one record for selected drug(s) at patient level is present in the MF, then both methods disregard entities in the PF.However, if there is no available data in MF table, the methods append data from PF.
Assessment of the first marketing date for a particular drug is an example of additional global consistency audit that is omitted in the methods' description.For instance, any start date of GLP-1RA drugs must not be prior to April 2005, the date when first representative (Exenatide) was approved.

RESULTS
To evaluate the performance of described methods, we chose to focus on the estimation of the duration of treatment with two widely used anti-diabetic drugs, namely GLP-1RA and insulin.In the CEMR database 1,861,560 patients were identified as having been diagnosed with type 2 diabetes mellitus, as inferred from the assigned ICD-9 codes.

Case Study 1
As the first case study, we consider a randomly selected patient from the CEMR database, whose relevant treatment details are shown in Table 3.The treatments with EXENATIDE and INSULIN GLARGINE started on the 18th of June 2007.The treatment with EXENATIDE was terminated on the 7 th of January 2008, while INSULIN therapy continued until the last recorded follow-up date on the 24 th of January 2008 (notice that the treatment is flagged as active, "Y").In this case, the "chaining" and "continuous" methods produce the same estimates for the durations of the two treatments.Specifically, the estimates corresponding to insulin and GLP-1RA are 7.2 and 6.7 months, respectively.

Case Study 2
As an insightful case study, we consider a patient whose relevant treatment details are shown in Table 4. Since all of the records shown have the same chain ID it can be concluded that in the period from the 23 rd of April of 2010 until the 13 th of March 2013 the patient was alternating between two therapies, namely with GLP-1RA (EXENATIDE) and insulin (INSULIN GLARGINE).This example illustrates the importance of chain ID information, as readily corroborated by comparing the predicted therapy end dates using the "chaining" and "continuous" methods (per record estimates are shown in the two rightmost columns of Table 4).The latter disregards chain ID information, it implicitly assumes that EXENATIDE was taken continuously from the 23 rd of April 2010 until the 27 th of April 2011, with the last prescription date being the 28 th of March 2011.However, treatment with EXENATIDE was terminated on the 29 th of December 2010 when a switch to insulin was made.Treatment with insulin continued until the 28 th of March 2011 when a switch back to EXENATIDE appeared.This complex and frequent pattern of therapy alteration leads to vastly different treatment duration estimates when chain ID information is used ("chaining") and when it is not ("continuous").For example, in this particular case, "continuous" approach estimates the total duration of insulin/ EXENATIDE treatment to be 5.7/ 28.9 months, compared to 26.5/ 12.1 months estimated by "chaining" method.

General Analysis
Given our focus on GLP-1RA and insulin, to facilitate further analysis, from the cohort of all T2DM patients we selected those who at any point in their medical history received treatment with either of the two drugs of interest.Text mining of drug names in MD table revealed various insulin regimens as well as related devices (e.g.insulin syringe).To quantify the result, we found that approximately 30% of the patients in the T2DM cohort received at least one prescription for insulin drug.Interestingly, a large number of patients (~25,000) were found to have received prescriptions for insulin devices but not for insulin therapy itself.Further exploration on these patients revealed that the average duration of use of these devices in this patient group was 21 months (Table 5), strongly suggesting that there was an accompanying insulin therapy which was not recorded in the stored EMRs.This conclusion is further corroborated by the finding that the mean glycated haemoglobin (HbA1c) level for these patients was measured to be 7.8% on the date of the first record associated with the device.
Table 5. Summary statistics on the estimated duration in months of treatment with specific medications in T2DM cohort (n=1,861,560) by "chaining" and "continuous" methods, and the difference in the estimated duration between "chaining" and "continuous" methods.
"Chaining" method "Continuous" method "Chaining" -"continuous" n (%) Mean The number of patients receiving insulin and GLP-1RA, and the corresponding treatment duration estimates (in months) produced by our algorithms ("chaining" and "continuous"), are summarized in Table 5. Different insulin regimens were treated jointly, as we found that any finer level of detail is poorly recorded in the database.As regards to The estimate of the proportion of patients identified as having received specific individual drugs was found to be very similar using both the "chaining" approach, as well as the non-chain ID based alternative "continuous" approach, as shown in Table 5.The corresponding values of the key statistics -namely the mean, standard deviation (SD), median, and the interquartile range (IQR)-of the respective estimates of the duration of treatment with individual drugs were also similar.The average differences in the estimated duration of treatment with insulin only and GLP-1RA drugs were 0.3 month and 1 month respectively.There were no differences at the median levels.Separate analyses for patients with minimum 2 months of treatment duration with individual therapies also revealed the same results.However, it is important to note that although the cumulative statistics of the estimated treatment durations with different therapies were not significantly different, we did find notable differences in the minimum and maximum duration estimates for specific patient subgroups, as evident from (Table 5).

DISCUSSION
In this work we addressed a number of challenging data mining related issues while extracting patient-level longitudinal information on prescription patterns and medication usages from large relational databases (our data set comprises more than a billion records).There are several key contributions of note.Firstly we identified the specific challenges which automatic methods must deal with in the processing of this complex voluminous data.We corroborated our arguments using analysis of real-world EMRs and discussed the importance and the implications of being able to handle erroneous and incomplete longitudinal information.Secondly, we introduced two methods for the estimation of the duration of treatment with specific drug(s) in the presence of the aforementioned challenges.Developed sequentially ordered case by case rules were presented mathematically.To the best of our knowledge, no robust algorithmic approach has yet been reported to evaluate treatment duration with individual medications in multiple treatment scenario [22,27].
We have described two algorithmic approaches to estimate treatment duration on the individual record level.First method ("chaining") relies on specific chaining fields of medication information, while second approach ("continuous") does not use chain related information and employs only chronological record information instead.Our results on the large Centricity EMR database show that the two approaches do not produce significantly different results on average at population level.However, when examined in detail, the "chaining" method could identify the treatment alterations longitudinally and was shown to be more robust at individual patient level.Furthermore, treatment duration estimates from the "continuous" approach are more sensible to the set of selected medications.The difference between methods is particularly prominent in studies involving multiple drugs as opposed to single drug therapies or focusing on the order of treatment initiation [48,49].
Our study highlighted the potential risk of underestimating the duration of treatment when EMR data is used directly, due to erroneous or incomplete data emerging from omissions in the data entry process, appointments missed by patients, typographical errors, or numerous others.Both proposed algorithms robustly handle these challenges whenever is possible, estimating values of the missing or erroneous entries.Importantly, being rule based, the decisions of our algorithms are readily interpretable by humans and lend themselves to effortless use by medical professionals not necessarily proficient in data mining and related disciplines.Both approaches implement two fact datasets available in the Centricity EMRs, however algorithms are easily adjusted in case of only one available dataset.

CONCLUSION
This study discusses the challenges in exploring the prescription / medication patterns for individual patients in large primary / ambulatory care electronic databases, and introduces two algorithmic approaches for robust estimation of treatment duration with individual drug(s).We have demonstrated that implementing chaining fields of medication information additionally improve the quality of estimates.Given the importance of extracting medication information appropriately in pharmaco-epidemiological studies based on real world data, the proposed algorithms has the potential to significantly contribute to the analytical quality aspects in the future EMR based clinical and epidemiological studies.

ETHICS APPROVAL AND CONSENT TO PARTICIPATE
Not applicable.

HUMAN AND ANIMAL RIGHTS
No Animals/Humans were used for studies that are base of this research.

CONSENT FOR PUBLICATION
Not applicable.

CONFLICT OF INTEREST
Sanjay K. Paul (SKP) has acted as a consultant and/or speaker for Novartis, GI Dynamics, Roche, AstraZeneca, Guangzhou Zhongyi Pharmaceutical and Amylin Pharmaceuticals LLC.He has received grants in support of investigator and investigator initiated clinical studies from Merck, Novo Nordisk, AstraZeneca, Hospira, Amylin Pharmaceuticals, Sanofi-Avensis and Pfizer.Olga Montvida (OM) and Ognjen Arandjelovic (OA) has no conflict of interest to declare.Edward Reiner (ER) was an employee of Quintiles and was responsible for the strategic development of the Centricity EMR database.

1 .
Merge the following to the MF dataset by patient key: 1.1) date of birth DOB = (db 1 , db 2 ,...,db n ) T .1.2) last available follow-up date L = (l 1 , l 2 ,...,l n ) T .The extended MF dataset would be of the form.

7 .
Merge the following to the PF set by patient key: 7.1) date of birth DOB = (db 1 , db 2 ,...,d k ) T 7.2) last available follow-up date within the database L = (l 1 , l 2 ,...l k ) T .The extended PF dataset would take the following form: 8. Replace erroneous prescription dates (b i V (b i <db i b i >l i ), i= ) with missing values.

Table 2 . Mathematical Formulation Scalars
n number of records in MF table k number of records in PF table sd standard prescription duration for individual drug mx maximal number of prescription refills for individual drug u number of unique patient keys in the cohort of interest Sets PS = {ps 1 ,ps 2 ,…… ps u } set of unique patient keys in the cohort of interest V set of missing values MS set of medication keys of selected drug(s)

In patients with treatment duration ≥2 Months
)