Predicting Neutropenia Risk in Breast Cancer Patients from Pre- Chemotherapy Characteristics

A previous study (Pittman, Hopman, Mates) of breast cancer patients undergoing curative chemotherapy (CT) found that the third most common reason for emergency department (ER) visits and hospital admission (HA) was febrile neutropenia. Factors associated with ER visits and HA included (1) stage of the cancer, (2) size of tumor, (3) adjuvant versus neo-adjuvant CT (" adjuvance "), and (4) number of CT cycles. We hypothesized that a statistically-significant pre-dictor of neutropenia could be built based on some of these factors, so that risk of neutropenia predicted for a patient feeling unwell during CT could be used in weighing need to visit the ER. The number of CT cycles was not used as a factor so that the predictor could calculate the neutropenia risk for a patient before the first CT cycle. Different models were built corresponding to different pre-chemotherapy factors or combinations of factors. The single factor yielding the best classification accuracy was tumor size (Mathews' correlation coefficient = +0.18, Fisher's exact two-tailed probability P < 0.0374). The odds ratio of developing febrile neutropenia for the predicted high-risk group compared to the predicted low-risk group was 5.1875. Combining tumor size with adjuvance yielded a slightly more accurate predictor (Mathews' correlation coefficient = +0.19, Fisher's exact two-tailed probability P < 0.0331, odds ratio = 5.5093). Based on the observed odds ratios, we conclude that a simple predictor of neutropenia may have value in deciding whether to recommend an ER visit. The predictor is sufficiently fast that it can run conveniently as an Applet on a mobile computing device.


INTRODUCTION
The present paper introduces a fast and efficient method to predict whether a breast cancer patient undergoing chemotherapy (CT) is at high risk of developing febrile neutropenia, using predictive factors available before the first chemotherapy cycle.The motivation for this work is a recent study [1] into factors associated with emergency room visits and hospital admissions in patients undergoing curative chemotherapy for breast cancer in the Southeast Ontario Local Health Integration Network (LHIN).These patients have a higher risk of emergency room visits and hospital admission rates compared to other LHINs in Ontario, Canada.The study found that febrile neutropenia was the third most common cause of emergency room (ER) visits [1].It also found that the only statistically significant factor associated with ER visits was the stage of the cancer, while factors with statistically significant associations with hospital admissions (HA) were tumor size, chemotherapy type (namely adjuvant versus neoadjuvant), and the number of CT cycles [1].A natural follow-up of this work is to develop a statisticallysignificant predictor of neutropenia risk, which could preferentially earmark high-risk patients for increased surveillance.Previously, another study [2] of neutropenia prediction using first cycle blood cell counts produced a FOS-3NN classifier that was extremely accurate in predicting neutropenic events.It had a Fisher's exact 2-tailed probability of P < 0.00023 and a Mathew's correlation coefficient of +0.83.
Building on the results from [1], we hypothesized that a statistically-significant predictor can be developed to calculate the neutropenia risk of a patient undergoing chemotherapy, based on predictive factors available before the first CT cycle.Since the predictor was developed assuming that the patient had yet to undergo the first CT cycle, we did not use the number of CT cycles as one of the factors in the predictor because that information will not be available before the patient's CT.
The predictive model was developed in MATLAB.The goal is to predict if a patient is at high risk of developing neutropenia based on the above mentioned factors: Stage of Cancer, Tumor Size and CT Type ("Adjuvance") using Nearest Neighbour Classifiers.Therefore, using some of these factors, we built a predictor of neutropenia risk based on a nearest neighbor classifier.A Leave-One-Out test protocol was employed, omitting the data of a patient under test from the training data for the predictive model.The model first found the mean value of each factor used for the training neutropenic and non-neutropenic patient data, then it classified the patient left out of the model as neutropenic or non-neutropenic depending on which mean its factor value was closest to in terms of standard deviations.For example, if tumor size was the only factor used, the mean size m 1 and standard deviation s 1 for the neutropenia patients, and the mean size m 2 and standard deviation s 2 for the nonneutropenia patients, were calculated, without including the data for the test patient.The test patient was then classified according to whether that patient's tumor size was closer, in the number of standard deviations, to m 1 or m 2 .If two or more factors were used, we summed the number of standard deviations for the factors, then chose the class for which the sum was smallest.This data analysis algorithm is easy on a computer's processor and accurately creates the weighted sums of the data needed for classifying a test patient.

METHOD
The program operates by first importing the Breast Cancer Patient Database from an EXCEL spreadsheet into MATLAB.The database contains data for 149 patients, of which 9 patients are neutropenic, and 140 patients are nonneutropenic.We could not use one non-neutropenic patient from the database because of missing data.Each patient's data consists of information such as age, gender, date of CT Cycles and more importantly, the three characteristics (factors) the predictor uses to predict risk of developing neutropenia, namely Tumor Size, Chemotherapy Type, and Stage of Cancer.In the database, each type and size are assigned an integer value as in Table 1 below.
The program then separates the training data into two classes.One class contains the Neutropenic Patients, the other class contains the Non-Neutropenic Patients.

Nearest Neighbour Classifier Algorithm and Leave One-Out Protocol
Once the data have been separated, each patient's classification is determined using a nearest neighbour classifier algorithm and a Leave-One-Out protocol.The Leave-One-Out protocol means that the data of a patient being classified is not used when the mean and standard deviation for each factor is calculated.The algorithm finds the arithmetic mean value for all three factors for both classes using the equation: (1)

Here
is the mean of a factor x for a class, N is the number of patients in the class (not including the test patient being classified) and is the value of the i patient's factor x.For example, if x denotes tumor size, then the mean tumor size is calculated for the neutropenic patients class and again for the non-neutropenic patients class, without including the test patient in either calculation.
Once the arithmetic mean is found, the magnitude of one standard deviation for each factor is calculated using the equation,

Here
is the magnitude of one standard deviation for a factor x for a class, N is the number of patients in the class (not including the patient being classified), is the mean for the factor x in the class, and is the value of the factor x for patient i.Note that is calculated for the neutropenic patients class and again for the non-neutropenic patients class.
The program then calculates the distance from the mean in standard deviations for the test patient k under classification, for each factor x in both classes using the equation Here is the value of the factor x for the test patient k under classification, is the arithmetic mean of the values of factor x in one of the classes, is the magnitude of one standard deviation for factor x in that class, and is the distance from the mean for the factor x in that class.Thus, for the test patient k, Eq. ( 3) was used to calculate the distance from the mean for factor x in the neutropenia patients class and again in the non-neutropenia patients class, without including patient k in calculating and .The test patient k was then predicted to belong to the class for which was smaller.

Classification Based on Distance From Mean
As noted, once is determined for each factor for both classes for a test patient, that patient will be classified as neutropenic or non-neutropenic depending on which mean its factor value is closest to in terms of standard deviations.There were different combinations of factors used for classifying the patient.A single factor was used, or the corresponding values of for two factors were added or multiplied together, or the corresponding values of for three factors were added or multiplied together.
Since the earlier study [1] had identified tumor size, chemotherapy type (adjuvant vs neoadjuvant), and number of CT cycles as statistically associated with HA, the first two factors were tried both alone and together as predictors of neutropenia risk.In our nearest neighbor classifiers, tumor size was the best single predictor of neutropenia risk (Matthews' correlation coefficient = 0.18, Fisher's exact twotailed probability P < 0.0374).The odds ratio of developing febrile neutropenia in the predicted high-risk group was 5.1875 relative to the predicted low-risk group and, significant statistically, the 95% confidence interval [1.0394, 25.8907] does not contain the point 1.Combining tumor size with adjuvance slightly improved accuracy (Matthews' correlation coefficient = 0.19, Fisher's exact two-tailed probability P < 0.0331).The odds ratio of developing febrile neutropenia in the predicted high-risk group rose to 5.5093, relative to the predicted low-risk group, with 95% confidence interval [1.1033, 27.509] not containing the point 1.For completeness, we also explored all three factors and factor combinations, as summarized in Table 2.
Once the predictor classifies each patient as seen in Table 3 below, the results were entered into a 2x2 contingency table  (Vasserstats [6]) to find Fisher's exact two-tailed probability, Mathews' correlation coefficient and the odds ratio.From there we determined the feasibility of using the factor combinations to predict the risk of a neutropenic event.

SIGNIFICANCE OF RESULTS
As shown in Table 4, the most statistically significant single factor predicting neutropenia is tumor size, and the best factor combination is CT Type + Tumor Size.They both have a Fisher's Exact test two-tailed Probability P less than 0.04 and an odds ratio above 5.1.Tumor size, and the combination CT Type + Tumor Size, have the highest Mathews' correlation coefficient of +0.18 and +0.19 respectively.This result is in alignment with the conclusion from ref. [1].As mentioned above, the latter study concluded that Tumor Size and CT Type were the 2 most significant factors associated with hospital admissions (of factors available before CT).It should be noted that every statistically significant result had exactly the same 2 (and only two) neutropenia patients misclassified.
A significant pattern was found when using the factor CT Type by itself, or multiplied by Tumor Stage, Tumor Size, or both.All patients with neutropenia were taking Adjuvant Therapy (or, equivalently, have a 1 in their CT Type column).Therefore by Eq. ( 1), the mean of the CT Type factor was 1 for the Neutropenic class, and since there was no variation, by Eq. ( 2) the Length of a Standard Deviation was 0 for that class.Therefore any patient that took adjuvant therapy was classified as Neutropenic and any patient undertaking Neo-Adjuvant Therapy was classified as Non-Neutropenic.Therefore the classification results for CT Type, CT Type X Tumor Stage, CT Type X Tumor Size, and CT Type X Tumor Stage X Tumor Size all reflect how many patients took adjuvant therapy and how many took neo-adjuvant therapy.Also the odds ratio for those factor and factor combinations is infinite.Again that is because every patient with neutropenia is taking Adjuvant Therapy.

Speed of Results
The predictor takes 25-30 seconds to run on a first generation Core i3-370M processor.Table 5 below shows MATLAB's profile data, which clocks how long the CPU takes to process each line of code.The specific profile shown is for when the program was classifiying patients using all the additive factor combinations.The important thing to note is that MATLAB spends 88.2% of the time in the function "xslswrite", which is MATLAB's function that writes the output of the predictor to an excel file, while 6.9% of the CPU time is spent in the xlsread function.The latter function loads the Breast Cancer Patient Data from an excel file into MATLAB's private workspace.The actual time spent creating the model and classifying all the patients in the database is 4.9% or 1.32 seconds.If this model was to be used in practical applications, a more efficient means of retrieving and writing to a database would be implemented as MATLAB is known to be slow when importing and exporting data.

DISCUSSION
This paper demonstrates that a Nearest Neighbour Classifier can be used to achieve statistically-significant prediction The results in the present study have shown that the most statistically significant factor combination associated with neutropenia is CT Type combined with Tumor Size, followed closely by Tumor Size alone.It also corroborates the finding of the earlier study [1] that a patient with a tumor size of T2 was more likely to be admitted to hospital.Finally it highlights that only patients in this particular group undertaking adjuvant therapy were at risk of developing neutropenia.No patient receiving neo-adjuvant therapy had been diagnosed with neutropenia but this is limited by the small number of patients in this group.Additionally patient's comorbidities were not significant predictors of febrile neutropenia in this small cohort but merits further research.

CONCLUSION
Based on our observed odds ratios over 5 of developing neutropenia in the predicted high-risk group relative to the predicted low-risk group, we conclude that a simple predictor of neutropenia may have value in deciding whether to recommend an ER visit.A possible use of the predictor is that when it places a patient in the non-neutropenia group, the probability that the patient will develop neutropenia is apparently small (about 2/85 in the present study), so the predictor appears to have good negative predictive value.The success of this neutropenia prediction must be confirmed in a larger independent data set.In addition, in view of ref. [2], first cycle blood count data should be combined with the predictive factors used in the present study to increase the sensitivity and specificity of recognizing high-risk patients.
The processing time for the classifier algorithm is less than 1.5 seconds when classifying all 148 patients.In a practical setting only one patient will be classified at a time, therefore the processing time will be approximately a hundredth of a second.

Table 1 . Table of Tumor Size, CT Type and Stage, their respective subtypes and the corresponding number value in the database. Number Value in Database Tumor Size Size of Primary Tumor CT Type Pre-Surgical Chemotherapy vs. Post-Surgical Chemotherapy Stage Extent of Disease in Terms of Spread and/or Size
[3][4][5]T3, T4 indicate size and/or extent of tumor, the higher the number, the larger the tumor.Note: Definitions of Cancer Pathology and Treatment notations and terms are summarized from American Cancer Society and National Cancer Institute Website[3][4][5].