Haplotype Classification Using Copy Number Variation and Principal Components Analysis

Elaborate downstream methods are required to analyze large microarray data-sets. At times, where the end goal is to look for relationships between (or patterns within) different subgroups or even just individual samples, large data-sets must first be filtered using statistical thresholds in order to reduce their overall volume. As an example, in anthropological microarray studies, such 'dimension reduction' techniques are essential to elucidate any links between polymorphisms and phenotypes for given populations. In such large data-sets, a subset can first be taken to represent the larger data-set. For example, polling results taken during elections are used to infer the opinions of the population at large. However, what is the best and easiest method of capturing a subset of variation in a data-set that can represent the overall portrait of variation? In this article, principal components analysis (PCA) is discussed in detail, including its history, the mathematics behind the process, and in which ways it can be applied to modern large-scale biological datasets. New methods of analysis using PCA are also suggested, with tentative results outlined.


INTRODUCTION
Principal components analysis and other multivariate tools are used to analyze large volumes of data in order to tease out the differences/relationships between the logical entities being analyzed (for example, a data-set consisting of a large number of samples, each with their own data points/varia-bles) [1].It extracts the fundamental structure of the data without the need to build any model to represent it [2].This 'summary' of the data is arrived at through a process of reduction that transforms the large number of variables into a lesser number that are uncorrelated (i.e. the 'principal' components), whilst at the same time being capable of easy interpretation on the original data [3,4].
Principal components analysis has broad applications and is used in a wide range of areas.Examples include craniofacial recognition [5], analysis of water quality [3], and to derive a set of highly confident genes [6] or single nucleotide polymorphisms (SNPs) [7,8] for classification purposes.It has also been used in subject areas such as climatology, geology, meteorology, psychology, quality control [4], forensics and population genetics (particularly in relation to SNPs), medical genetics [2], and bacteriology [9].It can also help in the identification of subgroups within samples by visually scanning the resulting bi-plot created to represent the data [10].There has also been notable success of applying PCA to protein datasets.Du [11] successfully adapted and applied PCA to protein data in the form of Amino Acid PCA (AAPCA), where the aim was to classify proteins into structural classes; meanwhile, Li [12] combined *Address correspondence to this author at the Sheffield Children's NHS Foundation Trust, Western Bank, Sheffield, S10 2TH, United Kingdom; Tel: +44 7500 190333; E-mail: kevinblighe@gmail.comPCA with continuous wavelet transform (CWT) to also successfully predict protein structural classes; Zhao [13] also used PCA to help predict protein-protein interaction (PPI) networks by using this method to first derive an optimised subset and then using this subset as input to a support vector machine (SVM).Chou [14] also outlines 'pseudo-amino acid composition' as a means of managing and using the large amount of protein sequence that is currently held in public repositories.With pseudo-amino acid composition and PCA, patterns in protein sequences can be found, which can then be used to infer the cellular attributes of the corresponding proteins.Pseudo-amino acid composition and PCA has also been employed by Liu [15].
In PCA experiments, two different approaches can be taken: 1) looking at relationships between variables; and 2) looking at relationships between samples.If only two variables are involved, then a simple linear correlation analysis can be employed.However, having numerous variables prevents this [3].

Post-PCA Analysis
After carrying out PCA and deriving the principal components, there is no standard way of choosing how many components to include or exclude in end-analysis, a fact that is probably related to the broad spectrum of analyses on which PCA is carried-out.The end goal of the study is of critical importance: if you wanted to determine the variables that defined differences between samples, then you would observe the first few principal components (or even just the first); if you wanted to determine the variables that were common across samples, then you would observe the last few.If it was the former, then choosing a certain number of components whose combined percentage of variance accumulated to a pre-determined level (generally ≥70%) could be employed [4,16].They could also be chosen using 'Kaiser's rule [17]', which states that all principal components with an eigenvalue greater than 1 should be retained, or by invoking and analyzing a Scree plot [18], which shows a line-graph of the eigenvalues of each principal component (see Appendix for detailed information on how principal components are derived).
After deriving the principal components, three methods for arriving at a subset of variables/markers for end-analysis can be pursued: 1. observe the resulting PCA bi-plot, remove a pre-determined number of markers, and then re-create the bi-plot to see if the original structure and variance is still visually/graphically represented.If the same structure is present, then remove further markers and repeat until a manageable subset of markers has been chosen.In essence, this method involves comparing the principal components derived from subsets of markers to those of the full set, but key markers could be lost in the process.2. Only choose markers that have high correlation coefficients to each of the generated eigenvectors of the chosen principal components and then search for overlap between them [4].There is no standard cut-off value for the correlation coefficients, but Mahloch [19] indicates that a value larger than 0.6 (r 2 ) is sufficient, while a coefficient larger than 0.8 is regarded as good.Correlation coefficients close to 0 indicate that the marker is not significantly-contributing to variance and is common across samples [2,20].
A third method of selecting the markers to include in end-analysis is to first perform an orthogonal rotation of the derived principal components.There are a number of ways to do this, including varimax, quartimax, equamax, and promax rotation.The rotation of principal components is performed to increase the accuracy of the relationship/correlation of the original markers to the newly-derived eigenvectors (principal components).As a result, this also serves to maximize the differences between each eigenvector [3,7], in respects resembling the complete linkage method in hierarchical clustering.

Limitations of Principal Components Analysis
There are some limitations to employ just PCA as the sole analytical tool for large data-sets.For example, PCA is a linear transformation; thus any data-set that is non-linear will not be represented sufficiently after data-reduction.In addition, PCA assumes that the directions with the largest variances are of most interest [21], which might not necessarily be true.It also follows the assumption that the original data variables are correlated -if they are not, then PCA cannot reduce the data [21,22].

Case Study: Deriving a Genetic Signature
A method used to minimally characterize individual samples from large data-sets was previously employed by finding haplotype-tagging SNPs (htSNPs) -i.e. a reduced list of SNP markers that captured the majority of haplotype diversity in a population [8] -whilst Horne and Camp [23] devised a separate method that aimed to find SNPs in linkage disequilibrium (LD), known as group-tagging SNPs (gtSNPs).The original htSNP method, which was devised by Meng [7], was only applied to analyze certain regions of the genome.Additionally, redundancy could still exist using this approach if, for example, 2 or more of the derived htSNPs mapped to regions affected by linkage disequilibrium (LD) [2,24], which can encompass segments ranging in size from 1-100Kb on the same chromosome and can also be used for classification purposes [25].If two alleles are in LD with each other, measuring the value of one can reveal with a certain level of confidence that of the other due to their high correlation [24].It can arise for different reasons, such as selection, favorable mutations, population mixture and migration, et cetera; if measured around known genes, it can help in the elucidation of those that are involved in disease [25,26].It is more common around genes that are 'rare' or recently evolved as these would have had less time for recombination to break the disequilibrium in question [25].
Lin & Altman [8] found that using this method by Meng [7] could produce a list of htSNPs that achieved a 90% reconstruction precision of each observed haplotype.The process involved the generation of eigenvectors (the synonymous term 'eigenSNPs' was coined by Meng [7]) and then reverse-mapping these to original SNPs to arrive at a minimal set that could define maximum diversity/variance.Single nucleotide polymorphisms are useful in this regard because they are regarded as the most common type of genetic variation in the human genome [27].
The derivation of eigen SNPs has also been employed in population and anthropological studies.The present-day genetic variation in humans around the world exists and has been strengthened by the history of migration patterns [2,28].What contributed to this included mutation, genetic drift, and natural selection that were each driven by an interaction with new climates, pathogens, etc. [2].This prompted Paschou [2] to attempt to build a scoring system to assign an individual of unknown origin to a population group based on PCA and SNPs -some success was achieved.However, their system was based on diverse population groups between whom was exhibited already-known differences/separation, both genetically and geographically.Thus, their method of assignation to a particular clade/population was made easier.Also, they only had small sample sizes from each observed group.
The method has yet to be applied to other polymorphisms in the human genome.However, through the use of a highdensity microarray that can scan the entire genome, applying the htSNP method could generate a genetic signature for a particular population group or even disease state, if the study was such.In the latter sense, it could potentially provide a minimum set of diagnostic and prognostic markers and assist in disease-type classification.
Thus, the aim of this work was to derive a set of haplotype-tagging copy number variants (htCNVs) amongst 128 female samples from the International HapMap that could be used for assignation and characterization purposes to deduce the origin of unknown samples in the future.

Principal Components Analysis
Principal components analysis (PCA) was performed as follows: In brief, principal components were determined using a covariance matrix method (used for mean-centred data) with normalized eigenvector scaling (see Supplementary Methods for detailed information on how principal components were derived).A Bonferroni-corrected p< 0.0001 for multiple comparisons was used to filter-out markers of insignificance before determining principal components.False discovery rate was not used as it was considered too lenient and unsuitable for the amount of comparisons involved.The component loadings for each of the derived principal components were rotated using varimax rotation with Kaiser normalization [31] on a UNIX-based system using custom R [32] scripts.

Haplotype-tagging CNV
For deriving the genetic signature of variation amongst the HapMap samples, the method according to Meng [7] was applied as follows: Using absolute values on all component loadings, the mean correlation coefficient for each marker to the first few principal components whose total variance accumulated to ≥70% was obtained.Then, the corresponding mean correlation coefficient for the marker was calculated for all remaining principal components.If the mean of a marker for the first set of components was greater than its corresponding mean for the remaining components, then the marker was included.
Pathway analysis and keyword/term-enrichment for genes was performed using DAVID [33,34].

RESULTS
On the Affymetrix SNP 6.0, 762,463 markers target known genes that are listed in the RefSeq gene database [29].After pre-filtering for markers of difference amongst the 128 healthy female HapMap samples through a Bonferroni-corrected ANOVA, the number of markers for PCA was reduced to 5,896.The data generated through PCA was then channelled through the htCNV pipeline, which was capable of reducing them further to 4,594.This reduced number of markers covered a total of 2,866 genes that were significantly driving differences based on copy number between the four HapMap sub-population groups analyzed.A total of 1,893 of these genes were enriched for the UniProt [30] key-term 'sequence variant', whilst 1,838 were enriched for the keyword 'polymorphism'.In addition, 1,412 were genes that are expressed in the brain, while 456 are expressed in the epithelium.
Hierarchical clustering using this severely reduced number of 4,594 markers was capable of distinguishing the different sub-populations (Fig. 1).However, similarities were revealed between the CHB (Han Chinese in Beijing, China) and JPT (Japanese in Tokyo, Japan).The dendrogram also revealed that the YRI (Yoruba in Ibadan, Nigeria) were distinct from all other groups.The remaining samples from the International HapMap, which comprised 172 healthy male samples, were then added and clustered using the same panel of markers to again reveal sub-groups based on both ethnicity and sex (Fig. 2).

DISCUSSION AND CONCLUSION
Thus, the htCNV method -through PCA-is capable of defining a reduced number of markers for characterization purposes in a large sample cohort.Moreover, it is then robust in the sense that new samples can be prospectively added to the cohort using these markers with correct classification based on both population group and sex.It is reasonable to suggest that this method could be applied to other larger data-sets and used to derive panels of markers for diagnostic purposes.For example, in metabolomic studies, it could be used to drastically reduce the large -and sometimes incoherent-number of variables to a select few that had much meaning between a healthy and disease state.Although the derived, reduced set of markers cannot provide 100% confidence that the genomic loci to which they target are indeed capable of classifying the population groups studied -and that further work in the wet-laboratory would be required to prove this-it is my belief that the results herewith are evidence of the sound computational methodology employed.Indeed, such a method had not previously been applied to copy number markers in the human genome; thus, the results show how copy number loci can equally provide for haplotype classification along with other polymorphism-types in the human genome.

CONFLICT OF INTEREST
The authors confirm that this article content has no conflicts of interest.

ACKNOWLEDGEMENT
I would like to Bentham Science, for their great support during the peer-review and publication process.

APPENDIX Principal Components Analysis
The process of reducing data using PCA involves the following steps: 1. subtract mean; 2. build covariance matrix; 3. calculate eigenvectors and eigenvalues; 4. transform original data; 5. end-analysis.The first step in PCA is to subtract the mean of each data-set from each data-point in the set.For example, if we had datasets X, Y, and Z, then from each value in X, Y, and Z, we would subtract the mean of each data-set, respectively.This would result in the data-sets having means of 0. The covariance matrix is then built using the mean-subtracted data-sets (note: if the variables were measured in different units, then a correlation matrix should be employed, whereas if the variables were in the same units and are mean-centred, then a covariance matrix should be employed).Thirdly, eigenvectors (a) and eigenvalues (e) for the built covariance matrix are determined.Eigenvectors characterize the data using straight, orthogonal lines and each is scaled with an eigenvalue.Eigenvalues indicate the amount of variance that a component contains (similar to the percentage of variance).The eigenvector is the direction cosine of the axis of the principal component, such that: Where: n=Number of original variables; Z v =v th principal component; a v =v th eigenvector; x=vector of the original variables As mentioned, it also holds true that the variance of a principal component is equivalent to its corresponding eigenvalue: Where: Z v =v th principal component; e v =v th eigenvalue Fig. (2).By adding 172 male samples and using the same markers as per Fig. 1, the htCNV pipeline is also shown to be capable of distinguishing between both sex and sub-population group within the Internationall HapMap.
The eigenvectors are then ordered by their eigenvalue (highest to lowest), indicating the level of significance to the data-set.
Transformation of the original data then occurs by multiplying a matrix containing the derived eigenvectors by one containing the mean-subtracted original data.The result is a matrix with the same dimensions as that of the mean-subtracted data but whose values have been transformed.The original data can then be said to be expressed with regard to the patterns within it.Once the data has been transformed, end-analysis can be performed, which can involve the selection of significant data-points and eigenvectors -it can also involve viewing the data on bi-plots.
The transformed data contains the principal components and each is assigned a percentage that corresponds to the amount of total variance in the data towards which the component contributes.The relationship of the original variables to the new components is represented by correlation (r) values that are scaled between -1 and +1.If there are variables having high correlation to one or more of the derived components, then most of the variation will be accounted for by these components and such variables will be the ones that are accounting for the differences amongst samples.The last few, in such a case, will account for little variation as they will define constant or near-constant linear relationships amongst the variables (i.e.variables whose values were common across samples).

Eigenvectors and Eigenvalues
Common statistical measures include the mean ( X ) and standard deviation (σ).However, variance (var), which is merely the square of the standard deviation (σ2), and covariance (cov) can also be used.To derive the eigenvectors and eigenvalues for a set of data, the covariance matrix must first be derived.
The standard deviation and variance are represented as follows: Where: n=Number of values in the data-set; X =Mean of the data-set; X i =i th element of the data-set These measure the spread of the data.However, whereas the variance is one-dimensional, covariance is used on twodimensional data (for example, measuring the relationships between the height of a person and their body-weight, or between hours studied and exam results).Covariance retains the same formula as variance, except for a minor difference: Where: n=Number of values in the data-set; X & Y =Means of data-sets X and Y; X i & Y i =i th elements of data-sets X and Y For n-dimensional data-sets, a covariance matrix can be constructed that represents the covariance between each possible combination of data-sets.For example, a covariance matrix for a three-dimensional data-set (X, Y, Z) would look like the following: The matrix is symmetrical across the main diagonal.Also, the covariance values on this diagonal are equivalent to simply finding the covariance of each particular data-set to itself.If we multiplied two matrices together (one a transformation matrix and the other a vector), the result would be a matrix that is either an integer-multiple of the original vector or not.Integermultiples are eigenvectors, with the scaling value being the eigenvalue.For example, observe the following: Eigenvectors are orthogonal (i.e.-perpendicular) to each other.For any nXn matrix, only n possible eigenvectors can be found, and they are scaled to have a length of 1. Visually, the orthogonal nature of eigenvectors means that at most 3 eigenvectors can be displayed on a three-dimensional plot.

Fig. ( 1
Fig. (1).Principal components analysis incorporating the htCNV pipeline reveals a reduced set of 4,594 copy number markers that can distinguish the International HapMap sub-population groups' female samples in hierarchical clustering.