Yeast Gene Function Prediction from Different Data Sources: an Empiri- Cal Comparison

Different data sources have been used to learn gene function. Whereas combining heterogeneous data sets to infer gene function has been widely studied, there is no empirical comparison to determine the relative effectiveness or usefulness of different types of data in terms of gene function prediction. In this paper, we report a comparative study of yeast gene function prediction using different data sources, namely microarray data, phylogenetic data, literature text data, and a combination of these three data sources. Our results showed that text data outperformed microarray data and phylo-genetic data in gene function prediction (p<0.01) as measured by sensitivity, accuracy, and correlation coefficient. There was no significant difference between the results derived from microarray data and phylogenetic data (p>0.05). The combined data led to decreased prediction performance relative to text data. In addition, we showed that feature selection did not improve the prediction performance of support vector machines.


INTRODUCTION
Functional genomics studies gene function on a large scale by conducting parallel analysis of gene expression for a large number of genes [1,2].This research is a natural successor to the genome sequencing efforts such as, for example, the Human Genome Project, and is made possible by the DNA microarrays.Such arrays, which allow researchers to simultaneously measure the expression levels of thousands of different genes, produce overwhelming amounts of data.In response, much recent research has been concerned with automating the analysis of microarray data [3].Current approaches mainly concentrate on applying clustering techniques to the expression data, in order to find clusters of genes demonstrating similar expression patterns.The assumption motivating such search for co-expressed genes is that simultaneously expressed genes often share a common function.However, there are several reasons that cluster analysis alone cannot fully address this core issue [3].
High-throughput gene and protein assays give a view into the organization of molecular cellular life through quantitative measurements of gene expression levels [1].Increasing quantities of high-throughput biological data have become available to assess functional relationships between gene products on a large scale.Different data sources can be used to predict gene function.
First, gene function can be inferred from DNA microarray expression data.DNA microarray is based on the assumption that genes with similar functions have similar expression profiles in cells.This is utilized by inductive learning methods that predict the function of genes that have *Address correspondence to this author at the Department of Mathematics and Information Sciences, University of North Texas at Dallas, 7300 University Hills Blvd Dallas, TX 75241, USA; Tel: 972-338-1573; Fax: 972-338-1911; E-mail: ying.liu@unt.eduan unknown function (unknown genes), from their expression-similarity with genes with a known function (known genes) [3].Currently, techniques pursued for microarray data analysis concentrate on applying clustering methods directly on the expression data.However, cluster analysis alone cannot fully address the issue of gene function prediction [3].Furthermore, many high-throughput methods sacrifice specificity for scale.Whereas gene coexpression data are an excellent tool for hypothesis generation, microarray data alone often lack the degree of specificity needed for accurate gene function prediction [4].
Secondly, gene function can be inferred from phylogenetic profiles.The complete genomic sequences of human and other species provide a tremendous opportunity for understanding the functions of biological macromolecules [5].Phylogenetic profiles are derived from a comparison between a given gene and a collection of complete genomes.Each profile characterizes the evolutionary history of a given gene.There is evidence that two genes with similar phylogenetic profiles may have similar functions, the idea being that their similar pattern of inheritance across species is the result of a functional link [6].
Finally, one more data source that can be used to infer the gene function is the scientific literature.The function of many genes is described in the literature.By relating documents talking about well understood genes to documents discussing other genes, we can predict, detect, and explain the functional relationships between the genes that are involved in large-scale experiments.A number of groups are developing to organize genes.The web tool, PubGene, finds links between pairs of genes based on their co-occurrence in MEDLINE abstracts [7].Liu et al. [8,9] developed a tool to retrieve functional keywords automatically from biomedical literature for each gene, and then cluster the genes by shared functional keywords.Using a similarity-based search in document space, Shatkay et al. [3] developed an approach

Ying Liu
for utilizing literature to establish functional relationships among genes on a genome-wide scale.
Different data sources have been used to infer gene functions [3][4][5][6][7][8][9].Furthermore, heterogeneous data sources have been combined to predict gene functions [10][11][12].However, there is no empirical comparison to determine the relative effectiveness or usefulness of different types of data in terms of gene function prediction.In this paper, we performed a comparative study for functional prediction of Saccharomyces cerevisiae genes from different data sources.Data from three different types of sources were compared: microarray data, phylogenetic profile data, biomedical literature data, and a combination of the three heterogeneous data sets.The goal was to determine the relative effectiveness or usefulness of this data in terms of gene function prediction.

1.
The first data set derives from a collection of DNA microarray hybridization experiments [13].Each data point represents the logarithm of the ratio of expression levels of a particular gene under two different experimental conditions.The data consists of a set of 79-element gene expression vectors for 2,465 yeast genes [4].These genes were selected by Eisen et al. [13] based on the availability of accurate functional annotations.The data were generated from spotted arrays using samples collected at various time points during the diauxic shift [14], the mitotic cell division cycle [15], sporulation [16], and temperature and reducing shocks [4].

2.
In addition to the microarray expression data, each of the 2,465 yeast genes is characterized by a phylogenetic profile [5].In its simplest form, a phylogenetic profile is a bit string, in which the Boolean value of each bit indicates whether the gene of interest has a close homolog in the corresponding genome.The profiles employed in this paper contain, at each position, the negative logarithm of the lowest E-value reported by BLAST version 2.0 [17] in a search against a complete genome, with negative values (corresponding to E-values greater than 1) truncated to 0. Two genes in an organism can have similar phylogenetic profiles for one of two reasons [4].First, genes with a high level of sequence similarity will have, by definition, similar phylogenetic profiles.Second, for two genes which lack sequence similarity, the similarity in phylogenetic profiles reflects a similar pattern of occurrence of their homologs across species [4].This coupled inheritance may indicate a functional link between the genes, on the hypothesis that the genes are always present together or always both absent because they cannot function independently of one another [4].

3.
In this paper, the experiments are carried out using gene functional categories from the MIPS Comprehensive Yeast Genome Database (CYGD) (http://mips.helmholtzmuenchen.de/genre/proj/yeast/index.jsp).The database contains several hundred functional classes, whose definitions come from biochemical and genetic studies of gene function [4].For each of the genes, the abstracts used to curate the CYGD were extracted and formed a document.Abstracts may occur in more than one document if they refer to multiple genes.All the documents form a document database.Since each document represents one gene, we use the words document and gene interchangeably.
The abstracts in each document were tokenized, stemmed by Porter's stemming algorithm, and filtered by a stop list [9].The standard term frequency-inverse document frequency (TFIDF) function was used [18] to assign the weight to each word in the document.TFIDF combines term frequency (TF), which measures the number of times a word occurs in the gene's set of abstracts (reflecting the importance of the word to the gene), and inverse document frequency (IDF), which measures the information content of a word -its rarity across all the abstracts in the background set.The inverse document frequency (IDF) is calculated as: where a idf denotes the inverse document frequency of word a in all the documents; a df denotes the number of abstracts in which word a occurs; and N is the total number of abstracts in all the documents.TFIDF is defined as: tfidf denotes the weight of the word a to the gene g; a g tf the number of times word a occurs in gene g.
To distribute the word weights over the [0, 1] interval, the weights resulting from TFIDF were often normalized by cosine normalization, given by where |W| denotes the number of words in the abstracts of gene g.
Each document, which corresponded to one gene, was modeled as an M-dimensional TFIDF vector, where M is the number of distinct words in the document.Formally, a document was a vector (tfidf 1 , tfidf 2 , … , tfidf M ), where tfidf i is the tfidf value of word i.

4.
These three types of data mentioned above are combined by concatenating the three types of vectors to form a single set of vectors.This is also called early integration or feature integration [4].We used feature integration because feature integration considers the various types of data at once, making a single prediction for each gene with respect to each functional category [4].
Prior to learning, the gene expression, phylogenetic profile, text TFIDF vectors, and the combined data are adjusted to have a mean of 0 and a variance of 1.The gene expression and phylogenetic profile data were from [4].

Classifier
In this study, Support Vector Machine (SVM) was used for gene function prediction.SVMLight v.3.5 was used [19].SVM has been widely used in gene and protein function prediction [12,20].Linear kernel and polynomial kernel were applied.

Cross-Validation of the Models
The normal method to evaluate the prediction results is to perform cross-validation on the prediction algorithms [21].Tenfold cross-validation has been proved to be statistically good enough in evaluating the prediction performance [22,23].In this paper, each of the data sets (microarray, phylogenetic, text mining, and the combined data sets) was partitioned into ten subsets with both positive and negative genes spread as equally as possible between the sets.Each of these sets in turn was set aside while a model was built using the other nine sets.This model was then used to classify the genes in the tenth set, and the accuracy computed by comparing these predictions with the actual category.This process was repeated ten times and the results averaged [24].

Feature SELECTION
The feature selection method we used in this study is MIT correlation score, which is also known as the signal-tonoise score [25] that helps to eliminate the "noisy" features.For a given feature i, we compute the mean and standard deviation of that feature across the positive examples ( + i and + i , respectively) and across the negative examples ( i and i , respectively).The MIT correlation score is defined as MIT (i) = . When making selection, we simply take those features with the highest scores as the most discriminatory features.For the text data, the features are the terms or the distinct words.

Performance Measures
Several statistics were used as performance measures: (1).Accuracy: the proportion of correctly classified instances: Paired t-tests were performed to evaluate whether the results obtained from the four data sets were significantly different from each other.

RESULTS
The database contains different functional classes, whose definitions come from biochemical and genetic studies of gene function.The experiments reported here used 8 CYGD functional categories which have the most genes available in the CYGD data set as of July 30 th , 2009, (Table 1).2, and Table 3.

Ying Liu
When microarry data was used and linear kernel was applied for gene function prediction, all the genes in each category, except category #12, were mis-classified (true positive = 0), which can be observed, in Fig. (1), that the sensitivity values were 0's.Similar results can be observed when phylogenetic data was used and linear kernel was applied to classify gene function except category #1 (Fig. 1).When linear kernel was applied and text data was used, the results derived from text data significantly outperformed those derived from microarray data and phylogenetic data (p<0.01).SVM can correctly classify the function of the genes in category #12 with an accuracy of 0.963 and a sensitivity of 0.7.
When polynomial kernel was applied, the results derived from text data outperformed those derived from microarray data and phylogenetic data (p< 0.05) except category #1 (Fig. 2).No significant difference was observed between the gene function prediction results derived from microarray data and phylogenetic data (p >0.05) (Fig. 2).
For text data, linear kernel outperformed polynomial kernel (p<0.01) as measured by sensitivity, PPV, accuracy, and CC.Polynomial kernel worked significantly better than linear kernel (p<0.01) for microarray data, and phylogenetic data (Fig. 1, Fig. 2, Table 2, and Table 3).A linear-SVM can outperform a polynomial-SVM because the noise contained in the dataset can be amplified by the high-order polynomial kernel into the feature-space, which may weaken the classifier's discriminative power.2, and Table 3, we can see that using combined data to classify yeast gene function did not improve the SVM performance.When linear kernel was applied, the results derived from text data significantly outperformed those derived from the combined data, as measured by sensitivity (p<0.01) (Fig. 1A), accuracy (p<0.05) (Fig. 1C), and CC (p<0.05)(Table 2).There was no significant difference between the combined data results, microarray data results, and phylogenetic data results (p>0.05).

Gene Function Prediction from Combined Data
Similar to microarray data and phylogenetic data, polynomial kernel worked significantly better than linear kernel (p<0.01) for combined data (Fig. 1, Fig. 2, Table 2, and Table 3).However, the results derived from text data with linear kernel still outperformed those derived from combined data with polynomial kernel (p<0.01).

Feature Selection
In this study, the MIT was used as the feature selection method to test if feature selection can improve SVM performance on gene function prediction using text data.Linear kernel was applied.The experiments demonstrated that, MIT, a naïve feature selection algorithms, which does not take into account the heterogeneity of the data, did not yield improved prediction performance (Fig. 3).Highest sensitivity, accuracy, PPV, NPV, and CC were obtained when all the features were used.

DISCUSSION
A primary goal in biology is to understand the molecular machinery of the cell.The sequencing projects provide us one view of this machinery.A complementary view is provided by data from microarray hybridization experiments.High-throughput techniques, such as DNA microarray and sequencing, accompanied by an increase in the number of publications discussing gene-related discovery, provide the researchers great resources to understand the gene function better.In this paper, we classified yeast gene functions from different data sources.CYGD database categorizes the yeast genes into different categories, of which we analyzed eight (category numbers 1, 11, 14, 20, 12, 10, 42, and 34).Although the idea of combining heterogeneous data sets to infer gene function is not new [4], there is no empirical comparison to determine the relative effectiveness or usefulness of difference types of data in terms of gene function prediction.In this paper, we report a comparative study of yeast gene function prediction using different data sources, namely microarray data, phylogenetic data, literature text data, and a combination of these three data sources.

Effect of Different Data Sources on Gene Function Prediction
The results showed that, using SVM as the classifier, text data can provide better prediction results than microarray data and phylogenetic data, particularly when linear kernel was applied (Fig. 1, and Table 2) as measured by sensitivity, PPV, NPV, accuracy, and CC.These results confirmed that the CYGD predictions we tested are not learnable from   either microarray data or phylogenetic data [4].Pavlidis et al. [4] pointed out that the failure to classify the gene functions from microarray data or phylogenetic data was not a failure of SVM model.Rather, for many functional categories, the data are simply not informative.The microarray data is only informative if the genes in the category are coordinately regulated at the level of transcription under the condition tested.Simultaneous expressed genes may not always share a function.On the other hand, genes that are functionally related may demonstrate strong anti-correlation in their expression levels, (a gene may be strongly suppressed to allow another to be expressed) [3].Similarly, phylogenetic data are limited in resolution in part because relatively few genomes are available.In particular, among the genomes from which phylogenetic profiles were derived, all but one is bacterial.Thus it is difficult to generate useful phylogenetic profiles for genes that are specific to eukaryotes [4].
One complement data source we can use to classify gene functions is literature data.With the advancement of genome sequencing techniques comes as an overwhelming increase in the amount of literature discussing the discovered genes [26].As an illustrative the number documents containing the word gene published between the years 1970-1980 is a little over 35,000, while the number of such documents published between the years 1990-2000 is 402,700 -over a ten fold increase [3].The gene functions have been described in the literature.Therefore, we believe that the gene functions can be classified by revealing coherent themes within the literature.Content-based relationships among abstracts are then translated into functional connections among genes.Liu et al. [8,9,25] developed a system to retrieve functional keywords automatically from biomedical literature for each gene, and then cluster the genes by shared functional keywords.The keywords extracted by the system revealed a wealth of potential functional concepts, which were not represented in existing public databases [27].The system also clustered the genes into appropriate functional groups based on the functional keyword association [8,9].
Our gene function prediction by text data strategy is similar to the document categorization in information retrieval.In our case, each document is the collection of abstracts which are related to a specific gene.Document categorization, defined as classifying documents into categories according to their topics or main contents in a supervised manner, organizes large amounts of information into a small number of meaningful categories and improves the information retrieval performance either via term-weighting, or query expansion.

Combining Heterogeneous Data Sets for Gene Function Prediction
The problem of learning from multiple information sources has been extensively studied in machine learning where it is called as multi-modal learning.Generally there are two types of multi-modal learning: feature level integration and semantic integration [28].The feature integration combines the information at the feature level and performs learning in the joint feature space.The correlation structure between different sources can be discovered via learning.The semantic integration, on the other hand, first builds individual models based on separate information sources and then combines these models via some processes say, mutual information maximization [29].Li et al. [28] listed four reasons why semantic integration was preferred over feature integration.However, Pavlidis et al. [4] argued that feature integration considers the various types of data at once, making a single prediction for each gene with respect to each functional category.They also argued that the performance of SVMs when data types are combined and a single hypothesis is formed is superior to combining different independent hypotheses [4].
In this study, we used feature integration.The results showed that the combined data did not improve the prediction results, especially compared with text data.Our results confirmed the conclusion drawn by Pavldis et al. [5] when they studied gene function learning from microarray data and phylogenetic data.Learning from different data types is not always a good idea.The combined data led to decreased prediction performance relative to an SVM trained on a single type of data (e.g.text data).In this case, the decrease occurs when one data type provides much more information than the others, indicating that the inferior data types (e.g.microarray and phylogenetic data) contribute noise that disrupts learning [5].

Effect of Feature Selection on Gene Function Prediction
In this study, MIT correlation score was used as the feature selection method.By treating each feature independ-ently, MIT correlation score does not take into account possible correlations between features.But MIT has the advantages of simplicity and efficiency.Prediction performance declined as features are removed.MIT has been successfully applied to gene expression data analysis for cancer prediction [25].
The results of the experiments indicated that SVM did not benefit from feature selection (Fig. 2), which had been reported in text prediction [30][31][32].Taira and Haruno [33] compared SVM and decision tree in text categorization, and the best average performance was achieved when all the features were given to SVM, which was a distinct characteristic of SVM compared with the decision tree learning algorithm.Joachims [19] argued that, in text prediction, feature selection was often not needed for SVM, as SVM tends to be fairly robust to overfitting and can scale up to considerable dimensionalities.SVM avoids overfitting by choosing the maximum margin separating hyperplane from among the many that can separate the positive from negative examples in the feature space.Also, the decision function for classifying points with respect to the hyperplane only involves dot products between points in the feature space.Because the algorithm that finds a separating hyperplane in the feature space can be stated entirely in terms of vectors in the input space and dot products in the feature space, a support vector machine can locate the hyperplane without ever representing the space explicitly, simply by defining a function, called a kernel function, that plays the role of the dot product in the feature space.This technique avoids the computational burden of explicitly representing the feature vectors [34].
true positives (TP) denote the correct predictions of positive examples; true negatives (TN) are the correct predictions of negative examples; false positives (FP) represent the incorrect prediction of negative examples into the positive class; and false negatives (FN) are the positive examples incorrectly classified into the negative class.(2).Sensitivity: the percent of positive examples which were correctly classified; Sensitivity = FN TP TP + (5) (3).Specificity: the percent of negative examples which were correctly classified; .Positive Predictive Value (PPV): the percentage of the examples predicted to be positive that were correct: .Negative Predictive Value (NPV): the percentage of the examples predicted to be negative that were in fact negative.

3 . 1 .
Gene Function Prediction from Microarray, Phylogenetic, and Text Data The results of gene function prediction from different data sources were shown in Fig. (1), Fig. (2), Table
an average value over ten-fold cross-validation.Values in the brackets are the standard errors.

Fig. ( 3 ).
Fig. (3).Effect of feature selection in combination of SVM classifier on sensitivity (A), specificity (B), PPV (C), NPV (D), accuracy (E), and correlation coefficient (F).Note the different scales on the vertical axes.The horizontal axes refer to the number of features used by SVM to predict the gene function.Error bars indicated the standard errors.The series 1, 11, 14, 20, 10, 12, 42, and 34 are the functional categories tested in this study.
). Correlation Coefficient (CC): It is also known as Simple Matching Coefficient (SMC).CC depends not only on sensitivity and specificity, but also on PPV and NPV.

Table 2 . The Prediction Performance (PPV, NPV and CC) of support Vector Machines (Linear Kernel) with Different Data Sets as Inputs Kernel Type Functional Category Phylogenetic Data Microarray Data Text Data Combined Data
*-: means no positive was predicted; *: means no value was calculated because of being divided by zero.Each value is an average value over ten-fold cross-validation.Values in the brackets are the standard errors.Bold values mean significant difference.