ECGpred: Correlation and Prediction of Gene Expression from Nucleotide Sequence
Abstract
Development of gene expression prediction systems from huge amount of microarray data is an inevitable problem. In the present study a support vector machine (SVM) based method has been developed to predict expression of genes from its nucleotide sequence. In this method, SVM was trained on microarray data of genes and trained SVM was used to predict the expression of other genes of the same organism under the same condition. The SVM models were developed using nucleotide, dinucleotide, and trinucleotide composition of genes and achieved correlation coefficients (r) 0.25, 0.70, 0.82 respectively, between predicted and experimentally determined gene expression. Besides, trinucleotide composition, we also tried codon composition in each forward reading frame and achieved the correlation r = 0.86, 0.83 and 0.73 between the predicted and the actual expression using trinucleotide composition from the first, second and third frames respectively. The method was developed on 4807 genes of Saccharomyces cerevisiae obtained from Holstege et al., (1998) and evaluated using 5-fold cross validation techniques. A web server ECGpred has been developed to allow users to understand the relationship between expression and various components of genes like coding/non-coding regions, transcription factor (http://www.imtech.res.in/raghava/ecgpred/).