Correlation of Metabolic Pathways with the Primary Structure in Acetylated Proteins

Signaling pathways are the major component in cellular networks, but most studies done recently on signaling pathways were either aimed to enhance various molecular predictions using pathways as contexts or focused to predict pathways indirectly. The former assumed that the pathways for the biomolecules (genes or proteins) used in the modelling were known. Although the latter was well suiting the biosciences researches at the systems level, the indirect predictions would more or less rely on the prediction accuracy of other systems. So far no work whatsoever has been done for studying the direct correlation between signaling pathways and protein primary structures although acetylated proteins are one of the main players in metabolic signaling pathways. In order to investigate their correlation, the sequences of 76 experimentally verified acetylated proteins were downloaded from NCBI. They cover three major metabolic pathways, i.e., bio-synthesis, degradation, and metabolism. Without any a prior knowledge about how these three metabolic pathways are correlated with the primary structures of acetylated proteins, we proposed some classification models between the pathways. It has been found through computer simulations that the signaling pathways are indeed correlated with the primary structures in acetylated proteins, further demonstrating the well-known biological law that sequence determines structure and structure determines function.

*Address correspondence to this author at the School of Biosciences, University of Exeter, UK; E-mail: Z.R.Yang@exeter.ac.ukBioinformatics studies involving signaling pathway are mainly using signaling pathways for genome annotation and signaling network analysis.For the former, the signaling pathway information was used for classifying gene expression data [44].In that study, the classification of gene expression data with two or three classes related to diseases were enhanced by the pathway information.In the corresponding data set, each gene is associated with a pathway with such classification that a pathway is ranked higher if the prediction error associated with the pathway is lower.The same research group [45] later published another related work on clustering pathways.Some web-based tools were developed for the visualisation of pathways.For instance, PathExpress was developed for gene expression data with functional context extracted from KEGG ligand database [46].ArrayXPath II was developed using the scalable vector graphics technique for gene expression data based on integrated biological pathway resources [47].Path-A was developed for metabolic pathway prediction, where machine learning algorithms and homology alignment algorithms were used to build basic reaction classifiers.Based on the reaction predictions, pathways were then integrated [48].In that work, the prediction if a query protein is a catalyst of a particular reaction was implemented using the support vector machine, hidden Markov models, and BLAST.
Because it is difficult to predict pathways in a completely automatic way, some work focused on identifying pathway fragments based on text mining [6].In addition, database technology was employed to curate pathway data and for visualisation [49].Because of incompleteness of pathway information in many databanks, identifying missing enzymes of pathways has also been an important research direction in pathway analysis [50][51][52].In these works, pathway context was used to guide the identification of missing enzymes in some specific pathways.Having built some pathway networks, it is very often for biologists to pin-point where a new sequenced protein is in the network, sequence similarity was therefore used in QPath (Querying pathway) for the prediction of pathways in a pathway network [53].
Although it is known that many essential pathways remain unknown or incomplete for newly sequenced proteins [6], very few work has been conducted to predict pathways for query sequences.To the best of our knowledge, the only one in the literature was predicting pathways implicitly [48], where catalytic reactions of various pathways were precompiled.The prediction was conducted on individual reactions.If a query protein turns out to exactly match the reactions belonging to a pathway, the query protein was predicted to belong to the pathway.
There are two opposite chemical activities in cells [54].The catabolic pathways are used to break down large molecules to smaller molecules.Small molecules are then used as building blocks to form new molecules by the anabolic pathways.The anabolic pathways are referred to as biosynthesis processes.The combination of the two is the metabolism.Metabolism is a vital process of life maintenance through chemical reactions in living organisms [7,55].Degradation is a process of breaking down molecules in living organisms [54].
Acetylation is a reaction which introduces an acetyl functional group into an organic compound and is also a major metabolic pathway.In some disease developments, it has been found that acetylation plays an important role.For instance, the interplay of acetylation and methylation in gene transcription regulation is one of contributing factors in cancer development and has been investigated if patterns can be used for disease diagnosis [56].Histone deacetylation inhibitors, if properly designed, can be used for tumour radiosensitization [57].
The relation between biosynthesis and acetylation can be seen as a cycle in which ATP generates S-adenosylmethionine for biosynthesis while acetyl-CoA is consumed in the acetylation [58].GDP-N-acetyl-d-perosamine was found to be a precursor of the LPS-O-antigen biosynthesis in E. coli [59].The acetylated polyamines induced by spermidine/ spermine N(1)-acetyltransferase increased in biosynthesis [60].The naturally occurring proteins with acetylated NH2termninal will normally be degraded except for the involvement of a conjugating enzyme (possibly a ubiquitin-protein ligase) [61].It was shown that acetylated GATA-1 can be targeted for degradation.This is completed in the uniquitin/proteasom pathway.It was suggested that acetylation may signal unquitination.From this, GATA-1 is led to degradation [62].In studying the inter-individual variability in 5-fluoroiracil metabolism, it was found that N-acetylation played a more important role than hydrolysis [63].In studying oxidative hair dyes, it was found that N-acetylation is a predominant metabolism pathway [64].In studying severe sulphonamide hypersensitivity reaction, it was found that Nacetylation of parent compound is the important metabolic pathway [65] and patients with slow acetylators showed diverse reactions [66].
The present study was initiated in an attempt to develop a computational approach by which one can predict the metabolic pathways in which a query protein involved according to its primary sequence structure.

Benchmark Dataset
Searching NCBI database for acetylated proteins led to 1251 hits.The following rules were used to select proper data for the current study.First, a sequence without pathway annotation (denoted by [PATHWAY]) was discarded.Second, a sequence without any experimentally verified acetylation residue was dropped.From this, the sequences with the annotation of [PATHWAY] contained five types of acetylations, i.e., lysine, serine, methionine, threonine, and valine.An experimentally verified acetylated residue was denoted by "/experiment=…" in the NCBI GenPept file.A computer program written in C language for these rules was used to scan the NCBI GenPept files downloaded from NCBI automatically.This led to 87 sequences being kept and the remaining 1164 sequences discarded.To reduce the redundancy and homology bias, the sequence identity was checked using the CD-HIT algorithm [67][68][69] to remove those sequences which have 90% pairwise sequence identity to any other in a same subset.This further removed 11 sequences and finally we obtained 76 sequences in the benchmark dataset for the current study.The accession numbers and sequences of the 76 proteins are given in the Online Supporting Information A. The 76 sequences involve many sub-pathways of three major metabolic pathways, most of which occur only once in one protein sequence.This made the task of model construction very difficult.To simplify the problem, let us focus the following three major pathways: biosynthesis (BIO), degradation (DEG), and metabolism (MET).Although some of the 76 proteins may involve two of the three major pathways, none involves both BIO and DEG, or more than two of the three pathways.Therefore, according to the modes of their involvement with the three pathways, we can classify the 76 proteins into the following five categories: (1) BIO, (2) DEG, (3) MET, (4) BIO+MET, and (5) DEG+MET.Listed in Table 1 are numbers of the 76 proteins among the five categories.Moreover, the 23 sub-pathways of BIO, 11 sunpathways of DEG, and 15 sub-pathways of MET are listed in Table 2.

Experimental Design
The computer simulation was designed in several steps each of which was for a specific target.First, sequence data needed to be coded to fit the machine learning algorithms for classification model construction.The coded data were then fed to the machine learning algorithms for training the predictor.In order to have an automatic model construction process, the evolutionary algorithm was used to tune the predictor for optimizing its parameters.
Each protein sequence was coded as a 20-D (dimensional) vector [70,71] with each component representing the occurrence frequency of one of the 20 native amino acids.For instance, if there are 20 alanine in a sequence while the sequence length is 100 (with 100 residues), the frequency or the first component in this 20-D vector is 0.2.
The SVM (support vector machine) algorithm [72,73] was used as the identification engine for the current study.SVM has been successfully used to predict protein subcellular locations, membrane protein types, and other protein attributes (see, e.g., [74][75][76]).For a brief introduction of using SVM to classify protein attributes based on a discrete model, refer to [75].The SVM light package (http://svmlight.joachims.org/)[72] was employed for model construction and evaluation.The package provides four kernel functions, i.e. the linear kernel, radial-basis function kernel, sigmoid function kernel and polynomial kernel.The radial basis function was found the best to fit this application and was used.The package needs the user to tune three hyper-parameters (C, J and ).The C parameter was designed for trading-off between training and testing error.The J parameter was introduced for dealing with heavily imbalanced data.The parameter is associated with the radial basis function to determine the sensitivity of the function.The combination of these three parameters sits in a very large space and finding the best one is actually a non-trivial problem.The evolutionary algorithm was therefore used for this optimization problem.Through many generations of multiple-solution competition, the final solution is believed to be most close to the best solution.

RESULTS
In statistical prediction, the following three crossvalidation methods are often used to examine a predictor for its effectiveness in practical application: independent dataset test, subsampling test, and jackknife test [77].However, as elucidated in [34] and demonstrated by Eq.50 of [78], among the three cross-validation methods, the jackknife test is deemed the most objective that can always yield a unique result for a given benchmark dataset, and hence has been increasingly used by investigators to examine the accuracy of various predictors [1,17,.Accordingly, the jackknife test was also used to examine the success rates in identifying the pathway a query protein involves with.Listed in Table 3 are the jackknife success rates by the current approach in identifying the five classes of pathways for the 76 proteins.As we can see from the table, the overall success rate is about 66%.Let us imagine: if the involvements of the 76 proteins are completely randomly distributed among the 5 possible pathways, the overall success rate by random assignments would generally be 1/ 5 20% = ; if the random assignments are weighted according to the number of proteins in each pathway class (see column 2 of Table 3), then the overall success rate would be [103] which is about 41% lower than the overall success rate by the current approach, indicating that the metabolic pathways are really correlated with the primary structure in acetylated proteins.

CONCLUSION
It has been demonstrated through this study that there exists a correlation between the signaling pathways and the protein primary structures.This is a quite encouraging sign, indicating that it is possible to predict the pathway property or the involvement of a query protein, and hence its functions at the systems level can be analyzed as well.Particularly, for a protein known from some disease-related tissue, it is possible to use the current approach to explore which kinds of signaling pathways might be triggered for the disease development.It is instructive to point out that in the current approach, the simplest discrete model, i.e., the 20-D vector was adopted to express the protein samples.It is anticipated that if using more sophisticated discrete models [78], such as the pseudo amino acid (PseAA) composition approach [104] or functional domain (FunD) approach [36], or the hybridization approach by fusing FunD with the sequential evolution information [40], the success rates in predicting protein metabolic pathways will be further enhanced.The bioinformatics tool thus established will be very useful for studying biomedicine at the systems level.