FoldRate: A Web-Server for Predicting Protein Folding Rates from Primary Sequence

: With the avalanche of gene products in the postgenomic age, the gap between newly found protein sequences and the knowledge of their 3D (three dimensional) structures is becoming increasingly wide. It is highly desired to develop a method by which one can predict the folding rates of proteins based on their amino acid sequence information alone. To address this problem, an ensemble predictor, called FoldRate, was developed by fusing the folding-correlated features that can be either directly obtained or easily derived from the sequences of proteins. It was demonstrated by the jackknife cross-validation on a benchmark dataset constructed recently that FoldRate is at least comparable with or even better than the existing methods that, however, need both the sequence and 3D structure information for predicting the folding rate. As a user-friendly web-server, FoldRate is freely accessible to the public at www.csbio.sjtu.edu.cn/bioinf/FoldRate/, by which one can get the desired result for a query protein sequence in around 30 seconds.


I. INTRODUCTION
A protein can function properly only if it is folded into a very special and individual shape or conformation, i.e., has the correct secondary, tertiary and quaternary structure [1].Failure to fold into the intended 3D (three-dimensional) structure usually produces inactive proteins or misfolded proteins [2] that may cause cell death and tissue damage [3] and be implicated in prion diseases such as bovine spongiform encephalopathy (BSE, also known as "mad cow disease") in cattle and Creutzfeldt-Jakob disease (CJD) in humans.All prion diseases are currently untreatable and are always fatal [4].
Since each protein begins as a polypeptide translated from a sequence of mRNA as a linear chain of amino acids, it is interesting to study the folding rates of proteins from their primary sequences.Actually, protein chains can fold into the functional 3D structures with quite different rates, varying from several microseconds [5] to even an hour [6].Since the 3D structure of a protein is determined by its primary sequence, we can assume the same is true for its folding rate.In view of this, we are challenged by an interesting question: Given a protein sequence, can we find its folding rate?Although the answer can be found by conducting various biochemical experiments, doing so is both timeconsuming and expensive.Also, although a number of prediction methods were proposed [7][8][9][10][11][12], they need the input from the 3D structure of the protein concerned, and hence the prediction is feasible only after its 3D structure has been determined.Particularly, the newly-found protein sequences have been increasing explosively.For instance, in 1986 the Swiss-Prot databank (www.ebi.ac.uk/swissprot) contained merely 3,939 protein sequence entries, but the number has jumped to 428,650 according to version 57.0 of 24-March-2009, meaning that the number of protein sequence entries now is more than 108 times the number about 23 years ago.In contrast, as of 5-May-2009, the RCSB Protein Data Bank (http://www.rcsb.org/pdb)contains only 57,424 3D structure entries, meaning that the structure-known proteins is about 1.34% of sequence-known proteins.Facing the avalanche of protein sequences generated in the post-genomic age and also considering the huge gap between the numbers of known protein sequences and 3D structures, it is highly desired to develop an automated method that can rapidly and approximately predict the folding rates of proteins according to their sequence information alone.
The present study was initiated in an attempt to address this problem in hopes that our approach can play a complementary role to the existing methods [13,14].Below, let us first clarify the meaning of the protein folding rates as usually observed by experiments.

II. THE PROTEIN FOLDING RATE
Since the prediction object in the current study is the protein folding rate, a clear understanding of its implication is necessary.The folding rate of a protein chain observed by experiments is usually measured by the "apparent folding rate constant" [15], as denoted by f K .It is instructive to unravel its relationship with the detailed rate constants, as given below.
The apparent folding rate constant f K for a protein chain is defined via the following differential equation: P ( ) 0 t = when 0 t = .Subsequently, the protein system is subjected to a sudden change in temperature, solvent, or any other factor that causes the protein to fold.Obviously, the solution for Eq. 1 is: It can be seen from the above equation that the larger the f K , the faster the folding rate will be.Given the value of f K , the half-life of an unfolded protein chain can be ex- pressed by: ( ) which can also be used to reflect the time that is needed for a protein chain to be half folded.However, the actual folding process is much more complicated than the one as described by Eq. 1 even if the reverse rate for the folding system concerned can be ignored.As an illustration, let us consider the following three-state folding mechanism: Eqs.4 and 5 can be expressed via an intuitive diagram called "directed graph" or "digraph" G [15, 16] as shown in Fig. (1a).To reflect the variation of the concentrations of the three protein states with time, the digraph G is further trans- formed to the phase digraph G [15, 16] as shown in Fig.
(1b), where s is an interim parameter associated with the following Laplace transform: P and fold P , respectively.Thus, according to the

Fig. (1). (a)
The directed graph or digraph G [15, 16] for the three-state protein folding mechanism as schematically expressed by Eq. 4 and formulated by Eq. 5. (b) The phase digraph G obtained from G of panel (a) according to graphic rule 4 for enzyme and protein folding kinetics [15,16] that is also called "Chou's graphic rule for non-steady-state kinetics" in literatures (see, e.g., [17]).The symbol s in the phase digraph G is an interim parameter (see the text for further explanation).[15,16], which is also called "Chou's graphic rule for nonsteady-state kinetics" in literatures (see, e.g., [17]), we can directly write out the following phase concentrations: )( ) P ( ) Through the above phase concentrations and using Laplace transform table (see, e.g., [18] or any standard mathematical tables), we can immediately obtain the desired concentrations for unfold P , inter P and fold P of Eq. 5, as given by: P unfold (t) = C 0 e Comparing Eq. 9 with Eq. 1, we obtain the following equivalent relation: meaning that the apparent folding rate constant f K is a func- tion of not only the detailed rate constants, but also t .Ac- cordingly, f K is actually not a constant but will change with time.Only when and f K be treated as a constant.
Even for a two-state protein folding system when the reverse effect needs to be considered, i.e., the system described by the following scheme and equation: It can be imagined that for a general multi-state folding system, f K will be much more complicated.It is important to keep this in mind to avoid confusion of the apparent rate constants with the detailed rate constants.
We can also see from the above derivation that using the graphic analysis to deal with kinetic systems is quite efficient and intuitive, particularly in dealing with complicated kinetic systems.For more discussions about the graphic analysis and its applications to kinetic systems, see [19][20][21][22][23][24][25].

III. MATERIALS AND METHODS
To develop an effective statistical predictor, the following three things are indispensable: (1) a valid benchmark dataset; (2) a mathematical expression for the samples that can effectively reflect their intrinsic correlation with the object to be predicted; and (3) a powerful prediction algorithm or engine.The three necessities for establishing the current protein folding rate predictor were realized via the following procedures.

Benchmark Dataset
The dataset recently constructed by Ouyang and Liang [12] was used in the current study.It contains 80 proteins whose apparent folding rate constants ( f K ) have been ex- perimentally determined.However, it is instructive to point out that, when the experimentally measured f K is a constant independent on time t , the conditions as mentioned in Sec- tion II (see Eqs.10 and 14 and the relevant texts) must be satisfied.Accordingly, the folding kinetic mechanisms for all these 80 proteins can be approximately described by Eq. 1, and hence there is no need here to specify which proteins belong to the two-state folding and which ones to the threestate or other multiple-state as done in [12].Furthermore, although the experimental 3D structures of the 80 proteins are known, none of this kind of information will be used here because we are intending to develop a statistical predictor purely based on the experimental f K values of proteins and their sequence information alone.If the success rates thus T (sec) (cf.Eq. 3)

Protein Folding Rate Prediction
The Open Bioinformatics Journal, 2009, Volume 3 35 (Table 1).Contd…..  obtained can be comparable or about the same as those by the method of Ouyang and Liang where the 3D structure information was needed as an input [12], the new predictor will have the advantage of being able to also cover those proteins whose 3D structures are unknown yet.This is particularly useful due to the huge gap between the number of known protein sequences and the number of known protein 3D structures, as mentioned in Section I.

Number PDB Code
For readers' convenience, the benchmark dataset, denoted as bench

S
, is given in Appendix A which can also be downloaded from the web-site at www.csbio.sjtu.edu.cn/bioinf/FoldRate/.As we can see there, f ln K (where ln means taking the natural logarithm for the number right after it) ranges from 6.9 to 12.9 ; i.e., f K ranges from 6.9 3 e 1.01 10 to 12.9 6 e 4.00 10 (where e 2.718 is the natural number, sometimes called Euler's number), meaning that the apparent folding rate constants of the 80 proteins span more than eight orders of magnitude (cf.Table 1).

Sample Expression or Feature Extraction
As shown in [12], the features extracted from the 3D structures of proteins are very useful for predicting their folding rates.However, for the majority of proteins, their 3D structures are unknown yet.To enable the prediction model to cover as many proteins as possible, here let us focus on those features that can be derived from the amino acid sequential information alone, either directly or indirectly.Owing to the fact that smaller proteins usually (although far from always) fold faster than larger ones [26], and thathelix and -sheet are the two most major structural elements [27], our attention should be particularly focused on the size of proteins as well as the effects of -helices and -strands.

(a) Protein Size or Length Effect
In protein science, the length of a protein chain is usually measured by L , the number of amino acids it contains.
Many lines of evidences (see, e.g., [12,13]) have indicated that the length of a protein chain is correlated with its folding rate, suggesting that L , as well as its various functions, could be useful for representing protein samples in predicting their folding rates.Our preliminary studies showed that ln( ) L was particularly remarkable in this regard and hence will be used in the current study.

(b) Predicted -Helix Effect and the Effective Folding Chain Length
Driven by the short-range interaction, -helices can be formed independently in a much faster pace than the entire structural frame.These helices can be treated as rigid blocks so as to reduce the original chain length L counted accord- ing to the number of amino acids.The effective folding chain length L eff thus considered is given by [13]: (15) where h L is the total number of amino acids in the helix blocks that can be easily predicted by using PSIPRED [28] for a given protein sequence; h-block N the number of predicted helix blocks; and the pseudo length of a helix block that was set at 3 in the current study, meaning that each helix block is equivalent to 3 amino acid units in length.Again, our preliminary studies showed that among various functions of L eff , ln(L eff ) was particularly remarkable in correlation with the protein folding rates, and hence will be used in the current study.

(c) Effect of -Sheet Propensity
It was hinted in some previous studies (see, e.g., [29,30]) that the folding of a protein is strongly correlated with those amino acids that have a high propensity to form -strands [31,32].To reflect the overall -sheet propensity of a protein chain, let us take the following consideration.Suppose a protein chain is formulated by: where the -th can be one of the 20 different types of amino acids each having its own propensity to form -strand [31].The overall -sheet propensity of the protein concerned is defined by: where , i is the -strand propensity for the -th i amino acid in the protein P .Note that be- fore substituting the values of -strand propensity into Eq.17, they are subject to a Max-Min normalization as given by: where , i 0 represent the original -strand propensity value for R i in Eq. 16 and can be obtained from [31] because it must be one of the 20 native amino acids, Max{ 0 } means taking the maximum value among the 20 original -strand propensities, and Min{ 0 } the corresponding minimum one.For reader's convenience, the converted -strand propensity value obtained through the Max-Min normalization procedure (cf.Eq. 18) for each of the 20 native amino acids is given in

Prediction Algorithm
According to the above discussion, we have the following three quantitative features extracted from a protein sequence: ln( ) L , ln( L eff ) , and .Each of these features derived from a protein may be correlated with its folding rate f K through the following equations.
ln K f where are the protein folding rate constants predicted based on the length of protein, its -helix related effective length, and its overall -sheet propensity, respectively; while i a and i b are the corresponding parame- ters that can be determined through a training dataset by the following regression procedure [33].(k) , respectively.In order to deter- mine the coefficients of Eq. 19, let us define three objective functions given by: ( where f ( ) K k is the observed folding rate for the -th k pro- tein in the dataset bench S as given in Appendix A. The process of determining these coefficients is actually a process of finding the minimum of ( ) ( 1,2,3) , and hence can be easily obtained by the following equation: Substituting Eq. 20 into Eq.21, followed by using the data provided in Appendix A and the data derived therefrom as given in Appendix B, we can easily determine the coefficients in Eq. 19, as given below: However, as explained below, the accuracy of a predictor is usually examined by the jackknife cross-validation in which the query sample should be in term excluded from the training dataset.Thus, instead of Eqs.20-21, we should have: (1) The results thus obtained for 3) can be used to predict the protein folding rates but they each reflect only one of the three features described above.To incorporate all these features into one predictor, let us consider the following equation: where i w is the weight that reflects the impact of the -th i formula on the protein folding rate.If the impacts of the three formulae were the same, we should have . Since they are actually not the same, it would be rational to introduce some sort of statistical criterion to reflect their different impacts, as formulated below.
Given a system containing N statistical samples, we can define a cosine function as formulated by [34,35]: where i x and i y are, respectively, the observed and pre- dicted results for the -th i sample.Obviously, the cosine function is within the range of 1 and 1 [36].When and only when all the predicted results are exactly the same as the observed ones, we have 1 = .Suppose the value of the cosine function yielded with the -th i predictor in Eq. 19 on the benchmark dataset bench S by the self-consistency test [37] is Then the weight i w in Eq. 25 can be formulated as: which yields However, when the accuracy of Eq. 25 is examined by the jackknife cross-validation, by following the similar procedures in treating Eq. 19, we should instead have where the values for The ensemble predictor formed by fusing the three individual predictors of Eq. 19 as formulated by Eq. 25 or Eq. 30 or Eq. 31 is called the FoldRate, which can yield much better prediction quality than the individual predictors as shown below.

IV. RESULTS AND DICSUSSIONS
In statistics the independent test, sub-sampling test, and jackknife test are the three cross-validation methods often used to examine the quality of a predictor [38].To demonstrate the quality of FoldRate, we adopted the jackknife cross-validation on the benchmark dataset bench S (see the Appendix A).During the jackknife cross-validation, each of protein samples in the benchmark dataset is in turn singled out as a tested protein and the predictor is trained by the remaining proteins.Compared with the other two crossvalidation test methods, the jackknife test is deemed more objective that can always yield a unique result for a given benchmark dataset [37,39], and hence has been increasingly used by investigators to examine the accuracy of various predictors (see, e.g., [40][41][42][43][44][45][46][47][48][49][50][51][52][53][54]).
In the current study, two kinds of scales are used to measure the prediction quality.One is the Pearson correlation coefficient (PCC) (see wikipedia.org/wiki/Correlation)and the other is the root mean square deviation (RMSD).They are respectively formulated as follows: where i x , i y and N have the same meanings as Eq. 26, while x and y the corresponding mean values for the N samples.The meaning of RMSD is obvious; i.e., the smaller the value of RMSD, the more accurate the prediction.PCC is usually used to reflect the correlation of the predicted results with the observed ones: the closer the value of PCC is to 1, the better the correlation is.When all the predicted results are exactly the same as the observed ones, we have PCC=1 and RMSD=0.3 are the PCC and RMSD results obtained by the ensemble predictor FoldRate on the benchmark dataset bench S via the jackknife cross-validation.For facilitating comparison, the corresponding results obtained by individual predictors are given in Table 3 as well.

Listed in Table
As we can see from Table 3, the overall PCC value yielded by the ensemble predictor of Eq. 25 is 0.88, which is the closest to 1 in comparison with those by the individual predictors in Eq. 19.Such an overall PCC value is even higher than 0.86 obtained for the same benchmark dataset by the method in which, however, the 3D structural information is needed [12].Although the method developed recently by Ouyang and Liang could also be used to predict the protein folding rate without using the 3D structural information, the overall PCC value thus obtained would drop to 0.82 [12].
Moreover, it can be seen from Table 3 that the overall RMSD value for the ensemble predictor is the lowest one in comparison with those by the individual predictors.The highest correlation and lowest deviation results indicate that the FoldRate ensemble predictor formed by fusing individual predictors is indeed a quite promising approach.

V. CONCLUSIONS
FoldRate is developed for predicting protein folding rate.It is an ensemble predictor formed by fusing three individual predictors with each based on the size of a protein, its -helix effect, and its -sheet effect, respectively.Given a protein, all these effects can be derived from its sequence.
Therefore, FoldRate can be used to predict the folding rate of a protein according to its sequence information alone.FoldRate is freely accessible to the public via the web-site at www.csbio.sjtu.edu.cn/bioinf/FoldingRate/.

ACKNOWLEDGEMENTS
This work was supported by the National Natural Science Foundation of China (Grant No. 60704047), Science and Technology Commission of Shanghai Municipality (Grant No. 08ZR1410600, 08JC1410600) and sponsored by Shanghai Pujiang Program.

APPENDIX A
The benchmark dataset bench S consists of 80 proteins.The PDB codes listed below are just for the role of identity.In this study, only the protein sequences and their ( ) f ln K values are used for developing the current predictor.See the text for further explanation.

APPENDIX C
The values of a 1 (k), b 1 (k) , a 2 (k), b 2 (k) , and a 3 (k), b 3 (k) determined according to Eqs. 23-24 by excluding (jackknifing) the -th k protein sample in term from bench S of Appendix A. See the text for further explanation.
G of Fig. (1b) and using the graphic rule 4

First, let
us just use the 80 proteins in the benchmark dataset bench S (Appendix A) as the training data.Suppose the length, effective folding chain length, and overall -sheet propensity for the -th k protein in the dataset are denoted by ( ) L k , L eff (k) , and

Protein Folding Rate Prediction The Open Bioinformatics Journal, 2009, Volume 3 45 PDB Code
See the text for further explanation.
of Appendix A.