The Miyazawa-jernigan Contact Energies Revisited

The Miyazawa-Jernigan (MJ) contact potential for globular proteins is a widely used knowledge-based potential. It is well known that MJ's contact energies mainly come from one-body terms. Directly in the framework of the MJ energy for a protein, we derive the one-body term based on a probabilistic model, and compare the term with several hydrophobicity scales of amino acids. This derivation is based on a set of native structures, and the only information of structures manipulated in the analysis is the contact numbers of each residue. Contact numbers strongly correlate with layers of a protein when it is viewed as an ellipsoid. Using an entropic clustering approach, we obtain two coarse-grained states by maximizing the mutual information between coordination numbers and residue types, and find their differences in the two-body correction. A contact definition using sidechain centers roughly estimated from C atoms results in no significant changes.


INTRODUCTION
Comprehending the nature of the physical interactions between different types of amino acids is crucial for understanding protein folding and structure prediction.The physics-based potential functions are based on full atomic models and therefore require high computational cost.Furthermore, they need not fully capture all of the important physical interactions.The known three-dimensional structures of proteins contain a large amount of information on the forces stabilizing proteins.Potential functions and the rules governing protein stability can be revealed from statistical analysis on known structures.By assuming that frequently observed structural features correspond to lowenergy states, the observed statistical frequencies of various features, after comparing with that in a reference state or a null model, are converted into effective free energies [1][2][3][4][5].A recent review on the theory and methods used to derive such potentials is in ref. [6].Although such potentials, implicitly incorporating many physical interactions, do not necessarily reflect real energies, their application in protein folding, protein-protein docking, and protein design has achieved impressive successes.Minimal models of proteins, which have a significantly reduced number of degrees of freedom and give a coarsegrained yet accurate description of polypeptide chains, are widely used to obtain insights into folding mechanisms of proteins, as well as in structure prediction and sequence design.Minimal models also enhance the statistical significance of knowledge-based potentials.In the simplest model, only the -carbon atoms are considered.In many applications, a sidechain is represented by a center attached to a C atom.For a coarse-grained representation of *Address correspondence to this author at the Institute of Theoretical Physics, Academia Sinica, Beijing 100190, China; Tel: 86-10-62541820; Fax: 86-10-62562587; E-mail: zheng@itp.ac.cn polypeptide chains, interaction potentials, either contact-or distance-dependent, can be extracted from a database of known structures.
The Miyazawa-Jernigan (MJ) contact potential for globular proteins is a widely used knowledge-based potential [2,7].In the MJ model, a residue is represented by its sidechain center, which for Gly is taken as the position of its C atom.A pair of residues are defined to be in contact if they are not nearest neighbors in sequence and the distance between their centers is less than 6.5Å.The number of different types of residue-residue contacts can be counted directly from the structure of proteins.Miyazawa and Jernigan also introduced an effective solvent molecule, which has the volume of an average residue, to consider the residue-solvent contacts for explicitly including the solvent effect.In this model, residues make the same number of contacts (coordination number) on average, with either effective solvent molecules or other residues.By means of the approximation that the solvent and solute molecules are in quasi-chemical equilibrium and an approximate treatment of the effects of chain connectivity, Miyazawa and Jernigan estimated their interresidue contact energies from known crystal structures of globular proteins.
The high correlation between MJ effective contact energies e ab and the energies required to transfer amino acids from water to less polar environments is well known.Energy e ab between residue types a and b may be decomposed into two components: the desolvation terms e a0 , e b0 and the mixing term e ab , and then written as e ab = e ab e a0 e b0 , where subscript '0' is for the solvent or water.Among different types of contacts, the average difference of the desolvation terms is about 9 times larger than that of the mixing terms.This means that contact energies e ab are dominant by the 'one-body' desolvation term.Similar conclusion can be drawn by an eigenvalue decomposition analysis of the MJ matrix (e ab ) [8].
If energy e ab came completely from the 'one-body' term, i.e. e ab = a + b , the one-body a would be well estimated ( (The difference between a and the least square estimation of Ref. [9] is insignificant.)The error of this one-body approximation is measured by the standard deviation of these 210 estimated values.The values a and their standard deviations are listed in Table 1.Indeed, the one-body approximation is quite good.(The estimation includes the special case: a = e aa / 2 = e a0 .)It is our main purpose here to explore the origin of the one-body term and the two-body correction based on a probabilistic consideration, rather than to improve the potential function.The MJ energies ij e were first derived in 1985.Later in 1996 the energies were updated based on a much larger database, but no significant effects of the database size were seen.We shall focus on the 1996 version of e ij .

METHODS
To make a close correspondence with the 1996 version of e ij , in the present study we use the PDB_select25 of 2001 Sep Release, instead of the current version.The database contains representative protein structures with sequence identity less than 25% [10].We exclude those with chain lengths less than 50 and those annotated as membrane proteins.The final number of structures used in the analysis is 1274.The center representing the i -th residue a i of a protein is a point along the joint line from the C atom of a i to its C atom and at a distance from the C atom, where values of depend on the residue types, and, taken from Park & Levitt, have been listed in Table 1 [11].For Gly, = 0 , i.e. the center is just its C atom.A pair of residues are held to be in contact if the distance between their centers are less than 6.5Å and one is not the nearest neighbors of the other along the chain.

A Probabilistic Model
According to Miyazawa and Jernigan, the total contact energy of a protein native structure is defined as the difference between the energy of its native structure and that of its extended conformation: where ab n is the number of contacts or 'sides' between residue types a and b .If e ab can be contributed completely to the one-body term, i.e. e ab = a + b , we have where i is the site index, a i indicates the residue type at site i , and i is the contact number of a i or its coordination number.In this form of the total contact energy, the only structural information is { i } , the coordination number at each site.Correspondingly, it is reasonable to assume that the joint probability P(A, S) for the sequence A = a 1 a 2 …a n to take its native structure S is where P(a i , i ) is the shorthand notation for P(a = a i , = i ) , and P(a, ) is the probability for a residue of type a to have contacts.We may associate an energy U with the logarithmic probability: Usually, instead of the probability, in statistical modeling some probability ratio with respect to certain null model or reference state is considered.Correspondingly, the energy U is associated with a log-odds.The proper reference state concerns the problem under study.For example, the gapless threading and the Rosetta's fragment assembling should use different reference states.One essential feature of the MJ model missing in P(A, S) or U is the solvent or 'water' effect.In the MJ model residues are in contact not only with other residues, but also with 'unseen' solvent.To connect U with E of Eq. ( 3), we have to distribute u i = log P(a i , i ) to its i sides in some way.
We have inspected the sign of log[P(a | ) / P(a)] , where P(a) is the fraction of residue type a in the entire database, and found that the sign changes only once when increases from zero for all residue types except Thr.In the picture of MJ, = 0 may be interpreted as the case when a residue is fully exposed to the solvent.A very large then corresponds to the case when a residue is deeply buried into the interior of a protein.Thus, for a given residue type a , at a * where P(a | a * ) = P(a) , residue a would freely contact with either other residue or water.Increasing from * by 1 means that a contact with water is converted to that with residue.Regarding the case when the coordination number of residue type a is always a * as the reference state, we may estimate the contribution a u of a contact 'emitted' from residue type a to the energy U as where N(a, ) is the total number of residues of type a and with coordination number in the entire database, and We have only a few discrete integer values of .Usually, at any , P(a | ) never exactly equals P(a) .We determine * a by interpolation as: where + = + 1 and [P(a | + ) P(a)][P(a | ) P(a)] < 0 .

Two-Body Correction
So far, we have ignored any explicit interaction between residues.To include the interaction between residue pairs, we may consider where a i denotes the contacts of a i , Q(a j | a i , i ) is the probability for a residue of the type a j to be in contact with a i conditional on the coordination number of a i being i , Q(a j ) is the probability for a residue of type a j to be in contact with any other residues, and power 1 2 is used to avoid double counting of pairs.We can then derive the two- body energy u ab for correction to u a as where the dependence of u ab on has been explicitly indicated.

RESULTS AND DISCUSSION
The fundamental numbers for analysis are counts N(a, ) , the total number of residues of type a with coordination number being , in the entire database.We do not count each protein separately.These counts are listed in Table 2.As seen from the table, the counts at large are small.We have to combine them into a big bin.Specifically speaking, each from 0 to 8 is a bin while all 9 form a single bin of = 9 .Undetermined residue types B, Z and X are included in counting , but their statistics is not considered.By estimating P(a | ) directly from N(a, ) (without adding pseudocounts), we calculate a * using Eq. ( 7) and then the one-body u a using Eq. ( 6).Their values (together with the average coordination numbers a ) are listed in Table 3.

A Binary Partition of Residue States According to the Coordination Number
Our analysis on the two-body correction is based on counts like N(a, ; b) , which is the number of the 'sides' emitted from a residue of type a with coordination number and reaching at a residue of type b .There are 20 20 = 400 types of pair ab , when distributed to different the counts become too small to be statistically significant.Thus, a coarse-graining is considered to give a binary partition of these .When we reduce into two classes, say '0' and '1', the mutual information between residue type and coordination state where P('0') = < c P( ) , P('1') = c P( ) , the definitions of P(a,'0') and P(a,'1') are analogous, and c is the parameter for partition.It has been proven that the coarse-grained mutual information I(a; ) is never greater than the original one, and an optimal binary partition can be obtained by maximizing I(a; ) with respect to c [12].We find that the optimal partition is at c = 4 , so we may call '0' the 'exposed', and '1' the 'buried'.Note that c is very close to the values a * listed in Table 3. (For Thr, its sign of log[P(a, ) / P(a)] changes twice; only the one closest to c = 4 is kept.)

Additivity of the Contribution to u a from Each Contact
In the calculation of u a according to Eq. ( 6), it is implied that the contribution to a u from each contact is additive.To examine the additivity, we plot u a, = log[P(a | ) / P(a)] versus in Figs.(1a and 1b), where = 9 is actually the combined bin of 9 .The additivity corresponds to the linearity of the curves.Roughly speaking, this linearity is still recognizable, but two slopes separated at c = 4 are seen rather clearly for most residue types.

Comparison Among u a , MJ's
a and some

Hydrophobicity Scales
The one-body terms a of MJ's contact energies have been attributed to the hydrophobic effect.We notice that the correlation between MJ's a and MJ's 'average contact energies' e a is very high (with the correlation coefficient r = 0.996 ), and the strong correlation between e a and the experimental transfer free energies of amino acids have been examined [7].
Several hydrophobicity scales for amino acid side chains based on statistical analyses of residue distributions of known structures have been reported [13][14][15][16].In these analyses it is assumed that residues are distributed between the surface and interior of the protein in the same way they would be distributed between water and a solvent with a polarity similar to that of the protein's interior.For example, Wertz and Scheraga (WS) classified all residues as being either buried in the protein or exposed to water based on the number of times that a grid of lines parallel with each of three orthogonal axes intersects the solvent exposed van der Waals surface of each residue.An apparent transfer energy for residue type a is then estimated as where M a e and M a b are the mole fractions of residue type a that are exposed to water and buried in the protein, respectively.Most side chains are neither entirely buried nor entirely exposed.Taking this into account, Guy modeled the spacial distribution of residues as a function of their distance from the protein surface by dividing the protein ellipsoid 'body' into layers parallel to the protein surface [17].This layer analysis leaded to a refined hydrophobicity scale.We have examined correlations between u a , MJ's a , WS's scale and Guy's scale.We have also added a scale w a derived from the similar version of binary partition according to c = 4 .The comparison among them is displayed in Table 4.It is seen that u a has a strong correlation with MJ's a , but the correlation of Guy's scale is even stronger.(We have also examined the reference state with the above 'neutral' * replaced by * = 0 or * = q a , MJ's average coordination numbers.However, the correlations between the obtained u a and MJ's a are much weaker.) The linear regression lines of MJ's a versus the probabilistic energies u a is shown in Fig. (2).In fact, the regression between a and u a (or Guy's hydrophobicity scale F a ) shows rather strong correlation for polar as well as apolar residues, except for typical hydrophobic residue types Leu, Phe and Ala, besides Cys which also involves the disulfide bond.The regression line between a and u a seems to imply that the correlation of u a with a is not restricted just on apolar residues.

Two-Body Corrections
Instead of using ( 9), the two-body corrections are calculated for the binary partition of at c = 4 into '0' (exposed) and '1' (buried).The values of Tables 5a and 5b.(Note that the values there have been multiplied by 200.)The two tables both show some asymmetry.The most striking difference between the two tables is: while the interactions between apolar residues are stronger in the exposed environment, the interactions between polar residues are stronger in the buried environment.This is consistent with the physics of residue interaction: apolar residues near the protein surface strongly attract each other to reduce their areas accessible to water, while ionizable sidechain groups in the protein interior are virtually always uncharged to reduce the permittivity effect.(There are some charges in the interior, but they are almost always functional.)It is not surprising that Cys, involving in disulfide bonds, behaves peculiarly in Cys-Cys pairing.In the tables, the environment propensities u a, = log[P(a | ) / P(a)] of residues are also listed.(Our scale a w of binary partition corresponds to w a = u a, '1' u a, '0' .)In consistence with the demixing effect, the magnitude of diagonal entries is generally larger than that of the nondiagonal ones.The correlation between e ab and u ab = u a + u b without two-body correction is r = 0.921 while the correlation using the one-body values ( w a + w b ) of the binary partition is r = 0.902 .To include the two-body correction, we consider with {'0','1'} .The correlation between e ab and this corrected u ab is r = 0.911 for ='0' , and r = 0.909 for ='1' . (The two-body correction deduced from the binary partition is able to improve the correlation between u ab = u a + u b and MJ's e ab to r = 0.931 , but only when one fifth of the correction is kept.)

Contacts Defined by C Atoms
In the simplest representation, the polypeptide chains are modeled using only the C atoms to provide a global picture of low-resolution structures.It is worth examining contact energies v a with the contact definition involving only C atoms.Generally, when contacts are defined by a cutoff distance between C atoms, the correlation of v a with a is much weaker than that of u a .At cutoff R c = 7.5Å the correlation is only 0.86 .Thus, proper representative centers for sidechains are advantageous to the quality and utility of v a .
We have tried the simplest way to assume that the sidechain center of the i -th residue is along the direction from the midpoint of the line joining the (i ± 1) -th C atoms to the i -th C , and at the distance associated with the sidechain residue type.The correlation between a and u a is r = 0.929 , and the regression line is a = 4.20u a 1.49 .
A more careful approximation takes the mean orientation of the sidechain center into account (mainly an off-plane angle of 40.1 ), but the change in correlation is almost negligible.Although the simplest models are useful, a certain degree of complexity is needed for more realistic applications.

CONCLUDING REMARKS
Knowledge-based potentials for simplified models of protein are essential to understanding the protein structure and folding dynamics.Among numerous forms of potential functions for coarse-grained protein structures, the most widely used one is the weighted linear sum over pairwise contacts, which provides the great computational expediency.Such potentials include the MJ contact energies as a special case.It is not so obvious that many of them may be approximated with a simple function of individual residue properties such as hydrophobicity, demixing, and electrostatics [9].(Note that the 'one-body' approximations of Ref. [9] may contain also 'two-body' terms in the sense of interactions, and that the one-body term of the MJ contact energies is the contribution of each contact associated with a residue of a given type to the energy of a protein.) According to the equilibrium statistical thermodynamics, the relative probability P(s) of a microstate s with energy u(s) is given by the Boltzmann distribution P(s) = Z 1 e u(s ) , where = 1 / kT , k is the Boltzmann constant and Z the normalization factor or the partition function.This Boltzmann distribution is assumed in most statistical potential functions.However, from the viewpoint of probabilistic modelling, the derivation of knowledgebased force fields need not rely on statistical mechanics.A model constructed based on certain conditional independency can be assessed by the fit of that model to the  data.The Boltzmann distribution as an exponential form is related to the exponential decay of the probability measures for certain tail events in the so-called large deviation theory [18].An example of deriving a 'temperature' factor for protein stability is given in Ref. [19] (see Eq. (16.8) there).
We have derived the one-body term based on a probabilistic model directly in the framework of the MJ energy of a protein.Although no direct relation is available between the scaling factor of our a u and the 'room temperature', the correlation between u a and MJ's a is evident.The amino acid hydrophobicity estimated from a layer analysis by Guy showed strong correlation with MJ's a , but their frameworks for the energy of a protein differ.The coordination number carries some interpretation as a layer index, so Guy's hydrophobicities strongly correlate with our u a or w a .
The structural information most concerning one-body energies is coordination numbers, which play an essential role in our analysis.According to the one-dimesional meanfield-like approximation of Ref. [20], the correlation between the mean coordination number a and one-body energy a should be strong.Indeed, this is verified by our calculation of the correlation, r = 0.909 .More sophisticated contact energy potentials have been proposed, which incorporate several features of residues, such as their solvent exposure, their secondary structures, or orientation of side chains.For example, effective contact energies for an expanded 60-residue alphabet (including three secondary structural classes) have been estimated [21].Using an entropic clustering approach, we have obtained two coarse-grained states by maximizing the mutual information between coordination numbers and residue types.Obvious differences between the two states are seen in their two-body corrections.The dependence of residue pair correlations on structural environment is worth a detailed investigation [22].Nonlocal sidechain contacts between regular secondary structure elements, those cross-strand or those between different helices or -sheets, which show stronger correlation than that of local contacts [22], would play a role different from local ones.Especially, the contacts between different helices and/or -sheets should be more responsible to forming the stable structure core than other contacts, and would be expected being more conservative in sequence.Inclusion of such features in designing new contact potentials is important in practice.

Table 4 .
Correlations between u a , a and some Hydrophobicity Scales.WS: the Scale of Wertz-Scheraga; Guy: the Scale of Guy Averaged Over four Datasets; w a : the Scale Based on the Binary Partition at c = 4

Multiplied by 200) for the 'Exposed' Environment (with rows for a , Except for the Last Row which is
u a, '0' Multiplied by 100)

Table 5b
. Two-Body Corrections u ab, '1' (Multiplied by 200) for the 'Buried' Environment (with Rows for a , Except for the Last Row which is u a, '1' Multiplied by 100)