Characterizing Protein Shape by a Volume Distribution Asymmetry Index

A fully quantitative shape index relying upon the asymmetry of mass distribution of protein molecules along the three space dimensions is proposed. Multidimensional statistical analysis, based on principal component extraction and subsequent linear discriminant analysis, showed the presence of three major 'attractor forms' roughly correspondent to rod-like, discoidal and spherical shapes. This classification of protein shapes was in turn demonstrated to be strictly connected with topological features of proteins, as emerging from complex network invariants of their contact maps.


INTRODUCTION
It is commonly stated that the activity of a protein is somewhat encoded into its shape [1].A rough classification of proteins on the basis of their shape, identifies two distinct classes: globins (near spherical molecules) and scleroproteins (rod-like or fibrous).Fibrous proteins are for the major part mainly structural elements (for instance, collagen in the connective tissue); on the other hand, globins are apt to many different tasks, often subdued to the presence of specific interaction sites located on the protein surface [2].
The concept of molecular shape is somewhat elusive: the identification of quantitative descriptors for the molecular structure is, thus, a potentially very interesting avenue of research [3].
Several methods have been proposed to characterize proteins shape [4,5]: so far, shape analysis has been limited to protein surface representation, assuming believing that surface as the privileged view given is a key factor, because is the region where it is where biologically meaningful interactions take place [6].Actually, geomitetric shape has been often defined with reference to a finite set of points, a space curve, or a surface [7], instead of considering the overall volume of a molecule, that is specifically the purpose of this work, building upon previous results in which we demonstrated both the lack of any marked separation between protein internal and external milieu and the basic fractal structure of protein fold (Di Paola et al.JCIM).
Literature provides several different approaches to describe the molecular surfaces: among these, Van der Waals surface(VdW) refers to the union of atoms (modeled as balls) according to their van der Waals radiuses [8]; the Solvent AccessibleSurface (SAS), originally proposed by Lee and Richard [9], is the surface traced out by the center of a probe sphere (typically a water molecule) rolling on top of the VdW surface: in this way, the overall protein molecule hindrance comprises also the hydration shell.The Solvent Excluded Surface (SES) is the result of the SAS erosion by the same probe [10,11].For a graphic representation, see Fig. (1).
Hopfinger developed a useful method for small molecules named 'molecular shape analysis' [12], based on the comparison of electrostatic fields, later adapted by Arteca and Mezey to define shape descriptors of macromolecules [13].Some authors have focused on the detection of protrusions and cavities of known input structures, provided by shape descriptors.The Connelly function [10], the most used and known, is derived as follows: in any point on the surface, a sphere is centered, having a diameter as large as a water molecule.If the fraction of the sphere volume within the SES volume (see dashed sphere in Fig. 1) is smaller than 0.5, the surface is considered as locally convex, otherwise concave.
Formally, for any point x ∈ M , let us consider the ball B (r, x )centered at x with radius r : if S (r, x ) = δB (r, x ) is the boundary of B (r, x ) and SI the portion of S (r, x ) contained within the surface, the Connolly function fr: M → R is defined as: High values of f r (x ) indicate that the surface around x is largely concave, while low values point to a prevalent convexity around x. Røgen and Sinclair introduced protein shape descriptors based on backbone [14].
Although curvature-based methods (as Connelly's) well identify points located at local protrusions and cavities, they all depend on a pre-fixed value r (the neighborhood size); in many cases, it is desirable that the function value can also give some clues about the length scale of the conformational feature the function refers to.
All these models have a strong 'theoretical flavor' and are concentrated on the molecule surface shape.On the contrary, we adopted a mainly statistical bottom-up approach in order to derive a coarse-grain, but easily computable and free from a priori constraints shape index.At odds with surfacebased approach, the proposed index is based on the volume distribution of the atoms along the three axes of the space.
The starting point of this work is that the most interesting geometrical templates in structural biochemistry are the sphere, the disc and the cylinder; thus, we decided to rely upon the amount of symmetry of the volume distribution on the three dimensional space so to develop a global index catching the relative 'spherical', 'discoidal' or 'cylinder' character of the studied structure.
A data set spanning the entire range of variation of protein shapes, from perfect sphere to almost perfect cylinders, was developed in order to check by means of a correlative approach based on Principal Component Analysis (PCA) [15], the consistency of the proposed index with relevant size, geometry and topology related properties of protein structures.
The demonstrated ability of the proposed method not only to discriminate different shapes but to discover the shape variability typical of a functional protein class (membrane proteins) confirms the relevance of volume based shape representation.
In this work, we perform an analysis of the threedimensional protein structures along the canonical axes, as reported in PDB files, containing the relevant information about biomolecular structures.
As a first step, we identify the Center of Mass (CM) of the molecule, reducing each amino acid residue to the correspondent α -carbon.In the case of the sphere, the center of mass coincides with its geometrical center and the distance from the CM to the surface of the molecule is identical along each of the three dimensions.In the case of the disc, the CM represents its center, two dimensions have an almost identical elongation and the third is not relevant; in the case of the cylinder, there is just one relevant dimension and the CM is located approximately at the middle point of the cylinder main axis.
Once identified the CM, the maximal distance of αcarbons from the CM is computed along the three axes; the three values R x ,R y , R z represent the radius of hypothetical spheres (Fig. 2), whose volume provides indication concerning the length of the object along a specific direction: The corresponding equivalent spherical volumes are then computed: In the case of the sphere, V x = V y = V z , whereas for a generic non-spherical molecule, these three volumes differ from each other.
Let's now introduce a shape space, in which each protein is identified by a vector ρ defined as follows:  The reference shapes correspond to the following points in the shape space (Fig. 3):  Disc: The tetrahedron represented in Fig. (3) is the space of possible protein molecular shapes.Clearly, a nearly spherical molecule is represented by a point closely located to S; on the contrary, the largest distance from S accounts for rod-like proteins.Thus we use the ratio between the actual distance of the protein from S with the maximum distance correspondent to perfect rod-like shape; this maximum distance is: On the other hand, the distance related to disc template is: Thus, let us define a normalized distance , being the distance of a generic point in the shape space from S; according to ξ values, molecules can be classified as follows: where ξ is an asymmetry index, given its character of departurefrom perfect symmetry in space.
To prove the effectiveness of our index, we tested it on a 40 proteins data set, half of which chosen amongglobular shapes and half among fibrous.
In Table 1 the values of ξ are reported along with the classification consistent with our proposal and the number of residues(nodes).3D structures of three proteins belonging to spherical, discoidal and rod-like groups are shown in Fig.In order to put into perspective the proposed asymmetry index, we introduce some topological descriptors, based on a protein structure representation in terms of inter-residue contact graphs [16].
As a matter of fact, the 3D crystal structure of a protein canbe translated into a contact matrix among α -carbons that in turn can be considered as a network with α-carbons as nodes and the contacts between them as edges.This kind of formalization isextremely useful to study protein properties at all [17][18][19].
Starting from the spatial position of α -carbons, in the PDB files, the mutual residue distance matrix d = {d ij } is computed : the generic element d ij is the Euclidean distance in the 3D space between the i-th and j-th residue, holding the primary structure ordering.A link is established between two residues if their mutual distance lies in the range 4 − 8Å; the contact graph adjacency matrix A ={a ij } is therefore defined as: Some topological descriptors can be extracted from A [20]:  N : number of nodes (residues) in the graph;  E : number of edges connecting the graph nodes;  density : ratio between the actual number of edges E and the maximum value;  N (N− 1)/ 2, corresponding to the complete graph;  avdegree : the average of node degrees k i , where  is the number of links involving the i-th node;  avshortpath : the shortest path is the minimum number oflinks connecting two residues; this value, averaged over allthe residue pairs, is the average shortest path;  diameter : the longest shortest path; is a measure of connectivity on a local scale, for the i-th node: it measures the connectivity of the sub-graph made of nodes con-nected to thei-th node.C i averaged over the whole set of nodes is the avcluscoeff.

RESULTS
We computed the asymmetry ξ and the seven above mentioned topological properties for each protein of the data set.In order to evaluate the correlation of ξ with the other parameters, a multivariate data analysis is required.To this aim, we applied PCA to the data matrix, containing all the computed properties for each protein in the data set.
The presence of a specific component highly correlated withξ (PC2) is a consequence of the selection of a data set spanning the entire range from spherical to rod-like structures.On the other hand, protein size as measured by N is the main order parameter shaping the data set (PC1).
Results are reported in Table 2 in terms of component loadings, i. e., of correlation coefficients between principal components(PCs) and original variables.A high absolute value of the correlation coefficient (loading) between a variable and a component is used as guide for the structural interpretation of the extracted components.
PCA highlighted a three component solution as explaining the by far most important (and reasonably signal-like) part of information correspondent to the 86% of total variance, with PC1 explaining the 47% of variability, while PC2 and PC3 accounting for 25% and 14% respectively.Not surprisingly, the first component (PC1) corresponds toprotein size: the number of nodes N, as well as the number of links E, are strongly related to this component.
Both contact density and clustering coefficient negatively scale with size, confirming previous results, [19].As shown in Fig. (5), density neatly scales with size (here, number of nodes).
The second component (PC2) identifies the 'general shape',since asymmetry has the highest correlation.It has to be stressed that diameter and avcluscoeff bring considerable contributions to PC2, suggesting that topology influences general shape.
In the case of PC3, the only relevant descriptor is avdegree.Therefore, this component is a topologic invariant, since it is neither influenced by size nor by general shape.
Afterwards, we repeated the analysis taking out asymme-  try, thus evaluating the ability of sole topologic features to predict protein shape.The 'reduced' principal components (PC1r -PC3r, i.e., those components generated without the explicit contribution of ξ )are indeed able to perform a very significant classification of the three groups of rod-like, discoidal, and spherical proteins, as reported in Table 3, where the classification matrix based on linear discriminant analysis based PC1r-PC3r is reported.
As evident from Table 3, the discriminant analysis allows for an 84.4% of correct classification.
The efficacy of this discrimination can be appreciated in Fig. (6), reporting the space spanned by the first two reduced space components, where the space is approximately subdivided into three regions, correspondent to the three classes.
The ability of the components to separate the three classes is a proof-of-concept of the fact ξ is consistent with other features of protein organization (as those used for generating PC1r-PC3r).As it can be observed in Table 4, PC1r still represents size, but avshortpath is now concentrated on PC1r.
In order to check the relevance of the proposed index with an independent data set, asymmetry index was computed on a sample made of three different classes of proteins: globins, membrane and fibrous proteins.While 'globin' and 'fibrous' classifications refer directly to the protein shape, the denomination 'membrane' has to do only with the location of the molecule in the cell.According to the presence in membrane proteins of a part of the structure in the form of an elongated (mainly alpha helix) patch inside the membrane, Furthermore, avcluscoeff is now the most sensitive descriptor to PC2r, being clusterization linked to shape; on the contrary, avdegree does not change appreciably, confirming it isa general invariant property.The 'reduced' component space explains the 88% of total variability as follows: PC1r = 52%,PC2r = 21% and PC3r = 15%.we expect that membrane proteins must lie in between 'globin' and 'fibrous' shapes as for their asymmetry index values.In the meantime, we do expect membrane proteins do have an higher variability with respect to the two other classes as for their asymmetry values.Here, see Fig.

CONCLUSION
As previously suggested by Holm and Sander [1], the generation of a principal component space based on the mutual correlation of different shape features allows for the identification of 'attractor shapes' acting as ideal templates rationalizing the apparently wild variety of protein forms.In this work, the same strategy was adopted in order to validate a global shape index allowing for a quantitative appreciation of the position of a given structure in the continuum spanning from very asymmetric fibrous structured to approximately globular shapes.The possibility to discriminate the pertaining of a given molecule to the 'rod-like', 'discoidal' and 'spherical' attractors by the components of a feature space, not explicitly taking into account the proposed index, was a proof-of-concept of both the existence of such attractors and the consistency of the asymmetry descriptor here defined.Focusing on the quantification of symmetry, in order to build a general shape descriptor is notonly one of the many possible choices.In contrast, symmetry, as aptly explained by Goodsell and Olson [2], is a crucial property for rationalizing structure, function and even evolution history of protein molecules.Here it is sufficient to remind the role played by protein internal structural symmetries in allosteric effects, folding and cell localization [2] and the importance of detecting sequencebased symmetries, for both the modeling of sequencestructure relations and the protein evolution by gene duplication [21].
Our results point to the possibility to sketch a quantitative formalization of a so far largely qualitative concept as protein form, that could have very relevant outcomes in protein science.
This hope is substantiated by the strong, and still largely unexploited, link between general shape information and graph theoretical properties of protein contact networks.

CONFLICTS OF INTEREST
None Declared

Fig. ( 2
Fig. (2).Radiuses R x , R y and R z in the case of HSA (PDB code 1E7I).

Fig. ( 3 ).
Fig. (3).Shape space: each point of the space represents a molecule in terms of its own shape; green, red and black dots refer to spheres, discs and rods, respectively.
(7), we report result for the shape index for the above three classes of proteins; blue dots are proteins sharing the same globinfold pattern, resulting in a spheroidal structure; green triangles are membrane proteins known to have widely different shapes with a slight prevalence for elongated forms (at least for the membrane embedded part of the structure), red squares correspond to fibrous proteins, having an elongated, rod-likemolecular shape.As shown in figure, fibrous protein segregate in the upper part of the figure, with asymmetry index close to maximum (Mean = 0.96, Std.Dev.= 0.03) ; globins, that are approximately spherical, locate, as expected, on the bottom, in a wider area with respect to rod-like structures (Mean= 0.24, Std.Dev.=0.09).Membrane proteins, finally, spread out in the wide central part of the figure (Mean = 0.44, Std.Dev.= 0.24),consistently with their morphological variability going hand-in-hand with a tendency toward elongated shape as for their membrane-embedded part.The above results were highly statistically significant for both mean (Students t-test) and variance (F-test) pairwise comparisons.Fibrous vs. membrane comparison scored a t value =6.8 (p ¡ 0.0001) and an F value = 44.35(p ≤ 0.0001); globin vs membrane comparison scored a t value= 2.53 (p ≤ 0.03) and an F value = 6.05 (p ≤ 0.02), eventually globin vs. fibrous comparison scored a t-value = 21.2 (p ≤0.0001) and an F-value = 7.34 (p ≤ 0.008).The ability of the index not only to discriminate between different classes but to account for the internal variance of the membrane proteins is afurther proof of their possible use as a simple quantitative shapeindex to study diff erent protein folds.

Fig. ( 7 ).
Fig. (7).Structural class of proteins discriminated on the basis of the asymmetry index.