A Short Survey on Genetic Sequences, Chou's Pseudo Amino Acid Composition and Its Combination with Fuzzy Set Theory

The study of genetic sequences is of great importance in biology and medicine. Sequence analysis and taxonomy are two major fields of application of bioinformatics. In this survey, we present results concerning genetic sequences and Chou's pseudo amino acid composition as well as methodologies developed based on this concept along with elements of fuzzy set theory, and emphasize on fuzzy clustering and its application in analysis of genetic sequences.


INTRODUCTION
The study of genetic sequences presents particular interest since it is of great importance for diagnosis reasons and taxonomy.Several efforts have been made in this direction.The computational effort for such purposes becomes significant as the length of polynucleotides increases and there is also a difficulty when it comes to compare polynucleotides of different length.An important advancement came from the introduction of the pseudo amino acid concept by Chou in 2001 [1] which is based on the transformation of the polynucleotide to a limited size vector based on several key properties like the hydrophobicity, hydrophilicity and side chain molecular weight as well as other approaches.
Many studies from various research laboratories around the world have indicated that mathematical analysis, computational modeling, and the introduction of novel physical concepts to solve important problems in genetics sequences classification can timely provide very useful information and insights for basic research and hence are widely welcomed by science community.Fuzzy theory has been used to identify G-protein-coupled receptor functional classes [31], nuclear receptor subfamilies [32,33], antimicrobial peptides and their functional types [34].It has also been employed in identifying Nuclear Receptor Subfamilies Based on Sequence-Derived Features [35] and predicting G protein-coupled receptors and their types by hybridizing two different modes of pseudo amino acid compositions [36].
The paper is structured as follows: Section 2 presents results concerning genetic sequences and Chou's pseudo amino acid composition as well as other ideas of transformation of long polynucleotides to representations in a lower dimension space and relevant work.Section 3 presents notions and methodologies concerning fuzzy sets and fuzzy clustering and mentions work performed combining the two methodologies.

Genetic Sequences
DNA and RNA are made of triplets XYZ of codons each of them having the possibility to be one of four nucleotides {U,C, A,G} in the case of DNA and {T ,C, A,G} in the case of RNA (A = Adenine, C = Cytosine, G = Guanine, T = Thymine, U=Uracil).
In the case of RNA alphabet if U is the first letter of this alphabet one codes it as (1, 0, 0, 0) : 1 because the first letter U is present, 0 since the second letter does not appear, 0 since the third letter is not present and 0 since the fourth letter G does not appear.In a similar way C is represented as (0,1, 0, 0) , A as (0, 0,1, 0) and G as (0, 0, 0,1) .So if we have a nucleotide described by the codon UCG (serine) this would be written in the [0,1] 12 hypercube as (1, 0, 0, 0, 0,1, 0, 0, 0, 0, 0,1).
There are cases where the exact chemical structure of the sequence is not known for the complete sequence.In this case some components of its fuzzy code being neither 0 nor 1 but a value in the interval (0,1) are sequences not necessarily at a corner of the hypercube.
First Professor Sadegh-Zadeh [37] showed that nucleic acids (DNA and RNA) can be treated as ordered fuzzy sets in a 12-dimensional space.The genetic code can be represented in a 12 -dimensional space because a triplet codon XYZ has a 3 !4 = 12 dimensional fuzzy code (a 1 ,..., a 12 ) and it is a point in the 12 -dimensional fuzzy polynucleotide space [0,1] 12 as a subspace of the real space [0, !] 12 .Sadegh-Zadeh (see [37]) and Nieto et al. see [38- 40] introduced the Fuzzy Polynucleotide Space (FPS) based on the principle of the fuzzy hypercube [41].
The situation becomes more complex when dealing with polynucleotides since a polynucleotide consisting of a sequence of k triplets XYZ can be seen as a point in a [0,1] 12!k space.This means that as the length of polynucleotides increases the comparison and taxonomy procedure presents increased complexity [1].In his initial approach, Chou instead of dealing with the whole polynucleotide proposed a methodology that takes into account the correlation of properties of the residues of the polynucleotide based on their physical properties such as hydrophobicity, hydrophilicity and side-chain mass [1].This approach resulted in a representation of the amino acid with a (20 + !) dimensional vector, where ! is the degree of correlation.In such an approach, the reduced representation carries sequence order effects.In case the (20 + !) components were imposed by the normalized condition as in the case of the classical amino acid composition, the dimension-reduced operation is often needed when performing the prediction with some operation engines (such as the covariant discriminant [42]) in order to avoid the divergence problem.However, which one of these normalized components should be removed?Will the result be different by removing a different component?To address these problems, the Chou's Invariance Theorem was developed in 1995.According to the Chou's Invariance Theorem, the outcome of the covariant discriminant will remain exactly the same regardless of which one of the components is left out.In other words, any one of the constituent normalized components can be left out to overcome the divergence problem without changing the final result.For more information about Chou's Invariance Theorem and its applications as well as the relevant references, see a Wikipedia article at http://en.wikipedia.org/wiki/Chou.
In a similar spirit, Torres & Nieto [38] mapped a polynucleotide on a [0,1] 12 space by considering the frequencies of appearance of the nucleotides at the three base sites of a codon in the coding sequence.This would result in even more reduction of the information that one should devote to represent an aminoacid.In this representation, only the nature of residues plays role in the representation.In that work using a metric motivated by publications of Lin [43] and Sadegh-Zadeh [37], they calculated distances between nucleotides.They also applied their algorithm for the comparison of complete genomes (for example: M.tuberculosis and E.coli).Further work has been recently performed using the idea of Nieto et al. [40] in which the influence of several metrics has been examined.The advantages of this methodology are: a) One can compare polynucleotides of very big length in a very computationally efficient way and b) One can apply the algorithm in order to compare polynucleotides of different length as it is the case for genomes of different organisms.
We point out that metrics play an important role in computational biology.Different metrics have been used to study secondary structures see [44] or biopolymer contact structures (see [45]).The interest in this domain as well as many important biological and medical implications of the study of genetic sequences is reflected on the number of several works performed (see for example [46][47][48][49][50][51]).

Polynucleotides
The genetic code consists of strings made up from four constitutive elements which correspond to the four nucleic acids: A (adenosine), T (thymidine), C (cytidine), and G (guanosine).Aminoacids (or codons) are strings consisting of three nucleic acids.Provided that we have four letters (nucleic acids) with the possibility of three positions in a codon, this results in 64 possible combinations which correspond to 64 possible amino acids (See the table, for example, in http://psyche.uthct.edu/shaun/SBlack/geneticd.html,Freeland and Hurst 1998 [52]).Three of these possible codons specify the termination of the polypeptide chain and thus they are called "stop codons".This leaves 61 codons to specify only 20 different amino acids.
There is an important problem which consists of how to proceed to taxonomy of polynucleotides and how similar they are.In this frame, a reduced common base to describe a polynucleotide like Chou's pseudo amino acid composition concept can be very useful.In fact in an earlier paper [53], the physicochemical distance among the 20 amino acids [54] was adopted to define PseAA.Subsequently, several researchers employed the complexity measure factor [55], results from cellular automata analysis have also been presented ( [56][57][58][59]), several researchers used hydrophobicity and/or hydrophilicity properties [60][61][62][63][64][65][66], while others employed Fourier transform [67,68].
The pseudo amino acid composition was originally introduced in order to improve the prediction for protein subcellular localization and membrane protein type [1], as well as for enzyme functional class [61].The pseudo amino acid composition has the advantage that on one hand it can represent a protein sequence with a discrete model without on the other hand losing its sequence-order information [69].For this reasons, it can be very useful in the analysis of a large number complex sequences in taxonomic studies.Later, many researchers have studied the protein and yielded interesting results.Indicative examples are presented in a number of research papers (see [1, 43, 53, 55-59, 60-62, 65-66, 69-103]).Several excellent reviews of these results can also be found in [104][105].
In the next section we present a brief description of the fuzzy sets theory.

FUZZY SET THEORY
In this section, we give basic notions of Fuzzy Set Theory which are essential and have important applications to issues associated with genetic sequences and Chou's pseudo amino acid composition.

Fuzzy Sets and Similarity
, that is A is the set of all pairs (x, µ A (x)) such that x !X and µ A (x) are the degrees of its membership in A .In what follows if X = {x 1 , x 2 ,..., x n } and A = {(x 1 , µ A (x 1 )),...,(x n , µ A (x n ))} (1) then we write Kosko [41] introduced a geometrical interpretation of fuzzy sets as points in a hypercube.For a given set X = {x 1 , x 2 ,..., x n } , the set of all fuzzy subsets (of X ) is precisely the unit hypercube I n = [0,1] n , since any fuzzy subset A determines a point P !I n given by P = (µ A (x 1 ),..., µ A (x n )).
Reciprocally, any point P = (a 1 ,..., a n ) !I n generates a fuzzy subset A of X defined by the map µ A : Non-fuzzy or crisp subsets of X = {x 1 ,..., x n } are given by mappings from the set X into the set {0,1} and they are located at the n 2 corners of the n -dimensional unit hypercube n I .So, the ground set X = {x 1 ,..., x n } is itself the fuzzy set (1,1,...,1) !I n .Also, the empty fuzzy set is the fuzzy set (0, 0,..., 0) !I n , denoted by ! .
The distance d is motivated by publications [37] and [43].We know that d is a metric [110] and has already been employed in [38] and [40].In [111] (see, also [112]) it is proposed to call this metric as the NTV metric.
Let X = {x 1 ,..., x n } be a set and A = (a 1 ,..., a n ), and B = (b 1 ,...,b n ) , where a i ,b i ![0,1] , two fuzzy sets of X .The degree of similarity between A and B (see [40]), denoted by sim(A, B) is defined to be the number

Fuzzy Relations and Fuzzy Clustering
When performing clustering, an important parameter is the metric (see [113]) employed in the distance calculation between the elements to be classified.These elements are considered as point in a finite dimensional space.
In what follows by R we denote a fuzzy relation on a set X , that is a fuzzy set in the direct product X !X = {(x, y) : x, y " X} which is characterized by the membership function: Also by !we denote the set of all real numbers and by !+ the set of all positive real numbers.
Let X = {x 1 , x 2 ,..., x n } be a finite set.A fuzzy relation R in X !X can be expressed by a n n !matrix as following: , and A fuzzy relation with the above properties is called fuzzy similarity relation or fuzzy equivalence relation.
A fuzzy relation on X that is reflexive and symmetric is usually called a compatibility relation.
The max-min transitive closure of a fuzzy relation R on X is defined as the smallest max-min fuzzy transitive relation containing R .
It is known that (see, for example, [114]) if R is a fuzzy compatibility relation on a finite set X = {x 1 ,..., x n } , then the max-min transitive closure R T is the relation We note that if R and S are two fuzzy relations on X , the composition is characterized by the membership function: Several techniques exist for the extraction of genome characteristics.A common approach is the investigation of common characteristic properties between its constituent elements.Fuzzy clustering is a procedure that can be employed in this direction.There are several methods of fuzzy clustering.Two of the most commonly employed approaches in fuzzy clustering are: a) The fuzzy c-means algorithm [115], which needs an a priori definition of the number of classes (called clusters) and its final result critically depends on this choice.
b) The fuzzy equivalence relation-based hierarchical clustering method (see for example, [116,117]) which does not need any a priori assumption on the number of classes.This is a significant advantage since we avoid any unwanted biases on the classified data.
A simple fuzzy cluster analysis of amino acids has been introduced by Mocz [118] to recognize secondary structure in proteins.Biological distances between the twenty aminoacids can be calculated based on their properties.Since the introduction of the pseudo amino acid composition by Chou [1], a significant number of efforts have been in order to employ different quantities in order to create representations of the twenty aminoacids with the aim to better capture and represent the sequence order effects through the principle of pseudo amino acid composition (PseAA).
In order to illustrate the clustering methodology based on fuzzy equivalence relations (see, for example, [53,[72][73][74]), let's consider a data set: where and i = 1, 2,..., n .Then we proceed with the following three steps: Step 1: We define a fuzzy relation R on X , using the distance function of Minkowski, via the membership function: for all (x i , x k ) ! X " X , where q !" + and Clearly R is a fuzzy compatibility relation but not necessarily a fuzzy equivalence relation [117].
Step 2: We find the max-min transitive closure R T .
Step 3: For every a ![0,1] , call the degree of similarity, we define a new fuzzy matrix The intervals of a that determine the partitions are derived from the values of the matrix R T .In this way by the examination of the R T matrix, we can determine the resulting partitions for all intervals of a -cuts.

Applications of Fuzzy Set Theory
Several properties can be used in the clustering methodology.Such properties can be selected form the AAindex database (see [119,120]) and motivate selection of the considered properties.The results of such classifications can contribute to the prediction of subcellular location of proteins, which is a very hot topic of research in the domain of bioinformatics.The resulting clusters may also be very helpful in understanding the origin and emergence of the alphabet of amino acids encoded by the standard genetic code.
A clustering analysis of the twenty amino acids based on several physical properties is presented in [121].The properties employed in the analysis are the following: the number of codons that code the protein, molecular weight, hydrophobicity, the number of atoms of different type and the corresponding number of protons as well as the number of total protons.The analysis concerned the influence of the properties on the classification procedure and the resulting clusters as well as the effect of the metric employed in the clustering procedure.
Recently, Stepehn and Freeland [122] have presented the first quantitative exploration of nature's "choices" set against various models for plausible alternatives with the help of computational chemistry.This analysis made clear that the fuzzy approach such as fuzzy clustering and fuzzy cognitive maps can be very useful in the field of the protein content prediction and the prediction of protein structural classes.
Fuzzy set theory has also been employed using the pseudoamino acid composition concept also in other works.Representative results are that of Ding et al. [123] who employed fuzzy support vector machine for the prediction of protein structure classes with pseudo aminoacid composition, Shen et al. [124] who used Fuzzy KNN for predicting membrane protein types from pseudo aminoacid composition, Hayat et al. [125] who employed fuzzy Knearest neighbor algorithms based on Chou's pseudo aminoacid composition, Shen et al. [126] who applied supervised fuzzy clustering to predict protein structural classes and Goergiou et al. [121] who applied fuzzy clustering techniques to classify aminoacids based on Chou's aminoacid composition.

CONCLUSIONS
In a nutshell, we can say that the aminoacid composition principle introduced by Chou has greatly contributed to the taxonomy, study and prediction in bioinformatics.It has also initiated further work in this direction of studying genetic sequences in reduced representation space by other researchers in the field.The combination of such principles with elements of fuzzy set theory seems to be very promising for further extension in the research of genetic sequences and necessitates further interdisciplinary research.