Protein Secondary Structure Prediction Using RT-RICO: A Rule-Based Approach

: Protein structure prediction has always been an important research area in biochemistry. In particular, the prediction of protein secondary structure has been a well-studied research topic. The experimental methods currently used to determine protein structure are accurate, yet costly both in terms of equipment and time. Despite the recent breakthrough of combining multiple sequence alignment information and artificial intelligence algorithms to predict protein secondary structure, the Q 3 accuracy of various computational prediction methods rarely has exceeded 75%. In this paper, a newly developed rule-based data-mining approach called RT-RICO (Relaxed Threshold Rule Induction from Coverings) is presented. This method identifies dependencies between amino acids in a protein sequence and generates rules that can be used to predict secondary structure. RT-RICO achieved a Q 3 score of 81.75% on the standard test dataset RS126 and a Q 3 score of 79.19% on the standard test dataset CB396, an improvement over comparable computational methods.


INTRODUCTION
Prediction of 3D structure of a protein from its amino acid sequence is a very important research goal in biochemistry and bioinformatics, and has been studied extensively since the 1960s. Protein structure prediction is valuable for drug design, enzyme design, and many other biotechnology applications. Rost [1] suggests that although protein 3D structure prediction from sequence still cannot be achieved fully, in general, research has continuously improved methods for predicting simplified aspects of structure. Particularly in the area of secondary structure prediction, accuracy has surpassed the 70% threshold for all residues of a protein. That breakthrough was achieved by combining multiple sequence alignment information and artificial intelligence algorithms.
It is not an easy task to evaluate the performance of a protein secondary structure prediction method. For example, the use of different datasets for training and testing each algorithm makes it difficult to find an objective comparison of methods [2]. Interestingly, Kabsh and Sanders [3] tested prediction methods using proteins that had not been used in the development of the algorithms and found that the reported prediction accuracy of most of those methods decreased by more than 7%. One method's prediction accuracy decreased by as much as 27%. Rost [1] stated that "there is no value in comparing methods evaluated on different datasets." Efforts have been made to develop standard test datasets to accurately evaluate the performance of prediction methods. Rost and Sander [4] selected a list of 126 protein domains (the RS126 set) that now constitutes a comparative standard.
Cuff and Barton [2] described the development of a non-redundant test set of 396 protein domains (the CB396 set) where non-redundancy is the case; no two proteins in the set share more than 25% sequence identity over a length of more than 80 residues [4]. They used the CB396 set to test four secondary structure prediction methods: PHD [4], DSC [5], PREDATOR [6] and NNSSP [7]. They also combined the four methods by a simple majority-wins method, the CONSENSUS method [2]. The resulting Q 3 scores for the CB396 set were 71.9% (PHD), 68.4% (DSC), 68.6% (PREDATOR), 71.4% (NNSSP) and 72.9% for the CONSENSUS method.
An interesting secondary structure prediction method described by Fadime, O¨zlem and Metin [8] uses a twostage approach. In the first stage, the folding type of a protein is determined. The second stage utilizes data from the Protein Data Bank (PDB) [9] and a probabilistic search algorithm to determine the locations of secondary structure elements. The resulting average accuracy of their prediction score is 74. 1 %. However, the test dataset was not RS126 or CB396.
In this paper, we present a new method for predicting the secondary structure elements for different folding types. The algorithm, RT-RICO (Relaxed Threshold Rule Induction from Coverings), generates rules for discovering dependencies between protein amino acid sequences and related secondary structure elements. These rules are then used to predict protein secondary structure. The RT-RICO method performed better than previously reported methods, with a Q3 accuracy of 81.75% on the RS126 set and 79.19% on the CB396 set.
The RT-RICO approach and the main RT-RICO rule generation algorithm are discussed in Sections 3 and 4. A parallelized version of this algorithm is presented in Section 5, and detailed results of this method are presented in Section 6.

RELATED WORK
Rost [1] classifies protein secondary structure prediction methods into three generations. The first generation methods depend on single residue statistics to perform prediction. The second generation methods depend on segment statistics. The third generation methods use evolutionary information to predict secondary structure. For example, PHD [4] is a third generation prediction method based on a multiple-level neural network approach. It has been the most accurate method for many years.
One of the best secondary structure predictors is PSIPRED Protein Structure Prediction Server [10], which was developed at University College London [10,11]. PSIPRED uses a two-stage neural network to predict the protein's secondary structure based on position-specific scoring matrices. The matrices are generated by PSI-BLAST (Position-Specific Iterated BLAST) [12]. The PSIPRED's Q 3 score based on a set of 187 unique folds is between 76.5% and 78.3% [10]. There are other secondary structure prediction methods that utilize neural network prediction algorithms. For example, Jnet works by applying multiple sequence alignments alongside profiles such as PSI-BLAST and HMM [13].
Random errors in the DNA sequence lead to a different translation of protein sequences. These 'errors' are the basis for evolution [1]. Due to the fact that mutations resulting in a structural change are not likely to survive, Rost states that the evolutionary pressure to conserve structure and function has led to a record of the unlikely event: structure is more conserved than sequence [1]. Many third generation methods capitalize on this event to improve prediction accuracy. In PHD [4], Rost and Sander use multiple sequence alignments rather than single sequences as input to a neural network. At the training stage, a database of protein families aligned to proteins of known structure is used. At the prediction stage, the database of sequences is scanned for all homologues of the protein to be predicted, and the family profile of amino acid frequencies at each alignment position is fed into the network [14]. PSIPRED take advantage of the same concept, but uses a slightly different approach, via matrices generated by PSI-BLAST [10].
These artificial neural network methods are revolutionary in the sense that they employ the homologues of proteins for training and prediction. It is considered that a neural network is like a "black box"; it is difficult to formulate an algorithm from a neural network. A trained network may succeed in solving a problem, but it is hard to understand how it works. As a result, we are inspired to utilize a different approach, a rule-based prediction method. This approach still makes use of the fundamental principle that structure is more conserved than sequence. We establish rules between each known secondary structure element and its "neighboring" amino acid residues. These rules are used to perform predictions. Due to the different approaches, it is difficult to directly compare prediction results between this method and other methods. Neural network methods normally employ rigorous cross-validation testing techniques. The final Q 3 scores comparison should be used as a general guide, not a strict percentile comparison.
Recently, there is a trend using the support vector ma-chine (SVM) to predict protein secondary structures. Hu, Pan, Harrison and Tai [15] achieved a Q 3 accuracy of 78.8% on the RS126 dataset using a SVM approach. Kim and Park [ 16] developed the SVMpsi method that resulted in Q 3 scores of 76.1% on the RS126 dataset and 78.5% on their KP480 dataset. Nguyen and Rajapakse [17] proposed a two-stage multi-class SVM approach utilizing position-specific scoring matrices generated by PSI-BLAST. Their Q 3 scores were 78.0% on the RS126 dataset and 76.3% on the CB396 dataset.
Levitt and Chothia [18] proposed to classify proteins as four basic types according to their α-helix and β-sheet con-tent. "All-α" class proteins consist almost entirely (at least 90%) of α-helices. "All-β" class proteins are composed mostly of β-sheets (at least 90%). The "α/β" class proteins have alternating, mainly parallel segments of α-helices and β-sheets. The "α+β" class proteins have a mixture of all-α and all-β regions, mostly in sequential order. The first stage of the two stage method developed by Fadime, O¨zlem and Metin [8] is able to determine the class of unknown proteins with 100% accuracy. Given a protein sequence, they use a mixedinteger linear program (MILP) approach to decide if the protein sequence belongs to one of the four classes ("all-α", "all-β", "α/β", or "α+β"). In the second stage they use a probabilistic approach based on their stage one results. The amino acid sequences of the training set are distributed into overlapping sequence groups of three to seven residues. These groups are used to calculate the probability statistics for secondary structure. Specifically, the secondary structure at a particular sequence location is determined by comparing the probabilities that an amino acid residue is a particular secondary structure type based on the statistics.
Their results are impressive. They achieved a 100% accuracy for classifying proteins into one of the four protein type classes ("all-α", "all-β", "α/β", or "α+β"). This greatly simplifies part of the protein secondary structure prediction problem. That is, given a protein amino acid sequence, if it can be determined which one of the four classes this protein belongs to, then other approaches can be applied to predict the secondary structure elements within these four classes. In contrast, the RT-RICO method uses a rule-based approach as an alternative way to make the prediction.
A study by Maglia, Leopold, and Ghatti [19] implemented a data mining approach based on rule induction from coverings in order to identify non-independence in phylogenetic data. Although rule induction from coverings appeared to be a promising solution for the phylogenetic data non-independence problem, it suffered from exponential computational complexity (which was in part addressed by a parallelized implementation that was tailored for the phylogenetic data [20]) as well as the strictness required for the resulting rules (i.e., all rules had to be correct for all instances in the dataset). The restrictive requirement for the rules is addressed in Section 3, and this allowed the research team to discover meaningful rules in another problem domain, protein datasets.
Kabsch and Sander developed a set of simple and physically motivated criteria for secondary structure, programmed as a pattern-recognition process of hydrogen-bonded and geometrical features extracted from x-ray coordinates [21]. This DSSP (Define Secondary Structure of Proteins) algorithm is the standard method for assigning secondary structure to the primary structure (amino acids) of a protein. De-pending on the pattern of hydrogen bonds, DSSP recognizes eight types or states of secondary structure. The 3-helix (3/10 helix), alpha helix, and 5 helix (pi helix) are symbolized as G, H and I, respectively. DSSP recognizes two types of hydrogen-bond pairs in beta sheet structures, the parallel and antiparallel bridge. Residue in isolated beta-bridge is symbolized by B, whereas E represents an extended strand, and participates in a beta ladder. The remaining types are T for hydrogen bonded turn, and S for bend. There is also blank or "-" meaning "loop" or "other." These eight types are usually grouped into three classes: helix (G, H, and I), strand/sheet (E and B) and loop/coil (all others).

Problem Description
In general, the protein secondary structure prediction problem can be characterized in terms of the following components [22]:

3-Class Prediction [23]
This is a characterization of the problem as a multi-class prediction problem with 3 classes {H,E,C} in which one obtains a 3 × 3 confusion matrix Z = (z ij ). z ij represents the number of times the input is predicted to be in class j while belonging to class i.
Accuracy is computed as Q 3 = W αα + W ββ + W cc W αα = % of helices correctly predicted W ββ = % of sheets correctly predicted W cc = % of coils correctly predicted In other words, a protein secondary structure data sequence D is compared to the prediction result sequence M to calculate the Q 3 score.

RT-RICO Step 1, Data Preparation
RT-RICO (Relaxed Threshold Rule Induction from Coverings) is the implementation of a prediction method for solving the protein secondary structure prediction problem. First, all protein names and corresponding folding types of each protein are retrieved from the SCOP database [24,25]. All available corresponding protein sequences and secondary structure sequences are retrieved from the PDB database [9]. Five databases of protein domains (with their amino acid sequences and secondary structure sequences) of different protein domain types ("all-α", "all-β", "α/β", "α+β" and "others") are built. Proteins from the test datasets (RS126 or CB396) are first removed from these databases, so that they will be excluded from the possible training datasets. Protein domains from different protein families are selected to form the training datasets. See Table 1 for the number of protein domains in each training dataset on the RS126 test dataset.
The protein secondary structure sequences from PDB are formed by elements of eight states of secondary structure, {H, G, I, E, B, T, S, -}. The eight states are converted to four states to facilitate rule generation as follows: Note that rule generation uses a four-state decision at-tribute. The final Q 3 score calculation uses a three-state decision attribute: The basis for the RT-RICO approach is to first search segments of amino acid sequences of known protein secondary structures, and then find the rules that relate amino acid residues to secondary structure elements. The generated rules are subsequently used to predict the secondary structure. Klepeis and Floudas [26] showed that the use of overlapping segments of five residues is very effective in predicting the helical segments of proteins. Thus, the overlapping 5-residue segments approach was used to prepare the training data records. As shown in Fig. (1), for each secondary structure element, five "neighboring" amino acid residues are extracted to form a segment of five amino acid residues, plus one secondary structure element. These segments are used as input to the RT-RICO rule generation algorithm (Section 3.3, with more detail in Section 4) to generate rules. The numbers of 5-residue segments generated for the five protein type classes are shown in Table 1.
Although we use 5-residue segments, there is no evidence that five is the best segment length for this algorithm. PSIPRED uses a window of 15 amino acid residues for the neural network design [10]. Most previous methods combine multiple sequence alignment information and machine learning techniques. The purpose is to find the highly-correlated patterns from the training databases. A challenging future research problem remaining for RT-RICO is how to choose the best residue segment length, hence extracting correct and concise rules.
The main inputs to the RT-RICO rule generation algorithm are in the form of 6-tuples. The first five elements of a 6-tuple are formed by amino acid residues, {A, C, D, E, F, G, H, I, K, L, M, N, P, Q, R, S, T, V, W, Y}. The last element of a 6-tuple is formed by one of four secondary structure states {H, E, C, -}. The last element is considered the decision attribute. In other words, the input to RT-RICO Step 2, Rule Generation, is in the form of an m × (n+ 1) matrix, where m is the number of all entities (the number of 5-residue plus one secondary structure element segments), and n = | S| (the number of attributes, n = 5 in this case).
As shown in Fig. (1), for a protein amino acid sequence and corresponding secondary structure sequence of length k, only the secondary structure elements from the third position to position (k-2) are extracted as the 5residue segments. In other words, the first and second positions at the beginning of the secondary structure sequence, as well as the last and second-to-last positions at the end of the secondary structure sequence, are not extracted as 5-residue segments. To handle these positions, extractions are done slightly differently, as shown in Fig. (2). These 3-residue and 4-residue segments also are used as input to the RT-RICO rule generation algorithm to generate rules. As previously mentioned, the input to RT-RICO Step 2, Rule Generation, is in the form of an m×(n+1) matrix, where m is the number of all entities, and n = |S| (the number of attributes, where n = 3 for 3residue segments, and n=4 for 4-residue segments). The same rule generation algorithm applies to all these segments. The rules generated are used in step 3 to predict the secondary structure elements at the first and second positions, as well as the last and second-to-last positions of unknown secondary structure sequences, respectively.
For an amino acid sequence of length k, (k-4) 5-residue segments are extracted, whereas only two 3-residue segments (in the first and last positions), and two 4-residue segments (in the second and second-to-last positions) are extracted. As the extraction is done for a large number of protein domains (Table 1), the rule generation and prediction operations in later steps involve mostly 5-residue segments in terms of the training data size. Due to this reason, only 5 -residue segment numbers are recorded in the prediction result tables, and only 5-residue segment numbers are considered in the algorithm time complexity that is discussed in later sections.

RT-RICO Step 2, Rule Generation
RT-RICO generates rules based on the segments in the form of an m× (n+ 1) matrix. The main RT-RICO rule generation algorithm is covered in Section 4. Some examples of the generated rules are shown in Fig. (3) in two separate formats. The first format is intended to be read by the computer programs at the later prediction stage (i.e., the computer rule format). The second format is intended to be read by the user (i.e., the human rule format). The first rule (in human rule format) is interpreted as follows: if the fourth position attribute (or "3" as interpreted by program) is "C", and the fifth position attribute (or "4" as interpreted by program) is "C", then the sixth attribute (decision attribute, or "5" as interpreted by program) is "H" with a confidence of 91.53% and a support of 0.04864442%. The definitions of confidence and support can be found in [27].
The corresponding first rule (in computer rule format) is interpreted as follows: if the first position attribute is "+" (representing any amino acid element), the second position attribute is "+", the third position attribute is "+", the fourth position attribute is "C", and the fifth position attribute is "C", then the sixth attribute (i.e., the decision attribute) is "H." The number of occurrences of the fourth position attribute (which is "C") and the fifth position attribute (which is "C") equals 720 among all inputs to RT-RICO. The number of occurrences of the fourth position attribute (which is "C"), the fifth position attribute (which is "C"), and the sixth attribute (which is "H"), equals 659 among all inputs to RT-RICO. The confidence is 91.53% and the support is 0.04864442%.

RT-RICO Step 3, Prediction
Finally RT-RICO loads protein primary structures from the test dataset, and predicts the secondary structure elements. As shown in Fig. (4), for each secondary structure element prediction position (for a corresponding amino acid sequence of length k, from position 3 to k-2), five "neighboring" amino acid residues are extracted to form a segment of five amino acid residues. Each of these segments is compared with the generated rules (generated from 5-residue segments). If a segment matches a rule, the support value of the rule is taken into consideration for the prediction of the related secondary structure element.
The algorithm first searches for matching rules with 100% confidence value. The secondary structure element with the highest total support value (among 100% confidence value rules) is selected.
If no matching rule exists among 100% confidence value rules, the algorithm then searches for other matching rules (with confidence values greater than or equal to 90%, but less than 100%). The secondary structure element with the highest total support value among these rules is selected as the predicted secondary structure element for that specific position.
If no matching rule is found for the segment at all, the secondary structure of the previous position is used as the predicted secondary structure.
To predict the first and second positions at the beginning of a secondary structure sequence, and the last and second-to-last positions at the end of a secondary structure sequence, three or four "neighboring" amino acid residues are extracted, as shown in Fig. (5). The same prediction algorithm mentioned above is responsible for the secondary structure prediction at these positions, but instead using rules generated from 3-residue and 4residue segments as was discussed in Section 3.2.
The number of residues used in the RS126 test dataset, and the final Q 3 score of the RS126 set are shown in Table 1.

MAIN RT-RICO RULE-GENERATION ALGORITHM
Although the RT-RICO protein secondary structure prediction method consists of the three steps mentioned in Section 3, the most computationally intensive part is in the second step, rule generation. This section covers the details of that algorithm.

Rule Induction From Coverings
RT-RICO is based on a previously implemented method called RICO (Rule Induction from Coverings) [20]. RICO uses some of the concepts introduced by Pawlak [28] for rough sets, a classification scheme based on partitions of entities in a dataset [29].
In this approach, if S is a set of attributes and R is a set of decision attributes (i.e., attributes whose values we are interested in being able to determine if the values of the attributes in the set S are known), then a covering P of R in S can be found if the following three conditions are satisfied: i. P is a subset of S.
ii. R depends on P (i.e., P determines R). That is, if a pair of entities x and y cannot be distinguished by means of attributes from P, then x and y also cannot be distinguished by means of attributes from R. If this is true, then entities x and y are said to be indiscernible by P (and, hence, R), denoted x ~P y. An indiscernibility relation ~P is such a partition over all entities in the data set.
iii. P is minimal.
Condition (ii) is true if and only if an equivalent condition ≤, known as the attribute dependency inequality, holds for P* and R*, the partitions of all attributes and decisions generated by P and R, respectively, where, for a set of attributes A: The inequality P* ≤ R* holds if and only if for each block B of P*, there exists a block B' of R* such that B is a subset of B'.
Once a covering is found, it is a straightforward process to induce rules from it. For example, if a set of attributes P = {a 1 , a 2 } is found to determine a set of attributes R = {a 3 } (i.e., P is a covering for R), then rules of the form (a 1 , v 1 ) (a 2 , v 2 ) → (a 3 , v 3 ) (read as "if a 1 equals v 1 and a 2 equals v 2 , then a 3 equals v 3 ") can be generated where v 1 , v 2 , and v 3 are actual values of attributes a 1 , a 2 , and a 3 , respectively, for which the relationship holds in the dataset. Such a rule also conveys a notion of non-independence between the attributes in the sets P and R (e.g., a 3 is not independent of a 1 and a 2 ). Here non-independence means that the relationship between the two attributes could be correlation, dependency, or co-dependency.

Relaxed Attribute Dependency Inequality
All rules generated from coverings in this manner are "perfect" in the sense that there is no instance in the dataset for which the rule is not true. In order to relax this restriction somewhat (in much the same way that rules generated by decision tree induction are not always true for all instances in the dataset), the definition of the attribute dependency inequality can be modified as follows.

Definition 1 Relaxed Attribute Dependency Inequality
The inequality P* ≤ r R* holds if and only if there exists a block B of P*, and there exists a block B~ of R* such that B is a subset of B'.
As an example for the data set of

Relaxed Coverings
Similarly, the definition of a covering can be relaxed in order to induce rules depending on as small a number of attributes as possible.

Definition 2
Relaxed Coverings A subset P of the set S is called a relaxed covering of R in S if and only if P → r R and P is minimal in S. This is equivalent to saying that a subset P of the set S is a relaxed covering of R in S if and only if P → r R and no proper subset P' of P exists such that P'→ r R.
As an example for the dataset of Table 2, suppose rules need to be induced for R = (3}. The covering (1, 2} can be used; that is, for any assignment of values for the covering (1, 2}, each entity in

Checking Attribute Dependency
To implement rule induction from coverings with the relaxed constraints, it is necessary to use the concept of checking attribute dependency, which was introduced by Grzymala-Busse [29]. In order for P to be a relaxed covering of R in S, the following conditions must be true: i. P must be a subset of S, ii.
R must depend on set P (for some values of P), and iii.
P must be minimal.
For the specific application of generating rules for protein secondary structure prediction, rules involving more attributes are preferred over rules involving fewer attributes, because they normally generate higher confidence values. In addition, all the possible attribute position combinations are needed to predict secondary structure. As a result, condition (iii) is not enforced for rule generation in our implementation. In fact, condition (iii) cannot be enforced for this particular application; otherwise, many meaningful rules involving multiple attributes and high confidence values would not be generated, leading to inaccurate predictions.

Condition (ii) is true if and only if the relaxed attribute dependency inequality, P*≤ r R*, is satisfied.
The question then becomes how the above inequality can be efficiently checked. For each set P, a new partition, generated by P, must be determined. Partition U should be generated by P. For partitions π and τ of U, π·τ is a partition of U such that two entities, x and y, are in the same block of π·τ if and only if x and y are in the same block for both partitions π and τ of U. For example, referring to

Finding the set of All Relaxed Coverings
The algorithm R-RICO (Relaxed Rule Induction from Coverings) which is given below can be used to find the set C of all relaxed coverings of R in S (as well as the related rules).
Let S be the set of all attributes, and let R be the set of all decision attributes. Let k be a positive integer. The set of all subsets of the same cardinality k of the set S is denoted P k = {{x i1 , x i2 , ... , x ik } | x i1 , x i2 , ... , x ik S} [29]. Note that the condition (iii) for a relaxed covering is not enforced in the R-RICO algorithm. The time complexity of the R-RICO algorithm is exponential to |S|, the number of attributes in the dataset.

RT-RICO Algorithm
The R-RICO algorithm produces rules that are 100% correct. However, unlike decision tree induction, R-RICO produces a more comprehensive rule set. The algorithm can be further modified to satisfy some particular level of uncertainty in the rules (e.g., the rule is ≥ 50% true). That is, rather than just reporting a rule R, the rule can be reported as a tuple (R, p) where p is the probability that rule R is true. To accommodate this information in the rules, the definition of attribute dependency inequality must be further modified as in Definition 3.

Definition 3 Relaxed Attribute Dependency Inequality with Threshold
Set R depends on a set P with threshold probability t (0 < t ≤ 1), and is denoted by P → r,t R if and only if P* ≤ r,t R* and there exists a block B of P*, and there exists a block B' of R* such that (|B B'| / |B|) ≥ t.
It can be observed that, when t= 1, Definitions 1 and 3 represent the same mathematical relation.
As an example, for the dataset of Table 4 The definition of relaxed coverings must also be modified to incorporate the notion of the threshold probability given in Definition 4.

Definition 4 Relaxed Coverings with Threshold Probability
Let S be a nonempty subset of a set of all attributes, and let R be a nonempty subset of decision attributes, where S and R are disjoint. A subset P of the set S is called a relaxed covering of R in S with threshold probability t (0 < t ≤ 1) if and only if P → r,t R and P is minimal in S.
Algorithm RT-RICO (Relaxed Threshold Rule Induction From Coverings) finds the set C of all relaxed coverings of R in S (and the related rules), with threshold probability t (0 < t ≤ 1), where S is the set of all attributes, and R is the set of all decisions. The set of all subsets of the same cardinality k of the set S is denoted Pk = {{x i1 , x i2 , ... , x ik } | x i1 , x i2 , ... , x ik S}. add rule to output file; end k := k+1 end-while; end-algorithm.
Note that the condition "P is minimal in S" of a relaxed covering with threshold probability is not enforced in the RT-RICO algorithm. The reason for not implementing this condition is the same as the reason mentioned for the R-RICO algorithm. For this application, to generate rules for protein secondary structure prediction, rules involving more attributes are preferred over rules involving fewer attributes, because they normally generate higher confidence values. Also, all the possible attribute position combinations are needed for accurate prediction.
The time complexity of the RT-RICO algorithm is again exponential to |S|, the number of attributes in the dataset. Specifically, the time complexity is O(m 2 2 n ), where m is the number of all entities (the number of 5residue segments), and n = | S| (the number of attributes). It would appear that 2 n dominates the time complexity. But, for the training datasets of this application, n = |S| = 5, and m is sufficiently large. Hence, m 2 dominates the time complexity in this case.
As mentioned in Section 3, the rules generated by the RT-RICO algorithm are then compared with the proteins in the test dataset to predict the secondary structure elements.

PARALLELIZED/MODIFIED RT-RICO ALGO-RITHM
The RT-RICO algorithm has a time complexity of O(m 2 2 n ), where m is the number of all entities (the number of 5-residue segments), and n = |S| (the number of attributes). In practice, n is only 5, while m can be fairly large. Hence, m 2 dominates the time complexity. The test programs were written in PERL, and the largest m value tested was 137,715. When executed on a computer with an Intel Pentium Dual-Core processor, 2 GB of RAM, and Windows XP OS, the total program running time was approximately 14 days.
In order to accommodate a larger dataset (e.g., m value 3,366,832), two new algorithms (Modified RT-RICO and Parallelization of Modified RT-RICO) were developed. The time complexity of modified RT-RICO is only O(m2 n ), although it comes at an acceptable sacrifice of space complexity (i.e., more main memory space is needed, as is discussed later in this section). The program was parallelized using an NVIDIA Tesla C1060 GPU with 4GB of RAM. The 240 cores on this GPU each run at 1.3 GHz. The CPU on the same test machine is a 4core Intel Core i7-920 with 8GB of RAM. With the modified algorithm, and the new hardware, the total program running time improved from days to a few minutes.
The focus of the parallelization of RT-RICO was the rule generation step. It is the most expensive part of the algorithm since it involves generating rules from each segment, counting the frequency of each rule, and finally calculating the confidence and support of each rule. As mentioned earlier, in the sequential implementation of RT-RICO, the complexity of this step is O(m 2 2 n ), where m is the number of segments and n is the number of amino acid residues in a segment. Usually n is fixed at 5, but m could range from a few thousand to the millions. To reduce the complexity, and hence improve its running time, it was essential to reduce the factor of m in the RT-RICO algorithm.
The m 2 in O(m 2 2 n ) is a result of counting the occurrences of each rule. After generating a rule from a segment, the algorithm has to iterate through the list of m segments to count how many times that rule has been seen. This has to be repeated for each of the m2 n rules that can be generated. Hence the complexity is O(m 2 2 n ).
But RT-RICO can skip the iteration through the list m times per rule if it simply increments a rule-specific counter every time a rule is generated. The drawback is that there needs to be a counter for every possible rule that can be generated, and this requires an immense amount of main memory. In the worst-case, 20 n ×2 n rules can be generated, which translates to approximately 99 Megabytes for 5aa segments, and 163 Gigabytes for 7aa segments. This increases exponentially with an increase in n. The calculation of space complexity is illustrated in Fig. (6).
Despite the exponential space complexity, 5aa segments only require 99 Megabytes of memory. This was further reduced to just 4 Megabytes, by accounting for the duplicate rules that two different segments can generate. For example, the two 5aa segments [S,L,F,E,Q] and [E,L,S,E,Q] can generate the same rule for [+,L,+,E,Q]. The mathematics behind this space optimization is not explained here, because the 99 MB, or the 4 MB required by the modified algorithm, are both trivial amounts on the newer test machine that was used (which has 8192 Megabytes of memory).

Modified Algorithm for Rule Generation
In essence, the modified RT-RICO algorithm compromises on space complexity for the sake of reducing time complexity. Algorithm 3 describes this modification in more detail.

Algorithm 3: Modified RT-RICO begin
Allocate counters for every possible rule (initialize to 0) for each segment for each 2 n -1 rules that can be generated from this segment Calculate the memory location of the counter corresponding to this rule, and increment it by 1 end-for end-for Read each counter and calculate the confidence and support for those rules that pass the relaxed threshold end-algorithm.
The complexity of this algorithm is just O(m2 n ) because the algorithm does not need to count the reoccurrence of each rule. The generated rules simply increment a counter whenever they are generated. There is an additional amount of time required to calculate the memory location of the counter that corresponds to a rule. However, this is negligible, and as a constant, it does not affect the overall complexity of the algorithm.

Parallelization of Rule Generation
The modified RT-RICO rule generation algorithm places no restrictions on the order in which rules are generated. So parallelizing the algorithm involves a straightforward distribution of the input data among processing units. Each processing unit accepts a segment as input, determines a rule from that segment, and increments the shared memory counter corresponding to that rule. Theoretically, these operations can be performed in parallel by any number of concurrent processing units. However, to minimize potentially conflicting concurrent updates of shared memory locations, the number of concurrent processing units (p) is kept at 2 n -1, which is the number of rules that a single segment can generate. Since these 2 n -1 rules are guaranteed to be distinct, they would guarantee mutually exclusive concurrent updates of shared memory counters. Algorithm 4 shows a parallelized version of Algorithm 3. The time complexity of Algorithm 4 is O((m2 n )/p), where p equals the number of concur-rent processing units.

Algorithm 4: Modified RT-RICO begin
Allocate counters for every possible rule (initialize to 0) for each segment s Send s to 2 n-1 processes that each calculates a different rule from it, and increment the corresponding shared memory counter end-for Read each counter and calculate the confidence and sup-port for those rules that pass the relaxed threshold end-algorithm.

Massively Parallel Computation Using GPUs
Compute Unified Device Architecture (CUDA) is a programming interface for developing general purpose applications on Graphics Processing Units (GPUs). GPUs are conventionally used for graphics acceleration, which typically involves repeatedly performing the same computational operation on multiple input data, also known as SIMD (single instruction multiple data) operations. Because of the constraints placed on SIMD operations, GPU hardware is designed with features such as massively parallel processing and pipelining to accelerate the execution of these operations. With CUDA, GPUs can be directly programmed using the C programming language to process any kind of general purpose operation, which normally would be tasked to CPUs. However, because the GPU hardware remains the same, they are still ideally suited for SIMD operations, and more complex operations are likely to run faster sequentially on a CPU.
The modified RT-RICO rule generation algorithm is an ideal SIMD operation. The calculation of the memory location of the counter that corresponds to a rule extracted from a segment is performed over and over again for all the given segments in the input file. This SIMD operation was parallelized using an NVIDIA Tesla C1060 GPU with 4GB of RAM. The 240 cores on this GPU each run at 1.3 GHz. The CPU on the same test machine was a 4-core Intel Core i7-920 with 8GB of RAM. The total program running time was close to 3 minutes for rule generation of the dataset in Table 1.

RESULTS
The RS126 set [4] and the CB396 set [2] are both non-redundant test datasets created with the objective of comparing different protein secondary structure prediction methods.
These two standard test datasets were used to evaluate the performance of the RT-RICO protein secondary prediction method. The two datasets have been studied extensively in other literature, and have been used as standard datasets to evaluate other prediction methods. Some of the prediction scores with different methods for the same datasets are mentioned in Sections 1 and 2. It should be noted that the CB396 set does not include protein domains from the RS126 set. Table 1 lists the number of protein domains in each training dataset and the performance of the RT-RICO prediction method on the RS126 test dataset. Table 5 shows the number of protein domains in each training dataset and the performance of the RT-RICO on the CB396 test dataset. Cuff and Barton [2] tested the RS126 set with various prediction methods and generated Q 3 scores of 73.5% (PHD), 71.1% (DSC), 70.3% (PREDATOR), 72.7% (NNSSP) and 74.8% for the CONSENSUS method. The final Q 3 scores of RT-RICO prediction using the RS126 test dataset are shown in Table 1. The "all-α" protein domains have the highest Q 3 score of 87.40%. The "all-β" and "α/β" protein domains have Q 3 scores of 82.22% and 78.05% respectively. The "α+β" and "Others" protein domains have the prediction accuracy of 84.64% and 81.23%. On average, RT-RICO has a Q 3 score of 81.75%, which is higher than the Q 3 score generated by other methods using the same RS126 test dataset reported in [2].
Cuff and Barton [2] also tested the same prediction methods using the CB396 set, resulting in Q 3 scores of 71.9% (PHD), 68.4% (DSC), 68.6% (PREDATOR), 71.4% (NNSSP) and 72.9% for the CONSENSUS method. The final Q 3 scores of the RT-RICO prediction method on the CB396 test dataset are shown in Table 5. The "all-α" protein domains have the highest Q 3 score of 83.50%. The "all-β" and "α/β" protein domains have Q 3 scores of 80.14% and 78.79% respectively. The "α+β" and "Others" protein domains have the prediction accuracy of 76.50% and 76.35%. On average, RT-RICO has a Q 3 score of 79.19%, which is higher than the Q 3 score generated by other methods using the same CB396 test dataset reported in [2].
Due to the different approaches and test designs, it should be noted that it is difficult to directly compare prediction results between this method and other methods. The final Q 3 scores comparison should be used as a general guide, not a strict percentile comparison.

CONCLUSIONS
A novel rule-based method, RT-RICO, which generates rules that can be used in predicting protein secondary structure was presented in this paper. This method performed very well with the standard test datasets RS126 and CB396. The Q 3 scores of 81.75% for the RS126 set and 79.19% for the CB396 set are better than the Q 3 scores generated by comparable computational methods using the same datasets. The resulting fast running time of the program enables us to generate rules from the large amount of available protein data within an acceptable timeframe, and to predict the secondary structure of available test datasets efficiently. In the future, we plan to continue to look for ways to improve the accuracy of this new promising rule-based prediction method.