Send Orders for Reprints to Reprints@benthamscience.ae Using Chou's Pseudo Amino Acid Composition and Machine Learning Method to Predict the Antiviral Peptides

Traditional antiviral therapies are expensive, limitedly available, and cause several side effects. Currently, designing antiviral peptides is very important, because these peptides interfere with the key stage of virus life cycle. Most of the antiviral peptides are derived from viral proteins for example peptide derived from HIV-1 capsid protein. Because of the importance of these peptides, in this study the concept of pseudo-amino acid composition (PseAAC) and machine learning methods are used to classify or identify antiviral peptides.


INTRODUCTION
Antimicrobial peptides exist naturally in all organisms.They have an important role in immune system.These peptides have a broad spectrum of antimicrobial activity [1,2].They act as antibacterials; antifungals; antivirals and sometimes anticancers molecules [3][4][5].Some peptides with antiviral activity such as C34 from immunodeficiency virus [5], NS5A from hepatites C virus [6] and 7524BVS7 from hepatitis B virus have been previously studied [7].In the recent decades, many anti-HIV peptides are isolated form HIV-1 proteins and most of them are fusion inhibitor [8].Some of these peptides have clinical use, for example Enfuvirtide (T20).T20 was isolated from heptads repeat region of gp41 protein [9].Capsid (p24) is another protein of HIV-1which plays an important role in maturation and viral assembly.It has been demonstrated that corresponding peptides to HIV-1 capsid protein can prevent the spread of viral infection [10].
Antiviral peptides are sometimes better than other antiviral agents, because they have low molecular weight, low toxicity, rapid elimination from the host cells and low side effects [11,12].Currently, antiviral peptides have been developed to block virus attachment or entry in to host cells or inhibiting viral replication.Despite the same mechanism of action, antiviral peptides have low sequence homology.Thus, it is difficult to predict the antiviral peptides based on the sequence homology.Due to the cost of computational methods prediction of antiviral peptides is valuable previous to carry out the bench work [13].Lots of different characters of these molecules can be predicted by variety of computational methods.In fact these methods employ several of protein features for instance amino acid sequence [14,15], template [16][17][18] and amino acid composition (AAC) [19,20].On the other hand one of the most important and also most difficult problems in computational biology is how to formulate biological sequences with vectors but still considerably keep their sequence-order information.To address this challenging problem, the pseudo amino acid composition (PseAAC) was proposed by Chou.Since the concept of PseAAC or Chou's PseAAC [21] was proposed in 2001, it has rapidly penetrated into almost all the areas of computational proteomics, such as cyclin [22], metalloproteinase family [23], risk type of human papillomaviruses [24], protein quaternary structure [25], Discriminating protein structure classes [26], Predicting anticancer peptides [27], Prediction of bacterial protein subcellular localization [28], predict membrane protein types [29], Predict cysteine Snitrosylation sites in proteins [30], Identifying the heat shock protein families [31] and Predicting hydroxyproline and hydroxylysine in proteins [32], more example and the like [33][34][35][36][37][38][39][40][41][42][43][44][45][46][47].Recently, the concept of PseAAC was further extended to represent the feature vectors of DNA and nucleotides [41,[48][49][50][51][52][53], as well as other biological samples (see, e.g., [54]).
Because it has been widely and increasingly used, in addition to the web-server 'PseAAC' built in 2008, recently three powerful open access soft-wares, called 'PseAAC-Builder' [55], 'propy' [56], and 'PseAAC-General' [39], were established: the former two are for generating various modes of Chou's special PseAAC; while the 3rd one for those of Chou's general PseAAC.Also, similar to PseAAC for protein/peptide sequences, two powerful web-server predictors have also been established to generate pseudo K-tuble nucleotide composition or PseKNC for DNA/RNA sequences.In the present study, concepts of PseAAc and machine learning methods were used for classification and prediction of antiviral peptides.

Dataset
In this work two datasets were used.Positive set includes 614 sequences of antiviral peptides and the negative one includes 452 non-antiviral peptides.These sequences were chosen from Antiviral Peptides prediction Database (http://crdd.osdd.net/server/avppred).All peptides contain 10 to 100 amino acids.Cd-HIT program was applied to eliminate peptides with 90% similarity [57].Using Cd-HIT, the number of antiviral peptides was reduced to 342 and the number of negative ones was reduced to 312.

Producing Chou's PseAAC
The concept of Chou's PseAAC was presented in 2001 and then it quickly pierced into many areas of computational proteomics [23,24,[58][59][60][61].A flexible web server creates a variety of protein PseAACs (http://chou.med.harvard.edu/bioinf/PseAAC).PseAAC of a protein or peptide is shown by more than 20 different factors.The first 20 factors are related to components of their conservative amino acid composition, whereas the extra factors incorporate their sequence order information through a variety of methods [62][63][64].Three factors are often used to produce various types of PseAAc: quantitative parameter of amino acid composition, weight factor and grad of correlation.PseAAc back up six amino acid features are (1) hydrophobicity, (2) hydrophilicity, (3) side chain mass, (4) pK1 (alpha-COOH), ( 5) pK2 (NH3) and ( 6) pI (Isoelectric point).These characters are applied to value the effect of different locations of amino acid along the peptide sequence.
In this study, type 1of PseAAC or parallel type, =1 and weight factor = 0.05 were selected as in Chou`s original paper and similar papers were selected [27,65].Applying six characters and their combination produced 126 features of per peptide that were used to classify dataset [66].

Adaboost
Freund and Schapire presented the Adaboost (Adaptive boosting) as a Meta algorithm in 1995 [67].This algorithm uses combination of simple weak (base) classifiers to construct a strong classifier.In contrast with bagging, this algorithm uses training set re-weighting, instead of resampling.Initially, equal weights are assigned to all training samples.In other words, if the training set consist of: (x 1 ,y 1 ),…,(x m ,y m ), the initialize weights are D 1 (i)=1/m , i=1,…m.With these weights, a weak classifier is trained on the dataset.Then, depending upon how the classifier is learned the training dataset; the values of weights are updated and this process done frequently in a series of T round.
The weights of incorrectly classified examples are increased in each round.Thereupon the emphasis of new classifier is on the hard examples in the training set.After predefined cycles (T), the prediction class for each sample is obtained by taking a weighted vote of the predictions of each classifier that these weights being commensurate to accuracy of the classifier on its training set.
Selecting the base classifier is one of the factors that have much effect on the performance of Adaboost algorithm.In the present study, three decision trees were used as base classifier; consist of reduced-error pruning tree (REPTree), J48 and Decision Stump.REP Tree uses information gain/variance reduction and prunes to build a decision/regression tree.By using this fast algorithm, the classification process would be less complex.In addition, to find the best sub-tree, the initially grown tree is pruned.
The complexity in the classification process is reduced by using this fast algorithm.In addition, pruning is used to find the best sub-tree of the initially grown tree with the minimum error for the test set [68].
J48 is the C4.5 algorithm that is implemented in the WEKA data mining tool and produces decision trees.This is a standard algorithm that commonly used for practical machine learning [68].Decision stumps are basically one level decision trees that use a single feature value to prediction.This type of decision trees are not useful in prediction on their own and often used as weak (base) learner in ensemble technique such as bagging and boosting [69].
Moreover, Radial Basis Function (RBF) and Naïve Bayes were used as base classifier.RBFis a feed forward multilayer neural network to classify the non-linearly separable data.This algorithm uses non-linear functions to transform the input feature space into a working feature space.Then, it applies a linear function on the working feature space to produce the output space.This method can be used for classification and/or function approximation problems.The Naïve Bayes algorithms a supervised learning method that exploits the Bayes rule and assumes that attributes of the training set are conditionally independent.This method calculates the maximum posterior probability for each class [70].

RESULTS
To assess the performance of predictors in statistical prediction, three cross-validation methods are used: independent dataset test, sub sampling test, and jackknife test [62].In this study, we used the five-fold cross-validation for assessing the validity of the proposed predictors.In the fivefold cross-validation, the dataset is divided into two subsets consist of training and testing data in 5 different ways.With these subsets, each time Four subsets are used for training and one subset is used for testing.The same process is repeated 5 times and the average performance is calculated.In this article, to estimate the performance of the predictor different measures such as overall accuracy (ACC), sensitivity (SEN), specificity (SPEC), Matthew's Correlation Coefficient (MCC), and area under curve (AUC) were used.Overall accuracy is the total accuracy rate of predictor.Sensitivity shows the ability of predictor to correctly classifying the AVPs.Specificity expresses the correct prediction of non-AVPs.Matthew's correlation coefficient is a measure of the quality of binary classifications.
In AUC the area under Receiver Operating Characteristic (ROC) curve is calculated to measure the quality of the prediction [71].If AUC is equal one, the predictor is perfect [72][73][74][75][76].
These parameters are also given by following equations (1-4) [77]: Where, TP, TN, FP and FN refer to the numbers of true positive (AVPs predicted as AVPs), true negative (non-AVPs predicted as non-AVPs), false positive (non-AVPs predicted as AVPs) and false negative (AVPs predicted as non-AVPs), respectively.Also, AUC is a measure that determines the quality of the prediction by calculating the area under ROC curve [78].ROC curve is a graphical plot of the true-positive rate vs. false-positive rate.For the perfect predictor, the AUC is equal one.
To most biologists, unfortunately, the four metrics as formulated in Eqs.1-4 are not quite intuitive and easy to understand, particularly the equation for MCC.Here we adopt the formulation proposed recently in [30,41,79] based on the symbols introduced by Chou [80,81] in predicting signal peptides.According to the Chou's formulation, the same four metrics can be expressed as (5) Where is the total number of the AVPs investigated white ubiquitination peptides incorrectly predicted as the non-AVPs; the total number of the non-AVPs investigated while the number of the non-AVPs incorrectly predicted as the AVPs [82].Now, it is crystal clear from Eq. 5 that when meaning none of the AVPs was incorrectly predicted to be a non-AVPs, we have the sensitivity when meaning that all the AVPs were incorrectly predicted as the non-AVPs, we have the ty .Likewise, when meaning none of the non-AVPs was incorrectly predicted to be the AVPs, we have the specificity ; whereas meaning all the non-AVPs were incorrectly predicted as the AVPs, we have the specificity when meaning that none of AVPs in the positive dataset and none of the non-AVPs in the negative dataset was incorrectly predicted, we have the overall accuracy and ; when and meaning that all the AVPs in the positive dataset and all the non-AVPs in the negative taset were incorrectly predicted, we have the overall accuracy and ; whereas when and we have and meaning no better than random prediction.As we can see from the above discussion based on Eq. 5, the meanings of sensitivity, specificity, overall accuracy, and Mathew's correlation coefficient have become much more intuitive and easier-tounderstand.It is instructive to point out, however, the set of metrics in Eqs.1-4 as well as Eq. 5 are valid only for the single-label systems.For the multi-label systems, such as those for the subcellular localization of multiplex proteins (see, e.g., [83][84][85][86][87][88][89][90][91][92][93][94][95]) where a protein may have two or more locations, and those for the functional types of antimicrobial peptides (see, e.g., [96] where a peptide may possess two or more functional types, a completely different set of metrics is needed as elaborated in [97].
In this work, the concept of PseAAC was applied.Then the Adaboost algorithm with five different base classifiers consists of; RBF, Naïve Bayes, J48, REPTree, and Decision Stump, was used as classier.The results of applying Adaboost with different base classifiers using five-fold cross-validation are shown in Table 1.According to the results, when Adaboost was applied with J48 as base classifier, the maximum values of evaluation parameters were obtained.In this condition, the accuracy, Matthew's correlation coefficient and area under curve are 93.26%,0.86 and 0.982, respectively.
The Adaboost algorithm is known as a successful metatechnique to improve the predictive power of classifier.This algorithm constructs a strong classifier as linear combination of base classifiers.Indeed, if individual classifiers make errors on different instances, a strategic combination of these classifiers can reduce the total error.In order to examine this issue, the results of applying these five classifiers to classify the data are shown in Table 2.
According to the results in Tables 1 and 2, using Adaboost to combine a set of classifiers by voting, the evaluation parameters are improved in comparison with a single classifier.
In this work, the concept of Pse-AAC was applied.Then the Adaboost algorithm with five different base classifiers consists of; RBF, Naïve Bayes, J48, REPTree, and Decision Stump, was used as classier.Since the number of cycle (T) is an important value for Adaboost method, we used three different T value to assess the effect of this parameter on the accuracy.The results of applying Adaboost with different base classifiers using five-fold cross-validation for different T values are shown in Table 1.The results showed that the larger T value yields a better classification performance, but the running time and computational complexity are increases.For example, the running time for T=20 is twice the time for T=10.When the number of data is enormous, running time would be problematic, therefore we select T=10 as optimum value in this study.According to the results, when Adaboost was applied with J48 as base classifier, the maximum values of evaluation parameters were obtained.In this condition with T=10, the accuracy, Matthew's correlation coefficient and area under curve are 93.26%,0.86 and 0.982, respectively.

DISCUSSION
There are a few antiviral drugs to battle against viral infections.Also emergence of drug-resistant strains is not a rare occurrence.Therefore, there are considerable focuses on the new antiviral agents including antiviral peptides.These peptides are preferable due to low toxicity, low molecular weight, rapid elimination from the host cell and selectivity function.Since prediction of antiviral activity of peptides is faster and cheaper than experimental methods, in this study an antiviral peptide prediction method has been designed.Antiviral peptides have very low sequence homology, so concept of PseAAC and Machine Learning Methods have been applied for their classification and prediction.In the present study, PseAACs was extracted from sequences and then Adaboost algorithm with five different base classifiers, was used for the classification task.Adaboost algorithm adjusts the weights of training samples and produces a classifier as linear combination of weak classifiers.This algorithm is reported as a successful method to improve the accuracy of classifier learning system.The results show that employing PseAAc and Adaboost with J48 as base classifier can be useful in predicting antiviral peptides.Due to the advantages of a web server [98], we try to develop a server based on the method presented in the present paper.