Protein-protein Interaction Prediction Using Pca and Svr-phcs

Protein-Protein Interactions (PPIs) play a key role in many biological systems. Thus, identifying PPIs is critical for understanding cellular processes. Many experimental techniques were applied to predict PPIs. The data extracted using these techniques are incomplete and noisy. In this regard, a number of computational methods include machine learning classification techniques have been developed to reduce the noise data and predict new PPIs. Since, using regression methods to solve classification problems has good results in other applications. Therefore, in this paper, a regression view is applied to the PPI prediction classification problem, so a new approach is proposed using Principal Component Analysis (PCA) and Support Vector Regression (SVR) which has been improved by a new Parallel Hierarchical Cube Search (PHCS) method. Firstly, PCA algorithm is implemented to select an optimal subset of features which leads to reduce processing time and to lessen the effect of noise. Then, the PPIs would be predicted, by using SVR. To get a better performance of SVR, a new PHCS method has been applied to select the appropriate values of SVR parameters. The obtained classification accuracy of the proposed method is 74.505% on KUPS (The University of Kansas Proteomics Service) dataset which outperforms the other methods.


INTRODUCTION
Proteins have the major responsibility in cellular process, such as, signal transduction, gene regulation, cell-cell contact and many additional processes [1].These responsibilities are performed by the interaction between proteins.Therefore, prediction of PPI improves the knowledge of the cell functionality, protein functions [2][3][4], gene functions [5], signaling pathway [6] and disease proteins finding [7].
Several high-throughput experimental approaches have been introduced to predicting PPIs, including yeast twohybrid systems [8], mass Spectrometry [9], protein chip [10] and so on.Unfortunately, the data produced by these methods consist of a large amount of false positive and false negative.Moreover, these methods suffer from high computational time and they only seem able to identify small fraction of all interactions that exist in the cell [11].
In recent years, great efforts have been done to develop some reliable computational methods for predicting PPIs.These methods are mostly classified based on the type of data sources used in the prediction procedure.For instance, some of them use gene data [12], including gene neighborhood [13], gene fusion [1,14], phylogenetic profile [15] and mirror-tree [16].Some other methods employ structural information [17,18], protein sequence [19][20][21][22][23] and domain information [24][25][26].Since each of these datasets provides partial information about the interacting pairs, many researchers have attempted to integrate several data source for predicting PPIs with more reliability [27][28][29].Results have indicated that the integration of protein pairs information can improve the quality of protein interaction data [30,31].
Among the proposed machine learning methods for predicting protein-protein interactions, Support Vector Machine (SVM) has shown a better performance than other single classifiers such as: Decision Tree and K-Nearest Neighbor [30].SVM constructs a hyperplane or set of hyperplanes in feature space which can be used for classification and regression and are capable of dealing with high dimensional input features [32].A version of SVM for regression is called Support Vector Regression (SVR).SVR outperforms the SVM due to better generalization performance and more robustness against outliers [33].Like SVM, SVR method requires tuning and setting its parameters properly to achieve better performance and minimize an estimate of the Generalization error [34,35].
In this paper, a new method, consists of feature extraction and SVR improved by a new Parallel Hierarchical Cube Search (PHCS) method, is presented for solving PPI prediction problem.Since the integration of various data sources produces a high dimensional feature vector, applying feature extraction algorithm would be necessary to reduce processing time and to lessen noise effects.In this paper, at first PCA (Principal component analysis) algorithm is used for feature extraction, then SVR algorithm is carried out for classification.To improve the performance of the model, a new Parallel Hierarchical Cube Search (PHCS) method is implemented for tuning SVR kernel parameters optimally.This method improves the performance of prediction system without increasing the overall learning time significantly.
KUPS dataset [28] has been used to evaluate and compare the performance of the proposed method.KUPS is freely available at http://www.ittc.ku.edu/chenlab.The result of the experiments indicates how the classification accuracy has been increased to 74.505% in comparison with other works.This paper is organized as follows: background of protein-protein interactions prediction and support vector regression are discussed in section 2. SVR based on Parallel Hierarchical Cube Search (SVR-PHCS) is presented in section 3. Performance evaluation and experimental results are shown in section 4. Section 5 is conclusion.
In 2005, Chen and Liu considered protein domain information to present a domain-based random forest for inferring protein interactions [24].
Ixia et al. in 2010 suggested a Moran autocorrelation descriptor to translate the sequences of protein to numerical feature vector and then to predict PPI's, applying rotation forest method [44].Xing and Dunson In 2011, proposed a new Bayesian integration method called deemed Nonparametric Bayes Ensemble Learning (NBEL) to predict PPI using the sequence of protein pairs [37].
In 2006 Nanni and Lumini attempted to combine multiple K-local Hyperplane distance Nearest Neighbor (HKNN) classifiers with different physicochemical properties of protein sequence to obtain better classification result [45].
In 2007, Shen et al. proposed a new method based on SVM with a kernel function [38].They applied conjoint triad composition method for constructing feature vectors from sequences of protein pairs.In 2010, Ixia et al. presented a meta approach for PPI prediction which predicts PPIs by combining six independent predictors based on SVM [19].
It is necessary to mention that all the above methods employed one type of data source to predict PPIs.
Qi et al.In 2005, used multiple high throughput biological data sources to construct their features vector, including: Y2H, Gene Expression, Protein Expression, Gene Neighborhood, Domain-Domain.Then, they presented a hybrid of random forests and weighted k-nearest neighbour for predicting PPI [29].
They also employed a Mixture-of-Feature-Experts (MFE) method to improve the classification accuracy in this other study in 2007 [31].The results of these methods show that integration of multiple data sources could improve the prediction of PPI.
Using an appropriate classification technique is crucial in all the mentioned methods for prediction of PPIs.Since, there are some attempts to use regression methods to solve the classification problem in the literature of machine learning [34].In this work, Support Vector Regression (SVR) is applied as one of the powerful methods in the field of machine intelligence to proper classification of PPIs.
Since the selection of optimal values for the parameters in the SVR model is important to improve the performance of model and minimize an estimate of the Generalization error [46], in this paper a new Parallel Hierarchical Cube Search (PHCS) method is introduced.PHCS selects the optimal value of SVR parameters by searching three dimensional spaces in parallel and hierarchically.To evaluate the efficiency and validity of method, KUPS dataset [28] has been employed, which is the aggregation of different data sources related by PPIs.

Support Vector Regression (SVR)
The Support Vector Machine (SVM) is known as a popular and useful technique for data classification and regression in machine learning.Let be a set of n training samples, where x i is input sample and y i is the corresponding class.Generally, while The main idea is to find a linear separating hyperplane to maximize the distance between two classes; ( Where, w and b are the weight vector and bias, respectively.In some cases, data in the original input space cannot be linearly separated, and therefore some nonlinear kernel functions should be used.Polynomial, sigmoid and Radial Basis Function (RBF) are the most well-known kernel functions.These kernel functions implicitly map their inputs into high-dimensional feature spaces.
The optimal hyperplane can be determined as follows; With subject to Equation ( 2) is a nonlinear optimization problem with inequality constrains.This problem is solved by using Lagrange multipliers method that represents the following optimization problem (Some kernel tricks are used for nonlinear separating problem): (3) In Equation ( 3), and are two parameters which are determined experimentally.A linear decision function can be written as where b is given by .In cases where the decision function is non-linear, the input space is mapped to another Euclidean space by the kernel function in advance.This decision function is formulated as: (4) In SVR the mathematical formulation has to consider the approximation errors.SVM solve the regression problem by introducing a -insensitive loss function . ( Considering the above function, SVR is performed regression by minimizing the following function: , Where slack variable i, represents the upper training error and i * is the lower training error.In non-linear SVR, the following equation indicates kernel expansion of the decision function which is defined as follows; (7) The SVR parameters directly effect on the classification performance and the complexity of regression.Tuning and setting of these parameters to get a better decision function, is an open research problem.The main contribution of this study is on this problem.Therefore, a new PHCS method to optimally select these parameters is proposed.

SVR BASED ON PARALLEL HIERARCHICAL CUBE SEARCH (SVR-PHCS)
In this section, details of the proposed PCA and SVR-PHCS method are introduced.Accordingly, the progress has been started with proper number of features which is extracted by using PCA, and then attempt to obtain optimum parameters value of SVR which uses the RBF kernel function.The method consists of the following steps (Fig. In this section each of these steps is explained in details.

Feature Selection
Biological datasets are generally very large, dimensional and noisy.One of these datasets is KUPS.KUPS is a highly dimensional dataset created by aggregating of multiple data sources, but not all features are effective in the prediction.So, a set of feature extraction methods are usually employed for dimensionality reduction [47,48].In this way, the irrelevant and redundant features are put away from a dataset to reduce the data dimensionality.It cause low complexity of data, increases the search speed and consequently increases the performance of the classification.Among these, PCA is one of the most widely used algorithms for dealing with this problem.PCA is a linear combination that changes the coordinate system of data (feature vector) to a new one, such that the new set of features are linear functions of the original features and uncorrelated.Here, the greatest variance by any projection of the data lies on the first coordinate, and the second greatest variance on the second coordinate, and so on.After applying PCA, the features which lead to a better accuracy were selected.
Table 1 show the average and variance of accuracy when the numbers of feature change from 50 to 400 on KUPS dataset after ten runs.As it is found, when the numbers of features are 250, the better accuracy would be obtained.

Scaling Data
Variables often have considerably different numerical ranges.When a variable be in a large range its variance become large, and vice versa.Since PCA is a maximum variance method, it leads that a variable with a large variance is more likely to be expressed in a modeling.In this regard, all the data would be scaled in advance in order to provide the same contribution for them to the model.Another advantage is to avoid numerical di culties during the calculation.Since kernel values usually depend on the inner products of feature vectors, e.g. the linear kernel and the polynomial kernel, large attribute values might cause numerical problems [45].For this purpose, Eq. ( 8) is used for linearly scaling, where X indicates the original data, X Normalized is the normalized data, X max and X min are the maximum and minimum values of X, respectively.

From Regression to Classification
While in the literature of machine learning, classification and regression problems are addressed as two different problems differentiated by categorical or continuous dependent variable, there have been some attempts to use regression methods to solve the classification problems and vice versa [34].
In this paper, the support vector machine regression method is used to solve the classification problem (PPI prediction).Since in the regression problem, the class labels are real-valued rather than binary-valued, a solution is needed to map the real-valued class label to binary-valued for classifying.Therefore, if a perfect mapping method is applied, the classification problem can be solved by regression methods.The most important aspect of rounding values is the selection of mapping point (MP).The following pseudo code presents the overall procedure for choosing MP.

Parallel Hierarchical Cube Search (PHCS)
PHCS method is employed for tuning SVR kernel function parameters.It should be noted that the selection of the best values for kernel function parameters is an NP complete problem, so the selected parameter values are not necessarily the best overall calculation.
There are some methods to find these parameters properly.They mostly differ in the way the search the parameter space.Among them, greedy search, pattern search and GA are mostly used in different applications.PHCS is the extended version of PHGS introduced in [50].PHGS method is used for tuning SVM kernel function parameters .Therefore, it has a grid search space of two dimensions.While the PHCS is applied for tuning SVR kernel function parameters , with the three-dimensional search space.Moreover, it is able to find Mapping Point (MP).So that for mapping the real-valued class label would be mapped in to binary-valued one.In this work, Cross Validation Score (CVS) has been used to validate the hierarchical cube search effectively.A is considered support vector machine regression learning algorithm where is a vector of SVR parameters with RBF kernel function.A is employed on dataset D, A (D); the result will be a classifier.Given a set , assessment of CVS of the best accessible classifier A * (D) is desired, where * is the best assignment for D.
In order to calculate the CVS, the following k-fold cross validation procedure is applied, which returns the cross validation score of k different classifier that are learned by the algorithm on different folds of dataset.The cross validation procedure consists of the following steps: 1. Data permutation and split.Randomly permute the whole data and then split it into k non-overlapping equally sized subsets D i which is called folds.Each times k-1 folds are assumed as train and one fold for validation.
2. Train classifiers over folds.Algorithm repeats k times while in each iteration; one subset is tested using the classifier trained on the remaining k-1 subsets.Finally, each instance of the whole training set is predicted.
K-fold cross validation minimizes the bias associated with the random sampling of the training.Because of this property, it is widely used among researchers.Now, the proposed PHCS method would be described in details.There are three main parameters for SVR kernel function: and .
The c parameter trades off misclassification of training ex- amples against simplicity of the decision surface.A low value of c makes the decision surface smooth, while a high value of c aims at classifying all training examples correct- ly.As a result, Parameter c controls the balance between the complexity of the machine and the number of separable points.The parameter defines how far the influence of a single training example reaches.Low values are meant as 'far' and high values as 'close'.On the other hand, the value of parameter is very crucial in support vector condition and hence in the model performance.Choosing some large values for , the number of support vectors is decreased, in this way bond becomes wider and the range of accepa error increases.In addition, very small values of makes more support vectors and increases the risk of over-training.
The best values of these parameters depend on the nature of the problem.Selecting the best values for parameters is a vital step that has a direct effect on the performance and overall capability of SVR learning algorithm.Grid search is one of the popular techniques for finding the optimal values for SVM kernel function parameters.This method is very popular and reliable for selecting the best value on parameter ranges.However, this approach suffers from dimensionality, grid granularity and high computational time [49,51].
In SVR with RBF kernel function, there are three parameters that should be found out in 3D search space.In order to find the best parameter values, a hierarchical cube search method is used.Although this method saves time, but it is still time consuming.Since all points on the cube are independent from each other, hierarchical cube search can be implemented in parallel.With parallel implementation of a hierarchical cube search, the required time to find the best parameters will be significantly reduced.
In this paper, exponentially growing sequences of and search the optimal values of these parameters are considered in the space of .
In order to find the best values for in user defined boundaries, the whole search space must be searched.
N is assumed the number of available CPUs, consequently c , and values divided into N interval.Then, each interval is assigned to one CPU.Interval division task and assigning each interval to one CPU are managed by one CPU as master.Each CPU performs the cube search on the total space that belongs to it.For each triple in an interval, the CPU calculates the CVS for all them.Then, based on the maximum value of CVS, the best is selected as the best local optimum for each CPU.Fig. (2) presents all i CPUs that find the best local values in parallel and in independent manner.As it is shown, all triple of that have the maximum CVS in each CPU, have been marked.
When the candidate values of are found for each CPU, all the N candidate values are compared and the best one is chosen.So the local search procedure will be continued for the chosen candidate point.In this regard, the local search is done in the neighborhood of the selected candidates with smaller steps to find the best possible result.
In the next iteration, in order to find the best values of , a virtual cube around the best local optimum point of the last iteration is defined.This virtual cube denotes new search space which is divided in to N new intervals.Each CPU begins to search the new determined search space to find a better triple of .Then, the best CVS in the new space will be searched to find the optimum values again.Fig. (3) represents the hierarchical constructing virtual cube and finding the best new local .As it is shown, the new best local is marked as star.
By increasing the iterations of parallel hierarchical cube search, the accuracy will be increased.However, it leads to more processing time.Therefore, a trade-off between accuracy and processing time should be considered to solve the problem.
Finally, all the best i local values of CVS will be compared to select the best as global optimum.Then, SVR algorithm is performed on train and test dataset using the best global values.
The overall process of the proposed SVR-PHCS method is illustrated in Fig. (4).

Metrics
Confusion matrix contains information about the actual and predicted class of samples that are classified by the classification method.
The performance of supervised machine learning techniques can be evaluated by confusion matrix.Parameters used in the confusion matrix are: TP: The number of interacting proteins that are correctly classified.
FN: The number of interacting proteins that are wrongly classified non-interactive.
TN: The number of non-interacting protein pairs that are correctly classified.
FP: The number of non-interacting protein pairs that are incorrectly classified as interactive.
In the following, a series of evaluation metrics that have been used in this work is presented:  Re call = TP TP + FN (12) F Measure = 2.Pr ecision.Re call (Pr ecision + Re call) (13)

Experimental Result
Table 2 shows the 10 best combinations of and values that have been obtained from SVR-PHCS.
The results of performance evaluation metrics by using SVR and the extension of SVR (SVR-PHCS) are presented in Tables 3 and 4, respectively.As it is indicated, the performance of SVR-PHCS is better than the classical SVR.Various performance evaluation metrics can be considered as     In order to find out the effect of the MP value on performance evaluation metrics, the predicted outputs of test samples are mapped into two binary classes using various MP.Fig. (8) shows Accuracy, Precision, Recall and F-Measure value while MP changes from 0.1 to 1 with interval 0.1.

Comparison with other Works
The proposed method is compared with other wellknown prediction methods based on KUPS (The University of Kansas Proteomics Service) dataset.This dataset contains PPI of various organisms which is aggregated from seven data sets including, MINT, IntAct, HPRD, Gene Ontology, Uniprot, AAindex and PSSM [28].The dataset is composed of training and testing sets, where training set has 10518 protein pairs and testing set has 10516 protein pairs.Each protein pair in KUPS is composed of 400 features.To compare the results, accuracy and F-Measure have been used as a proper metric.The results of the proposed method and other classification methods on KUPS dataset are showed in Table 5.
Precision measures the exactness of a classifier, Whereas, Recall measures the completeness, or sensitivity of a classifier.Improving Recall often decreases precision and vice versa.Precision and Recall are combined to produce a single metric known as F-measure, which is the weighted harmonic average of Precision and Recall.In this paper, the results are compared with other results by accuracy and fmeasure metrics.

CONCLUSION
There are many classification techniques to predict Protein-Protein Interactions in literature.Using regression methods is a new approach to solve classification problems.In this paper, a new approach is proposed using PCA and Support Vector Regression (SVR) which has been improved by a new Parallel Hierarchical Cube Search (PHCS) method.The major challenge of applying SVR is how to tune and set the parameters (to achieve the best performance) for a given dataset and how to map the regression output to classification label.In this regard, the PHCS is applied to tune SVR parameters ( and ) and select the mapping point.The proposed method has been employed on KUPS dataset that is an aggregating of multiple data source and highly dimensional.Some features of the dataset may have no effect at all, or contain a high level of noise.Deletion of such features increases the search speed and the accuracy rate, therefore PCA has been used to select the appropriate features.
According to the experimental results, SVR-PHCS prediction system obtains very promising results in classifying the protein pairs.The results indicate 74.705% accuracy, which is one of the best results reported for this dataset.
data 3. Parallel Hierarchical Cube Search (obtain the best by cross validation scoring)

Fig. ( 5 )
Fig. (5).Cross Validation Score (CVS) changes for all combination of and on the best c value (c=2).