Prediction of polyproline II secondary structure propensity in proteins

Background: The polyproline II helix (PPIIH) is an extended protein left-handed secondary structure that usually but not necessarily involves prolines. Short PPIIHs are frequently, but not exclusively, found in disordered protein regions, where they may interact with peptide-binding domains. However, no readily usable software is available to predict this state. Results: We developed PPIIPRED to predict polyproline II helix secondary structure from protein sequences, using bidirectional recurrent neural networks trained on known three-dimensional structures with dihedral angle filtering. The performance of the method was evaluated in an external validation set. In addition to proline, PPIIPRED favours amino acids whose side chains extend from the backbone (Leu, Met, Lys, Arg, Glu, Gln), as well as Ala and Val. Utility for individual residue predictions is restricted by the rarity of the PPIIH feature compared to structurally common features. Conclusion: The software, available at http://bioware.ucd.ie/PPIIPRED, is useful in large-scale studies, such as evolutionary analyses of PPIIH, or computationally reducing large datasets of candidate binding peptides for further experimental validation.


Introduction
Polyproline II helices (PPIIHs) are an important class of secondary structure which makes up approximately 2% of the protein structure database (PDB) and are enriched in protein binding regions [1,2]. PPIIH conformations are adopted by peptides when binding to SH3, WW, EVH1, GYF, UEV and profilin domains [3,4]. They play roles in a wide variety of contexts [5]. The absence of hydrogen bonding interactions that characterize alpha helices has led to suggestions that water molecule interactions may play a role in stabilizing the helix. However,

Methods
The predictor was implemented using a bidirectional recurrent neural network (BRNN) architecture [20]. We trained five separate models on different partitions between the training set and validation set (which is used to assess training progression but not for tuning the model parameters). These were then ensembled to predict PPII instances in an independent test dataset. Performance was evaluated by measuring the true positive rate and false-positive rate (FPR). Receiver operating characteristic (ROC) curves were then used to plot the performance at different cut-offs ranging from 0 to 1. The count of true positives (TPs) versus the log10 of the false positives (FPs) was also visualized to evaluate performance, since the total number of FPs is much larger than TPs. See electronic supplementary material for more details.

Training and test datasets
Protein structures were obtained from the PDB and PISCES databases [21][22][23]. PISCES was used to extract redundancy reduced PDB stuctures ( percentage identity 30%) and filtered for high-quality structures only (resolution ≤2.5, R-value ≤0. 25). We used the DSSP program [24] to assign dihedral angles, and removed sequences for which DSSP does not produce an output due to missing entries or formatting errors. We defined the set of PPIIHs, applying filtering rules used in the literature [1]. We investigated both 'strict' and 'less strict' definitions.
The less strict definition was identical to that for the strict definition, except that the requirement of −105 < Φ < −45 was removed. Thus, dihedral angle filtering constructed a set of known PPIIH structures, using either the 'strict' or 'less strict' criteria. Each residue of every sequence in the datasets was labelled as either a PPIIH residue or a non-PPIIH residue (table 1). The number of sequences in the dataset used in training the non-strict definition is larger, we require that all sequences have at least one PPIIH region (three or more residues) for inclusion.
In addition to the amino acid sequences under investigation, the BRNN also considered alignments of related sequences to that sequence, and derived statistics of each residue of the protein sequence. Multiple sequence alignments (MSA) were extracted from the NR database (uniref 90) available in March 2014. The alignments were generated by three runs of PSI-BLAST [25] with parameter e = 10 − 3 (expectation of a random hit).
IUPRED was used to calculate a 'long' disorder prediction score [26] for each residue, and espritz [27] was used to calculate the 'NMR' disorder score. We included these two disorder predictions for every residue as input. Predicted disorder may provide information not only about the protein structural state, but also about the context of the residue, since PPII helices are enriched in disordered regions [28].
Thus, the inputs to the BRNN for each protein sequence were the sequence itself, the length of the sequence, the sequence alignment, and for each residue the IUPRED (long) disorder prediction score, the espritz-NMR disorder score, and an input representing an explicit indication of the charge of the residue (1 for R or K, 0 or − 1 for D or E). Each residue is labelled as either PPIIH or non-PPIIH. PPIIPRED predicts a score between 0 and 1 for each residue indicating the propensity for PPIIH formation. High scores indicate a higher probability of PPIIH formation.
The PPIIH dataset was split into training and test datasets, where every 10th sequence was assigned to the independent test dataset, as shown in table 1. All the tests reported in this paper were run in fivefold cross-validation, where assignment to each fold was random. The fivefold datasets were of roughly equal sizes. The training and test datasets are available in electronic supplementary material.

Algorithms
We used a BRNN to learn the mapping between inputs I and outputs O (protein sequence to a PPIIH score per residue). BRNNs have been used successfully to predict protein secondary structure [16], binding within disordered protein regions [29], bioactive peptides [30] and short linear protein binding regions [31]. They have the advantage over standard feed-forward neural networks that they can automatically find the optimal context on which to base a prediction, i.e. the number of residues that are informative to determine a property. Because of their recursive nature, BRNNs also have a relatively low number of free parameters compared to other neural networks with similar input size. See Baldi et al. [20] for a detailed explanation of the BRNN model, and electronic supplementary material, figure S1 which illustrates the topology.
These networks take the form where i j (respectively, o j ) is the input (respectively, output) of the network in position j, and h (F) j and h (B) j are forward and backward chains of hidden vectors with h (F) 0 ¼ h (B) Nþ1 ¼ 0. We parametrize the output update, forward update and backward update functions (respectively, N (O) , N (F) and N (B) ) using three two-layered feed-forward neural networks.

Encoding sequence and disorder information
The input i j associated with the j th residue contains protein sequence information and predicted disorder information where, assuming that e units are devoted to sequence, and t to disorder information and Hence i j contains a total of e + t components.
We used e = 22: beside the 20 standard amino acids, unknown or non-standard amino acids were represented as a vector of zeroes, while the 21st input encodes the length of the sequence, and the 22nd input encodes the charge.
In a second set of tests, we used e = 43 where, alongside the previous 22 inputs we had a further 21 representing the frequency profile of the 20 amino acids in the MSA for the protein. The 21st input represented the frequency of gaps, which provides information about the conservation of a site, and proved helpful in preliminary tests.
In both cases, we used t = 2 for representing disorder information as predicted by IUPRED and espritz. Hence the total number of inputs for a given residue is e + t = 24 in the first representation (sequence and disorder) and e + t = 45 in the second representation (sequence, MSA and disorder). The output is the predicted probability of the j-th residue belonging to a PPIIH.

Training, Ensembling
Training was conducted by fivefold cross-validation, i.e. five sets of training were performed in which a different fifth of the overall set was reserved for validation purposes, i.e. to monitor the progress of the training on data not used for tuning the parameters of the models. The training set was used to learn the free parameters of the network by stochastic gradient descent. Two thousand passes through the entire training set (epochs) of training were performed for each fold, with 1920 weight updates per epoch, and the learning rate (which controls how fast the algorithm converges), starting from an initial value of 0.005, was halved whenever we did not observe a reduction of the error for more than 1000 epochs.
By the end of training, the five models of the network had errors of less 4.1% on the validation set, indicating that the networks had converged to find good local optima. We averaged the results on the five validation sets to get the overall fivefold cross-validation result. Alongside fivefold crossvalidation results, we also tested the ensemble of all five models on the independent test set which we had set aside, to get an unbiased estimate of its performance. This ensemble is the final implementation of PPIIPRED.

Neural network outperforms a proline window
Since prolines dominate many but not all PPII helices, we were interested in whether the machine learning approach was clearly outperforming a much simpler 'proline window' method that assesses the frequency of prolines in a fixed window around each residue as a direct prediction of the PPIIH state. We applied this using a sliding window method with variable window sizes. The ROC curve in figure 1a shows the performance of the 'proline window' predictions. Since the rarity of PPII residues results in a typical excess of false positives over true positives at many predictor thresholds, we also investigated plots of royalsocietypublishing.org/journal/rsos R. Soc. open sci. 7: 191239 true positive versus log FPR (figure 1b). Although it may appear that the proline windows of size one, two or three are reasonable classifiers, the significant imbalance between TPs and true negatives (TNs) will result in a large number of false predictions when using this approach (figure 1b).
PPIIHs share some properties with disordered regions of proteins, and are often found embedded within them. Proline can interrupt alpha-helical or beta-sheet regions, and thus contribute to protein disorder. We were interested to see whether a disorder scoring method could alone provide a reasonable predictive power of the tendency to form PPIIHs. However, one standard method of disorder prediction, IUPRED, had only very weak predictive power, only modestly exceeding random predictions ( figure 1).
The PPIIH predictor trained on the strict dataset and including PSI-BLAST alignments (from here on termed PPIIPRED) performed very substantially better than the disorder or proline windowing methodologies (figure 1), with an area under curve (AUC) of 0.91. At a cut-off of 0.5, it had a sensitivity of 0.86, an Mathews correlation coefficient (MCC) of 0.26 and an accuracy (Q) of 82.3. MCC and accuracy were higher at a cut-off of 0.2 (table 2). This performance compares favourably with previous methods of [14,15], although it must be pointed out that these predictors were each evaluated on different validation sets, and therefore no direct comparison is possible as their software is not made available. We noted that the strict definition of PPIIH, despite having a smaller training set of true positives, performed somewhat better than the network trained on a dataset with a less strict definition of PPIIH (table 2). We focused all further attention on the networks trained using a stricter definition of PPIIH.

Alignments improve predictive power
We explored the performance of the method when the alignment data is absent. While sequences alone do have reasonable predictive power, addition of alignments improves the predictions: a sensitivity of 0.23 without alignment rises to 0.38 when the alignments are included (table 2). While alignments generated on the fly by PSI-BLAST offer flexibility of analysis, in some cases better alignment accuracy can be obtained by using pre-computed alignment datasets. We wanted to check that PPIIPRED would be relatively robust to alternative means of calculating alignments. We took a set of pre-computed alignments generated for each sequence using the GOPHER approach [32]. These precomputed alignments performed well when substituted for the PSI-BLAST alignments in the assessment of PPIIPRED, with an accuracy of 96.7 compared to 97.2 (table 2), suggesting that the approach is relatively insensitive to the particular alignment strategy adopted. However, for users studying a disordered protein region, which are often difficult to align well, it is important to check by eye that the alignment is believable, otherwise the conservation information will only mislead the predictor, and a prediction performed without an alignment may be more accurate. Both options are available to users on the website. The user may submit multiple sequences in FASTA format within a single file, allowing a large number of predictions to be returned within one submission. The output provides the user with numeric output of PPIIPRED scores for each residue of the user's submitted proteins. In addition, for individually submitted sequences, there is a graphic output (electronic supplementary material, figure S3) which allows the user to easily compare the findings of PPIIPRED against the backdrop of the predicted disorder in the sequence, using the IUPRED predictor.

High-ranking regions among human proteins
We used PPIIPRED to predict the highest scoring regions in the human proteome (cut-off = 0.5 and region length >3). As expected, the top-scoring results are dominated by proline-rich regions (table 3). However, these highest scores cannot be explained by proline composition alone, since PPPAE and PPPPP have almost identical scores. The score provided is dependent also on sequence context and evolutionary conservation, so that, for example, different scores were observed for PPPA in TM175 and LKAM1.
We were interested to explore higher confidence predicted PPII helices that were not markedly dominated by proline. While PPIIH has been proposed not to propagate beyond one sequential nonproline residue [33], we noted that one of the highest scoring proposed helices (table 3) terminates in Ala-Glu. Table 4 shows the top-scoring regions where proline is less than 40% of the motif.
Among this set of top-ranking peptides in tables 1-3, there are representations of hydrophobic (A,L,V, I,M), charged (E,K,R,D) and small polar (S,T) amino acids. By contrast, there is no representation of amino acids with bulky side-chain rings (H,F,Y,W), which may disfavour PPIIH formation. Table 5 shows top-scoring proline-free predictions. It will be of interest to experimentally examine the conformations of some of these proline-free predicted helices, particularly in the context of their larger containing proteins and protein complexes, to determine if the predictions at this edge of the comfort zone of the method have good predictive utility.

Discussion
The PPIIPRED tool offers support to those seeking to make sense of functional and evolutionary change in sequences that are likely to form polyproline II helices. This is superior to simply scanning a protein sequence by eye to identify proline-rich regions, since PPIIPRED clearly outperforms a simple proline windowing approach. It is useful to consider how reliable or interpretable the predictions may be in a typical analysis. Residues with a PPIIPRED score of greater than 0.2 account for 37% of true positives and 1.3% of false positives (figure 1b). In our test set, this translates to 1828 true positive residues and 2778 false positives. Thus, assuming that a researcher interested in predicting PPIIH within a protein was investigating a dataset similar in structural composition to the PDB test set, one false positive may be expected for every two true positives, at this cut-off, and to detect around a third of the true positives. In a practical setting, there may be a greater proportion of true positives, since many researchers interested in PPIIH are already focusing on regions of disorder, where the frequency of PPIIH is relatively high. Nevertheless, these statistics give a realistic indication of the utility of applying the predictive method to proteins in realistic conditions of interest to biologists. While this highlights the difficulties of interpreting predictions of a relatively rare structural state with modest predictive power, these predictions are of value in many biological contexts, so long as the users remain aware of the reasonable limitations of the predictions, in terms of how many false positives are typically expected for every true positive. It is of interest to evaluate to what extent there are regions that have a high predicted PPII propensity, that also have a high alpha-helical or beta-sheet propensity. Interpretation of such findings from a machine learning predictor of PPIIH states are complex, since the dataset used in training comprises fixed structures rather than structural ensembles, so that each residue is only found in one state. A potential consequence is that the method may to some extent use information from lack of alphahelical or beta-sheet propensity to increase the likelihood that a residue is PPIIH. Thus, it is not clear to what extent the training algorithm of PPIIPRED may militate against residues with an alpha-helical or beta-sheet propensity, simply as a consequence of the training set provided. Careful analyses of results from structural ensembles would be required in order to tease apart these questions, and give insights into the more detailed behaviour of the PPIIPRED predictor. Electronic supplementary material, figure S2 gives an indication of the contribution of different amino acids to human proteome predictions at various cut-offs of PPIIPRED. Clearly, the most highly predicted residues are almost all prolines, but at the cut-off of 0.2, which was previously discussed, there is a very substantial contribution of different amino acids. The preferred amino acids in these predictions match to some extent the previously known information regarding PPII propensity, with a preference among negatively charged residues for E over D in PPIIHs, previously noted in PPIIH [7]. However, the preference for methionine over leucine noted by Kentsis et al. [7] is not seen here, suggesting that their experimental investigation of different amino acids in the context GGxGG may not have general relevance to PPIIH formation in all contexts. In comparing other similar pairs of amino acids, a preference is also seen (electronic supplementary material, figure S2) for lysine over arginine, and for glutamine over asparagine. While glycine has a key role in PPIIH formation in the context of triple-helical collagens, it is avoided in the predicted PPIIH helices (electronic supplementary material, figure S2). Triple helical collagen structures are not well represented on the structure databases or in this training set, and are best predicted by other prediction approaches looking for the strong triplet periodicity of extended collagen regions. All PPIIHs have an exact triplet periodicity. However, the bulk of regions in this training set are short, so while it is possible that the BRNN may have used and incorporated some signal relating to short-range periodicity in refining the prediction, this is hard to assess. One potential application of PPIIPRED may be in defining candidate PPIIH regions, in which triplet periodicities of possible functional importance may be assessed, such as any amphiphilic tendencies of the helices.
Data accessibility. The 'data' is essentially the software contained in the neural networks. These are provided as a electronic supplementary material allowing researchers to download the code and run it on a Linux platform.