Journal of The Royal Society Interface
You have accessResearch articles

Coarse-grain simulations on NMR conformational ensembles highlight functional residues in proteins

Sophie Sacquin-Mora

Sophie Sacquin-Mora

Laboratoire de Biochimie Théorique, CNRS UPR9080, Institut de Biologie Physico-Chimique, 13 rue Pierre et Marie Curie, 75005 Paris, France

[email protected]

Google Scholar

Find this author on PubMed

    Abstract

    Dynamics are a key feature of protein function, and this is especially true of gating residues, which occupy cavity or tunnel lining positions in the protein structure, and will reversibly switch between open and closed conformations in order to control the diffusion of small molecules within a protein's internal matrix. Earlier work on globins and hydrogenases have shown that these gating residues can be detected using a multiscale scheme combining all-atom classic molecular dynamics simulations and coarse-grain calculations of the resulting conformational ensemble mechanical properties. Here, we show that the structural variations observed in the conformational ensembles produced by NMR spectroscopy experiments are sufficient to induce noticeable mechanical changes in a protein, which in turn can be used to identify residues important for function and forming a mechanical nucleus in the protein core. This new approach, which combines experimental data and rapid coarse-grain calculations and no longer needs to resort to time-consuming all-atom simulations, was successfully applied to five different protein families.

    1. Introduction

    Far from being static, rigid objects, proteins are highly flexible molecules, and understanding this flexibility is essential for understanding protein function [1,2]. Interestingly, protein conformational diversity and dynamics do not always involve large-scale protein motions, and function can arise from small structural changes [3]. In their work investigating the conformational changes between the apo and ligand-bound state in 60 enzymes, Gutteridge & Thornton [4] showed that over 90% of the studied proteins presented root mean square deviations (RMSDs) below 2 Å. These small conformational variations are nonetheless sufficient to support enzymatic activity, and inversely, slightly disrupting a protein's structure can greatly affect its biological function [5,6]. As a consequence, current de novo protein design strategies do not only involve the production of a particular fold, but also need to take into account protein dynamics, if one wants to achieve a specific function [79].

    Molecular gates represent a particular class of dynamic structures that play an important role in protein activity. They can be defined as a system, involving individual residues, loops or secondary structure elements, that can reversibly switch between open and closed conformations, and that control the passage of small molecules into and out of the protein structure [10,11]. Gating residues, lining internal channels connecting a buried active site to the surface, or two distant active sites within a single protein, are a common feature in the protein world, with more than 60% of the enzymes in the Catalytic Site Atlas [12] presenting channels at least 15 Å long [13], and as shown by the 70 examples discussed in a review by Gora et al. [11]. The dynamic properties of these tunnel lining residues play a key role in controlling catalytic efficiency and enzyme substrate specificity [14,15], and they can be seen as a new category of functional residues that take a part in regulating biological processes without being directly involved in the catalytic reaction. As a consequence, they have attracted growing interest among the drug-design community over recent years [16].

    From a theoretical point of view, numerous tools are nowadays available for investigating protein dynamics [14,17]. Classic all-atom molecular dynamics (MD) simulations represent a common strategy, sometimes combined with experimental approaches [18,19], which can, however, be computationally costly when working on large systems or modelling events taking place on a long timescale [20]. Another option is to use coarse-grain models, which combine simplified protein representations and energy functions [2123]. Elastic network models (ENMs) [24,25] in particular have led to many results regarding protein mechanics and dynamics [2630].

    In that perspective, we initially developed the ProPHet (Probing Protein Heterogeneity) program, which models proteins as an elastic network, to investigate protein mechanical properties on the residue level [31]. The rigidity profiles produced by ProPHet then turned out to be a remarkable tool for linking a protein structure to its dynamic properties and its function, showing how small conformational changes within a protein could induce large, non-trivial, mechanical effects and in turn impact its activity [3234]. In earlier studies on globins [35] and [NiFe]-hydrogenases [36], we also showed that the mechanical variations (calculated with ProPHET) observed in the conformational ensemble produced by classical equilibrium MD simulations can be used to identify a mechanical nucleus (MN) in the protein, which is a cluster formed by cavity lining residues occupying conserved positions along the protein sequence. In both cases, MN residues appear to be gating residues, oscillating between rigid and flexible states along the MD trajectory and controlling ligand migration in the vicinity of the active site.

    Since its initial developments a few decades ago, NMR spectroscopy has turned out to be a powerful tool for exploring the structure and dynamics of proteins in solution [18,3739]. In particular, proteic entries in the Protein Data Bank [40] (PDB) produced by solution NMR often provide a conformational ensemble containing between 10 and 30 structural models instead of a single structure. In this work, we used these conformational ensembles to investigate the mechanical properties in five globular protein families using CG simulations performed by ProPHet. The conformational variations in the experimental NMR ensembles are sufficient to induce large mechanical changes between models from a single PDB entry. Combined with sequence alignments, the coarse-grain calculations based on the NMR structural models permit us to identify MN residues in the proteins without resorting to computationally costly all-atom MD simulations, thus highlighting potential mutational targets in a drug-design perspective.

    2. Material and methods

    2.1. NMR conformational ensembles

    We searched the PDB for proteins where NMR conformational ensembles, comprising a minimum of 10 models, are available for at least three different species. This led to the selection of 19 PDB entries belonging to five protein families with various folds listed in table 1: globins, dihydrofolate reductase (DHFR), cyclophilin A (CPA), peptidyl-tRNA hydrolase (PTH), and parvulin. Each NMR ensemble comprises between 10 and 40 models, and the most representative ones (between two and seven models) were selected using the OLDERADO webserver, http://www.ebi.ac.uk/pdbe/nmr/olderado/ [41]. If the NMR structure was not listed online, we used the Ensemble cluster tool, which employs the same method as OLDERADO [42], in the UCSF-Chimera program, https://www.cgl.ucsf.edu/chimera/ [43], to determine the most representative models. Both the ChannelsDB database, https://webchemdev.ncbr.muni.cz/ChannelsDB/index.html [44], and the CASTP webserver, http://sts.bioe.uic.edu/castp/index.html?2r7g [45], were used to identify tunnel and cavity lining residues within all these protein structures.

    Table 1. List of the NMR structural ensembles used in this work. In the last column, cavity or tunnel lining residues are italicized and MN residues are highlighted in bold.

    protein name PDB NMR structures number of NMR models residues number representative NMR models (starting with the most representative one) maximum Cα RMSD between two representative models (in Å) MS residues
    globins 1myf 12 153 2,5,1,6,8 2.10 Val17, Ile28, Ala53, Met55, Leu69, Leu72, Glu83, Leu104, Ile107, Ile111, Leu135
    sequences identity range: 15–45% 2m6z-A 20 141 16,2,13,15,18 2.35 Val62, Leu66, Phe98, Leu101, Ser102, Leu105, Val132
    MN with 3MS positions/4 2m6z-B 20 146 16,2,13,15,18 1.60 Trp15, Ala27, Gly64, Val67, Phe71, Asn102, Leu106, Gly107, Leu110, Val134
    1mwb 20 123 17,15,19 2.65 Ser2, Phe21, Val25, His46, Phe50, Ala54, Val87, His117
    dihydrofolate reductase 2l28 25 162 1,6,22,23,25 2.00 Phe3, Leu4, Trp5, Phe30, Gln33, Thr34, Val95, Ala97, Gly98, Phe103
    sequences identity range: 23–38% 2kgk 15 162 1,7,5,6,9,12 2.40 Phe5, Met6, Val7, Asn19, Leu21, Phe96, Gly97, Tyr102
    MN with 3MS positions/4 2ith 20 162 7,9,12,20 1.52 Ser5, Val6, Ala7, Glu20, Val99, Ile100, Gly101
    1yho 25 186 3,6,10,13,19,21,24 0.90 Cys6, Ile7, Val8, Trp113, Ile114, Val115, Gly116, Tyr121
    cyclophilin A 2mvz 20 146 5,7,8,14,15 1.80 Phe31, Gly57, Met87, Phe101, Ile103, Phe118
    sequences identity range: 34–41% 1oca 20 165 6.18 0.74 Phe8, Leu24, Asn35, Phe36, Ser40, Phe 53, Gly65, Leu98, Ser99, Met100, Phe112, Phe113,Ile114
    MN with 2MS positions/3 1clh 12 166 1,2,8,9 2.30 Leu18, Phe32, Gln56, Gly57, Gly59, Ile88, Ala91, Gln102, Phe103, Phe104, Phe125, Gly126
    peptidyl-tRNA hydrolase 2lgj 10 191 10,8,7,6,9 2.13 Gly9, Leu24, Gly25, Val92, Ile93, His94
    sequences identity range: 39–82% 2jrc 40 191 5,26,29 1.62 Gly25, Ile93
    MN with 2MS positions/3 2mjl 10 197 6,4,2 1.34 Leu9, Val10, Gly11, Ala13, Ala26, Gly27,Val30, Met93, Ala95, His96, Leu136, Leu138
    parvulin 1jns 18 92 15,12,11,16,3 1.64 His8, Ile9, Gly46, Phe61, Val64, Ile85, Ile86
    sequences identity range: 31–55% 1nmw 10 114 7,3,4,10 1.52 His59, Leu60, Leu61, Cys113, Ile158, Ile159
    MN with 4MS positions/5 1j6y 20 120 19,18 2.60 His12, Ile13, Leu14, Ser68, Ser72, Asp105, His114, Ile115
    2rqs 20 97 3,20,8,19 2.91 Cys12, His14, Phe36, Leu56, Phe59, Phe68, Val92, Ile93
    2jzv 25 111 6,11,12 3.27 His146, Ile147, His239, Ile240, Ile241

    2.2. Coarse-grain simulation

    Coarse-grain Brownian dynamics (BD) simulations were performed using the ProPHet program, http://bioserv.rpbs.univ-paris-diderot.fr/services/ProPHet/ [33,34,46], on all the representative models listed in table 1. Diverging from most common CG models, where each residue is described by a single pseudo-atom [47], ProPHet uses a more detailed model enabling different residues to be distinguished, in which the amino acids are represented by one pseudo-atom located at the Cα position, and either one or two (for larger residues) pseudo-atoms replacing the side chain (with the exception of Gly) [48]. Interactions between the pseudo-atoms are treated according to the standard ENM [47], that is, pseudo-atoms closer than the cut-off parameter, Rc = 9 Å, are joined by Gaussian springs that all have identical spring constants of γ = 0.42 N m−1 (0.6 kcal mol−1 Å−2). The springs are taken to be relaxed for the starting conformation of the protein, i.e. its crystallographic structure. Mechanical properties are obtained from 200 000 BD steps at 300 K. The simulations are analysed in terms of the fluctuations of the mean distance between each pseudo-atom belonging to a given amino acid and the pseudo-atoms belonging to the remaining protein residues. The inverse of these fluctuations yields an effective force constant ki that describes the ease of moving a pseudo-atom i with respect to the overall protein structure:

    ki=3kBT(didi)2,
    where 〈 〉 denotes an average taken over the whole simulation and di=dijj is the average distance from particle i to the other particles j in the protein (the sum over j* implies the exclusion of the pseudo-atoms belonging to the same residue as i). The distances between the Cα pseudo-atom of residue k and the Cα pseudo-atoms of the adjacent residues k − 1 and k + 1 are excluded, since the corresponding distances are virtually constant. The force constant for each residue k in the protein is the average of the force constants for all its constituent pseudo-atoms i. We will use the term rigidity profile to describe the ordered set of force constants for all the residues in a given protein. Note that, following earlier studies, which showed how prosthetic groups, such as haemes, or small ligands had little influence on calculated force constants [31,49], we chose not to include these in the protein representations. This enables us to study the proteins intrinsic mechanical properties independently of the nature and position of any bound ligand.

    Following a procedure developed during earlier work on globins [35] and hydrogenases [36], we compared the rigidity profiles obtained for the selected representatives models. The models extracted from the NMR conformational ensembles do not present important structural variations, with Cα RMSDs between two models from the same PDB entry being always inferior to 3 Å (except for the structure of parvulin domain from Staphylococcus aureus, 2jzv, which presents both a flexible N-terminal tail and a flexible loop). However, these small structural changes are nonetheless sufficient to induce noticeable variations in the mechanical properties of a limited number of residues. We made a systematic pairwise comparison of all the rigidity profiles produced for the representative models from a single PDB entry and, for each residue in the protein, kept the maximum value that could be observed for its force constant variation. We then use the resulting max(|Δk|) profiles to determine mechanically sensitive (MS) residues, i.e. residues presenting the largest force constant variations between two structural models. MS residues are defined as residues whose maximum force constant variation satisfies the following criterion

    |Δk|max(i)|Δk|max+1.5σ(|Δk|max),
    with 〈|Δk|max〉 being the average value of the maximum force constant variation over the whole protein sequence and σ(|Δk|max) its variance. This procedure leads to the selection of roughly 5–10% of the residues for each one of the NMR structures that are listed in table 1.

    2.3. Sequence alignments

    Sequences from PDB entries belonging to the same protein family were aligned using the online MUSCLE tool, https://www.ebi.ac.uk/Tools/msa/muscle/ [50], with the default parameters. Despite the important mechanical similarity that can be observed in the rigidity profiles within a single protein family, the corresponding sequences display relatively low identities ranging from 15 to 82% (table 1). MS residues that were identified using the protocol described in the previous section are highlighted in bold and colour along the protein sequences in figure 3a and all throughout the electronic supplementary material.

    3. Results

    3.1. Globins

    Four NMR structures were selected in the PDB for the globin family: myoglobin (Mgb) from Physeter catodon (PDB entry 1myf), truncated haemoglobin (Tr. Hb) from S. sp. PCC 6803 (1mwb) and the α and β chains from human haemoglobin, Hb (2m6z). The force constant profiles for the main representative model of each protein are shown in figure 1. Similar to what has been observed in previous studies [35,46], the analogous aspects of the profiles reflect the common globin fold, with α-helices appearing as grouped rigidity peaks along the protein sequence. The max(Δk) profiles resulting from the systematic pairwise comparison of the force constants calculated for different representative models for a given NMR conformational ensemble are shown in figure 2 and residues with max(Δk) values above the threshold given in the Material and Methods section (indicated by the red horizontal line) are defined as MS residues, since their mechanical properties are the most impacted by the small conformational variations observed between the NMR models. The MS residue positions are highlighted in red on the multiple sequence alignment shown in figure 3a and we can see that most of these positions are common to one or more globins. Four positions in particular, which are framed and highlighted in bold green in figure 3a, E11, E15, G8 and G12 (using the standard numbering for globins that indicates the residue position along helices E and G) correspond to MS residues in three chains out of four. We can see from figure 3b that these residues form a central cluster right at the heart of the globin fold, with their side chain tightly interacting, and they all correspond to tunnel or pocket lining residues. These four positions also exactly concur with the MN of globins which was determined in our earlier work using a similar approach, with the alignment of six globin sequences, and the MS residues determination based on clusterized representative structures from trajectories produced by all-atom classical molecular dynamic simulations [35]. A search in the literature shows that the MN residues play an important part in protein function by controlling the ligand diffusion process. Positions E11 and G8 were the object of numerous mutational studies in myoglobin [5254], E11 and E15 are thought to regulate ligand migration in tr. Hb by acting as gate-opening molecular switches [5558], and in human Hb, gating movements of the leucine residue in G12 govern the hopping of gaseous ligand from and to different binding sites [59]. In neuroglobin, the mechanical properties for all four MN residues have also been shown to be particularly sensitive to pressure changes in high-pressure crystallography [60].

    Figure 1.

    Figure 1. Rigidity profiles (in kcal mol−1 Å−2) of the first NMR representative model for the four globin chains under study. (a) Mgb, (b) Tr. Hb, (c) αHb and (d) βHb.

    Figure 2.

    Figure 2. Maximum variation (in kcal mol−1 Å−2) of the force constant observed upon pairwise comparison of the NMR representative models. (a) Mgb, (b) Tr. Hb, (c) αHb and (d) βHb. The red horizontal line indicates the threshold value used (following the criterion given in the Material and methods section) for the selection of MS residues listed in table 1. (Online version in colour.)

    Figure 3.

    Figure 3. (a) Multiple sequence alignment for the globin structures; MS residues are shown in bold red, and residues belonging to the MN are framed and highlighted in bold green. (b) Cartoon representation of myoglobin with the haem group shown in black and the MN residues as green van der Waals spheres. a and figures 47 are prepared using Visual Molecular Dynamics [51]. (Online version in colour.)

    3.2. Dihydrofolate reductase

    Four NMR conformational ensembles were found in the PDB for dihydrolate reductases (DHFR) from Homo sapiens (1yho), Lactobacillus casei (2l28), Bacillus anthracis (2kgk) and Haloferax volcanii (2ith). The combination of coarse-grain mechanical calculations and multiple sequence alignment leads to the identification, when keeping only MS residues that are present in at least three out of four sequences (see the alignment in electronic supplementary material, figure SI-1), of a six-residue MN comprising Cys6, Ile7, Val8, Val115, Gly116 and Try121 (using the human sequence numbering) shown in figure 4. All residues from the MN are listed as tunnel lining or cavity lining in ChannelsDB or CASTP and lie on the interface between the subdomains defined in [61]. A search in the experimental literature shows residues 6–8 belong to the protein initial β-strand, which contains residues critical for ligand and cofactor binding [62,63], Phe96 from the BaDHFR (corresponding to Val115 in the human DHFR) is also involved in ligand binding [63] and so is Ile94, its equivalent in the E. coli DHFR variant [64]. Interestingly, the F98Y mutation in DHFR from S. aureus (which corresponds to a mutation of Tyr121 in human DHFR) is responsible for an important loss in TMP affinity, due to Tyr98 forming a hydrogen bond with Leu5 (Ile7 in human), thus preventing it from binding the TMP ligand [65]. The same mutation has also been investigated in DHFR from Mycobacterium tuberculosis, with the Y100F mutant displaying low affinity for classical DHFR inhibitors [66].

    Figure 4.

    Figure 4. (a) Upper black line, rigidity profile of the first NMR representative model of human DHFR. Lower black line, maximum variation of the force constant observed upon pairwise comparison of the NMR representative models for human DHFR, the red horizontal line indicates the threshold value used for the selection of MS residues listed in table 1. (b) Cartoon representation of human DHFR with the NDP and TRR ligands shown in black and the MN residues as green van der Waals spheres. (Online version in colour.)

    3.3. Cyclophilin A

    Coarse-grain calculations were made on three NMR conformational ensembles for CPA from H. sapiens (1oca), Geobacillus kaustophilus (2mvz) and E. coli (1clh). The consensus MN is formed by six residues (Phe36, Gly65, Leu98, Phe112, Phe113 and Phe129, using the human sequence numbering) that correspond to MS positions in at least two out of the three sequences (see electronic supplementary material, figure SI-1 for the alignment) and is shown in figure 5. The ChannelsDB entry for 3rdd, which corresponds to a crystallographic structure for human CPA, indicates that Phe113 is a tunnel lining residue, and CASTP calculations identify all the MS residues in the E. coli CPA as cavity lining residues. From the experimental literature, we find that the ligand-binding pocket in human CPA comprises residues Gly65, Leu98, Phe112 and Phe113, and Phe36, Phe112 and Phe129 belong to the protein aromatic core [67]. In a study using NMR relaxation experiments [68], Eisenmesser et al. showed how the protein motions play an essential part in its catalytic activity, and this catalytic activity is severely diminished for the F113 W mutant. In addition, the large variations in rigidity observed for Leu98, Ser99 and Phe113 in human CPA concur with the fact that these residues display the largest differences in NMR chemical shifts between the enzyme's major and minor conformers. A mutational study on Ser99 (which lies outside of the catalytic site), designed to stabilize the previously hidden minor conformation, causes a large decrease in the catalytic rate [69]. In the E. coli CPA, residues Ile88, Phe103 and Phe125 belong to the protein hydrophobic core, but are not directly involved in its isomerase activity [70]. Interestingly, MS residues Ala91 and Phe103 are located on both extremities of the B5–B6 surface loop, which overlooks the active site. As a consequence, mechanical variations in these two residues are likely to impact the loop conformational variability, and therefore the ligand's access to the catalytic site. A similar effect is also observed in the CPA from G. kaustophilus, with residues Gly57, Met87 and Phe101 lying on the edge of active site loops [71].

    Figure 5.

    Figure 5. (a) Upper black line, rigidity profile of the first NMR representative model of human CPA. Lower black line, maximum variation of the force constant observed upon pairwise comparison of the NMR representative models for human CPA; the red horizontal line indicates the threshold value used for the selection of MS residues listed in table 1. (b) Cartoon representation of human CPA with the MN residues as green van der Waals spheres. (Online version in colour.)

    3.4. Peptidyl-tRNA hydrolase

    Three NMR conformational ensembles were found in the PDB for PTH from M. tuberculosis (2jrc), M. smegmatis (2lgj) and Vibrio cholerae (2mjl), which lead to the identification of a five-residues MN formed by Gly11, Ala26, Gly27, Ala95 and His96 (using the V. cholerae sequence numbering) shown in figure 6, with MN residue occupying MS positions in at least two out of the three sequences (see electronic supplementary material, figure SI-1 for the sequence alignment). All the MS residues listed in table 1 for the 2jrc and the 2lgj structures are either tunnel lining or cavity lining residues in ChannelsDB or CASTP. The α/β globular fold for PTH comprises an active site channel connecting the surface of the molecule to the central β-sheet core [72]. In the M. tuberculosis variant, MN residues are not directly involved in the catalytic activity of PTH (unlike the neighbouring His24 and Asp97); however, Gly11 and Gly27 are strictly conserved in this proteic family, while Ala95 occupies and hydrophobic positions, and all residues are lining the substrate binding channel [73]. In most PTH variants, Ala95 and His96 lie at the base of the gate loop, which controls the entrance to the substrate binding cleft [7476], and has to move away to permit ligand binding in the enzyme's active site, which means that a perturbation of their mechanical properties is likely to disrupt the protein catalytic activity. In addition, NMR studies on the E. coli PTH show that Ala 26, Gly27 and Ala95 are affected by RNA binding to PTH, despite their central position in the protein core [77], due to the interplay between the tRNA binding region and the peptide binding channel [72]. Ala95 also forms van der Waals contacts with docked substrates in the V. cholerae PTH [76].

    Figure 6.

    Figure 6. (a) Upper black line, rigidity profile of the first NMR representative model of PTH from V. cholerae. Lower black line, maximum variation of the force constant observed upon pairwise comparison of the NMR representative models for PTH from V. cholerae; the red horizontal line indicates the threshold value used for the selection of MS residues listed in table 1. (b) Cartoon representation of PTH from V. cholerae with the MN residues as green van der Waals spheres and the catalytic residues (at the entrance of the peptide binding cleft) as red van der Waals spheres. (Online version in colour.)

    3.5. Parvulin

    Five NMR conformational ensembles were found in the PDB for human parvulin (1nmw), and variants from E. coli (1jns), Arabidopsis thaliana (1j6y), Cenarchaeum symbiosum (2rqs) and S. aureus (2jzv), thus leading to the identification of a four-residues MN formed by His59, Ile60, Il158 and Ile159 (using human sequence numbering) shown in figure 7 (see electronic supplementary material, figure SI-1 for the sequence alignment). In parvulin from E. coli, all the MN residues are listed as tunnel binding by ChannelsDB and cavity lining by CasTP (table 1). Structural studies on human parvulin highlight the functional importance of the two conserved histidines His59, His157 and Ile159, which are located in the enzyme's active centre [78,79,80]. Interestingly, later mutational studies on the His59/His157 pair showed that these residues are not directly involved in the parvulin isomerase activity, but more likely contribute to the enzyme's stability [80]. His8 in E. coli parvulin and His14 from the C. symbiosum parvulin (which both correspond to His59 in human parvulin) are located in the protein hydrophobic core and belong to the putative binding pocket [81,82], in parvulin from A. thaliana, His12 displays no chemical shift after binding of the peptide ligand, thus suggesting that this residue is not directly involved in the catalytic cycle [83]. The NMR conformational ensemble obtained for S. aureus parvulin highlights the structural variability of the catalytic histidines (His146 and His239), and the protein hydrophobic core also comprises residue Ile147, Ile240 and Ile241 [84]. A later study combining MD simulation and experimentally available S2 order parameter for three parvulins suggests that the conserved histidines play a key role in the modulation of the enzyme's internal dynamics [85].

    Figure 7.

    Figure 7. (a) Upper black line, rigidity profile of the first NMR representative model of human parvulin. Lower black line, maximum variation of the force constant observed upon pairwise comparison of the NMR representative models for human parvulin; the red horizontal line indicates the threshold value used for the selection of MS residues listed in table 1. (b) Cartoon representation of human parvulin with the MN residues as green van der Waals spheres. (Online version in colour.)

    4. Discussion and conclusion

    As shown in earlier work on globins and hydrogenases, protein mechanical properties are a dynamic feature [35,36]. Due to the robustness of the ENM approach [86], a protein's rigidity profile remains qualitatively the same over time, with its main peaks associated with a given set of residues (that are often likely to be catalytic residues). However, it will also present notable variations from one equilibrium structure to the other for another limited number of residues (termed as MS, residues), even though the conformational variations between these structures remain negligible. We have shown previously that one could use conformational ensembles produced by classic all-atom MD trajectories to identify MS residues in a protein, and repeating this approach on several homologous proteins will lead to the definition of an MN for this protein family, i.e. a set of conserved positions along the protein sequence that are occupied by MS residues. In the present work, we use protein conformational ensembles obtained by NMR spectroscopy for five distinct protein families. While not directly describing protein dynamics, the NMR conformational ensembles suggest the presence of dynamic regions within the proteins, and we show that the structural sampling provided by these ensembles is sufficient to identify an MN comprising between four and six residues in each group. In the case of globins, this MN concurs perfectly with the globin MN determined using trajectories from MD simulations. Its four constitutive residues have been highlighted as playing a key role in substrate diffusion in the protein matrix in numerous experimental approaches. The MN residues identified in the four other protein families also appear to be excellent candidates for playing a part in protein function as gating residues. Most of them are tunnel or cavity lining residues (italicized residues in table 1), and for each protein family, experimental evidence, from point mutation studies or evolutionary data, supports the hypothesis that, while not being directly involved in the chemical reaction taking place in the protein active site, the MN residues are nonetheless important for protein function. Altogether, our procedure leads to the selection of 25 MN residues for the five protein families, and all but one (Gly116 in DHFR) were highlighted as functionally relevant by earlier experimental data. In addition, researching the literature led to identifying only two cases of MS residues important for function that were not selected to be part of the MN, namely Ser99 in CPA (which is an MS residue for the human variant only) and His157 in parvulin (which is an MS residue in the variants from A. thaliana and S. aureus only).

    In particular, by modulating the shape and size of the cavities and tunnels leading to the protein active site, MN residues can impact an enzyme's substrate selectivity and catalytic efficiency. Therefore, they represent an interesting target for the fine tuning of protein function.

    Conformational ensembles built from crystallographic structures can bring a lot of information regarding protein dynamics [8789]. In order to see whether these could also be used with this approach, we performed mechanical calculations and a pairwise comparison of the rigidity profiles on crystallographic structures of DHFR from three variants, using the PDB IDs listed in [90]: DHFR from B. anthraci: chain A from 4alb, 4ale, 4elf, 4elg, 4elh (five structures), human DHFR, chains A and B from 2w3a, 2w3b and 2w3 m (six structures), DHFR from E. coli, chain a from 1rx1, 1rx2, 1rx3, 1rx4, 1rx5, 1rx6, 1rx7, 1rx8, 1rx9 (nine structures). The resulting force constant variation profiles lead to the same MN as when using NMR ensembles, except for Ty121 (using the human sequence numbering), which is no longer an MS residue for any of the three species (see electronic supplementary material, figure SI-2). This loss in the MN definition is not negligible, since this position has been shown to be functionally important for DHFR from S. aureus and M. tuberculosis [65,66].

    Another issue with ensembles of crystallographic structures is the choice of representative structures which will be used for mechanical calculations (ideally, one should retain around four to six models to limit the computational cost of the approach). In the dataset listed in [90], at least 45 structures are retained for each protein. However, selecting structures that are to mechanically too close can lead to a poor discrimination of MS residues, resulting in a poorly defined MN.

    As an example, we performed mechanical calculations on 10 crystallographic structures from human CPA (PDB 5noq, 5nor, 5nos, 5not, 5nou, 5nov, 5now, 5nox, 5noy, 5noz). The pairwise comparison of all 10 rigidity profiles leads to the same selection of MS residues as when using the NMR conformational ensemble (see the lower black line in electronic supplementary material, figure SI-3a). However, if one selects structures that are too close from a mechanical point of view (namely 5noq, 5nor, 5nou and 5nov), the residues' force constant variations are always below 10 kcal mol−1 Å−2, thus making the selection of MS residues more difficult (see the red line in electronic supplementary material, figure SI-3a), a situation that was never observed for any of the 19 NMR ensembles used in this work. Moreover, in the case of structures with low RMSDs (below 2.5 Å), the RMSD and the maximum force constant variation between two structures are not correlated, as can be seen on electronic supplementary material, figures SI-3b and SI-4b. Which means that selecting structures on the basis of their RMSD is no guarantee that they will be sufficiently different from a mechanical point of view.

    So all in all, it seems that our method can also use crystallographic structures. But this will require more work from a potential user, as one has to build an ensemble from several PDB structures instead of using a single NMR ensemble, and be careful that the mechanical properties of the selected structures vary sufficiently in order to clearly define the MS residues.

    Finally, our use of NMR structural ensembles, which are already publicly available in the PDB, and coarse-grain calculations provides us with a simple and efficient tool for identifying key spots in the protein structure where point mutations could help modulate the enzymatic activity or specificity of the protein. While the current study has been limited to small proteins, comprising less than 200 residues, recent progress in NMR spectroscopy on large systems [39] should ensure that experimental conformational ensembles for larger proteins will be accessible in the near future.

    Data accessibility

    All sequences alignments are available in the supplementary information and all mechanical calculations were performed on the open ProPHet webserver: http://bioserv.rpbs.univ-paris-diderot.fr/services/ProPHet/ (and the data are available upon request). A preliminary version of this work, https://doi.org/10.1101/532507, was deposited in bioRxiv.

    Competing interests

    I declare I have no competing interests.

    Funding

    This work was supported by the ‘Initiative d'Excellence’ programme from the French State (grant ‘DYNAMO’, ANR-11-LABX-0011-01).

    Footnotes

    Electronic supplementary material is available online at https://dx.doi.org/10.6084/m9.figshare.c.4546865.

    Published by the Royal Society. All rights reserved.