Abstract
Motivation: The mutation of amino acids often impacts protein function and structure. Mutations without negative effect sustain evolutionary pressure. We study a particular aspect of structural robustness with respect to mutations: regular protein secondary structure and natively unstructured (intrinsically disordered) regions. Is the formation of regular secondary structure an intrinsic feature of amino acid sequences, or is it a feature that is lost upon mutation and is maintained by evolution against the odds? Similarly, is disorder an intrinsic sequence feature or is it difficult to maintain? To tackle these questions, we in silico mutated native protein sequences into random sequence-like ensembles and monitored the change in predicted secondary structure and disorder.
Results: We established that by our coarse-grained measures for change, predictions and observations were similar, suggesting that our results were not biased by prediction mistakes. Changes in secondary structure and disorder predictions were linearly proportional to the change in sequence. Surprisingly, neither the content nor the length distribution for the predicted secondary structure changed substantially. Regions with long disorder behaved differently in that significantly fewer such regions were predicted after a few mutation steps. Our findings suggest that the formation of regular secondary structure is an intrinsic feature of random amino acid sequences, while the formation of long-disordered regions is not an intrinsic feature of proteins with disordered regions. Put differently, helices and strands appear to be maintained easily by evolution, whereas maintaining disordered regions appears difficult. Neutral mutations with respect to disorder are therefore very unlikely.
Contact: schaefer@rostlab.org
Supplementary Information: Supplementary data are available at Bioinformatics online.
1 INTRODUCTION
Random, undirected mutation is a major driving force for change in nature. In the protein universe, selection is realized through function: mutations leading to loss of function are rarely observed. As protein structure determines protein function, it is also subjected to evolutionary selection. Most problematic single nucleotide polymorphisms (SNP) that alter the amino acid sequence (non-synonymous SNPs) appear to impact the stability of protein structure (Yue et al., 2005; Yue et al., 2006).
Helices and strands constitute the major macromolecular building blocks of all ‘well-ordered’ proteins (Benner et al., 1997; Kabsch and Sander, 1983; Levitt and Chothia, 1976; Morea et al., 1998; Pauling and Corey, 1951a; Pauling and Corey, 1951b). The particular 3D structure of a protein is assumed to correspond to the global minimum free energy and hence defines the unique fold of an amino acid polymer (Anfinsen and Scheraga, 1975; Dill, 1993; Karplus and Petsko, 1990; Levitt and Warshel, 1975; Liwo et al., 1999; Reva et al., 1995; Sippl, 1993). Another essential feature of protein structure is the unique interplay between well-ordered and flexible regions (Alexov and Gunner, 1997; Cavasotto and Abagyan, 2004; Claussen et al., 2001; Daniel et al., 2003; Gu et al., 2006; Morea et al., 2000; Radivojac et al., 2004; Schlessinger et al., 2006). One particular aspect of this interplay is that between what we may loosely refer to as ‘order’ and ‘disorder’ (Dunker and Obradovic, 2001; Dunker et al., 2008; Radivojac et al., 2004; Uversky, 2003).
Many proteins have regions that remain ‘unstructured’ unless bound to a substrate: they do not adopt a unique stable conformation in isolation. Such regions are also referred to as intrinsically disordered or simply as disordered. Our operational definition for this vague term is: we consider as disorder whatever is predicted as such. Proteins with long-disorder regions have unique biophysical traits that enable the binding to different substrates, often at different cellular conditions (Wright and Dyson, 2009). Very long regions without regular secondary structure (loosely referred to as ‘loops’) may resemble disorder (Liu et al., 2002); nevertheless, we can clearly distinguish between disorder-like and well-structured loops (Schlessinger et al., 2007a; Schlessinger et al., 2009). Disorder is an important ‘building block’ for the increase in complexity in the evolution from unicellular prokaryotes to multi-cellular eukaryotes.
Our two hypotheses were: (i) we assumed that regular secondary structure is difficult to maintain evolutionarily, i.e. single residue mutations are likely to impact helices and strands and that we would lose regular secondary structure and transit into ‘loopy’ polypeptide chains with increasing random mutations away from the native state. (ii) We assumed, furthermore, that disordered regions provide a means to become robust against mutations because most mutations would rather increase than decrease disorder by increasing the non-regular secondary structure. Here, we present results that falsify both hypotheses as clearly as possible without investing tens of millions of dollars.
2 METHODS
2.1 Datasets
We used protein sequences from two databases for the in silico mutation. First, we assessed the robustness of secondary structure through globular proteins from the Protein Data Bank (PDB) (Berman et al., 2000). Secondly, we assessed the robustness of disordered regions through proteins from DisProt (Vucetic et al., 2005) (version 4.9). We applied UniqueProt (Mika and Rost, 2003) to reduce the redundancy in both sets filtering at a sequence similarity threshold of HVAL >10 (Rost, 1999; Sander and Schneider, 1991) (this corresponds to ∼30% pairwise sequence identity—PIDE—for alignments over 250 residues). The redundancy-reduced sets comprised 1369 (PDB) and 374 (DisProt) proteins.
For each of the two datasets (PDB and DisProt), we also created random sequences that had the same amino acid composition, same length distribution and same number of sequences as the natives. The random sets served as convergence control: if we mutate enough to ‘lose all memory’ (convergence), the random sets will not differ from the mutated sets.
To shed light on potential biases from the chosen databases, we additionally predicted the secondary structure in 33 812 proteins, representing the entire human proteome as taken from RefSeq 2006.
Finally, we sub-sampled a set of sequences from the PDB set with the same size, amino acid and length distribution as that of the DisProt set to examine the ability of ordered proteins to retain or lose their ordered state.
2.2 Mutation protocol
We gradually mutated native protein sequences into quasi-random strings of amino acids by the following iterative procedure.
2.2.1 One mutation step
It consisted of two moves: (i) select a particular residue position, i.e. site in the sequence to mutate, and (ii) mutate the amino acid X at that position with amino acid Y with the probability pXY (X = Y). For technical reasons (lack of CPU because after each step we have to apply several prediction methods), we repeat these two moves N/10 times (N number of residues in the protein). Effectively, we thereby touch 10% of all residues in one mutation step.
2.2.2 Sixty-nine mutation steps
We carried out 69 mutation steps (with 69 × N/10 mutations) for each protein. Any other, sufficiently large, number would have worked. We chose 69 because we had reached convergence in all the cases that we looked at in detail after 65 steps.
Effectively, we applied a Markovian-like model for evolution, i.e. assuming that each residue mutates independently of all others and that the mutation depends only on the amino acid type. We applied three alternative substitution schemes: (i) we mutated according to the PAM120 probability (Dayhoff, 1978). (ii) PAM120 is valid for great evolutionary distance. In order to also cover closer relations, we also implemented BLOSUM62 (Henikoff and Henikoff, 1992). (iii) Finally, we took the underlying amino acid distribution in the database (PDB, DisProt—ordered/disordered regions in DisProt not distinguished) as substitution probabilities. Note that for the most PAM120 and BLOSUM62 mutations, the most likely ‘mutation step’ was the maintenance of the current amino acid as the diagonals are typically highest in these matrices. We did not consider mutations that led to insertions or deletions. BLOSUM62 and PAM120 behaved identically with respect to our results. For readability, we confined the BLOSUM62 results to the Supplementary Material.
2.2.3 Single trajectory versus ensemble
The ‘mutation path’ for each native sequence constitutes a single unique trajectory in the space of all possible mutations. We created five different such single paths (five different mutants) in order to investigate the divergence from the native of an ensemble of evolutionary paths. From these five, we compiled a consensus by per-residue averaging over each of the five predictions (secondary structure/disorder). Note that by default, we reported the results for single trajectories and added the ensemble comparison only where explicitly stated.
2.3 Secondary structure
We predicted secondary structure through PROFsec (Rost, 2005). Secondary structure prediction methods improve when using evolutionary information (Liu and Rost, 2001; Rost, 1996; Rost and Sander, 1993). Without this information, PROFsec reaches a sustained single-sequence level of ∼68% three-state per-residue accuracy (Q3 is the percentage of residues predicted correctly in one of the three states helix, strand and other). We had to use this single-sequence mode to monitor the effect of point mutations. Prediction mistakes might invalidate the generality of our findings. One way in which we addressed this concern was by monitoring the parameters that we plotted for our mutants also for the experimental observations from the native proteins as taken from DSSP (Kabsch and Sander, 1983) with the usual conversion of eight into three ‘states’ (Andersen et al., 2002; Rost, 1996; Rost and Sander, 1993). For each mutation step (i.e. after each step of 10% change), we monitored the sequence similarity compared with the native sequence, the relative content of residues predicted in helix and strand and the average length of predicted helices and strands.
2.4 Disordered regions
We predicted disordered regions by three methods: IUPred (Dosztányi et al., 2005), MD (Schlessinger et al., 2009) and VSL2 (Obradovic et al., 2005; Peng et al., 2006) and compared the predictions to the experimental annotations in DisProt. IUPred has three options (long, short and glob); we chose short for short and long for long disorder. MD (Meta Disorder predictor) combines independent methods through machine learning. We used it without alignments. VSL2 is a collection of eight methods. We used the VSL2B variant that uses only single sequences as input.
The three methods focus on different aspects of disorder and have different strengths and weaknesses. We did not combine methods and, for simplicity, focused only on IUPred. The results from the other methods that were crucial to rule out method-specific findings are given in the Supplementary Material. We chose IUPred because it is accurate, fast and set up to work only with single sequences.
For each mutation step (i.e. after each step of 10% change), we monitored sequence similarity to native, the relative content of residues predicted in short/long-disordered regions and the length of the regions (SOM).
2.5 Box plots to present results
Box plots (McGill et al., 1978; Tukey, 1977) present our results concisely. The lower and upper box edges depict the first and third quartile, respectively. The length of a box is the interquartile range of the distribution. The bold bar inside the box represents the median, while dashed lines reach to the most extreme data point that is no more than 1.5 times the interquartile range away from the upper or lower box edge. Average (mean) values are connected through solid lines and intersect with box plots.
Median and mean are related to the protein level, i.e. summarize the specific feature of all sequences that fall within the same interval of PIDE.
3 RESULTS AND DISCUSSION
3.1 Secondary structure surprisingly robust
Comparisons of pairs of evolutionarily related protein structures reveal two major results (Abagyan and Batalov, 1997; Chothia and Lesk, 1986; Chung and Subbiah, 1996; Sander and Schneider, 1991): first, the less similar their sequences, the less similar their 3D structures [as well as their secondary structures (Rost et al., 1994; Rost et al., 1997)]; and second, the transition from the regime of ‘similar structure’ to ‘non-similar structure’ is highly non-linear and characterized by sigmoids indicative of phase transitions in physics. Our mutation protocol yielded a very different outcome.
Secondary structure diverged to almost random levels over the course of our mutation protocol. We compared this divergence to what is observed between naturally occurring homologues. Towards this end, we used the HSSP database (Sander and Schneider, 1991) and compared homologues at the corresponding levels of PIDE (Supplementary Fig. SOM_5). The change of secondary structure on random mutation was much more dramatic than that for homologous proteins (Fig. 1A), e.g. at 30%, PIDE natural homologues still had levels of Q3∼ 63%, while the random mutants reached Q3∼ 45% (Supplementary Fig. SOM_5). This result is not surprising: evolution feels the pressure to enrich neutral mutations, i.e. those that do not alter structure, while no such incentive was built into our in silico mutation protocol. Nevertheless, secondary structure was surprisingly robust under mutation. The consensus over ensembles of five different mutation trajectories (Fig. 1C and D) diverged much more dramatically from wild type than any single mutant (Fig. 1A and B).
Another important difference between our in silico mutation and natural evolution pertained to the shape of the transition: instead of a sigmoidal phase transition, we observed an almost linear transition from native wild-type to almost random mutant. This was true for both the single trajectory (Fig. 1A) and the ensemble (Fig. 1C), although the signal was clearer for the ensemble.
We observed that some regions did not alter secondary structure even at the end of our protocol at which the mutant was as similar to the wild type as to any other sequence in our dataset (Fig. 1B). For the ensemble, in contrast, the consensus secondary structure had changed almost completely from the native (Fig. 1D). Nevertheless, the Q3 levels converged to the same level in both cases.
3.2 Helix and strand intrinsic to random sequences
Our most surprising finding was that neither the overall content (Fig. 2A and B) nor the length (Fig. 2C and D) of predicted helices and strands was altered during the course of our mutation protocol. The average helix content remained ∼30%, whereas the average strand content around 20%; the average helix was about 10 residues long (2–3 helix turns), and the average strand extended over about five residues. In other words, regular secondary structure was predicted to be robust under extreme mutation. In this respect, we observed no significant difference between choosing mutations according to the background distribution and PAM120, although the latter tends to follow the evolutionarily more accepted mutations (mutations according to BLOSUM62 gave similar results Supplementary Fig. SOM_6).
After the 69 mutation steps (Section 2), we reached a point at which the mutant was as similar to the native as to any other sequence. This was reflected by the similarity in the prediction of helix/strand content/length between the final mutant and randomly created sequences (Fig. 2: two rightmost bars almost identical).
Our results were based on predictions rather than on observations. Prediction methods make mistakes. One might hypothesize that rather than shedding light on protein features, our results are caused by those prediction mistakes. As no large-scale experiments establish structure for random sequences, we cannot refute this view. However, we could provide evidence that prediction mistakes might not matter for the aspects of structure that we monitored. In fact, by the measures that we used to report our results, predictions and observations were almost identical (Fig. 2: left gray bars in each panel). The precise levels of helix/strand content and length differed indeed more between different datasets (PDB subset versus entire set of human proteins) than between observation and prediction for any set for which we have experimental information. In other words, prediction mistakes appeared not to matter for all the proteins for which we could verify this statement.
Our findings that random and wild-type sequences were predicted to have similar content of regular secondary structure along with the observation that mistakes in predicting this were negligible suggest that the formation of helices and strands is an intrinsic feature of amino acid sequences. Neither helices nor strands were predicted to be significantly shortened during our drastic in silico mutation protocol. Note that this is not a consequence of the fact that PROFsec is trained to predict a particular length distribution, because predicted length distributions deviate substantially between all-helical and coiled-coil proteins. The maintenance of such regular secondary structure elements would then appear to come at seemingly low costs, i.e. mutations that are neutral with respect to structure might be more likely than might have been anticipated. Finally, we verified that the reliability of the predictions did not change during mutation (Supplementary Fig. SOM_10).
3.3 Long regions of disorder sensitive, short not
Arguably, there are two different regimes of disorder (Dosztányi et al., 2005; Liu et al., 2002; Obradovic et al., 2005; Peng et al., 2006; Schlessinger et al., 2007b; Schlessinger et al., 2009): very short and very long regions. No threshold distinguishes between these two regimes in a biophysically meaningful way.
In particular, there likely exists an intermediate range that might belong to both regimes. Here, we followed the typical ‘convention’ in the field and defined as short disorder regions with eight or less consecutive residues and as long disorder regions with 30 or more consecutive residues. Thereby, we ignored the uncertain regime in between these two extremes. In order to establish that our results did not crucially depend on the particular threshold, we also tested other thresholds for long disorder, namely 20, 40 and 50. We found that the trend of loss during in silico mutation is independent of the chosen cut-off and is even clearer for larger thresholds (40 and 50) (Supplementary Fig. SOM_09).
First, we observed that regions of short disorder behaved like regular secondary structure in that their content (Fig. 3B, D and F; Supplementary Fig. SOM_2D and E) and length (Supplementary Fig. SOM_2A–C) did not alter on mutation. In stark contrast was the result for long regions with predicted disorder gradually diminished over the course of our mutation protocol (Fig. 3A, C and E; by definition a prediction of 29 disordered residues for some mutant implies that for that mutant the long disordered region seemingly ‘disappeared’, e.g. Fig. 3E middle; Supplementary Fig. SOM_1). The loss on mutation was much more dramatic for mutations according to PAM120 (yellow in Fig. 3C) than for those according to the background distribution (green in Fig. 3C). This is understandable because disordered regions are abundant in polar residues, and these are more likely to be chosen if mutation probability is ‘skewed’ toward this abundance. Put differently, PAM120-driven mutations drifted toward sequences that resembled regular well-structured proteins and as such had no disorder, while background-driven mutations yielded sequences that were as abundant in disorder as the native wild types and therefore had many long regions with predicted disorder.
The actual numbers in terms of content of predicted long disorder decreased from ∼18% for the native to ∼9% for the final mutant by using the background mutation protocol (Fig. 3C, green). This reflected the fact that a considerable fraction of the residues in our DisProt dataset was polar: for mutations according to PAM120 (Fig. 3C, yellow) or BLOSUM62 (Supplementary Fig. SOM_7), the content dropped to 0. However, at this level of mutations, almost no single residue predicted as long disorder in the native was predicted as disorder in the mutant (Fig. 3A). For some, this might appear to PAM120.
Studies of particular mutation paths revealed that long disorder might just appear to vanish suddenly (Fig. 3E). This was partially a threshold issue: assume a region with 35 consecutive ‘disordered’ residues and assume the mutant loses three on each side (six in total); we will no longer consider this as long disorder (35–6 <30). This also explains how additional mutations may recover the long disorder (Fig. 3E: after solid block of red bars, suddenly one mutant has disorder again as seen by a single bar below this block).
Another observation reflects one of the important aspects when studying short disorder: a considerable fraction of the short disorder is predicted (and observed) near the protein termini (Fig. 3F). Short disorder ‘comes and goes’ during mutation (middle region in Fig. 3F). Although this effect is biologically relevant and dominates the study of disorder in otherwise well-ordered proteins (Bordoli et al., 2007; Jin and Dunbrack, 2005), it again underlines the problem of not differentiating between long and short disorder.
Our analyses of regular secondary structure and disorder are based on very different datasets. PDB is biased in many ways (Liu and Rost, 2001), one of those pertains to disorder (Liu and Deber, 1999; Peng et al., 2004). One reason simply is that proteins with disordered regions pose extreme challenges to structure determination (Burley et al., 2008; Dunker et al., 2008; Graslund et al., 2008; Liu et al., 2004; Nair et al., 2009; Romier et al., 2006). To address this difference, we predicted disorder also for the dataset of well-ordered proteins from the PDB. As expected, the level of both long and short disorder for both of those was very low (Supplementary Figs SOM_3 and 4); given the lack of disorder in these proteins, we could therefore not observe any significant difference between close-to-zero in the wild type and close-to-zero in the mutants.
IUPred is arguably one of the best disorder prediction methods (Bordoli et al., 2007; Le Gall et al., 2007; Schlessinger et al., 2007b; Schlessinger et al., 2009; Shimizu et al., 2007); however, it is still only one of many and it has specific strengths and weaknesses. Therefore, we also predicted disorder with two other state-of-the-art prediction methods, namely VSL2 (Obradovic et al., 2005; Peng et al., 2006) and MD (Schlessinger et al., 2009). Although the predictions for those two differed slightly from those for IUPred, by the measures we reported here, they revealed exactly the same trend: while predicted long disorder disappeared on mutation, the content and length distribution of predicted short disorder remained largely unaffected by the mutation.
We addressed the impact of incorrect predictions by randomly introducing errors. At any significant error rate, long disorder disappeared in the native. This highlights the high prediction accuracy of today's methods. For short disorder, the added error did not alter the content over the course of our mutation protocol (Supplementary Fig. SOM_8).
As short and long disorders have different physical traits, we need length thresholds. However, we can drop these thresholds while monitoring the disappearance of disorder. Toward this end, we began with all native regions longer than N (chosen in steps of between 20 and 50), and monitored the percentage of disorder predicted after mutation irrespective of the length of the predicted regions. We found that long disordered regions indeed get decomposed into shorter ones and that disorder disappears throughout (Supplementary Figs SOM_11 and 12).
4 CONCLUSIONS
We addressed the general question whether or not well-ordered regular secondary structure and disordered regions sustain random mutations. Is it likely or unlikely that any mutation affects this particular coarse-grained feature of protein structure (and through it's function)? Do random sequences have different content in secondary structure and disorder than native proteins that have evolved to satisfy many constraints? Our analysis clearly suggests two different answers for regular secondary structure and long disorder. On the one hand, the maintenance of regular secondary structure might not be too challenging because its formation appears to be an intrinsic feature of random sequences. It, therefore, appears surprisingly likely to transit from helix to strand and back. In fact, this is exactly what we dynamically observed during the course of our mutations (Fig. 4). On the other hand, regions of long disorder do not appear to be robust under mutation. Random changes likely disrupt this feature that thereby appears volatile and unique. This has important impact on how we picture the role of long disorder in proteins: it is not ‘easy’ to acquire. Prokaryotes have only ∼10–25% of the disorder observed in multi-cellular eukaryotes (Dunker et al., 2008; Ekman et al., 2005; Liu et al., 2002; Oldfield et al., 2005; Romero et al., 2004; Schlessinger et al., 2009; Ward et al., 2004). Our observation of how volatile long disorder is provides another evidence for the importance of this feature for the transition from prokaryotes to eukaryotes.
Many SNPs that alter the protein sequence (nsSNPs) appear to be deleterious. Is this a bias in the experimental technique (more likely to be observed/reported if deleterious), or is it a genuine feature of proteins imposed by the sensitivity of protein structure to mutations? Although our work neither addresses nor answers this question, the surprising robustness of regular secondary structure might support the view that protein structure is more flexible and adaptable than the intricate details of the concert of interacting residues in protein 3D structures might suggest.
Supplementary Material
ACKNOWLEDGEMENTS
The authors would like to thank the following for valuable discussions: Zsuzsanna Dosztanyi (Eötvös Loránd University Budapest and Columbia University in the City of New York), Dietlind Gerloff (UCSC Santa Cruz), Marco Punta (Columbia University in the City of New York and TUM Munich), Reinhard Schneider (EMBL Heidelberg), Anna Tramontano (La Sapienza Rome and KAUST); the anonymous reviewers for very constructive and helpful suggestions that helped shaping this work; and also to all those who deposit their experimental data in public databases and to those who maintain these databases, in particular to those who contribute to PDB and DisProt.
Funding: National Institute of General Medical Sciences (NIGMS) at the National Institutes of Health (NIH) (grant number R01-GM079767).
Conflict of Interest: none declared.
REFERENCES
- Abagyan RA, Batalov S. Do aligned sequences share the same fold? J. Mol. Biol. 1997;273:355–368. doi: 10.1006/jmbi.1997.1287. [DOI] [PubMed] [Google Scholar]
- Alexov EG, Gunner MR. Incorporating protein conformational flexibility into the calculation of pH-dependent protein properties. Biophys. J. 1997;72:2075–2093. doi: 10.1016/S0006-3495(97)78851-9. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Andersen C.AF, et al. Continuum secondary structure captures protein flexibility. Structure. 2002;10:175–184. doi: 10.1016/s0969-2126(02)00700-1. [DOI] [PubMed] [Google Scholar]
- Anfinsen CB, Scheraga HA. Experimental and theoretical aspects of protein folding. Adv. Prot. Chem. 1975;29:205–300. doi: 10.1016/s0065-3233(08)60413-1. [DOI] [PubMed] [Google Scholar]
- Benner SA, et al. Bona fide predictions of protein secondary structure using transparent analyses of multiple sequence alignments. Chem. Rev. 1997;97:2725–2844. doi: 10.1021/cr940469a. [DOI] [PubMed] [Google Scholar]
- Berman H, et al. The Protein Data Bank. Nucleic Acids Res. 2000;28:235–242. doi: 10.1093/nar/28.1.235. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Bordoli L, et al. Assessment of disorder predictions in CASP7. Prot. Struct. Funct. Genet. 2007;69(Suppl. 8):129–136. doi: 10.1002/prot.21671. [DOI] [PubMed] [Google Scholar]
- Burley SK, et al. Contributions to the NIH-NIGMS protein structure initiative from the PSI production centers. Structure. 2008;16:5–11. doi: 10.1016/j.str.2007.12.002. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Cavasotto CN, Abagyan RA. Protein flexibility in ligand docking and virtual screening to protein kinases. J. Mol. Biol. 2004;337:209–225. doi: 10.1016/j.jmb.2004.01.003. [DOI] [PubMed] [Google Scholar]
- Chothia C, Lesk AM. The use of sequence homologies to predict protein structures. In: Robert F, Mark Z, editors. Computer Graphics and Molecular Modeling. New York: Cold Spring Harbor Laboratory; 1986. pp. 33–37. [Google Scholar]
- Chung SY, Subbiah S. A structural explanation for the twilight zone of protein sequence homology. Structure. 1996;4:1123–1127. doi: 10.1016/s0969-2126(96)00119-0. [DOI] [PubMed] [Google Scholar]
- Claussen H, et al. FlexE: efficient molecular docking considering protein structure variations. J. Mol. Biol. 2001;308:377–395. doi: 10.1006/jmbi.2001.4551. [DOI] [PubMed] [Google Scholar]
- Daniel RM, et al. The role of dynamics in enzyme activity. Annu. Rev. Biophys. Biomol. Struct. 2003;32:69–92. doi: 10.1146/annurev.biophys.32.110601.142445. [DOI] [PubMed] [Google Scholar]
- Dayhoff MO. Atlas of Protein Sequence and Structure. MD: National Biomedical Research Foundation, Silver Spring; 1978. pp. 345–358. [Google Scholar]
- Dill KA. Folding proteins: finding a needle in a haystack. Curr. Opin. Struct. Biol. 1993;3:99–103. [Google Scholar]
- Dosztányi Z, et al. The pairwise energy content estimated from amino acid composition discriminates between folded and intrinsically unstructured proteins. J. Mol. Biol. 2005;347:827–839. doi: 10.1016/j.jmb.2005.01.071. [DOI] [PubMed] [Google Scholar]
- Dunker AK, Obradovic Z. The protein trinity-linking function and disorder. Nat. Biotechnol. 2001;19:805–806. doi: 10.1038/nbt0901-805. [DOI] [PubMed] [Google Scholar]
- Dunker AK, et al. Function and structure of inherently disordered proteins. Curr. Opin. Struct. Biol. 2008;18:756–764. doi: 10.1016/j.sbi.2008.10.002. [DOI] [PubMed] [Google Scholar]
- Ekman D, et al. Multi-domain proteins in the three kingdoms of life: orphan domains and other unassigned regions. J. Mol. Biol. 2005;348:231–243. doi: 10.1016/j.jmb.2005.02.007. [DOI] [PubMed] [Google Scholar]
- Graslund S, et al. Protein production and purification. Nat. Methods. 2008;5:135–146. doi: 10.1038/nmeth.f.202. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Gu J, et al. Wiggle-predicting functionally flexible regions from primary sequence. PLoS Comput. Biol. 2006;2:e90. doi: 10.1371/journal.pcbi.0020090. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Henikoff S, Henikoff JG. Amino acid substitution matrices from protein blocks. Proc. Natl Acad. Sci. USA. 1992;89:10915–10919. doi: 10.1073/pnas.89.22.10915. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Jin Y, Dunbrack R.L., Jr Assessment of disorder predictions in CASP6. Proteins. 2005;61(Suppl. 7):167–175. doi: 10.1002/prot.20734. [DOI] [PubMed] [Google Scholar]
- Kabsch W, Sander C. Dictionary of protein secondary structure: pattern recognition of hydrogen-bonded and geometrical features. Biopolymers. 1983;22:2577–2637. doi: 10.1002/bip.360221211. [DOI] [PubMed] [Google Scholar]
- Karplus M, Petsko GA. Molecular dynamics simulations in biology. Nature. 1990;347:631–639. doi: 10.1038/347631a0. [DOI] [PubMed] [Google Scholar]
- Le Gall T, et al. Intrinsic disorder in the Protein Data Bank. J. Biomol. Struct. Dyn. 2007;24:325–342. doi: 10.1080/07391102.2007.10507123. [DOI] [PubMed] [Google Scholar]
- Levitt M, Chothia C. Structural patterns in globular proteins. Nature. 1976;261:552–558. doi: 10.1038/261552a0. [DOI] [PubMed] [Google Scholar]
- Levitt M, Warshel A. Computer simulation of protein folding. Nature. 1975;253:694–698. doi: 10.1038/253694a0. [DOI] [PubMed] [Google Scholar]
- Liu J, et al. Automatic target selection for structural genomics on eukaryotes. Prot. Struct., Funct., Bioinform. 2004;56:188–200. doi: 10.1002/prot.20012. [DOI] [PubMed] [Google Scholar]
- Liu J, Rost B. Comparing function and structure between entire proteomes. Protein Sci. 2001;10:1970–1979. doi: 10.1110/ps.10101. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Liu J, et al. Loopy proteins appear conserved in evolution. J. Mol. Biol. 2002;322:53–64. doi: 10.1016/s0022-2836(02)00736-2. [DOI] [PubMed] [Google Scholar]
- Liu LP, Deber CM. Combining hydrophobicity and helicity: a novel approach to membrane protein structure prediction. Bioorg. Med. Chem. 1999;7:1–7. doi: 10.1016/s0968-0896(98)00233-8. [DOI] [PubMed] [Google Scholar]
- Liwo A, et al. Protein structure prediction by global optimization of a potential energy function. Proc. Natl Acad. Sci. USA. 1999;96:5482–5485. doi: 10.1073/pnas.96.10.5482. [DOI] [PMC free article] [PubMed] [Google Scholar]
- McGill R, et al. Variations of box plots. Am Statistician. 1978;32:12–16. [Google Scholar]
- Mika S, Rost B. UniqueProt: creating representative protein sequence sets. Nucleic Acids Res. 2003;31:3789–3791. doi: 10.1093/nar/gkg620. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Morea V, et al. Protein structure prediction and design. Biotechnol. Annu. Rev. 1998;4:177–214. doi: 10.1016/s1387-2656(08)70070-x. [DOI] [PubMed] [Google Scholar]
- Morea V, et al. Antibody modeling: implications for engineering and design. Methods. 2000;20:267–279. doi: 10.1006/meth.1999.0921. [DOI] [PubMed] [Google Scholar]
- Murzin AG, et al. SCOP: a structural classification of proteins database for the investigation of sequences and structures. J. Mol. Biol. 1995;247:536–540. doi: 10.1006/jmbi.1995.0159. [DOI] [PubMed] [Google Scholar]
- Nair R, et al. Structural genomics is the largest contributor of novel structural leverage. J. Struct. Funct. Genomics. 2009;10:181–191. doi: 10.1007/s10969-008-9055-6. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Obradovic Z, et al. Exploiting heterogeneous sequence properties improves prediction of protein disorder. Prot. Struct., Funct., Genet. 2005;61(Suppl. 7):176–182. doi: 10.1002/prot.20735. [DOI] [PubMed] [Google Scholar]
- Oldfield CJ, et al. Comparing and combining predictors of mostly disordered proteins. Biochemistry. 2005;44:1989–2000. doi: 10.1021/bi047993o. [DOI] [PubMed] [Google Scholar]
- Pauling L, Corey RB. Configurations of polypeptide chains with favored orientations around single bonds: two new pleated sheets. Proc. Natl Acad. Sci. 1951a;37:729–740. doi: 10.1073/pnas.37.11.729. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Pauling L, Corey RB. The pleated sheet, a new layer configuration of polypeptide chains. Proc. Natl Acad. Sci. USA. 1951b;37:251–256. doi: 10.1073/pnas.37.5.251. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Peng K, et al. Exploring bias in the Protein Data Bank using contrast classifiers. Pac. Symp. Biocomput. 2004;9:435–446. doi: 10.1142/9789812704856_0041. [DOI] [PubMed] [Google Scholar]
- Peng K, et al. Length-dependent prediction of protein intrinsic disorder. BMC Bioinformatics. 2006;7:208. doi: 10.1186/1471-2105-7-208. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Pettersen EF, et al. UCSF Chimera–a visualization system for exploratory research and analysis. J. Comput. Chem. 2004;25:1605–1612. doi: 10.1002/jcc.20084. [DOI] [PubMed] [Google Scholar]
- Radivojac P, et al. Protein flexibility and intrinsic disorder. Protein Sci. 2004;13:71–80. doi: 10.1110/ps.03128904. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Reva BA, et al. Constructing lattice models of protein chains with side groups. J. Comput. Biol. 1995;2:527–535. doi: 10.1089/cmb.1995.2.527. [DOI] [PubMed] [Google Scholar]
- Romero P, et al. Natively disordered proteins : functions and predictions. Appl. Bioinform. 2004;3:105–113. doi: 10.2165/00822942-200403020-00005. [DOI] [PubMed] [Google Scholar]
- Romier C, et al. Co-expression of protein complexes in prokaryotic and eukaryotic hosts: experimental procedures, database tracking and case studies. Acta Crystallogr. D Biol. Crystallogr. 2006;62:1232–1242. doi: 10.1107/S0907444906031003. [DOI] [PubMed] [Google Scholar]
- Rost B. PHD: predicting one-dimensional protein structure by profile based neural networks. Methods Enzymol. 1996;266:525–539. doi: 10.1016/s0076-6879(96)66033-9. [DOI] [PubMed] [Google Scholar]
- Rost B. Twilight zone of protein sequence alignments. Protein Eng. 1999;12:85–94. doi: 10.1093/protein/12.2.85. [DOI] [PubMed] [Google Scholar]
- Rost B. How to use protein 1-D structure predicted by PROFphd. In: Walker JM, editor. The Proteomics Protocols Handbook. Humana Press; 2005. pp. 875–901. [Google Scholar]
- Rost B, Sander C. Prediction of protein secondary structure at better than 70% accuracy. J. Mol. Biol. 1993;232:584–599. doi: 10.1006/jmbi.1993.1413. [DOI] [PubMed] [Google Scholar]
- Rost B, et al. Redefining the goals of protein secondary structure prediction. J. Mol. Biol. 1994;235:13–26. doi: 10.1016/s0022-2836(05)80007-5. [DOI] [PubMed] [Google Scholar]
- Rost B, et al. Protein fold recognition by prediction-based threading. J. Mol. Biol. 1997;270:471–480. doi: 10.1006/jmbi.1997.1101. [DOI] [PubMed] [Google Scholar]
- Sander C, Schneider R. Database of homology-derived protein structures and the structural meaning of sequence alignment. Prot. Struct. Funct. Genet. 1991;9:56–68. doi: 10.1002/prot.340090107. [DOI] [PubMed] [Google Scholar]
- Schlessinger A, et al. Natively unstructured loops differ from other loops. PLoS Comput. Biol. 2007a;3:e140. doi: 10.1371/journal.pcbi.0030140. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Schlessinger A, et al. Natively unstructured regions in proteins identified from contact predictions. Bioinformatics. 2007b;23:2376–2384. doi: 10.1093/bioinformatics/btm349. [DOI] [PubMed] [Google Scholar]
- Schlessinger A, et al. Improved disorder prediction by combination of orthogonal approaches. PLOS ONE. 2009;4:e4433. doi: 10.1371/journal.pone.0004433. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Schlessinger A, et al. PROFbval: predict flexible and rigid residues in proteins. Bioinformatics. 2006;22:891–893. doi: 10.1093/bioinformatics/btl032. [DOI] [PubMed] [Google Scholar]
- Shimizu K, et al. Predicting mostly disordered proteins by using structure-unknown protein data. BMC Bioinformatics. 2007;8:78. doi: 10.1186/1471-2105-8-78. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Sippl MJ. Boltzmann's principle, knowledge based mean fields and protein folding. An approach to the computational determination of protein structures. J. Comput.-Aided Mol. Des. 1993;7:473–501. doi: 10.1007/BF02337562. [DOI] [PubMed] [Google Scholar]
- Tukey JW. Exploratory Data Analysis. Reading, MA: Addison-Wesley Pub. Co.; 1977. [Google Scholar]
- Uversky VN. Protein folding revisited. A polypeptide chain at the folding-misfolding-nonfolding cross-roads: which way to go? Cell Mol. Life Sci. 2003;60:1852–1871. doi: 10.1007/s00018-003-3096-6. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Vucetic S, et al. DisProt: a database of protein disorder. Bioinformatics. 2005;21:137–140. doi: 10.1093/bioinformatics/bth476. [DOI] [PubMed] [Google Scholar]
- Ward JJ, et al. Prediction and functional analysis of native disorder in proteins from the three kingdoms of life. J. Mol. Biol. 2004;337:635–645. doi: 10.1016/j.jmb.2004.02.002. [DOI] [PubMed] [Google Scholar]
- Wright PE, Dyson HJ. Linking folding and binding. Curr. Opin. Struct. Biol. 2009;19:31–38. doi: 10.1016/j.sbi.2008.12.003. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Yue P, et al. Loss of protein structure stability as a major causative factor in monogenic disease. J. Mol. Biol. 2005;353:459–473. doi: 10.1016/j.jmb.2005.08.020. [DOI] [PubMed] [Google Scholar]
- Yue P, et al. SNPs3D: candidate gene and SNP selection for association studies. BMC Bioinformatics. 2006;7:166. doi: 10.1186/1471-2105-7-166. [DOI] [PMC free article] [PubMed] [Google Scholar]
Associated Data
This section collects any data citations, data availability statements, or supplementary materials included in this article.