Skip to main page content
U.S. flag

An official website of the United States government

Dot gov

The .gov means it’s official.
Federal government websites often end in .gov or .mil. Before sharing sensitive information, make sure you’re on a federal government site.

Https

The site is secure.
The https:// ensures that you are connecting to the official website and that any information you provide is encrypted and transmitted securely.

Access keys NCBI Homepage MyNCBI Homepage Main Content Main Navigation
. 2024 Oct 29;14(1):25886.
doi: 10.1038/s41598-024-77056-8.

Positions of cysteine residues reveal local clusters and hidden relationships to Sequons and Transmembrane domains in Human proteins

Affiliations

Positions of cysteine residues reveal local clusters and hidden relationships to Sequons and Transmembrane domains in Human proteins

Manthan Desai et al. Sci Rep. .

Abstract

Membrane proteins often possess critical structural features, such as transmembrane domains (TMs), N-glycosylation, and disulfide bonds (SS bonds), which are essential to their structure and function. Here, we extend the study of the motifs carrying N-glycosylation, i.e. the sequons, and the Cys residues supporting the SS bonds, to the whole human proteome with a particular focus on the Cys positions in human proteins with respect to those of sequons and TMs. As the least abundant amino acid residue in protein sequences, the positions of Cys residues in proteins are not random but rather selected through evolution. We discovered that the frequency of Cys residues in proteins is length dependent, and the frequency of CC gaps formed between adjacent Cys residues can be used as a classifier to distinguish proteins with special structures and functions, such as keratin-associated proteins (KAPs), extracellular proteins with EGF-like domains, and nuclear proteins with zinc finger C2H2 domains. Most importantly, by comparing the positions of Cys residues to those of sequons and TMs, we discovered that these structural features can form dense clusters in highly repeated and mutually exclusive modalities in protein sequences. The evolutionary advantages of such complementarity among the three structural features are discussed, particularly in light of structural dynamics in proteins that are lacking from computational predictions. The discoveries made here highlight the sequence-structure-function axis in biological organisms that can be utilized in future protein engineering toward synthetic biology.

Keywords: Cysteine residues; Disulfide bonds; N-glycosylation; Posttranslational modifications; Protein sequence; Protein structure and function; Sequons; Transmembrane domains.

PubMed Disclaimer

Conflict of interest statement

The authors declare no competing interests.

Figures

Fig. 1
Fig. 1
Protein length distribution of cysteine-containing proteins (A) and cysteine-free proteins (B) in the human proteome, as well as (C) the UniProt keyword enrichment analysis of the cysteine-free proteins, in which the x axis displays the -log10 transformed enrichment p value.
Fig. 2
Fig. 2
Count and density analysis of Cys residues in human proteins. (A) Distribution of the average fragment length of a Cys residue in a protein with a bin width of 10. The open bars represent proteins with average fragment lengths shorter than 60 residues, and the filled bars represent the remaining cysteine-containing proteins. (B) Distribution of Cys counts in proteins with a bin size of 5 counts. The filled bars indicate proteins with fewer than 15 Cys residues (C-low proteins), the rest are C-high proteins. The pie chart displays the proportions and the percentages of C-free, C-low and C-high proteins in the human proteome. (C) Feature counts and average protein length in the first 10 bins in Fig. 2 A. The analyzed features included cysteine residues, annotated disulfide bonds, sequons, and predicted transmembrane domains. (D) Expanded view of Fig. 2 B with the bin size of a single count of Cys residues.
Fig. 3
Fig. 3
Interpro-annotated SCO-spondin domains are shown, and the red boxes highlight the regions containing free cysteine residues. The length of each highlighted region is labeled above the corresponding box. Only 2 out of 357 free cysteine residues are outside of these boxed regions.
Fig. 4
Fig. 4
Distribution of CC gap frequency and the corresponding predicted structures for selected C-dense and C-rich proteins. A & B, Keratin-associated protein 28 − 5 (KAP); C & D, Metallothionein-1B (MT); E & F, Late cornified envelope protein 2 A (LCE); G & H, SCO-spondin.
Fig. 5
Fig. 5
Distributions of gap/loop lengths among adjacent cysteine residues and disulfide (SS) bonds. A, Distance between two cysteine residues forming an SS bond (SS loop); B, distance between two adjacent SS bonds (SS gap); C, absolute SS gap; D, distance between two adjacent cysteine residues (CC gap).
Fig. 6
Fig. 6
Clustering of high-C proteins (open bars in Fig. 2B) and their corresponding frequencies of CC gaps. (A) Comparison of nonsupervised hierarchical clustering with CC gaps progressively reduced from a gap length of 100 to 10 benchmarked by the EGF-like domain highlighted by the white circle. Quantitative evaluation of the clustering efficiency is represented by the pie chart above each corresponding heatmap. (B) 3D principal component analysis of the CC gap frequency considering 25 CC gaps. Cluster 1 includes Gaps 1 and 4; Cluster 2 includes Gaps 3 and 25; Cluster 3 includes Gaps 2, 5–7, and 9; and Cluster 4 includes the remaining gaps. (C) The corresponding 2D PCA of proteins in the analysis of 25 CC gaps. Green indicates proteins with zinc finger C2H2 domains, blue indicates proteins with EGF-like domains, red indicates KAP proteins, and gray indicates the remaining proteins.
Fig. 7
Fig. 7
Comparisons of sequon and transmembrane domains (TMs) for verification of the observations of cysteine residues in terms of the lengths of fragments (A & B), gaps (E and G), and loops (F), corresponding feature counts (C & D) and enriched protein functions (H). The height of the bar denotes the population percentage of the specified proteins relative to the total proteins analyzed. KAP, keratin-associated protein; Ig, immunoglobulin; FT III, fibronectin type III; LRR, leucine-rich repeat; MFS, major facilitator superfamily.

Similar articles

References

    1. Bakshi, T., et al., Hidden Relationships between N-Glycosylation and Disulfide Bonds in Individual Proteins. Int J Mol Sci, 2022. 23(7). - PMC - PubMed
    1. Desai, M., et al., Discovery and Visualization of the Hidden Relationships among N-Glycosylation, Disulfide Bonds, and Membrane Topology. Int J Mol Sci, 2023. 24(22). - PMC - PubMed
    1. Petersen, M.T., P.H. Jonson, and S.B. Petersen, Amino acid neighbours and detailed conformational analysis of cysteines in proteins. Protein Eng, 1999. 12(7): p. 535 − 48. - PubMed
    1. Gupta, R. and S. Brunak, Prediction of glycosylation across the human proteome and the correlation to protein function. Pac Symp Biocomput, 2002: p. 310 − 22. - PubMed
    1. Pakhrin, S.C., et al., DeepNGlyPred: A Deep Neural Network-Based Approach for Human N-Linked Glycosylation Site Prediction. Molecules, 2021. 26(23). - PMC - PubMed

LinkOut - more resources