Skip to main page content
U.S. flag

An official website of the United States government

Dot gov

The .gov means it’s official.
Federal government websites often end in .gov or .mil. Before sharing sensitive information, make sure you’re on a federal government site.

Https

The site is secure.
The https:// ensures that you are connecting to the official website and that any information you provide is encrypted and transmitted securely.

Access keys NCBI Homepage MyNCBI Homepage Main Content Main Navigation
. 2023 Nov 27;19(11):e1011655.
doi: 10.1371/journal.pcbi.1011655. eCollection 2023 Nov.

GENERALIST: A latent space based generative model for protein sequence families

Affiliations

GENERALIST: A latent space based generative model for protein sequence families

Hoda Akl et al. PLoS Comput Biol. .

Abstract

Generative models of protein sequence families are an important tool in the repertoire of protein scientists and engineers alike. However, state-of-the-art generative approaches face inference, accuracy, and overfitting- related obstacles when modeling moderately sized to large proteins and/or protein families with low sequence coverage. Here, we present a simple to learn, tunable, and accurate generative model, GENERALIST: GENERAtive nonLInear tenSor-factorizaTion for protein sequences. GENERALIST accurately captures several high order summary statistics of amino acid covariation. GENERALIST also predicts conservative local optimal sequences which are likely to fold in stable 3D structure. Importantly, unlike current methods, the density of sequences in GENERALIST-modeled sequence ensembles closely resembles the corresponding natural ensembles. Finally, GENERALIST embeds protein sequences in an informative latent space. GENERALIST will be an important tool to study protein sequence variability.

PubMed Disclaimer

Conflict of interest statement

The authors have declared that no competing interests exist.

Figures

Fig 1
Fig 1. Schematic of the GENERALIST approach.
Sequences are modeled as arising from sequence-specific Gibbs-Boltzmann distributions over categorical variables. The inferred probabilities are used to generate new sequences.
Fig 2
Fig 2. Optimization for the latent dimension for BPTI (left panel) and EGFR (right panel).
The optimized value Δ2 = (〈Hmingenerated− 〈Hminnatural)2 (y-axis) is plotted against the latent dimension K (x-axis). For each sequence in an ensemble (natural or generated) Hmin is calculated by obtaining the minimum fractional Hamming distance to the natural sequences. Each box plot represents 10 runs, each run with a different random initialization, for each latent dimension. The optimum latent dimension is determined as the one that minimizes the average value of Δ2 across the 10 runs. Optimal dimension for BPTI is 42, and for EGFR is 19.
Fig 3
Fig 3
Panels A and B. Comparison of amino acid frequencies and cumulants up to order 4 calculated from GENERALIST-generated in silico ensembles (y-axis) and the natural sequences (x-axis) for BPTI (panel A) and EGFR (panel B). Panels C and D. 1 –Pearson correlation coefficient versus 1 –slope of the best fit line for the comparison between amino acid frequencies, and cumulants up to order 4 for GENERALIST, ArDCA, adabmDCA, and VAEs shown for BPTI (panel C) and EGFR (panel D). Panels E and F. The average Pearson correlation coefficient between frequencies of top 20 amino acid combinations of order n (x-axis) averaged across different combinations (y-axis) for GENERALIST, ArDCA, adabmDCA, and VAEs shown for BPTI (panel E) and EGFR (panel F).
Fig 4
Fig 4
Panel A. The distribution of distances to the nearest natural neighbor from sequences optimized using GENERALIST, ArDCA, and adabmDCA modeled probabilities. Panel B. The log-fold improvement in probabilities between the starting sequence and the local optimum. Panel C. Sequence-averaged plddt scores for AlphaFold2 predicted structures for the locally optimum sequences for all models, and the starting natural sequences labeled “MSA”.
Fig 5
Fig 5
Panels A and B. Distribution of fractional Hamming distances between random pairs of sequences within an ensemble shown as the fraction of pairs for which the hamming distance hrand within ensemble (y-axis) is greater or equal than value h (x-axis). Panel A: BPTI, Panel B: EGFR. Panels C and D. Distribution of fractional Hamming distances to the closest sequence within an ensemble for different models shown as the fraction of sequences for which the minimum hamming distance hmin within ensemble (y-axis) is greater or equal than value h (x-axis). Panel C: BPTI, Pabel D: EGFR. Panels E and F. Distribution of fractional Hamming distances to closest natural sequence for different models shown as the fraction of sequences for which the minimum hamming distance hmin to MSA (y-axis) is greater or equal than value h (x-axis). Panel E: BPTI, Panel F: EGFR.
Fig 6
Fig 6. Statistics of the cluster assignments for BPTI.
Pearson correlation coefficient between higher order statistics of amino acid occurrences in sequences belonging to the two clusters. Cluster sizes are 12646 and 3923.

Similar articles

Cited by

References

    1. Povolotskaya IS, Kondrashov FA. Sequence space and the ongoing expansion of the protein universe. Nature. 2010;465: 922–926. doi: 10.1038/nature09105 - DOI - PubMed
    1. Zeldovich KB, Chen P, Shakhnovich EI. Protein stability imposes limits on organism complexity and speed of molecular evolution. Proc Natl Acad Sci. 2007;104: 16152–16157. doi: 10.1073/pnas.0705366104 - DOI - PMC - PubMed
    1. Dixit PD, Maslov S. Evolutionary Capacitance and Control of Protein Stability in Protein-Protein Interaction Networks. Liberles D, editor. PLoS Comput Biol. 2013;9: e1003023. doi: 10.1371/journal.pcbi.1003023 - DOI - PMC - PubMed
    1. Konaté MM, Plata G, Park J, Usmanova DR, Wang H, Vitkup D. Molecular function limits divergent protein evolution on planetary timescales. eLife. 2019;8: e39705. doi: 10.7554/eLife.39705 - DOI - PMC - PubMed
    1. Cocco S, Feinauer C, Figliuzzi M, Monasson R, Weigt M. Inverse statistical physics of protein sequences: a key issues review. Rep Prog Phys. 2018;81: 032601. doi: 10.1088/1361-6633/aa9965 - DOI - PubMed