Skip to main page content
U.S. flag

An official website of the United States government

Dot gov

The .gov means it’s official.
Federal government websites often end in .gov or .mil. Before sharing sensitive information, make sure you’re on a federal government site.

Https

The site is secure.
The https:// ensures that you are connecting to the official website and that any information you provide is encrypted and transmitted securely.

Access keys NCBI Homepage MyNCBI Homepage Main Content Main Navigation
. 2024 Oct 30;15(1):9392.
doi: 10.1038/s41467-024-53759-4.

A long-context language model for deciphering and generating bacteriophage genomes

Affiliations

A long-context language model for deciphering and generating bacteriophage genomes

Bin Shao et al. Nat Commun. .

Abstract

Inspired by the success of large language models (LLMs), we develop a long-context generative model for genomes. Our multiscale transformer model, megaDNA, is pre-trained on unannotated bacteriophage genomes with nucleotide-level tokenization. We demonstrate the foundational capabilities of our model including the prediction of essential genes, genetic variant effects, regulatory element activity and taxonomy of unannotated sequences. Furthermore, it generates de novo sequences up to 96 K base pairs, which contain potential regulatory elements and annotated proteins with phage-related functions.

PubMed Disclaimer

Conflict of interest statement

The authors declare no competing interests.

Figures

Fig. 1
Fig. 1. Foundational capacities of our language model.
a Overview of the model applications. b In silico mutagenesis analysis to identify essential genes in the bacteriophage genome. c Model loss variation across the lambda phage genome in the mutagenesis analysis. Upper, essential, and non-essential genes in the genome. Lower: changes in model loss for 50 bp non-overlapping windows across the genome (blue). The step size is 50 bp, and moving averages of model loss across 5000 bp windows are denoted in red. d Zero-shot prediction of essential genes by calculating the effects of mutations in the gene coding region (blue), start codon (orange) and stop codon (green). Area under the ROC curve (AUROC) scores are reported. e Prediction of mutational effects on protein functions using model embeddings. f Prediction of mutational effects for the deep mutational scanning experiment of the infA gene. Spearman correlation coefficients of the predicted and reported fitness from fivefold cross-validation tests are reported (Blue: megaDNA, gray: DeepSequence). n is the number of training samples. g Prediction of the impacts of Single Nucleotide Polymorphisms (SNPs) in the T7 bacteriophage genome. Spearman correlation of the predicted and reported fitness from fivefold cross-validation tests is reported. h Prediction of regulatory element activity using model embeddings. i Prediction of translation efficiencies for non-model organisms and high-throughput gene expression libraries. For K. oxytoca, P. protegens, and E. coli DH10B, we evaluated the model performance on endogenous genes. Fivefold cross-validation tests were used for all calculations. j Classifying taxonomies of unannotated sequences using model embeddings. k UMAP visualization of model embeddings for sequences from bacteriophages, bacteria, and archaea (model middle layer, sample size: n = 5000 per group). For f, g, and i, data are presented as mean values ± SEM from fivefold cross-validation tests (n = 5 folds). Source data are provided as a Source Data file.
Fig. 2
Fig. 2. Language model generates sequences with functional genomic structures.
a The workflow schematic. b Comparison of gene length distributions between randomly sampled subsets of predicted genes in generated sequences and training dataset (sample size: n = 2000). Two-sided Kolmogorov–Smirnov test: p value = 0.15. c Comparison of the predicted virus scores for generated sequences (sample size: n = 1024) and the training dataset (sample size: n = 99,429). Median virus scores are indicated by white dots. Black bars denote the interquartile ranges (25th to 75th percentiles). d Predicted taxonomy for the generated sequences predicted as viral. Only taxonomies with >1 sequence are shown. geNomad was used to produce results in (c, d). e Functional annotation of a selected sequence fragment (generated sequence #87). f Predicted promoter activity for all the 5′UTRs in the generated sequence #87 (sample size: n = 44), along with the promoter activity of the random sequences with the same length. Promoter activities were calculated using the Promoter Calculator. Two-sided Kolmogorov–Smirnov test: p value = 6.3 × 10−8. g Proportions of adenine (A) and guanine (G) nucleotides preceding the start codon of all the predicted genes in the generated sequence #87. h Mean predicted pLDDT scores for the ESMFold predicted structures. We focused on proteins with geNomad markers from generated sequences (sample size: n = 343; median value: 28) against random peptide sequences of the same lengths (sample size: n = 343; median value: 18). A sample generated protein is shown on the right. Two-sided Kolmogorov–Smirnov test: p value = 6.7 × 10−42. i Top 10 predicted functions of proteins derived from the generated sequences, as identified by phold. Source data are provided as a Source Data file.

Similar articles

Cited by

References

    1. Devlin, J., Chang, M.-W., Lee, K. & Toutanova, K. Bert: pre-training of deep bidirectional transformers for language understanding. Preprint at arXiv10.48550/arXiv.1810.04805 (2018).
    1. Brown, T. et al. Language models are few-shot learners. Adv. Neural Inf. Process. Syst.33, 1877–1901 (2020).
    1. Benegas, G., Ye, C., Albors, C., Li, J. C. & Song, Y. S. Genomic language models: opportunities and challenges. Preprint at arXiv10.48550/arXiv.2407.11435 (2024).
    1. Ji, Y., Zhou, Z., Liu, H. & Davuluri, R. V. DNABERT: pre-trained Bidirectional Encoder Representations from transformers model for DNA-language in genome. Bioinformatics37, 2112–2120 (2021). - PMC - PubMed
    1. Dalla-Torre, H. et al. The nucleotide transformer: building and evaluating robust foundation models for human genomics. bioRxiv10.1101/2023.01.11.523679 (2023).

MeSH terms

LinkOut - more resources