MMseqs software suite for fast and deep clustering and searching of large protein sequence sets

doi:10.1093/bioinformatics/btw006

. 2016 May 1;32(9):1323-30.

doi: 10.1093/bioinformatics/btw006. Epub 2016 Jan 6.

MMseqs software suite for fast and deep clustering and searching of large protein sequence sets

Maria Hauser¹, Martin Steinegger², Johannes Söding³

Affiliations

¹ Gene Center, Ludwig-Maximilians-Universität München, Munich 81377, Germany.
² Gene Center, Ludwig-Maximilians-Universität München, Munich 81377, Germany, Computational Biology, Max Planck Institute for Biophysical Chemistry, Göttingen 37077, Germany and TUM, Department of Informatics, Bioinformatics & Computational Biology-I12, Garching 85748, Germany.
³ Gene Center, Ludwig-Maximilians-Universität München, Munich 81377, Germany, Computational Biology, Max Planck Institute for Biophysical Chemistry, Göttingen 37077, Germany and.

PMID: 26743509
DOI: 10.1093/bioinformatics/btw006

MMseqs software suite for fast and deep clustering and searching of large protein sequence sets

Maria Hauser et al. Bioinformatics. 2016.

. 2016 May 1;32(9):1323-30.

doi: 10.1093/bioinformatics/btw006. Epub 2016 Jan 6.

Authors

Maria Hauser¹, Martin Steinegger², Johannes Söding³

Affiliations

¹ Gene Center, Ludwig-Maximilians-Universität München, Munich 81377, Germany.
² Gene Center, Ludwig-Maximilians-Universität München, Munich 81377, Germany, Computational Biology, Max Planck Institute for Biophysical Chemistry, Göttingen 37077, Germany and TUM, Department of Informatics, Bioinformatics & Computational Biology-I12, Garching 85748, Germany.
³ Gene Center, Ludwig-Maximilians-Universität München, Munich 81377, Germany, Computational Biology, Max Planck Institute for Biophysical Chemistry, Göttingen 37077, Germany and.

PMID: 26743509
DOI: 10.1093/bioinformatics/btw006

Abstract

Motivation: Sequence databases are growing fast, challenging existing analysis pipelines. Reducing the redundancy of sequence databases by similarity clustering improves speed and sensitivity of iterative searches. But existing tools cannot efficiently cluster databases of the size of UniProt to 50% maximum pairwise sequence identity or below. Furthermore, in metagenomics experiments typically large fractions of reads cannot be matched to any known sequence anymore because searching with sensitive but relatively slow tools (e.g. BLAST or HMMER3) through comprehensive databases such as UniProt is becoming too costly.

Results: MMseqs (Many-against-Many sequence searching) is a software suite for fast and deep clustering and searching of large datasets, such as UniProt, or 6-frame translated metagenomics sequencing reads. MMseqs contains three core modules: a fast and sensitive prefiltering module that sums up the scores of similar k-mers between query and target sequences, an SSE2- and multi-core-parallelized local alignment module, and a clustering module.In our homology detection benchmarks, MMseqs is much more sensitive and 4-30 times faster than UBLAST and RAPsearch, respectively, although it does not reach BLAST sensitivity yet. Using its cascaded clustering workflow, MMseqs can cluster large databases down to ∼30% sequence identity at hundreds of times the speed of BLASTclust and much deeper than CD-HIT and USEARCH. MMseqs can also update a database clustering in linear instead of quadratic time. Its much improved sensitivity-speed trade-off should make MMseqs attractive for a wide range of large-scale sequence analysis tasks.

Availability and implementation: MMseqs is open-source software available under GPL at https://github.com/soedinglab/MMseqs

Contact: martin.steinegger@mpibpc.mpg.de, soeding@mpibpc.mpg.de

Supplementary information: Supplementary data are available at Bioinformatics online.

PubMed Disclaimer

Cited by

Direct activation of a bacterial innate immune system by a viral capsid protein.
Zhang T, Tamman H, Coppieters 't Wallant K, Kurata T, LeRoux M, Srikant S, Brodiazhenko T, Cepauskas A, Talavera A, Martens C, Atkinson GC, Hauryliuk V, Garcia-Pino A, Laub MT. Zhang T, et al. Nature. 2022 Dec;612(7938):132-140. doi: 10.1038/s41586-022-05444-z. Epub 2022 Nov 16. Nature. 2022. PMID: 36385533 Free PMC article.
Efficient and accurate detection of viral sequences at single-cell resolution reveals putative novel viruses perturbing host gene expression.
Luebbert L, Sullivan DK, Carilli M, Hjörleifsson KE, Winnett AV, Chari T, Pachter L. Luebbert L, et al. bioRxiv [Preprint]. 2025 Jan 2:2023.12.11.571168. doi: 10.1101/2023.12.11.571168. bioRxiv. 2025. PMID: 38168363 Free PMC article. Preprint.
SHOOT: phylogenetic gene search and ortholog inference.
Emms DM, Kelly S. Emms DM, et al. Genome Biol. 2022 Mar 28;23(1):85. doi: 10.1186/s13059-022-02652-8. Genome Biol. 2022. PMID: 35346327 Free PMC article.
Identification of Uncharacterized Components of Prokaryotic Immune Systems and Their Diverse Eukaryotic Reformulations.
Burroughs AM, Aravind L. Burroughs AM, et al. J Bacteriol. 2020 Nov 19;202(24):e00365-20. doi: 10.1128/JB.00365-20. Print 2020 Nov 19. J Bacteriol. 2020. PMID: 32868406 Free PMC article.
Tandem repeats in giant archaeal Borg elements undergo rapid evolution and create new intrinsically disordered regions in proteins.
Schoelmerich MC, Sachdeva R, West-Roberts J, Waldburger L, Banfield JF. Schoelmerich MC, et al. PLoS Biol. 2023 Jan 26;21(1):e3001980. doi: 10.1371/journal.pbio.3001980. eCollection 2023 Jan. PLoS Biol. 2023. PMID: 36701369 Free PMC article.

See all "Cited by" articles

MeSH terms

Actions
Actions
Actions
Actions
Actions
Actions
Actions
Actions
Actions
Actions

LinkOut - more resources

Full Text Sources
- Ovid Technologies, Inc.
- Silverchair Information Systems
Other Literature Sources
Research Materials
- NCI CPTC Antibody Characterization Program

Save citation to file

Email citation

Add to Collections

Add to My Bibliography

Your saved search

Create a file for external citation management software

Your RSS Feed

MMseqs software suite for fast and deep clustering and searching of large protein sequence sets

Affiliations

MMseqs software suite for fast and deep clustering and searching of large protein sequence sets

Authors

Affiliations

Abstract

Similar articles

Cited by

MeSH terms

LinkOut - more resources

Full Text Sources

Other Literature Sources

Research Materials