Centrifuge: rapid and sensitive classification of metagenomic sequences

doi:10.1101/gr.210641.116

. 2016 Dec;26(12):1721-1729.

doi: 10.1101/gr.210641.116. Epub 2016 Oct 17.

Centrifuge: rapid and sensitive classification of metagenomic sequences

Daehwan Kim¹, Li Song^{1

2}, Florian P Breitwieser¹, Steven L Salzberg^{1

2

3}

Affiliations

¹ Center for Computational Biology, McKusick-Nathans Institute of Genetic Medicine, Johns Hopkins University School of Medicine, Baltimore, Maryland 21205, USA.
² Department of Computer Science, Johns Hopkins University, Baltimore, Maryland 21218, USA.
³ Departments of Biomedical Engineering and Biostatistics, Johns Hopkins University, Baltimore, Maryland 21205, USA.

PMID: 27852649
PMCID: PMC5131823
DOI: 10.1101/gr.210641.116

Centrifuge: rapid and sensitive classification of metagenomic sequences

Daehwan Kim et al. Genome Res. 2016 Dec.

. 2016 Dec;26(12):1721-1729.

doi: 10.1101/gr.210641.116. Epub 2016 Oct 17.

Authors

Daehwan Kim¹, Li Song^{1

2}, Florian P Breitwieser¹, Steven L Salzberg^{1

2

3}

Affiliations

¹ Center for Computational Biology, McKusick-Nathans Institute of Genetic Medicine, Johns Hopkins University School of Medicine, Baltimore, Maryland 21205, USA.
² Department of Computer Science, Johns Hopkins University, Baltimore, Maryland 21218, USA.
³ Departments of Biomedical Engineering and Biostatistics, Johns Hopkins University, Baltimore, Maryland 21205, USA.

PMID: 27852649
PMCID: PMC5131823
DOI: 10.1101/gr.210641.116

Abstract

Centrifuge is a novel microbial classification engine that enables rapid, accurate, and sensitive labeling of reads and quantification of species on desktop computers. The system uses an indexing scheme based on the Burrows-Wheeler transform (BWT) and the Ferragina-Manzini (FM) index, optimized specifically for the metagenomic classification problem. Centrifuge requires a relatively small index (4.2 GB for 4078 bacterial and 200 archaeal genomes) and classifies sequences at very high speed, allowing it to process the millions of reads from a typical high-throughput DNA sequencing run within a few minutes. Together, these advances enable timely and accurate analysis of large metagenomics data sets on conventional desktop computers. Because of its space-optimized indexing schemes, Centrifuge also makes it possible to index the entire NCBI nonredundant nucleotide sequence database (a total of 109 billion bases) with an index size of 69 GB, in contrast to k-mer-based indexing schemes, which require far more extensive space.

PubMed Disclaimer

Figures

**Figure 1.**
Compression of genome sequences before building the Centrifuge index. All genomes are compared and similarities are computed based on shared 53-mers. In the figure, genomes G₁ and G₂ are the most similar pair. Sequences of G₂ that are ≥99% identical to G₁ are discarded, and the remaining “unique” sequences from G₂ are added to genome G₁, creating a merged genome, G₁₊₂. Similarity between all genomes is recomputed using the merged genomes. Sequences <99% identical in genome G₃ are then added to the merged genome, creating genome G₁₊₂₊₃. This process repeats for the entire Centrifuge database until each merged genome has no sequences ≥99% identical to any other genome.

**Figure 2.**
Classification of reads. (A) The figure shows how the score for a candidate at the species level is calculated. Given a 100-bp read, both the read (forward) and its reverse complement from *right* to *left* are searched. Centrifuge first identifies a short exact match, then continues until reaching a mismatch: The first 40-bp segment exactly matches six species (A, B, C, D, E, F), followed by a mismatch at the 41st base; the second 26-bp segment matches two species (G and H), followed by a mismatch at the 68th base; and the third 32-bp segment matches only species G. This procedure is repeated for the reverse complement of the read. Centrifuge assigns the highest score (625) to species A, B, C, D, E, and F. (B) Centrifuge then traverses up the taxonomic tree to reduce the number of assignments, first by considering the genus that includes the largest number of species, genus I, which covers species A, B, and C, and then replacing these three species with the genus. This procedure results in reducing the number of assignments to four (genus I plus species D, E, and F).

**Figure 3.**
Results on 530 sequencing data sets from bacterial genomes retrieved from the Sequence Read Archive at NCBI. Each dot represents the results for one genome, with Centrifuge shown in orange and Kraken in teal. The *upper left* plot shows sensitivity, computed as the percentage of reads classified as the correct genus. The *upper right* plot shows precision, computed as the percentage of genus-level classifications made by a program that were correct. The *lower left* plot shows runtime measured in seconds.

**Figure 4.**
Heat map of the most abundant species in Ebola samples. The color scale encodes species abundance (the number of unique reads normalized by genome size), ranging from yellow (<0.1% of the normalized read count) to red (100%), with white representing an abundance of zero. All species that have a normalized read count over 1% in any of the samples are shown. Zaire ebolavirus dominates the samples; however, there is also a signal for other viruses in some of the patients—namely GB virus C and Hepatitis B virus.

See this image and copyright information in PMC

Cited by

Metagenomics: A viable tool for reconstructing herbivore diet.
Chua PYS, Crampton-Platt A, Lammers Y, Alsos IG, Boessenkool S, Bohmann K. Chua PYS, et al. Mol Ecol Resour. 2021 Oct;21(7):2249-2263. doi: 10.1111/1755-0998.13425. Epub 2021 May 25. Mol Ecol Resour. 2021. PMID: 33971086 Free PMC article.
The Impact of Migration on the Gut Metagenome of South Asian Canadians.
Copeland JK, Chao G, Vanderhout S, Acton E, Wang PW, Benchimol EI, El Sohami A, Croitoru K, Gommerman JL, Guttman DS; GEMINI Research Team. Copeland JK, et al. Gut Microbes. 2021 Jan-Dec;13(1):1-29. doi: 10.1080/19490976.2021.1902705. Gut Microbes. 2021. PMID: 33794735 Free PMC article.
Liver transcriptome resources of four commercially exploited teleost species.
Machado AM, Muñoz-Merida A, Fonseca E, Veríssimo A, Pinto R, Felício M, da Fonseca RR, Froufe E, Castro LFC. Machado AM, et al. Sci Data. 2020 Jul 7;7(1):214. doi: 10.1038/s41597-020-0565-9. Sci Data. 2020. PMID: 32636445 Free PMC article.
Microbial functional pathways based on metatranscriptomic profiling enable effective saliva-based health assessments for precision wellness.
Patridge E, Gorakshakar A, Molusky MM, Ogundijo O, Janevski A, Julian C, Hu L, Vuyisich M, Banavar G. Patridge E, et al. Comput Struct Biotechnol J. 2024 Jan 29;23:834-842. doi: 10.1016/j.csbj.2024.01.018. eCollection 2024 Dec. Comput Struct Biotechnol J. 2024. PMID: 38328005 Free PMC article.
DNA Thermo-Protection Facilitates Whole-Genome Sequencing of Mycobacteria Direct from Clinical Samples.
George S, Xu Y, Rodger G, Morgan M, Sanderson ND, Hoosdally SJ, Thulborn S, Robinson E, Rathod P, Walker AS, Peto TEA, Crook DW, Dingle KE. George S, et al. J Clin Microbiol. 2020 Sep 22;58(10):e00670-20. doi: 10.1128/JCM.00670-20. Print 2020 Sep 22. J Clin Microbiol. 2020. PMID: 32719032 Free PMC article.

See all "Cited by" articles

References

1. Amann RI, Ludwig W, Schleifer KH. 1995. Phylogenetic identification and in situ detection of individual microbial cells without cultivation. Microbiol Rev 59: 143–169. - PMC - PubMed
1. Baize S, Pannetier D, Oestereich L, Rieger T, Koivogui L, Magassouba N, Soropogui B, Sow MS, Keita S, De Clerck H, et al. 2014. Emergence of Zaire Ebola virus disease in Guinea. N Engl J Med 371: 1418–1425. - PubMed
1. Brady A, Salzberg SL. 2009. Phymm and PhymmBL: metagenomic phylogenetic classification with interpolated Markov models. Nat Methods 6: 673–676. - PMC - PubMed
1. Brady A, Salzberg S. 2011. PhymmBL expanded: confidence scores, custom databases, parallelization and more. Nat Methods 8: 367. - PMC - PubMed
1. Burrows M, Wheeler DJ. 1994. A block-sorting lossless data compression algorithm. Technical Report 124 Digital Equipment Corporation, Palo Alto, CA.

Publication types

Actions
Actions

MeSH terms

Actions
Actions
Actions
Actions
Actions
Actions
Actions
Actions
Actions
Actions

Grants and funding

LinkOut - more resources

Full Text Sources
Other Literature Sources
- The Lens - Patent Citations Database
- scite Smart Citations
Molecular Biology Databases
- NIAID Data Ecosystem - Find datasets on Infectious and Immune-mediated Diseases
Miscellaneous
- NCI CPTAC Assay Portal

[1] Amann RI, Ludwig W, Schleifer KH. 1995. Phylogenetic identification and in situ detection of individual microbial cells without cultivation. Microbiol Rev 59: 143–169. - PMC - PubMed

[2] Amann RI, Ludwig W, Schleifer KH. 1995. Phylogenetic identification and in situ detection of individual microbial cells without cultivation. Microbiol Rev 59: 143–169. - PMC - PubMed

[3] Baize S, Pannetier D, Oestereich L, Rieger T, Koivogui L, Magassouba N, Soropogui B, Sow MS, Keita S, De Clerck H, et al. 2014. Emergence of Zaire Ebola virus disease in Guinea. N Engl J Med 371: 1418–1425. - PubMed

[4] Baize S, Pannetier D, Oestereich L, Rieger T, Koivogui L, Magassouba N, Soropogui B, Sow MS, Keita S, De Clerck H, et al. 2014. Emergence of Zaire Ebola virus disease in Guinea. N Engl J Med 371: 1418–1425. - PubMed

[5] Brady A, Salzberg SL. 2009. Phymm and PhymmBL: metagenomic phylogenetic classification with interpolated Markov models. Nat Methods 6: 673–676. - PMC - PubMed

[6] Brady A, Salzberg SL. 2009. Phymm and PhymmBL: metagenomic phylogenetic classification with interpolated Markov models. Nat Methods 6: 673–676. - PMC - PubMed

[7] Brady A, Salzberg S. 2011. PhymmBL expanded: confidence scores, custom databases, parallelization and more. Nat Methods 8: 367. - PMC - PubMed

[8] Brady A, Salzberg S. 2011. PhymmBL expanded: confidence scores, custom databases, parallelization and more. Nat Methods 8: 367. - PMC - PubMed

[9] Burrows M, Wheeler DJ. 1994. A block-sorting lossless data compression algorithm. Technical Report 124 Digital Equipment Corporation, Palo Alto, CA.

[10] Burrows M, Wheeler DJ. 1994. A block-sorting lossless data compression algorithm. Technical Report 124 Digital Equipment Corporation, Palo Alto, CA.

Save citation to file

Email citation

Add to Collections

Add to My Bibliography

Your saved search

Create a file for external citation management software

Your RSS Feed

Centrifuge: rapid and sensitive classification of metagenomic sequences

Affiliations

Centrifuge: rapid and sensitive classification of metagenomic sequences

Authors

Affiliations

Abstract

Figures

Similar articles

Cited by

References

Publication types

MeSH terms

Grants and funding

LinkOut - more resources

Full Text Sources

Other Literature Sources

Molecular Biology Databases

Miscellaneous

Abstract

Figures

Similar articles

Cited by

References

Publication types

MeSH terms

Related information

Grants and funding

LinkOut - more resources

Full Text Sources

Other Literature Sources

Molecular Biology Databases

Miscellaneous