Skip to main page content
U.S. flag

An official website of the United States government

Dot gov

The .gov means it’s official.
Federal government websites often end in .gov or .mil. Before sharing sensitive information, make sure you’re on a federal government site.

Https

The site is secure.
The https:// ensures that you are connecting to the official website and that any information you provide is encrypted and transmitted securely.

Access keys NCBI Homepage MyNCBI Homepage Main Content Main Navigation
. 2016 Dec;26(12):1721-1729.
doi: 10.1101/gr.210641.116. Epub 2016 Oct 17.

Centrifuge: rapid and sensitive classification of metagenomic sequences

Affiliations

Centrifuge: rapid and sensitive classification of metagenomic sequences

Daehwan Kim et al. Genome Res. 2016 Dec.

Abstract

Centrifuge is a novel microbial classification engine that enables rapid, accurate, and sensitive labeling of reads and quantification of species on desktop computers. The system uses an indexing scheme based on the Burrows-Wheeler transform (BWT) and the Ferragina-Manzini (FM) index, optimized specifically for the metagenomic classification problem. Centrifuge requires a relatively small index (4.2 GB for 4078 bacterial and 200 archaeal genomes) and classifies sequences at very high speed, allowing it to process the millions of reads from a typical high-throughput DNA sequencing run within a few minutes. Together, these advances enable timely and accurate analysis of large metagenomics data sets on conventional desktop computers. Because of its space-optimized indexing schemes, Centrifuge also makes it possible to index the entire NCBI nonredundant nucleotide sequence database (a total of 109 billion bases) with an index size of 69 GB, in contrast to k-mer-based indexing schemes, which require far more extensive space.

PubMed Disclaimer

Figures

Figure 1.
Figure 1.
Compression of genome sequences before building the Centrifuge index. All genomes are compared and similarities are computed based on shared 53-mers. In the figure, genomes G1 and G2 are the most similar pair. Sequences of G2 that are ≥99% identical to G1 are discarded, and the remaining “unique” sequences from G2 are added to genome G1, creating a merged genome, G1+2. Similarity between all genomes is recomputed using the merged genomes. Sequences <99% identical in genome G3 are then added to the merged genome, creating genome G1+2+3. This process repeats for the entire Centrifuge database until each merged genome has no sequences ≥99% identical to any other genome.
Figure 2.
Figure 2.
Classification of reads. (A) The figure shows how the score for a candidate at the species level is calculated. Given a 100-bp read, both the read (forward) and its reverse complement from right to left are searched. Centrifuge first identifies a short exact match, then continues until reaching a mismatch: The first 40-bp segment exactly matches six species (A, B, C, D, E, F), followed by a mismatch at the 41st base; the second 26-bp segment matches two species (G and H), followed by a mismatch at the 68th base; and the third 32-bp segment matches only species G. This procedure is repeated for the reverse complement of the read. Centrifuge assigns the highest score (625) to species A, B, C, D, E, and F. (B) Centrifuge then traverses up the taxonomic tree to reduce the number of assignments, first by considering the genus that includes the largest number of species, genus I, which covers species A, B, and C, and then replacing these three species with the genus. This procedure results in reducing the number of assignments to four (genus I plus species D, E, and F).
Figure 3.
Figure 3.
Results on 530 sequencing data sets from bacterial genomes retrieved from the Sequence Read Archive at NCBI. Each dot represents the results for one genome, with Centrifuge shown in orange and Kraken in teal. The upper left plot shows sensitivity, computed as the percentage of reads classified as the correct genus. The upper right plot shows precision, computed as the percentage of genus-level classifications made by a program that were correct. The lower left plot shows runtime measured in seconds.
Figure 4.
Figure 4.
Heat map of the most abundant species in Ebola samples. The color scale encodes species abundance (the number of unique reads normalized by genome size), ranging from yellow (<0.1% of the normalized read count) to red (100%), with white representing an abundance of zero. All species that have a normalized read count over 1% in any of the samples are shown. Zaire ebolavirus dominates the samples; however, there is also a signal for other viruses in some of the patients—namely GB virus C and Hepatitis B virus.

Similar articles

Cited by

References

    1. Amann RI, Ludwig W, Schleifer KH. 1995. Phylogenetic identification and in situ detection of individual microbial cells without cultivation. Microbiol Rev 59: 143–169. - PMC - PubMed
    1. Baize S, Pannetier D, Oestereich L, Rieger T, Koivogui L, Magassouba N, Soropogui B, Sow MS, Keita S, De Clerck H, et al. 2014. Emergence of Zaire Ebola virus disease in Guinea. N Engl J Med 371: 1418–1425. - PubMed
    1. Brady A, Salzberg SL. 2009. Phymm and PhymmBL: metagenomic phylogenetic classification with interpolated Markov models. Nat Methods 6: 673–676. - PMC - PubMed
    1. Brady A, Salzberg S. 2011. PhymmBL expanded: confidence scores, custom databases, parallelization and more. Nat Methods 8: 367. - PMC - PubMed
    1. Burrows M, Wheeler DJ. 1994. A block-sorting lossless data compression algorithm. Technical Report 124 Digital Equipment Corporation, Palo Alto, CA.

Publication types