Skip to main page content
U.S. flag

An official website of the United States government

Dot gov

The .gov means it’s official.
Federal government websites often end in .gov or .mil. Before sharing sensitive information, make sure you’re on a federal government site.

Https

The site is secure.
The https:// ensures that you are connecting to the official website and that any information you provide is encrypted and transmitted securely.

Access keys NCBI Homepage MyNCBI Homepage Main Content Main Navigation
. 2024 Feb 1;40(2):btae071.
doi: 10.1093/bioinformatics/btae071.

Zero is not absence: censoring-based differential abundance analysis for microbiome data

Affiliations

Zero is not absence: censoring-based differential abundance analysis for microbiome data

Lap Sum Chan et al. Bioinformatics. .

Abstract

Motivation: Microbiome data analysis faces the challenge of sparsity, with many entries recorded as zeros. In differential abundance analysis, the presence of excessive zeros in data violates distributional assumptions and creates ties, leading to an increased risk of type I errors and reduced statistical power.

Results: We developed a novel normalization method, called censoring-based analysis of microbiome proportions (CAMP), for microbiome data by treating zeros as censored observations, transforming raw read counts into tie-free time-to-event-like data. This enables the use of survival analysis techniques, like the Cox proportional hazards model, for differential abundance analysis. Extensive simulations demonstrate that CAMP achieves proper type I error control and high power. Applying CAMP to a human gut microbiome dataset, we identify 60 new differentially abundant taxa across geographic locations, showcasing its usefulness. CAMP overcomes sparsity challenges, enabling improved statistical analysis and providing valuable insights into microbiome data in various contexts.

Availability and implementation: The R package is available at https://github.com/lapsumchan/CAMP.

PubMed Disclaimer

Conflict of interest statement

None declared.

Figures

Figure 1.
Figure 1.
Example displaying two populations, each composed of 6 distinct microbial taxa. Each population has a total volume of 20 microbes, with each taxon represented by a unique shape and color. As shown in the table, the taxa of primary interest are the blue circle with a relative abundance of 2/20=0.1, the red pentagon with a relative abundance of 3/20=0.15, and the orange square with a relative abundance of 1/20=0.05 in both populations. The table also provides the ratio of the abundance of each of these three taxa compared to the geometric mean abundance of all taxa within their respective populations. Notice that none of the calculated ratios are identical across the two populations. GM: geometric mean; All: all taxa (blue circle + red pentagon + orange square + green triangle + yellow star + purple ellipse).
Figure 2.
Figure 2.
Schematic representation of the CAMP workflow, divided into three key components: input, preprocessing, and DAA. The input encompasses read count data from n individuals, depicted as bar plots. During preprocessing (by censoring normalization), the read count data is subjected to a three-step process that includes censoring, library size normalization, and a negative log transformation. After censoring normalization, the data from each individual taxon can be represented as censored time-to-event data. The final stage, DAA, involves the creation of a presence/absence (P/A) table, computed for the log-rank test, with results visualized via a Kaplan-Meier curve. DiP(k) and and DiA(k): the number of samples present and absent in condition i for a given taxon at cutoff k, respectively.
Figure 3.
Figure 3.
Median type I error and power for six methods compared in simulation 1 and 2 across 1000 replicates: (A) type I error for simulation 1; (B) power for simulation 1; (C) type I error for simulation 2; (D) power for simulation 2. Red horizontal line in panels (A) and (C) indicates 5% type I error control.
Figure 4.
Figure 4.
Type I error and power for six methods (excluding MetagenomeSeq) across 1000 replicates: (A) type I error for simulation 3; (B) power for simulation 3. Red horizontal line in panel (A) indicates 5% type I error control.
Figure 5.
Figure 5.
Differential abundance analysis results for gut microbiome dataset: (A) number of discoveries given by four methods (excluding corncob and DESeq2) in the Malawi vs Venezuela comparison; (B) proportion distribution for the Raoultella genus in Malawi and Venezuela; (C) Kaplan–Meier curve for the Raoultella genus comparing Malawi vs Venezuela.

Similar articles

Cited by

References

    1. Abrams ZB, Johnson TS, Huang K. et al. A protocol to evaluate RNA sequencing normalization methods. BMC Bioinformatics 2019;20:679–7. - PMC - PubMed
    1. Anders S, McCarthy DJ, Chen Y. et al. Count-based differential expression analysis of RNA sequencing data using R and bioconductor. Nat Protoc 2013;8:1765–86. - PubMed
    1. Fernandes AD, Reid JN, Macklaim JM. et al. Unifying the analysis of high-throughput sequencing datasets: characterizing RNA-seq, 16s rRNA gene sequencing and selective growth experiments by compositional data analysis. Microbiome 2014;2:1–13. - PMC - PubMed
    1. Friedman GD, Cutter GR, Donahue RP. et al. CARDIA: study design, recruitment, and some characteristics of the examined subjects. J Clin Epidemiol 1988;41:1105–16. - PubMed
    1. Hu Y-J, Satten GA.. Testing hypotheses about the microbiome using the linear decomposition model (LDM). Bioinformatics 2020;36:4106–15. - PMC - PubMed

Publication types