Zero is not absence: censoring-based differential abundance analysis for microbiome data

doi:10.1093/bioinformatics/btae071

. 2024 Feb 1;40(2):btae071.

doi: 10.1093/bioinformatics/btae071.

Zero is not absence: censoring-based differential abundance analysis for microbiome data

Lap Sum Chan¹, Gen Li¹

Affiliations

PMID: 38331411
PMCID: PMC10885211
DOI: 10.1093/bioinformatics/btae071

Zero is not absence: censoring-based differential abundance analysis for microbiome data

Lap Sum Chan et al. Bioinformatics. 2024.

. 2024 Feb 1;40(2):btae071.

doi: 10.1093/bioinformatics/btae071.

Authors

Lap Sum Chan¹, Gen Li¹

Affiliation

¹ Department of Biostatistics, University of Michigan, Ann Arbor, MI 48109, United States.

PMID: 38331411
PMCID: PMC10885211
DOI: 10.1093/bioinformatics/btae071

Abstract

Motivation: Microbiome data analysis faces the challenge of sparsity, with many entries recorded as zeros. In differential abundance analysis, the presence of excessive zeros in data violates distributional assumptions and creates ties, leading to an increased risk of type I errors and reduced statistical power.

Results: We developed a novel normalization method, called censoring-based analysis of microbiome proportions (CAMP), for microbiome data by treating zeros as censored observations, transforming raw read counts into tie-free time-to-event-like data. This enables the use of survival analysis techniques, like the Cox proportional hazards model, for differential abundance analysis. Extensive simulations demonstrate that CAMP achieves proper type I error control and high power. Applying CAMP to a human gut microbiome dataset, we identify 60 new differentially abundant taxa across geographic locations, showcasing its usefulness. CAMP overcomes sparsity challenges, enabling improved statistical analysis and providing valuable insights into microbiome data in various contexts.

Availability and implementation: The R package is available at https://github.com/lapsumchan/CAMP.

PubMed Disclaimer

Conflict of interest statement

None declared.

Figures

**Figure 1.**
Example displaying two populations, each composed of 6 distinct microbial taxa. Each population has a total volume of 20 microbes, with each taxon represented by a unique shape and color. As shown in the table, the taxa of primary interest are the blue circle with a relative abundance of $2 / 20 = 0.1$ , the red pentagon with a relative abundance of $3 / 20 = 0.15$ , and the orange square with a relative abundance of $1 / 20 = 0.05$ in both populations. The table also provides the ratio of the abundance of each of these three taxa compared to the geometric mean abundance of all taxa within their respective populations. Notice that none of the calculated ratios are identical across the two populations. GM: geometric mean; All: all taxa (blue circle + red pentagon + orange square + green triangle + yellow star + purple ellipse).

**Figure 2.**
Schematic representation of the CAMP workflow, divided into three key components: input, preprocessing, and DAA. The input encompasses read count data from n individuals, depicted as bar plots. During preprocessing (by censoring normalization), the read count data is subjected to a three-step process that includes censoring, library size normalization, and a negative log transformation. After censoring normalization, the data from each individual taxon can be represented as censored time-to-event data. The final stage, DAA, involves the creation of a presence/absence (P/A) table, computed for the log-rank test, with results visualized via a Kaplan-Meier curve. $D_{i P}^{(k)}$ and and $D_{i A}^{(k)}$ : the number of samples present and absent in condition i for a given taxon at cutoff k, respectively.

**Figure 3.**
Median type I error and power for six methods compared in simulation 1 and 2 across 1000 replicates: (A) type I error for simulation 1; (B) power for simulation 1; (C) type I error for simulation 2; (D) power for simulation 2. Red horizontal line in panels (A) and (C) indicates 5% type I error control.

**Figure 4.**
Type I error and power for six methods (excluding MetagenomeSeq) across 1000 replicates: (A) type I error for simulation 3; (B) power for simulation 3. Red horizontal line in panel (A) indicates 5% type I error control.

**Figure 5.**
Differential abundance analysis results for gut microbiome dataset: (A) number of discoveries given by four methods (excluding corncob and DESeq2) in the Malawi vs Venezuela comparison; (B) proportion distribution for the Raoultella genus in Malawi and Venezuela; (C) Kaplan–Meier curve for the *Raoultella* genus comparing Malawi vs Venezuela.

See this image and copyright information in PMC

Cited by

ADAPT: Analysis of Microbiome Differential Abundance by Pooling Tobit Models.
Wang M, Fontaine S, Jiang H, Li G. Wang M, et al. bioRxiv [Preprint]. 2024 May 17:2024.05.14.594186. doi: 10.1101/2024.05.14.594186. bioRxiv. 2024. PMID: 38798558 Free PMC article. Preprint.

References

1. Abrams ZB, Johnson TS, Huang K. et al. A protocol to evaluate RNA sequencing normalization methods. BMC Bioinformatics 2019;20:679–7. - PMC - PubMed
1. Anders S, McCarthy DJ, Chen Y. et al. Count-based differential expression analysis of RNA sequencing data using R and bioconductor. Nat Protoc 2013;8:1765–86. - PubMed
1. Fernandes AD, Reid JN, Macklaim JM. et al. Unifying the analysis of high-throughput sequencing datasets: characterizing RNA-seq, 16s rRNA gene sequencing and selective growth experiments by compositional data analysis. Microbiome 2014;2:1–13. - PMC - PubMed
1. Friedman GD, Cutter GR, Donahue RP. et al. CARDIA: study design, recruitment, and some characteristics of the examined subjects. J Clin Epidemiol 1988;41:1105–16. - PubMed
1. Hu Y-J, Satten GA.. Testing hypotheses about the microbiome using the linear decomposition model (LDM). Bioinformatics 2020;36:4106–15. - PMC - PubMed

Publication types

Actions

MeSH terms

Actions
Actions
Actions
Actions
Actions

Grants and funding

R03 DE027773/DE/NIDCR NIH HHS/United States

LinkOut - more resources

Full Text Sources
Miscellaneous
- NCI CPTAC Assay Portal

[1] Abrams ZB, Johnson TS, Huang K. et al. A protocol to evaluate RNA sequencing normalization methods. BMC Bioinformatics 2019;20:679–7. - PMC - PubMed

[2] Abrams ZB, Johnson TS, Huang K. et al. A protocol to evaluate RNA sequencing normalization methods. BMC Bioinformatics 2019;20:679–7. - PMC - PubMed

[3] Anders S, McCarthy DJ, Chen Y. et al. Count-based differential expression analysis of RNA sequencing data using R and bioconductor. Nat Protoc 2013;8:1765–86. - PubMed

[4] Anders S, McCarthy DJ, Chen Y. et al. Count-based differential expression analysis of RNA sequencing data using R and bioconductor. Nat Protoc 2013;8:1765–86. - PubMed

[5] Fernandes AD, Reid JN, Macklaim JM. et al. Unifying the analysis of high-throughput sequencing datasets: characterizing RNA-seq, 16s rRNA gene sequencing and selective growth experiments by compositional data analysis. Microbiome 2014;2:1–13. - PMC - PubMed

[6] Fernandes AD, Reid JN, Macklaim JM. et al. Unifying the analysis of high-throughput sequencing datasets: characterizing RNA-seq, 16s rRNA gene sequencing and selective growth experiments by compositional data analysis. Microbiome 2014;2:1–13. - PMC - PubMed

[7] Friedman GD, Cutter GR, Donahue RP. et al. CARDIA: study design, recruitment, and some characteristics of the examined subjects. J Clin Epidemiol 1988;41:1105–16. - PubMed

[8] Friedman GD, Cutter GR, Donahue RP. et al. CARDIA: study design, recruitment, and some characteristics of the examined subjects. J Clin Epidemiol 1988;41:1105–16. - PubMed

[9] Hu Y-J, Satten GA.. Testing hypotheses about the microbiome using the linear decomposition model (LDM). Bioinformatics 2020;36:4106–15. - PMC - PubMed

[10] Hu Y-J, Satten GA.. Testing hypotheses about the microbiome using the linear decomposition model (LDM). Bioinformatics 2020;36:4106–15. - PMC - PubMed

Save citation to file

Email citation

Add to Collections

Add to My Bibliography

Your saved search

Create a file for external citation management software

Your RSS Feed

Zero is not absence: censoring-based differential abundance analysis for microbiome data

Affiliation

Zero is not absence: censoring-based differential abundance analysis for microbiome data

Authors

Affiliation

Abstract

Conflict of interest statement

Figures

Similar articles

Cited by

References

Publication types

MeSH terms

Grants and funding

LinkOut - more resources

Full Text Sources

Miscellaneous

Abstract

Conflict of interest statement

Figures

Similar articles

Cited by

References

Publication types

MeSH terms

Related information

Grants and funding

LinkOut - more resources

Full Text Sources

Miscellaneous