Skip to main page content
U.S. flag

An official website of the United States government

Dot gov

The .gov means it’s official.
Federal government websites often end in .gov or .mil. Before sharing sensitive information, make sure you’re on a federal government site.

Https

The site is secure.
The https:// ensures that you are connecting to the official website and that any information you provide is encrypted and transmitted securely.

Access keys NCBI Homepage MyNCBI Homepage Main Content Main Navigation
. 2024 Jul 1;40(7):btae431.
doi: 10.1093/bioinformatics/btae431.

A multi-bin rarefying method for evaluating alpha diversities in TCR sequencing data

Affiliations

A multi-bin rarefying method for evaluating alpha diversities in TCR sequencing data

Mo Li et al. Bioinformatics. .

Abstract

Motivation: T cell receptors (TCRs) constitute a major component of our adaptive immune system, governing the recognition and response to internal and external antigens. Studying the TCR diversity via sequencing technology is critical for a deeper understanding of immune dynamics. However, library sizes differ substantially across samples, hindering the accurate estimation/comparisons of alpha diversities. To address this, researchers frequently use an overall rarefying approach in which all samples are sub-sampled to an even depth. Despite its pervasive application, its efficacy has never been rigorously assessed.

Results: In this paper, we develop an innovative "multi-bin" rarefying approach that partitions samples into multiple bins according to their library sizes, conducts rarefying within each bin for alpha diversity calculations, and performs meta-analysis across bins. Extensive simulations using real-world data highlight the inadequacy of the overall rarefying approach in controlling the confounding effect of library size. Our method proves robust in addressing library size confounding, outperforming competing normalization strategies by achieving better-controlled type-I error rates and enhanced statistical power in association tests.

Availability and implementation: The code is available at https://github.com/mli171/MultibinAlpha. The datasets are freely available at https://doi.org/10.21417/B7001Z and https://doi.org/10.21417/AR2019NC.

PubMed Disclaimer

Conflict of interest statement

None declared.

Figures

Figure 1.
Figure 1.
Type I error rates for Simulation A. Panels (A) and (B): type I error rates to all simulated data (|ρ|>0) and restricting to datasets in which |ρ|>0.01,|ρ|>0.02,,|ρ|>0.1, for the unique sequence counts (A) and Shannon index (B). Panels (C) and (D): QQ plots for the empirical P-values and Expected P-values (both −log10 transformed), for the unique sequence counts (C) and Shannon index (D).
Figure 2.
Figure 2.
Data generation process for Simulation B.
Figure 3.
Figure 3.
Type I errors and powers in Simulation B. Panels (A) and (B) are under simulations when the library size is not a confounder between alpha diversity and phenotype. Panels (C) and (D) are under simulations when library size confounds the relationship between alpha diversity and phenotype. Note that when p =0 we are evaluating the type I error.
Figure 4.
Figure 4.
Spearman correlation analyses with scatter plots and LOESS curve fits for alpha diversities within six bins. Panels (A) and (B) analyze unique sequence counts and Shannon index, respectively, against library sizes rarefied to 1e5 across all samples. Panels (C) and (D) follow the same analyses but with samples rarefied to the lowest library size in each bin using the “multi-bin” approach. P-values were calculated to test the significance of Spearman correlations between alpha diversities and the library sizes across all samples (black) and within each bin (colored).
Figure 5.
Figure 5.
Correlations between covariates and alpha diversity calculated from overall-rarefied samples with L*=1e6. (Left column): Plots the unique sequence counts against the of-interested clinical variables. (Middle column): Plots the Shannon index against the of-interested clinical variables. (Right column): Plots the log base 10 transformed library size against the of-interest clinical variables. The curves represent LOWESS fittings for the “Age” variable in the first row. P-values at the bottom in each sub-figure are obtained from Spearman rank-based correlation tests. Correlation tests in the right column assess associations between initial library size and clinical variables.

Similar articles

References

    1. Aboukhalil A, Bulyk ML.. Loess correction for length variation in gene set-based genomic sequence analysis. Bioinformatics 2012;28:1446–54. - PMC - PubMed
    1. Azizi E, Carr AJ, Plitas G. et al. Single-cell map of diverse immune phenotypes in the breast tumor microenvironment. Cell 2018;174:1293–308.e36. - PMC - PubMed
    1. Bortone DS, Woodcock MG, Parker JS. et al. Improved t-cell receptor diversity estimates associate with survival and response to anti–pd-1 therapy. Cancer Immunol Res 2021;9:103–12. - PubMed
    1. Cameron ES, Schmidt PJ, Tremblay BJ-M. et al. Enhancing diversity analysis by repeatedly rarefying next generation sequencing data describing microbial communities. Sci Rep 2021;11:22302. - PMC - PubMed
    1. Chen Z, Zhang G, Li J.. Goodness-of-fit test for meta-analysis. Sci Rep 2015;5:16983. - PMC - PubMed

Publication types

MeSH terms

Substances