Skip to main page content
U.S. flag

An official website of the United States government

Dot gov

The .gov means it’s official.
Federal government websites often end in .gov or .mil. Before sharing sensitive information, make sure you’re on a federal government site.

Https

The site is secure.
The https:// ensures that you are connecting to the official website and that any information you provide is encrypted and transmitted securely.

Access keys NCBI Homepage MyNCBI Homepage Main Content Main Navigation
. 2019 Dec 23;20(1):296.
doi: 10.1186/s13059-019-1874-1.

Normalization and variance stabilization of single-cell RNA-seq data using regularized negative binomial regression

Affiliations

Normalization and variance stabilization of single-cell RNA-seq data using regularized negative binomial regression

Christoph Hafemeister et al. Genome Biol. .

Abstract

Single-cell RNA-seq (scRNA-seq) data exhibits significant cell-to-cell variation due to technical factors, including the number of molecules detected in each cell, which can confound biological heterogeneity with technical effects. To address this, we present a modeling framework for the normalization and variance stabilization of molecular count data from scRNA-seq experiments. We propose that the Pearson residuals from "regularized negative binomial regression," where cellular sequencing depth is utilized as a covariate in a generalized linear model, successfully remove the influence of technical characteristics from downstream analyses while preserving biological heterogeneity. Importantly, we show that an unconstrained negative binomial model may overfit scRNA-seq data, and overcome this by pooling information across genes with similar abundances to obtain stable parameter estimates. Our procedure omits the need for heuristic steps including pseudocount addition or log-transformation and improves common downstream analytical tasks such as variable gene selection, dimensional reduction, and differential expression. Our approach can be applied to any UMI-based scRNA-seq dataset and is freely available as part of the R package sctransform, with a direct interface to our single-cell toolkit Seurat.

Keywords: Normalization; Single-cell RNA-seq.

PubMed Disclaimer

Conflict of interest statement

The authors declare that they have no competing interests.

Figures

Fig. 1
Fig. 1
33,148 PBMC dataset from 10X Genomics. a Distribution of total UMI counts / cell (“sequencing depth”). b We placed genes into six groups, based on their average expression in the dataset. c For each gene group, we examined the average relationship between observed counts and cell sequencing depth. We fit a smooth line for each gene individually and combined results based on the groupings in b. Black line shows mean, colored region indicates interquartile range. d Same as in c, but showing scaled log-normalized values instead of UMI counts. Values were scaled (z-scored) so that a single Y-axis range could be used. e Relationship between gene variance and cell sequencing depth; cells were placed into five equal-sized groups based on total UMI counts (group 1 has the greatest depth), and we calculated the total variance of each gene group within each bin. For effectively normalized data, each cell bin should contribute 20% to the variance of each gene group
Fig. 2
Fig. 2
We fit NB regression models for each gene individually and bootstrapped the process to measure uncertainty in the resulting parameter estimates. a Model parameters for 16,809 genes for the NB regression model, plotted as a function of average gene abundance across the 33,148 cells. The color of each point indicates a parameter uncertainty score as determined by bootstrapping (“Methods” section). Pink line shows the regularized parameters obtained via kernel regression. b Standard deviation (σ) of NB regression model parameters across multiple bootstraps. Red points: σ for unconstrained NB model. Blue points: σ for regularized NB model, which is substantially reduced in comparison. Black trendline shows an increase in σ for low-abundance genes, highlighting the potential for overfitting in the absence of regularization
Fig. 3
Fig. 3
Pearson residuals from regularized NB regression represent effectively normalized scRNA-seq data. Panels a and b are analogous to Fig. 1 d and e, but calculated using Pearson residuals. c Boxplot of Pearson correlations between Pearson residuals and total cell UMI counts for each of the six gene bins. All three panels demonstrate that in contrast to log-normalized data, the level and variance of Pearson residuals is independent of sequencing depth
Fig. 4
Fig. 4
Regularized NB regression removes variation due to sequencing depth, but retains biological heterogeneity. a Distribution of residual mean, across all genes, is centered at 0. b Density of residual gene variance peaks at 1, as would be expected when the majority of genes do not vary across cell types. c Variance of Pearson residuals is independent of gene abundance, demonstrating that the GLM has successfully captured the mean-variance relationship inherent in the data. Genes with high residual variance are exclusively cell-type markers. d In contrast to a regularized NB, a Poisson error model does not fully capture the variance in highly expressed genes. An unconstrained (non-regularized) NB model overfits scRNA-seq data, attributing almost all variation to technical effects. As a result, even cell-type markers exhibit low residual variance. Mean-variance trendline shown in blue for each panel
Fig. 5
Fig. 5
The regularized NB model is an attractive middle ground between two extremes. a For four genes, we show the relationship between cell sequencing depth and molecular counts. White points show the observed data. Background color represents the Pearson residual magnitude under three error models. For MALAT1 (does not vary across cell types), the Poisson error model does not account for overdispersion and incorrectly infers significant residual variation (biological heterogeneity). For S100A9 (a CD14+ monocyte marker) and CD74 (expressed in antigen-presenting cells), the non-regularized NB model overfits the data and collapses biological heterogeneity. For PPBP (a Megakaryocyte marker), both non-regularized models wrongly fit a negative slope. b Boxplot of Pearson residuals for models shown in a. X-axis range shown is limited to [ − 8, 25] for visual clarity
Fig. 6
Fig. 6
Downstream analyses of Pearson residuals are unaffected by differences in sequencing depth. a UMAP embedding of the 33,148 cell PBMC dataset using either log-normalization or Pearson residuals. Both normalization schemes lead to similar results with respect to the major and minor cell populations in the dataset. However, in analyses of log-normalized data, cells within a cluster are ordered along a gradient that is correlated with sequencing depth. b Within the four major cell types, the percent of variance explained by sequencing depth under both normalization schemes. c UMAP embedding of two groups of biologically identical CD14+ monocytes, where one group was randomly downsampled to 50% depth. d Results of differential expression (DE) test between the two groups shown in c. Gray areas indicate expected group mean difference by chance and a false discovery rate cutoff of 1%. e Results of DE test between CD14+ and CD16+ monocytes, before and after randomly downsampling the CD16+ cells to 20% depth

Similar articles

Cited by

References

    1. Vallejos Catalina A, Risso Davide, Scialdone Antonio, Dudoit Sandrine, Marioni John C. Normalizing single-cell RNA sequencing data: challenges and opportunities. Nature Methods. 2017;14(6):565–571. - PMC - PubMed
    1. Stegle O, Teichmann SA, Marioni JC. Computational and analytical challenges in single-cell transcriptomics. Nat Rev Genet. 2015; 16(January 2014):133–45. http://dx.doi.org/10.1038/nrg3833{%}5Cn. http://www.nature.com/nrg/journal/vaop/ncurrent/full/nrg3833.html{#}author-information. - PubMed
    1. The Tabula MurisConsortium. Single-cell transcriptomic characterization of 20 organs and tissues from individual mice creates a Tabula Muris. bioRxiv. 2018. https://www.biorxiv.org/content/early/2018/03/29/237446. Accessed 29 Mar 2018.
    1. Hicks SC, Townes FW, Teng M, Irizarry RA. Missing data and technical variability in single-cell RNA-sequencing experiments. Biostatistics. 2017;19(4):562–78. - PMC - PubMed
    1. Svensson Valentine, Natarajan Kedar Nath, Ly Lam-Ha, Miragaia Ricardo J, Labalette Charlotte, Macaulay Iain C, Cvejic Ana, Teichmann Sarah A. Power analysis of single-cell RNA-sequencing experiments. Nature Methods. 2017;14(4):381–387. - PMC - PubMed

Publication types

LinkOut - more resources