Skip to main page content
U.S. flag

An official website of the United States government

Dot gov

The .gov means it’s official.
Federal government websites often end in .gov or .mil. Before sharing sensitive information, make sure you’re on a federal government site.

Https

The site is secure.
The https:// ensures that you are connecting to the official website and that any information you provide is encrypted and transmitted securely.

Access keys NCBI Homepage MyNCBI Homepage Main Content Main Navigation
. 2015 Jun 24;11(6):e1004333.
doi: 10.1371/journal.pcbi.1004333. eCollection 2015 Jun.

BASiCS: Bayesian Analysis of Single-Cell Sequencing Data

Affiliations

BASiCS: Bayesian Analysis of Single-Cell Sequencing Data

Catalina A Vallejos et al. PLoS Comput Biol. .

Abstract

Single-cell mRNA sequencing can uncover novel cell-to-cell heterogeneity in gene expression levels in seemingly homogeneous populations of cells. However, these experiments are prone to high levels of unexplained technical noise, creating new challenges for identifying genes that show genuine heterogeneous expression within the population of cells under study. BASiCS (Bayesian Analysis of Single-Cell Sequencing data) is an integrated Bayesian hierarchical model where: (i) cell-specific normalisation constants are estimated as part of the model parameters, (ii) technical variability is quantified based on spike-in genes that are artificially introduced to each analysed cell's lysate and (iii) the total variability of the expression counts is decomposed into technical and biological components. BASiCS also provides an intuitive detection criterion for highly (or lowly) variable genes within the population of cells under study. This is formalised by means of tail posterior probabilities associated to high (or low) biological cell-to-cell variance contributions, quantities that can be easily interpreted by users. We demonstrate our method using gene expression measurements from mouse Embryonic Stem Cells. Cross-validation and meaningful enrichment of gene ontology categories within genes classified as highly (or lowly) variable supports the efficacy of our approach.

PubMed Disclaimer

Conflict of interest statement

The authors have declared that no competing interests exist.

Figures

Fig 1
Fig 1. Graphical representation of gene expression in two cells from a homogeneous population but with different total mRNA content.
(a) Three biological genes have the same expression rates in both cells, however cell 1 doubles cell 2 in terms of total mRNA content. As a result, the expression counts in cell 1 will be roughly twice as much as those from cell 2, for all genes. In terms of the cell-specific size factors ϕ j, an appropriate normalisation in this case would be e.g. ϕ 1 = 2, ϕ 2 = 1 (or any other values such that ϕ 1/ϕ 2 = 2). (b) The same cells after the addition of two molecules of a spike-in gene to each cell. Because the same number of spike-in molecules are added to each cell, the spike-in expression counts are independent of the total mRNA content of each cell. Therefore, the cell-specific size factors ϕ j are not required when modelling the technical gene.
Fig 2
Fig 2. Graphical representation of the hierarchical model implemented in BASiCS.
Diagram based on the expression counts of 2 genes (i: biological and i′: technical) at 2 cells (j and j′). Squared and circular nodes denote known observed quantities (observed expression counts and added number of spike-in mRNA molecules) and unknown elements, respectively. Whereas black circular nodes represent the random effects that play an intermediate role in our hierarchical structure, red circular nodes relate to unknown model parameters in the top layer of hierarchy in our model. Blue, green and grey areas highlight elements that are shared within a biological gene, technical gene or cell, respectively. BASiCS treats cell-specific normalising constants (ϕ j’s and s j’s) as model parameters, and estimates them by combining information across all genes. Unexplained technical noise is quantified via a single hyper-parameter θ, borrowing information across all genes and cells. Finally, BASiCS quantifies biological cell-to-cell variability via gene-specific hyper-parameters δ i, borrowing information across all cells.
Fig 3
Fig 3. Simulated performance of s j’s estimates (method described in [5] and BASiCS).
Based on 400 simulated datasets from the model implemented in BASiCS with the same structure as in the mouse ESC dataset (simulated parameter values defined as posterior medians of the original model fit) and 6 different values for θ. (a) percentage of the simulated spike-in genes (out of 46) without zero counts (i.e. those that can be used when calculating the estimator proposed in [5]) for different simulated values of θ. (b) and (c) estimates of s 1 (first cell) across all simulated datasets for different simulated values of θ using the method described in [5] and BASiCS (posterior medians), respectively. As the strength of unexplained technical noise increases (larger values of θ), estimates obtained using the approach described in [5] become highly unstable (we illustrated this using the first simulated cell, but the same conclusion can be obtained based on any other cell). This is due to a larger proportion of zeros among the simulated expression counts, i.e. less spike-in genes can be used when estimating s 1. In contrast, the stability of the BASiCS estimates is not substantially affected by the strength of unexplained technical noise.
Fig 4
Fig 4. Normalisation.
(a) and (b): for each of the 41 mouse ESCs, vertical lines represent the 95% high posterior density interval (blue dot located at the posterior median) of cell-specific normalising constants ϕ j (cellular mRNA content) and s j (interpreted in terms of capture and reverse transcription efficiency for UMI counts), respectively. While BASiCS suggests substantial heterogeneity in the total amount of molecules per cell (ϕ j), the scale of the technical counts remains stable among cells (s j). This is expected when using UMI protocols, where counts should not be affected by sequencing depth and other amplification biases. Red dots are the values estimated by the stepwise method described in [5]. There is a good agreement of the methods in terms of cellular mRNA content (ϕ j), but the estimations of s j according to [5] suggest stronger differences than what is expected when using UMI protocols. In (b), black dots represent the proportion of total spike-in molecules captured in each cell. Our estimations of the s j’s are in better agreement with these empirical measurements (suggesting BASiCS infers a more adequate reverse transcription efficiency level). (c) and (d) histogram of a Markov Chain Monte Carlo sample from s 1 and s 2, respectively. These posterior distributions are highly skewed and thence the posterior modes are a closer match to the empirical capture proportions than the corresponding posterior medians.
Fig 5
Fig 5. Cell-specific random effects linked to unexplained technical variability.
(a): for each of the 41 mouse ES cells, vertical lines represent the 95% high posterior density interval (blue dot located at the posterior median) of the random effects related to unexplained technical cell-to-cell variability (ν j). (b): histogram of a Markov Chain Monte Carlo sample from θ. Posterior inference strongly suggests the presence of unexplained technical noise in gene expression measurements. In fact, the posterior distribution of θ is concentrated away from zero and—even though the posterior distributions of the s j’s are highly homogeneous across cells (see Fig 4(b))—there is a strong heterogeneity among the posterior distributions of the ν j’s (evidenced by non-overlapping 95% high posterior density intervals).
Fig 6
Fig 6. Biological cell-to-cell heterogeneity.
(a): Histogram of the posterior medians of gene-specific biological cell-to-cell heterogeneity variance contributions σ i (defined in Eq (6)) across the 7,895 biological genes. (b): For each of the 7,895 biological genes, posterior medians of biological cell-to-cell heterogeneity term δ i (log scale) against posterior medians of expression level μ i (log scale). Red lines represent the contours in Eq (7), related to HVG (log scale) at different levels of the variance contribution threshold γ H. Blue lines represent the equivalent contours linked to LVG at different levels of the variance contribution threshold γ L. These contours were estimated based on posterior medians of ϕ j’s, s j’s and θ.
Fig 7
Fig 7. HVG and LVG detection.
(a) and (b): for each of the 7,895 biological genes, gene-specific expression rate μ i (log scale) against the probability of being HVG (πiH(γH)) and the probability of being LVG (πiL(γL)), respectively. Setting the EFDR and the EFNR equal to 10%, the corresponding variance contribution thresholds are γ H = 79% and γ L = 41%. Black dashed lines located at optimal (i.e. when EFDR and EFNR coincide) evidence thresholds α H = 0.7925 and α L = 0.7650, respectively. The 133 and 589 genes classified as HVG and LVG are highlighted in red and blue, respectively.
Fig 8
Fig 8. Comparison of HVG detection among different methods.
For each of the 7,895 biological genes, posterior medians of biological cell-to-cell heterogeneity term δ i (log scale) against posterior medians of expression level μ i (log scale). While the methods described in [16] and [5] only provide a characterisation of HVG, BASiCS is able to detect those genes whose expression rates are stable among cells.
Fig 9
Fig 9. Cross-validation experiment.
(a) true versus estimated normalised expression level μ i for each of the 46 ERCC spike-in genes (log-scale). Our estimations are gathered around the true values, except for lowly expressed technical genes, where the experiment suggests a small positive bias. Estimations according to [5] are highly correlated with the true values, however the true scale has not been recovered (not surprising as their method is not designed to estimate the right scale of the s j’s, see Fig 3). (b): Fig 8(a) superimposing (in black) estimated values for each of the 46 ERCC spike-in genes. (c) and (d): panels (c) and (d) of Fig 7 superposing (in black) estimated probabilities associated to each of the 46 ERCC spike-in genes. As expected, none of the spike-in genes were detected as HVG by our criteria. On the other hand, there is strong evidence in favour of being LVG for 21 of the technical genes (and some others are borderline according to our conservative criteria).

Similar articles

Cited by

References

    1. Tang F, Barbacioru C, Wang Y, Nordman E, Lee C, et al. (2009) mRNA-Seq whole-transcriptome analysis of a single cell. Nature Methods 6: 377–382. 10.1038/nmeth.1315 - DOI - PubMed
    1. Patel AP, Tirosh I, Trombetta JJ, Shalek AK, Gillespie SM, et al. (2014) Single-cell RNA-seq highlights intratumoral heterogeneity in primary glioblastoma. Science 344: 1396–1401. 10.1126/science.1254257 - DOI - PMC - PubMed
    1. Saliba AE, Westermann AJ, Gorski SA, Vogel J (2014) Single-cell RNA-seq: advances and future challenges. Nucleic Acids Research: gku555. - PMC - PubMed
    1. Stegle O, Teichmann SA, Marioni JC (2015) Computational and analytical challenges in single-cell transcriptomics. Nature Review Genetics 16: 133–145. 10.1038/nrg3833 - DOI - PubMed
    1. Brennecke P, Anders S, Kim JK, Kolodziejczyk AA, Zhang X, et al. (2013) Accounting for technical noise in single-cell RNA-seq experiments. Nature Methods 10: 1093–1095. 10.1038/nmeth.2645 - DOI - PubMed

Publication types