Skip to main page content
U.S. flag

An official website of the United States government

Dot gov

The .gov means it’s official.
Federal government websites often end in .gov or .mil. Before sharing sensitive information, make sure you’re on a federal government site.

Https

The site is secure.
The https:// ensures that you are connecting to the official website and that any information you provide is encrypted and transmitted securely.

Access keys NCBI Homepage MyNCBI Homepage Main Content Main Navigation
. 2014 Jul 1;30(13):1867-75.
doi: 10.1093/bioinformatics/btu134. Epub 2014 Mar 10.

Probabilistic PCA of censored data: accounting for uncertainties in the visualization of high-throughput single-cell qPCR data

Affiliations

Probabilistic PCA of censored data: accounting for uncertainties in the visualization of high-throughput single-cell qPCR data

Florian Buettner et al. Bioinformatics. .

Abstract

Motivation: High-throughput single-cell quantitative real-time polymerase chain reaction (qPCR) is a promising technique allowing for new insights in complex cellular processes. However, the PCR reaction can be detected only up to a certain detection limit, whereas failed reactions could be due to low or absent expression, and the true expression level is unknown. Because this censoring can occur for high proportions of the data, it is one of the main challenges when dealing with single-cell qPCR data. Principal component analysis (PCA) is an important tool for visualizing the structure of high-dimensional data as well as for identifying subpopulations of cells. However, to date it is not clear how to perform a PCA of censored data. We present a probabilistic approach that accounts for the censoring and evaluate it for two typical datasets containing single-cell qPCR data.

Results: We use the Gaussian process latent variable model framework to account for censoring by introducing an appropriate noise model and allowing a different kernel for each dimension. We evaluate this new approach for two typical qPCR datasets (of mouse embryonic stem cells and blood stem/progenitor cells, respectively) by performing linear and non-linear probabilistic PCA. Taking the censoring into account results in a 2D representation of the data, which better reflects its known structure: in both datasets, our new approach results in a better separation of known cell types and is able to reveal subpopulations in one dataset that could not be resolved using standard PCA.

Availability and implementation: The implementation was based on the existing Gaussian process latent variable model toolbox (https://github.com/SheffieldML/GPmat); extensions for noise models and kernels accounting for censoring are available at http://icb.helmholtz-muenchen.de/censgplvm.

PubMed Disclaimer

Figures

Fig. 1.
Fig. 1.
Fraction of censored data for two typical datasets: (A) fraction of non-detects in mESC data resolved by genes and (B) fractions of non-detects in blood stem cell data. Genes sorted in descending order of fraction of censored values. (C) distribution of Ct values for mESC data and (D) blood stem/progenitor cell data. The long tail of high Ct values continues until the LOD
Fig. 2.
Fig. 2.
The 2D toy example (mixture of four Gaussians). In (A) the true values of Y are shown; Y1 is right censored for values >2 (shown in grey). In (B) a PCA is performed with all censored values substituted with two resulting in a biased representation of the data. In (C) a PCA taking censoring into account using an appropriate noise model is shown resulting in a more realistic representation of the data. The uncertainty inherent in the generative model is visualized using greyscale as described in Section 2.2. This uncertainty is greatest on the far right where censoring occurs
Fig. 3.
Fig. 3.
Probit noise model for three different vales of λ (A) and Gaussian noise model for three different values of formula image (B)
Fig. 4.
Fig. 4.
(A–C) Distribution of residuals between posterior means and the normalized LODs for different approaches. (D–F) PCA with censored data from mESC dataset. Standard PCA with substitution approach (D), taking censoring into account with probit noise model and fixed λ (E) and probit noise model with λ learnt from data (F). (G–J) GPLVM with RBF kernel for mESC data. Standard GPLVM with substitution approach (G), taking censoring into account with probit noise model and fixed λ (H) and probit noise model with λ learnt from data (I). In (I) the dashed lines indicate two distinct subpopulations at the 16-cell stage and ICM
Fig. 5.
Fig. 5.
Difference in gene expression between the two subclusters at the 16-cell stage for different mappings. The error bars show the variation of gene expression within the smaller subcluster (one standard deviation in each direction). For convenience, genes with the greatest differences are labelled in the plots
Fig. 6.
Fig. 6.
(A–C) Distribution of residuals between posterior means and the normalized LODs for different approaches. (D–F) PCA with censored data from blood dataset. Standard PCA with substitution approach (D), taking censoring into account with probit noise model and fixed λ (E) and probit noise model with λ learnt from data (F). (G–J) GPLVM with RBF kernel for blood data. Standard GPLVM with substitution approach (G), taking censoring into account with probit noise model and fixed λ (H) and probit noise model with λ learnt from data (I). The background intensity indicates the relative uncertainty of the mapping with black pixels corresponding to the highest uncertainty of the mapping

Similar articles

Cited by

References

    1. Amir el-AD, et al. viSNE enables visualization of high dimensional single-cell data and reveals phenotypic heterogeneity of leukemia. Nat. Biotechnol. 2013;31:545–552. - PMC - PubMed
    1. Anders S, Huber W. Differential expression analysis for sequence count data. Genome Biol. 2010;11:R106. - PMC - PubMed
    1. Ballenberger N, et al. Novel statistical approaches for non-normal censored immunological data: analysis of cytokine and gene expression data. PLoS One. 2012;7:e46423. - PMC - PubMed
    1. Bishop CM. Pattern Recognition and Machine Learning (Information Science and Statistics) New York: Springer; 2006.
    1. Brennecke P, et al. Accounting for technical noise in single-cell RNA-seq experiments. Nat. Methods. 2013;10:1093–1095. - PubMed

Publication types