Probabilistic PCA of censored data: accounting for uncertainties in the visualization of high-throughput single-cell qPCR data

doi:10.1093/bioinformatics/btu134

. 2014 Jul 1;30(13):1867-75.

doi: 10.1093/bioinformatics/btu134. Epub 2014 Mar 10.

Probabilistic PCA of censored data: accounting for uncertainties in the visualization of high-throughput single-cell qPCR data

Florian Buettner¹, Victoria Moignard¹, Berthold Göttgens¹, Fabian J Theis²

Affiliations

¹ Institute of Computational Biology, Helmholtz-Zentrum München, 85764 Neuherberg, Germany, Department of Haematology, University of Cambridge, Cambridge Institute for Medical Research and Wellcome Trust & MRC Cambridge Stem Cell Institute, Cambridge CB2 0XY, UK and Department of Mathematics, TU München, 85748 Garching, Germany.
² Institute of Computational Biology, Helmholtz-Zentrum München, 85764 Neuherberg, Germany, Department of Haematology, University of Cambridge, Cambridge Institute for Medical Research and Wellcome Trust & MRC Cambridge Stem Cell Institute, Cambridge CB2 0XY, UK and Department of Mathematics, TU München, 85748 Garching, GermanyInstitute of Computational Biology, Helmholtz-Zentrum München, 85764 Neuherberg, Germany, Department of Haematology, University of Cambridge, Cambridge Institute for Medical Research and Wellcome Trust & MRC Cambridge Stem Cell Institute, Cambridge CB2 0XY, UK and Department of Mathematics, TU München, 85748 Garching, Germany.

PMID: 24618470
PMCID: PMC4071202
DOI: 10.1093/bioinformatics/btu134

Probabilistic PCA of censored data: accounting for uncertainties in the visualization of high-throughput single-cell qPCR data

Florian Buettner et al. Bioinformatics. 2014.

. 2014 Jul 1;30(13):1867-75.

doi: 10.1093/bioinformatics/btu134. Epub 2014 Mar 10.

Authors

Florian Buettner¹, Victoria Moignard¹, Berthold Göttgens¹, Fabian J Theis²

Affiliations

¹ Institute of Computational Biology, Helmholtz-Zentrum München, 85764 Neuherberg, Germany, Department of Haematology, University of Cambridge, Cambridge Institute for Medical Research and Wellcome Trust & MRC Cambridge Stem Cell Institute, Cambridge CB2 0XY, UK and Department of Mathematics, TU München, 85748 Garching, Germany.
² Institute of Computational Biology, Helmholtz-Zentrum München, 85764 Neuherberg, Germany, Department of Haematology, University of Cambridge, Cambridge Institute for Medical Research and Wellcome Trust & MRC Cambridge Stem Cell Institute, Cambridge CB2 0XY, UK and Department of Mathematics, TU München, 85748 Garching, GermanyInstitute of Computational Biology, Helmholtz-Zentrum München, 85764 Neuherberg, Germany, Department of Haematology, University of Cambridge, Cambridge Institute for Medical Research and Wellcome Trust & MRC Cambridge Stem Cell Institute, Cambridge CB2 0XY, UK and Department of Mathematics, TU München, 85748 Garching, Germany.

PMID: 24618470
PMCID: PMC4071202
DOI: 10.1093/bioinformatics/btu134

Abstract

Motivation: High-throughput single-cell quantitative real-time polymerase chain reaction (qPCR) is a promising technique allowing for new insights in complex cellular processes. However, the PCR reaction can be detected only up to a certain detection limit, whereas failed reactions could be due to low or absent expression, and the true expression level is unknown. Because this censoring can occur for high proportions of the data, it is one of the main challenges when dealing with single-cell qPCR data. Principal component analysis (PCA) is an important tool for visualizing the structure of high-dimensional data as well as for identifying subpopulations of cells. However, to date it is not clear how to perform a PCA of censored data. We present a probabilistic approach that accounts for the censoring and evaluate it for two typical datasets containing single-cell qPCR data.

Results: We use the Gaussian process latent variable model framework to account for censoring by introducing an appropriate noise model and allowing a different kernel for each dimension. We evaluate this new approach for two typical qPCR datasets (of mouse embryonic stem cells and blood stem/progenitor cells, respectively) by performing linear and non-linear probabilistic PCA. Taking the censoring into account results in a 2D representation of the data, which better reflects its known structure: in both datasets, our new approach results in a better separation of known cell types and is able to reveal subpopulations in one dataset that could not be resolved using standard PCA.

Availability and implementation: The implementation was based on the existing Gaussian process latent variable model toolbox (https://github.com/SheffieldML/GPmat); extensions for noise models and kernels accounting for censoring are available at http://icb.helmholtz-muenchen.de/censgplvm.

PubMed Disclaimer

Figures

**Fig. 1.**
Fraction of censored data for two typical datasets: (A) fraction of non-detects in mESC data resolved by genes and (B) fractions of non-detects in blood stem cell data. Genes sorted in descending order of fraction of censored values. (C) distribution of Ct values for mESC data and (D) blood stem/progenitor cell data. The long tail of high Ct values continues until the LOD

**Fig. 2.**
The 2D toy example (mixture of four Gaussians). In (A) the true values of Y are shown; Y₁ is right censored for values >2 (shown in grey). In (B) a PCA is performed with all censored values substituted with two resulting in a biased representation of the data. In (C) a PCA taking censoring into account using an appropriate noise model is shown resulting in a more realistic representation of the data. The uncertainty inherent in the generative model is visualized using greyscale as described in Section 2.2. This uncertainty is greatest on the far right where censoring occurs

**Fig. 3.**
Probit noise model for three different vales of λ (A) and Gaussian noise model for three different values of (B)

formula image — **Fig. 3.**
Probit noise model for three different vales of λ (A) and Gaussian noise model for three different values of (B)

**Fig. 4.**
(**A–C**) Distribution of residuals between posterior means and the normalized LODs for different approaches. (**D–F**) PCA with censored data from mESC dataset. Standard PCA with substitution approach (D), taking censoring into account with probit noise model and fixed λ (E) and probit noise model with λ learnt from data (F). (**G–J**) GPLVM with RBF kernel for mESC data. Standard GPLVM with substitution approach (G), taking censoring into account with probit noise model and fixed λ (H) and probit noise model with λ learnt from data (I). In (I) the dashed lines indicate two distinct subpopulations at the 16-cell stage and ICM

**Fig. 5.**
Difference in gene expression between the two subclusters at the 16-cell stage for different mappings. The error bars show the variation of gene expression within the smaller subcluster (one standard deviation in each direction). For convenience, genes with the greatest differences are labelled in the plots

**Fig. 6.**
(**A–C**) Distribution of residuals between posterior means and the normalized LODs for different approaches. (**D–F**) PCA with censored data from blood dataset. Standard PCA with substitution approach (D), taking censoring into account with probit noise model and fixed λ (E) and probit noise model with λ learnt from data (F). (**G–J**) GPLVM with RBF kernel for blood data. Standard GPLVM with substitution approach (G), taking censoring into account with probit noise model and fixed λ (H) and probit noise model with λ learnt from data (I). The background intensity indicates the relative uncertainty of the mapping with black pixels corresponding to the highest uncertainty of the mapping

See this image and copyright information in PMC

Cited by

qRT-PCR evaluation of the transcriptional response of zebra mussel to heavy metals.
Jaumot J, Navarro A, Faria M, Barata C, Tauler R, Piña B. Jaumot J, et al. BMC Genomics. 2015 May 6;16(1):354. doi: 10.1186/s12864-015-1567-4. BMC Genomics. 2015. PMID: 25943386 Free PMC article.
MISC: missing imputation for single-cell RNA sequencing data.
Yang MQ, Weissman SM, Yang W, Zhang J, Canaann A, Guan R. Yang MQ, et al. BMC Syst Biol. 2018 Dec 14;12(Suppl 7):114. doi: 10.1186/s12918-018-0638-y. BMC Syst Biol. 2018. PMID: 30547798 Free PMC article.
Single-cell gene expression profiling and cell state dynamics: collecting data, correlating data points and connecting the dots.
Marr C, Zhou JX, Huang S. Marr C, et al. Curr Opin Biotechnol. 2016 Jun;39:207-214. doi: 10.1016/j.copbio.2016.04.015. Epub 2016 May 23. Curr Opin Biotechnol. 2016. PMID: 27152696 Free PMC article. Review.
Revealing the vectors of cellular identity with single-cell genomics.
Wagner A, Regev A, Yosef N. Wagner A, et al. Nat Biotechnol. 2016 Nov 8;34(11):1145-1160. doi: 10.1038/nbt.3711. Nat Biotechnol. 2016. PMID: 27824854 Free PMC article. Review.
Dimension Reduction and Clustering Models for Single-Cell RNA Sequencing Data: A Comparative Study.
Feng C, Liu S, Zhang H, Guan R, Li D, Zhou F, Liang Y, Feng X. Feng C, et al. Int J Mol Sci. 2020 Mar 22;21(6):2181. doi: 10.3390/ijms21062181. Int J Mol Sci. 2020. PMID: 32235704 Free PMC article.

See all "Cited by" articles

References

1. Amir el-AD, et al. viSNE enables visualization of high dimensional single-cell data and reveals phenotypic heterogeneity of leukemia. Nat. Biotechnol. 2013;31:545–552. - PMC - PubMed
1. Anders S, Huber W. Differential expression analysis for sequence count data. Genome Biol. 2010;11:R106. - PMC - PubMed
1. Ballenberger N, et al. Novel statistical approaches for non-normal censored immunological data: analysis of cytokine and gene expression data. PLoS One. 2012;7:e46423. - PMC - PubMed
1. Bishop CM. Pattern Recognition and Machine Learning (Information Science and Statistics) New York: Springer; 2006.
1. Brennecke P, et al. Accounting for technical noise in single-cell RNA-seq experiments. Nat. Methods. 2013;10:1093–1095. - PubMed

Publication types

Actions

MeSH terms

Actions
Actions
Actions
Actions
Actions
Actions
Actions

Grants and funding

LinkOut - more resources

Full Text Sources
Other Literature Sources
- scite Smart Citations

[1] Amir el-AD, et al. viSNE enables visualization of high dimensional single-cell data and reveals phenotypic heterogeneity of leukemia. Nat. Biotechnol. 2013;31:545–552. - PMC - PubMed

[2] Amir el-AD, et al. viSNE enables visualization of high dimensional single-cell data and reveals phenotypic heterogeneity of leukemia. Nat. Biotechnol. 2013;31:545–552. - PMC - PubMed

[3] Anders S, Huber W. Differential expression analysis for sequence count data. Genome Biol. 2010;11:R106. - PMC - PubMed

[4] Anders S, Huber W. Differential expression analysis for sequence count data. Genome Biol. 2010;11:R106. - PMC - PubMed

[5] Ballenberger N, et al. Novel statistical approaches for non-normal censored immunological data: analysis of cytokine and gene expression data. PLoS One. 2012;7:e46423. - PMC - PubMed

[6] Ballenberger N, et al. Novel statistical approaches for non-normal censored immunological data: analysis of cytokine and gene expression data. PLoS One. 2012;7:e46423. - PMC - PubMed

[7] Bishop CM. Pattern Recognition and Machine Learning (Information Science and Statistics) New York: Springer; 2006.

[8] Bishop CM. Pattern Recognition and Machine Learning (Information Science and Statistics) New York: Springer; 2006.

[9] Brennecke P, et al. Accounting for technical noise in single-cell RNA-seq experiments. Nat. Methods. 2013;10:1093–1095. - PubMed

[10] Brennecke P, et al. Accounting for technical noise in single-cell RNA-seq experiments. Nat. Methods. 2013;10:1093–1095. - PubMed

Save citation to file

Email citation

Add to Collections

Add to My Bibliography

Your saved search

Create a file for external citation management software

Your RSS Feed

Probabilistic PCA of censored data: accounting for uncertainties in the visualization of high-throughput single-cell qPCR data

Affiliations

Probabilistic PCA of censored data: accounting for uncertainties in the visualization of high-throughput single-cell qPCR data

Authors

Affiliations

Abstract

Figures

Similar articles

Cited by

References

Publication types

MeSH terms

Grants and funding

LinkOut - more resources

Full Text Sources

Other Literature Sources

Abstract

Figures

Similar articles

Cited by

References

Publication types

MeSH terms

Related information

Grants and funding

LinkOut - more resources

Full Text Sources

Other Literature Sources