Skip to main page content
U.S. flag

An official website of the United States government

Dot gov

The .gov means it’s official.
Federal government websites often end in .gov or .mil. Before sharing sensitive information, make sure you’re on a federal government site.

Https

The site is secure.
The https:// ensures that you are connecting to the official website and that any information you provide is encrypted and transmitted securely.

Access keys NCBI Homepage MyNCBI Homepage Main Content Main Navigation
. 2015 Nov 2:16:241.
doi: 10.1186/s13059-015-0805-z.

ZIFA: Dimensionality reduction for zero-inflated single-cell gene expression analysis

Affiliations

ZIFA: Dimensionality reduction for zero-inflated single-cell gene expression analysis

Emma Pierson et al. Genome Biol. .

Abstract

Single-cell RNA-seq data allows insight into normal cellular function and various disease states through molecular characterization of gene expression on the single cell level. Dimensionality reduction of such high-dimensional data sets is essential for visualization and analysis, but single-cell RNA-seq data are challenging for classical dimensionality-reduction methods because of the prevalence of dropout events, which lead to zero-inflated data. Here, we develop a dimensionality-reduction method, (Z)ero (I)nflated (F)actor (A)nalysis (ZIFA), which explicitly models the dropout characteristics, and show that it improves modeling accuracy on simulated and biological data sets.

PubMed Disclaimer

Figures

Fig. 1
Fig. 1
Zero-inflation in single-cell expression data. a Illustrative distribution of expression levels for three randomly chosen genes showing an abundance of single cells exhibiting null expression [15]. b Heat maps showing the relationship between dropout rate and mean non-zero expression level for three published single-cell data sets [3, 5, 14] including an approximate double exponential model fit. c Flow diagram illustrating the data generative process used by ZIFA. d Illustrative plot showing how different values of λ in the dropout-mean expression relationship (blue lines) can modulate the latent gene expression distribution to give a range of observed zero-inflated data
Fig. 2
Fig. 2
Comparison of exact and block-based EM algorithms. Plots show the correlation between expectations computed using the exact and block-based EM algorithms for latent low-dimensional positions (Z) (a) and latent observations X (b). Simulations were performed on a simulated data set with 500 genes and 200 cells. A block size of 50 was chosen for the approximate approach
Fig. 3
Fig. 3
Performance comparison of dimensionality-reduction techniques. a Toy simulated data example illustrating the performance of ZIFA compared to standard dimensionality-reduction algorithms. b Performance on simulated data sets based on correlation score between the estimated and true latent distances as a function of λ (larger λ, lower dropout rate), number of genes and latent dimensions, and noise level used in the simulations. c Plots showing the divergence between the predictive and empirical data distributions as a function of dropout rate and mean expression level for FA, PPCA and ZIFA. Illustrative predictive performance and model fits (red, color online) on the T-cell single-cell data set (black) [3]
Fig. 4
Fig. 4
Consistency of cell-to-cell distances. Box plots showing the correlation between distance matrices for PPCA and ZIFA from 100 gene sets selected at random from (a) differentiating T cells [3], (b) 11 populations [15], (c) myoblasts [5] and (d) bone marrow [14]. The distance matrices produced by ZIFA are more correlated with each other than are the distance matrices produced by PPCA
Fig. 5
Fig. 5
Cell type separability. Plot shows relative cell type misclassification error rates after applying PCA and ZIFA on random subset of 500 genes sampled for the Pollen [15] and Usoskin [16] data sets. Performance was measured based on error rates from (a, c) linear and (b, d) quadratic discriminant classifiers. Positive values indicate better performance based on PCA, and negative values for ZIFA
Fig. 6
Fig. 6
Understanding the relationship between cell type separability and dropout rate. This is a comparison of dimensionality-reduction techniques for cell typing. These plots show cell type misclassification rates (using QDA) as a function of dropout rate for the preprocessing using PCA and ZIFA on simulated data sets based on the (a) Pollen [15] and (b) Usoskin [16] data sets. The exact PCA results correspond to a ground-truth baseline when PCA is applied to simulated data with no dropout events

Similar articles

Cited by

References

    1. Shapiro E, Biezuner T, Linnarsson S. Single-cell sequencing-based technologies will revolutionize whole-organism science. Nat Rev Genet. 2013;14(9):618–30. doi: 10.1038/nrg3542. - DOI - PubMed
    1. Blainey PC, Quake SR. Dissecting genomic diversity, one cell at a time. Nat Methods. 2014;11(1):19–21. doi: 10.1038/nmeth.2783. - DOI - PMC - PubMed
    1. Buettner F, Natarajan KN, Casale FP, Proserpio V, Scialdone A, Theis FJ, et al. Computational analysis of cell-to-cell heterogeneity in single-cell RNA-sequencing data reveals hidden subpopulations of cells. Nat Biotechnol. 2015;33(2):155–160. doi: 10.1038/nbt.3102. - DOI - PubMed
    1. Treutlein B, Brownfield DG, Wu AR, Neff NF, Mantalas GL, Espinoza FH, et al. Reconstructing lineage hierarchies of the distal lung epithelium using single-cell RNA-seq. Nature. 2014;509(7500):371–5. doi: 10.1038/nature13173. - DOI - PMC - PubMed
    1. Trapnell C, Cacchiarelli D, Grimsby J, Pokharel P, Li S, Morse M, et al. The dynamics and regulators of cell fate decisions are revealed by pseudotemporal ordering of single cells. Nat Biotechnol. 2014;32(4):381–386. doi: 10.1038/nbt.2859. - DOI - PMC - PubMed

Publication types