Skip to main page content
U.S. flag

An official website of the United States government

Dot gov

The .gov means it’s official.
Federal government websites often end in .gov or .mil. Before sharing sensitive information, make sure you’re on a federal government site.

Https

The site is secure.
The https:// ensures that you are connecting to the official website and that any information you provide is encrypted and transmitted securely.

Access keys NCBI Homepage MyNCBI Homepage Main Content Main Navigation
. 2021 Jan;31(1):101-109.
doi: 10.1101/gr.267047.120. Epub 2020 Nov 20.

Predictive modeling of single-cell DNA methylome data enhances integration with transcriptome data

Affiliations

Predictive modeling of single-cell DNA methylome data enhances integration with transcriptome data

Yasin Uzun et al. Genome Res. 2021 Jan.

Abstract

Single-cell DNA methylation data has become increasingly abundant and has uncovered many genes with a positive correlation between expression and promoter methylation, challenging the common dogma based on bulk data. However, computational tools for analyzing single-cell methylome data are lagging far behind. A number of tasks, including cell type calling and integration with transcriptome data, requires the construction of a robust gene activity matrix as the prerequisite but challenging task. The advent of multi-omics data enables measurement of both DNA methylation and gene expression for the same single cells. Although such data is rather sparse, they are sufficient to train supervised models that capture the complex relationship between DNA methylation and gene expression and predict gene activities at single-cell level. Here, we present methylome association by predictive linkage to expression (MAPLE), a computational framework that learns the association between DNA methylation and expression using both gene- and cell-dependent statistical features. Using multiple data sets generated with different experimental protocols, we show that using predicted gene activity values significantly improves several analysis tasks, including clustering, cell type identification, and integration with transcriptome data. Application of MAPLE revealed several interesting biological insights into the relationship between methylation and gene expression, including asymmetric importance of methylation signals around transcription start site for predicting gene expression, and increased predictive power of methylation signals in promoters located outside CpG islands and shores. With the rapid accumulation of single-cell epigenomics data, MAPLE provides a general framework for integrating such data with transcriptome data.

PubMed Disclaimer

Figures

Figure 1.
Figure 1.
Schematic overview of the computational framework. (A) Generation of meta-cells from single-cell DNA methylation data. The gene-by-cell DNA methylation matrix is used for principal component analysis (PCA) to reduce the dimensionality. Each point in the PCA plot is a single cell. A meta-cell is the set of k-nearest cells to an individual cell in the PCA space. (B) Prediction of gene activity level by combining DNA methylation and sequence information in promoter. (TSS) Transcription start site; (ML) methylation level. Each horizontal box represents a genomic bin. Each circle represents a CpG site; a filled circle represents methylated cytosine, and an empty circle represents unmethylated cytosine in a meta-cell. CpG ratio is defined as the percentage of CpG dinucleotides in a genomic bin. CpG methylation level is defined as the ratio of the number of methylated CpG calls to all CpG calls. (C) Integration of single-cell methylation and single-cell RNA-seq data based on predicted gene activity scores using single-cell DNA methylation data.
Figure 2.
Figure 2.
Prediction accuracy of gene expression using DNA methylation data and ensemble learning. (A) Heatmap showing global Spearman's correlation coefficients between observed gene expression and predicted gene activity for all genes across all cells in a data set. Rows represent training data sets, and columns represent test data sets. Row and column Roman numerals correspond to the data sets shown. (B) Distribution of Spearman's correlation coefficients across genes using the data set of Clark et al. (2018) as the training set. Each data point represents one cell. MAPLE (EL) correlations are significantly higher than those of MPD (P < 10−16, one-sided t-test for all three comparisons).
Figure 3.
Figure 3.
Predicted gene activity using methylome data improves cell subtype identification for neurons. (A) UMAP of clustering result generated using mean promoter demethylation (MPD) as the input. (B) Violin plot of MPD values for marker genes for excitatory (Tyro3, Slc17a7, Tbr1, Itpka) and inhibitory (Slc6a1, Erbb4) neurons. (C) Same as A but using predicted gene activity (PGA) as the input. (D) Same as B but using predicted gene activity as the input.
Figure 4.
Figure 4.
Predictive modeling improves integration with transcriptome data of cell lines. (A) Cell heterogeneity based on transcriptome and DNA methylome data. (Left) UMAP using RNA-seq data as the input. Color scale represents the log-normalized (using Seurat) expression level (read counts) of Esrrb for EBs. (Middle) UMAP using mean promoter demethylation as the input. Color scale represents the MPD (1 − mean methylation level) level of the Esrrb gene. (Right) UMAP using MAPLE-predicted gene activity based on DNA methylation data as the input. Color scale represents the MAPLE-predicted gene activity levels of Esrrb. (B) Same as A, but for the T gene. (C) UMAP based on integrated RNA-seq and DNA methylation data. Mean promoter demethylation (MPD) was used as the input for data integration using Seurat. (EB) embryoid body; (ESC) embryonic stem cell. (D) Density clustering of the data shown in the UMAP in C. (E) Confusion matrix plot based on the clustering result shown in D, illustrating the agreement between cell type assignment based on clustering and true cell type. Size of each quadrant is proportional to the number of cells classified. (F) Same as C, but using predicted gene activity as the input. (G) Same as D, but using predicted gene activity as the input. (H) Same as E, but using predicted gene activity as the input. χ2 test P-value for the confusion matrices in G and H is 0.002.
Figure 5.
Figure 5.
Predictive modeling improves integration with transcriptome data of primary tissues. (A) UMAP plots for integrated expression and DNA methylation data. Mean promoter demethylation (MPD) was used as the input for data integration using Seurat. (B) Same as A, but using MAPLE-predicted gene activity as the input. (C) Pie chart showing the percentage of correctly and misclassified cells using scDNA-methylation data, based on k-nearest neighbor (k-NN) classification on the scRNA-seq cells for the MPD-based UMAP in A. (D) Same as C, but for the MAPLE-based UMAP in B. χ2 test P-value for the comparison between correct and misclassifications in C and D is 3.6 × 10−10.

Similar articles

Cited by

References

    1. Alipanahi B, Delong A, Weirauch MT, Frey BJ. 2015. Predicting the sequence specificities of DNA- and RNA-binding proteins by deep learning. Nat Biotechnol 33: 831–838. 10.1038/nbt.3300 - DOI - PubMed
    1. Angermueller C, Clark SJ, Lee HJ, Macaulay IC, Teng MJ, Hu TX, Krueger F, Smallwood S, Ponting CP, Voet T, et al. 2016a. Parallel single-cell sequencing links transcriptional and epigenetic heterogeneity. Nat Methods 13: 229–232. 10.1038/nmeth.3728 - DOI - PMC - PubMed
    1. Angermueller C, Pärnamaa T, Parts L, Stegle O. 2016b. Deep learning for computational biology. Mol Syst Biol 12: 878 10.15252/msb.20156651 - DOI - PMC - PubMed
    1. Angermueller C, Lee HJ, Reik W, Stegle O. 2017. DeepCpG: accurate prediction of single-cell DNA methylation states using deep learning. Genome Biol 18: 67 10.1186/s13059-017-1189-z - DOI - PMC - PubMed
    1. Antequera F. 2003. Structure, function and evolution of CpG island promoters. Cell Mol Life Sci 60: 1647–1658. 10.1007/s00018-003-3088-6 - DOI - PMC - PubMed

Publication types

LinkOut - more resources