Abstract
Genomic imprinting results in parental-specific gene expression. Imprinted genes are involved in the etiology of rare syndromes and have been associated with common diseases such as diabetes and cancer. Standard RNA bulk cell sequencing applied to whole-tissue samples has been used to detect imprinted genes in human and mouse models. However, lowly expressed genes cannot be detected by using RNA bulk approaches. Here, we report an original and robust method that combines single-cell RNA-seq and whole-genome sequencing into an optimized statistical framework to analyze genomic imprinting in specific cell types and in different individuals. Using samples from the probands of 2 family trios and 3 unrelated individuals, 1,084 individual primary fibroblasts were RNA sequenced and more than 700,000 informative heterozygous single-nucleotide variations (SNVs) were genotyped. The allele-specific coverage per gene of each SNV in each single cell was used to fit a beta-binomial distribution to model the likelihood of a gene being expressed from one and the same allele. Genes presenting a significant aggregate allelic ratio (between 0.9 and 1) were retained to identify of the allelic parent of origin. Our approach allowed us to validate the imprinting status of all of the known imprinted genes expressed in fibroblasts and the discovery of nine putative imprinted genes, thereby demonstrating the advantages of single-cell over bulk RNA-seq to identify imprinted genes. The proposed single-cell methodology is a powerful tool for establishing a cell type-specific map of genomic imprinting.
Keywords: imprinting, allele-specific expression, single cell
Introduction
Variations in allelic expression in the genome contribute to human diseases and traits.1 Allele-specific expression (ASE) is controlled by inherited or de novo genomic variants in regulatory regions, such as enhancers, insulators, or promoters.2 Notable exceptions to the laws of Mendelian inheritance due to ASE include genomic imprinting, X chromosome inactivation (XCI), and uniparental disomy (UPD). Specifically, genomic imprinting induces the silencing of one of the alleles when it is transmitted by one of the parents, resulting in a parent-of-origin-dependent allelic transcription.3, 4, 5 Owing to their monoallelic status, imprinted genes are easily affected by heterozygous deleterious mutations, CNVs, and alterations in expression dosage. For instance, some imprinted genes play a decisive role in growth control and developmental processes.6 The disruption of these genes causes severe genetic syndromes including Prader-Willi syndrome (PWS [MIM: 176270]), Angelman syndrome (AS [MIM: 105830]),7, 8, 9 Beckwith-Wiedemann syndrome (BWS [MIM: 130650]),10, 11, 12 Silver-Russell syndrome (SRS [MIM: 180860]),13, 14 Temple syndrome (TS14 [MIM: 616222]), and Kagami-Ogata syndrome (KOS14 [MIM: 608149]).15
In the early 1990s, Igf2 (paternally expressed) and H19 (maternally expressed) were the first two imprinted genes to be discovered in mice.16, 17, 18 To date, approximatively 150 human imprinted genes have been discovered,19, 20 but the quest to identify of the whole repertoire of imprinted genes is still ongoing. Recent studies have concluded that cell type-specific imprinted transcription is pervasive and underestimated.20 Novel approaches are thus required to analyze genomic imprinting in specific cell types.
Here, we took advantage of the maintenance of imprinting marks through mitosis and the recent advancements in single-cell genomics to devise a method that combines single-cell RNA (RNA-seq) and whole-genome sequencing to accurately assess ASE and parent-of-origin effects on gene transcription. Previously, our lab and others demonstrated that single-cell RNA-seq facilitates the identification of ASE patterns.21, 22, 23, 24 In this study, we sequenced the RNA of 1,084 single fibroblasts from 5 individuals. For 2 of these individuals, the parental DNA was available. We determined the transcripts expressed exclusively from the same allele across cells by applying a statistical framework that accounts for heterogeneous coverage, sampling bias, and allelic dropout. We identified the following genes in fibroblasts: 9 imprinted genes (ADTRP, ATP5EP2, SMOC1, SNORD114-11 / SNORD114-10, SNORD114-17 / SNORD114-16, SNORD114-19 / SNORD114-18, SNORD114-21 / SNORD114-20, SNORD114-26 / SNORD114-25, and PRIM2) with parent-of-origin effects and 33 genes with a strong skewed monoallelic imbalance for which we could not identify parental origin-dependent expression.
In this study, we reported the accuracy of a single-cell method to assess the imprinted expression of genes with unprecedented resolution. The method applies to genes expressed in a few cells, and thus unrecoverable using a standard RNA-seq.
Material and Methods
Ethical Statement
The study was approved by the ethics committee of the University Hospitals of Geneva, and written informed consent was obtained from both parents of each individual prior to the study.
Samples
We studied single cells from six primary fibroblast cell lines as summarized in Table S1. For each twin, we obtained one cell line from skin biopsies. Single-cell captures and bulk RNA sequencing were conducted independently on each primary cell line. For the ASE analysis, the single-cell RNA-seq data from the two twins were processed together and considered as one individual because they shared the same genotype. RNA-seq bulk samples were taken from the GenCord collection.25
Cell Growth
Cells were cultured in DMEM GlutaMAX (Life Technologies) supplemented with 10% fetal bovine serum (Life Technologies) and 1% penicillin/streptomycin/fungizone mix (Amimed, BioConcept) at 37°C in a 5% CO2 atmosphere. The day before the single-cell capture experiment, the cells were trypsinized (Trypsin 0.05%-EDTA, Life Technologies) and plated at a density of 0.3 × 106 cells/100-mm dish.
Single-Cell Capture
Single-cell captures were performed on the C1 single-cell auto prep system (Fluidigm) using the cell load script (1772×/1773×). Trypsinized cells were counted and sized using the cell counter and analyzer system Casy one (Shärfe system). The average size of human primary fibroblasts was 20 μm (15–25 μm) and 4,500–6,000 dissociated live cells were loaded into the assay well of a primed microfluidic array (C1 single-cell auto prep array for mRNA-seq [17–25 μm], 96 chambers, Fluidigm) according to the manufacturer’s instructions. After 30 min of the capture procedure, we visualized one-by-one the 96 chambers using an inverted phase contrast microscope to annotate the chamber contents. We considered chambers that contained one single cell by visual inspection in further analyses; those chambers that contained debris, multiple cells, and damaged cells and empty chambers were discarded.
Single-Cell RNA Sequencing
All of the cDNA preparations were performed on a C1 single-cell array for mRNA-seq using the C1 single-cell auto prep system (Fluidigm). We used the SMARTer Ultra Low RNA kit for Illumina Sequencing (version 2, Clontech) for cell lysis and cDNA synthesis according to the manufacturer’s instructions. We assessed cDNA quality using a 2100 Bioanalyzer (Agilent) with high sensitivity DNA chips (Agilent) and quantified the cDNA using the Qubit dsDNA BR assay kit (Invitrogen). Sequencing libraries were prepared with 0.3 ng of pre-amplified cDNA using the Nextera XT DNA kit (Illumina) according to the manufacturer’s instructions. Libraries were sequenced on an Illumina HiSeq2000 machine as 100 bp single-end reads.
Bulk RNA Sequencing
Libraries were prepared with 1 ng of total RNA using the TruSeq RNA kit (Illumina) according to the manufacturer’s instructions. Libraries were sequenced on an Illumina HiSeq2000 machine as 100 bp single-end reads.
Whole-Genome Sequencing
Cells were harvested on the day of the single-cell capture. Genomic DNA was extracted using the QIAamp DNA Blood Mini Kit (QIAGEN) according to the manufacturer’s instructions including the RNase treatment. Purified genomic DNA (100 ng) was electrophoresed on a 0.8% agarose gel to assess its quality. We quantified the gDNA concentration using the Qubit dsDNA BR assay kit (Invitrogen). Genomic DNA libraries were prepared using the TruSeq DNA kit (Illumina). The starting amount of material was 1 μg of genomic DNA sheared with Covaris S2 to 300–400 bp in size. The libraries were sequenced using an Illumina HiSeq 2000 machine, and 100 bp paired-end reads were generated. For parents of the trio family, the procedure was the same except that genomic DNA was extracted from blood samples.
Sequencing Data Processing
Raw whole-genome DNA sequences were analyzed using an in-house pipeline that utilizes published algorithms in a sequential manner (BWA26 for mapping the reads over the hg19 reference, SAMtools26 for detecting heterozygous sites, and ANNOVAR27 for annotation). Reads with a mapping quality >0 (uniquely mapping reads) were used for SNV calling. Of note, annotated Gencode pseudogenes are not excluded from our analysis. However, we excluded multiple mapped reads and intronic SNVs that confound pseudogenes with their parental genes.28 Variants falling inside repeated regions such as segmental duplications or repeats (according to RepeatMasker) were filtered out (Table S2). All experiments were performed using the manufacturer’s recommended protocols without modifications.
The RNA sequences were mapped with tophat229 and GEM.30 Uniquely mapping reads were extracted by filtering for mapping quality (MQ ≥ 50 for tophat2 and MQ ≥ 150 for GEM) and processed using an in-house pipeline to calculate reference and alternative allelic expression for each covered SNV site in each single cell (Tables S2 and S3).
Allele-Specific Expression
For each gene, the aggregate monoallelic ratio (AR) was calculated by averaging the allelic ratio calculated per SNV site (sum of #reads from the dominant allele [i.e., (the most expressed)/total SNV reads]).
To reduce the effect of allele dropout, we considered cells with ≥5 reads for each SNV site. To account for sampling bias,22 an SNV site was included in the analysis if it was expressed in more than one single cell (Table S2). Notably, sampling selection was not a critical issue in our approach. Indeed, considering more than one cell per heterozygous site, if both alleles of the gene are expressed, the probability to capture the same allele decreases exponentially with the number of cells.
We excluded chromosome X from the analysis. For each gene, the aggregated values are reported for simplicity (Figures 1 and 2).
Figure 1.
Transcript Detection in Bulk and Single Cells and Relative Allelic Expression of 52 Genes with Significant Allelic Imbalance in 1,084 Single-Cell Fibroblasts
(A) The cumulative number of detected genes is plotted according to increasing random subsets of samples (RPKM > 1; 24,753 genes in total; Gencode v.10). 165 bulk RNA-seq samples from 165 Gencord fibroblasts (blue) are compared with 165 single-cell RNA-seq samples from one GenCord fibroblast (red, UCF1014).
(B) Detection rate (number of samples in which a gene > 1 RPKM) is plotted for each gene for single-cell RNA-seq (n = 165, UCF1014, red) and bulk RNA-seq from GenCord fibroblast (n = 165, blue). Genes are ranked on the x axis according to the detection rate. The inset is a zoom-in around the intersection point. A total of 4,542 genes are detectable in more SC than bulk samples, and 2,738 genes are detected exclusively in single cells.
(C) SNVs per gene per cell are colored and plotted according to their respective allelic ratio (AR), measured as an aggregated value scaled from 0 to 1. For each site, the size of the dot is proportional to the respective number of reads. Negative log10-transformed p values are displayed in pink. The vertical dashed blue line reflects a statistical significance level of 0.05. The known imprinted genes are in green and the putative imprinted genes with validated parent-of-origin effects are in red.
Figure 2.
Karyogram with the Genomic Map of the 52 Selected Genes with Allelic Imbalance Compatible with Imprinting that Have Been Identified in 1,084 Single-Cell Fibroblasts
A zoom-in view (chr14: 100,179,741–101,091,294; hg38) of the 14q32 imprinted region is shown with skewed monoallelic genes in red. Chromosome X and Y were not tested. Asterisk (∗) indicates known imprinted genes.
For GenCord analysis (i.e., bulk samples), we manually evaluated that the reported SNV detected was a single gene and/or a single isoform and was not overlapping a repetitive element.
Statistical Analysis
We defined the gene g as characterized by S heterozygous sites. Each j-esim site is expressed in nj single cells and each i-esim informative cell expresses one of the two alleles or both {aij, bij} (a is the number of reads presenting with the reference allele, and b is the number of reads presenting with the alternative allele). The allelic ratio per cell ARij is calculated according to the dominant allele, i.e., the one that express more reads per site . According to a threshold T, a cell with ARij ≥ T is defined as a monoallelic expressor of the dominant allele for the site j. To define the allelic expression of the gene g, we pooled together all of the single cell observations in the S SNV sites and calculated an aggregate allelic ratio AR = mean(ARj). More formally, for a gene g with a total of N observations , the probability of having k monoallelic expressors for the gene g can be calculated using the binomial distribution
(Equation 1) |
where p is the unknown probability of expressing the dominant allele in each cell.
A natural choice is to model p with a Beta distribution:
(Equation 2) |
where B is the Beta function, a is the fraction of cells proportional to the aggregate allelic ratio a = AR ⋅ N and b = (1 − AR) ⋅ N. Combining Equations 2 with 1 we obtain the compound beta-binomial distribution:
(Equation 3) |
The null probability of obtaining the same configuration assuming an equal probability of getting reads from the two alleles is:
(Equation 4) |
We then applied the likelihood-log ratio test to statistically compare Equations 3 with 4 and obtain a measure of the significance for the AR of the gene g.
Of note, this derivation is valid under the assumption that the measurements of the allelic expression of two adjacent heterozygous sites on the same gene do not influence each other; in other words, there are no reads encompassing both sites. With a read length of 100 nt, this assumption is supported by the fact that less than 7% of the adjacent SNPs are closer than ±50 nt (Figure S1). After the discovery phase, we repeated the analysis by eliminating all reads that encompassed two adjacent SNPs for the imprinted genes and obtained the same results (data not shown). Estimation of false-negative error rates is provided in Figure S2.
Results
Single Cells Reveal Transcripts of Low Abundance
To empirically validate the single-cell RNA-seq approach for its sensitivity to detect more genes than standard RNA-seq from bulk cell populations, we compared the number of detected genes between 165 bulk samples obtained from the GenCord fibroblast collection31 and 165 single-cell fibroblasts from one individual from the same collection (Figure 1). The number of detected genes reached a plateau at approximately 14,000 genes (RPKM > 1) after analyzing RNA-seq data from 120 GenCord individuals. No plateau was reached with an analysis of 165 single cells, which allowed the detection of an additional 2,738 genes in single cells (RPKM > 1). Moreover, 4,542 genes were detectable in more single cells than in bulk samples (Figure 1). The reason for this result was that the gene expression profile detected in bulk samples was always dominated by the average of the highest transcribed genes and was substantially similar among individuals. Conversely, each single cell is sequenced at a specific time point in its life cycle and revealed a snapshot of the dynamic transcriptional process. According to the transcriptional bursting model,32 there are some temporal windows during which it is possible to detect less-abundant transcripts. Therefore, single-cell RNA-seq of many individual cells enables the exploration of the whole transcriptional dynamics of a tissue (Table S3), whereas low gene expression profiles are averaged out by RNA-seq bulk transcriptomes.
Detection of Significant Monoallelic Expression
To robustly identify skewed monoallelically expressed genes in 1,084 single-cell fibroblasts from 5 individuals, we sequenced and mapped the RNA of isolated single fibroblast using TopHat29 and GEM mapper.30 In parallel, we sequenced the whole genome of each individual from bulk cells (>1 million cells). To identify heterozygous variants, we performed DNA variant calling with SAMtools26 and considered SNVs inside uniquely mapping reads. ASE was then estimated using read counts at each expressed heterozygous site using a custom pipeline (see Table S1 and Material and Methods for details). We obtained 14 million uniquely mapped reads on average per single cell. After filtering out repeated regions of the genome (segmental duplications, complex and simple repeats), on average, approximately 700,000 (range 340,000–106,000) heterozygous single nucleotide variants (SNV) were interrogated (Table S2).
To classify a gene as putative imprinted, we implemented a statistical framework that included the number of reads on each allele per site and the number of informative single cells per site. In brief, we modeled the likelihood of a gene to be monoallelically expressed with a beta-binomial distribution and evaluated the significance of the aggregate allelic ratio (AR = reads sum of the most frequent allele per SNV site/total reads) with the log-likelihood test (see Material and Methods for details). Genes presenting with a significant AR (FDR corrected p value < 0.05, AR > 0.9) in at least two of the five individuals and not presenting with a biallelic state (AR < 0.9) in the others were retained (for results obtained in one individual, see Table S4). In addition, we manually excluded strong reported GTEx cis-expression quantitative trait loci (eQTLs) that might explain the observed skewed monoallelic expression (Table S5).
From this stringent analysis, we obtained a total of 52 genes that exhibited a strong and significant allelic bias in transcription for one and the same allele in primary fibroblasts of the same individual (Figures 1 and 2, Table S6).
Characterization of Known Imprinted Genes in Single-Cell Fibroblasts
These 52 genes mapped throughout the genome, except for an imprinted gene cluster located within the 14q32 imprinted region33, 34 (Figure 2), which contained the following nine described imprinted genes: SNRPN-SNURF, MEG3, MEG9, KCNQ1OT, NAP1L5, PEG10, PLAGL1, IPW, and ZDBF2. The 14q32 region has been associated with imprinting disorders35 and tumorigenesis.36 In particular, this genomic area harbors three differentially methylated regions37, 38 described to coordinately regulate (1) the paternally expressed genes DLK1 and RTL1 (retrosposon-like1) and (2) the maternally expressed genes MEG3 (long intergenic non-protein coding RNA 23 or GTL2), RTL-antisense transcript, MEG8, MEG9, and several polyadenylated C/D box small nucleolar (sno)RNAs39, 40 and microRNAs20, 40, 41, 42, 43 (Figure 2, Tables S6 and S7). From our analysis, we validated the known imprinting statuses of MEG3, MEG8, MEG9, DLK1, and RTL-antisense transcripts (Tables S7), but the DIO3 and RTL1 genes were not detectable in fibroblasts. Therefore, our results complement and support the status of this 14q32 imprinted region in single-cell fibroblasts.
To further validate our method, we conducted a specific analysis of 93 known imprinted genes annotated in the Geneimprint database (Table S7). First, 43 genes were absent from our dataset either because they were not detected in primary fibroblasts or because of a lack of informative heterozygous SNVs. Of the remaining 50 imprinted genes, 45 were validated by our single-cell analysis. However, 5 known imprinted genes showed clear evidence of biallelic expression in single-cell fibroblasts (ATP10A, GLIS3, GNAS, SLC4A2, and TFPI2) (Figure 3). These genes have been previously reported as showing a variable allelic expression among individuals and tissue-specific imprinting20 (Table S7). To validate our results, we analyzed the ASE of these 5 biallelic genes in 179 bulk samples from GenCord primary fibroblasts.31 All genes were detectable and displayed an unambiguous biallelic expression in fibroblast cell types, thus confirming our initial finding with the single-cell based method (Figure 3). Moreover, we examined in detail the 42 imprinted genes that were recently detected in 1,582 human primary tissue samples.20 Four of these genes (PPIEL, ZNF331, LPAR6, MEG9) were expressed in the fibroblasts. We observed that all of the genes presented with an AR > 0.9 in single-cell fibroblasts even if they did not achieve statistical significance after multiple test corrections for their low expression (Figure 3). We conducted the same analysis for 24 imprinted genes that were recently discovered in 45 human tissue samples and cell lines.19 Eleven of these genes have been reported to be imprinted genes and expressed in fibroblasts or skin tissues19 (Table S7). In our single-cell setting, we were also able to confirm their imprinting status.
Figure 3.
Allele-Specific Expression of 60 Known Imprinted Genes in 1,084 Single-Cell Fibroblasts
(A) Sixty known imprinted genes from the Geneimprint database detected in single-cell fibroblasts. The SNVs per gene are colored according to their respective allelic ratio (AR) and measured as an aggregated value scaled from 0 to 1. The size of the dots is proportional to the number of reads, and the pink curve displays the negative log10-transformed p values. The vertical dashed blue line reflects a statistical significance level of 0.05.
(B) Scatterplot of the aggregated AR value and negative log10-transformed p values. 60 known imprinted genes are in green, 10 putative imprinted genes are in red, and 33 genes with strong allelic imbalance in single-cell fibroblasts are in black.
(C) Distributions of allelic ratios (reference reads/total reads) represented as boxplots in 179 GenCord fibroblast cell lines of previously reported imprinted genes showing evidence of biallelic expression in single-cell fibroblasts (see A).
In conclusion, a single-cell approach enables the identification of known imprinted genes as long as the gene shows detectable expression and contains informative SNVs.
Identification of Putative Imprinted Genes in Primary Single-Cell Fibroblasts
To identify genes with reliable marks of imprinting, the parental origin of the single expressed allele was determined from two families in which the parental DNA was available. We thus sequenced the whole genomes of the probands and the parents. After extracting all of the heterozygous sites, we used the SHAPEIT software44 to reconstruct the allelic phases and derive, where possible, the parent of origin of each allele (see Material and Methods for details). We identified imprinted genes with the same parent-of-origin inheritance in both family trios (Table S6). As expected, this gene list contained nine known imprinted genes (SNRPN-SNURF, MEG3, MEG9, KCNQ1OT1, NAP1L5, PEG10, PLAGL1, IPW, and ZDBF2) (Figure 1 and Table S6). Although we confirmed the correct parent-of-origin for those genes (Table S6), we also found unexpectedly discordant parent-of-origin for inherited alleles in the two trios for four validated known imprinted genes (AIM1, MAGI2, ANO1 and DHFR) (Table S7). Additionally, we identified from this gene list ten novel putative imprinted genes in primary fibroblasts: ADTRP, ATP5EP2, SMOC1, SNORD114-11 / SNORD114-10, SNORD114-17 / SNORD114-16, SNORD114-19 / SNORD114-18, SNORD114-21 / SNORD114-20, SNORD114-26 / SNORD114-25, ERAP2, and PRIM2 (Table S6). Two genes are paternally expressed (PRIM2 and ERAP2) and eight are maternally expressed (Table S6). Importantly, PRIM2 has been previously described as being imprinted but conflicting data were reported in the literature45, 46 mainly due to a mis-annotation of PRIM2 with its pseudogene. In our study, we confirmed the maternally imprinted status of PRIM2 (ASE: 0.99, adj p value < 1 × 10−300) in primary fibroblasts because we filtered out multi-mapped reads. Consistent with our single-cell analysis, we validated the skewed monoallelic expression of the 19 putative imprinted genes in bulk samples from the same individuals (Figure 4). The allelic ratios were greater than 0.8 for all of these genes. However, several studies have reported the alteration of ERAP2 expression by nonsense-mediated decay47, 48 which may adversely affect the imprinting status for this gene. Ultimately, we identified four putative imprinted coding genes (ADTRP, ATP5EP2, SMOC1, PRIM2) and five polyadenylated (sno)RNAs in the known imprinted 14q32 region.
Figure 4.
Allelic-Specific Expression across the GenCord Bulk Samples of Known or Putative Imprinted Genes in Single-Cell Fibroblasts
Distributions of allelic ratios represented as boxplots in (A) GenCord RNA-seq fibroblast bulk samples and (B) RNA-seq bulk samples from three different GenCord cell types (179 primary fibroblasts, 179 lymphoblastoid cells [LCLs], 181 T cells). Asterisk (∗) indicates known imprinted genes.
Advantages of Single-Cell RNA-Seq over Bulk RNA-Seq to Identify Imprinted Genes
To demonstrate the improved sensitivity of the single-cell RNA-seq methodology for imprinted gene identification compared with the previous bulk-based approaches, we examined the ASE of the 52 selected genes (Table S8) in 539 RNA-seq bulk samples from 3 different GenCord cell types (179 primary fibroblasts, 179 lymphoblastoid cells [LCLs], and 181 T cells).25 Six genes were detectable (EDARADD, ERAP2, PEG10, ZDBF2, MEG3, and MEG9) in the GenCord bulk fibroblasts; nevertheless, these genes presented with the single-cell predicted imprinting status (Figure 4). As expected, the known imprinted genes (PEG10, ZDBF2, MEG3, MEG9) showed imprinting signatures in two different cell types: ZDBF2 (fibroblasts, T cells) and PEG10 (fibroblasts, LCLs) (Figure 4). Remarkably, EDARADD showed a similar imprinting profile among all three cell types, whereas ERAP2 seemed to be more restricted to one cell type (fibroblasts) (Figure 4). EDARADD is a promising putative imprinted gene but no information on parental expression was available from the two trios. Taking these results together, we concluded that the single-cell approach is advantageous to RNA-seq bulk methods for the identification of imprinted genes.
Discussion
We have presented several lines of evidence that combining single-cell RNA-seq with WGS and a robust analysis framework is suitable for assessing imprinted genes. This method provides a solution to the inherent limitation of standard RNA sequencing underlined by recent studies19, 20 that were unable to identify imprinted genes present in low abundance. Our data indicate that single-cell RNA-seq revealed genes detected in a limited subset of cells, accounting for a small percentage of the cell population. We have developed an analysis framework that provides statistical support to genes transcribed exclusively from one and the same allele and enables the detection of parent-of-origin effects on transcription. We tested this framework on 1,084 single primary fibroblasts from five individuals including two trios (parent-offspring). First, we demonstrated that this framework enables the detection of known imprinted genes expressed in fibroblasts, rendering the method suitable for imprinted gene discovery. Second, we identified nine putative previously undescribed imprinted genes in fibroblasts for which parent-of-origin effects were demonstrated. These nine putative imprinted genes were untraceable in all bulk samples that were processed using a standard RNA-seq procedure. In particular, we revealed the maternal expression of five polyadenylated (sno)RNAs belonging to the extensively studied human imprinted domain at 14q32. Our finding corroborate the tissue-specific maternal expression of (sno)RNAs encoded at the orthologous murin region on mouse distal 12 chromosome.40 The four remaining genes were not associated with any previously reported differentially methylated regions.49, 50
In addition, this study revealed 33 transcripts with a significant monoallelic imbalance in primary fibroblasts (Table S6). Given our limited number of trios, we were unable to determine an exclusive parental expression of the expressed allele in these cases. Indeed, the ascertainment of the imprinting status of a gene presenting a skewed allelic expression profile has to pass through the verification of the parental origin of the expressed allele. The probability to observe the same parental origin by chance decreases exponentially with the number of trios. For example, with ten informative trios, the probability of expressing the same parental allele by chance is about 0.001 (2−10). However, given the average occurrence of one heterozygous variant per kilobase,51 the probability of sequencing a trio informative for a specific gene (i.e., the gene contains at least one heterozygous SNP) is related to the respective gene length. We modeled this probability with a Poisson distribution and calculated the number of trios needed to achieve a significance p = 0.001 in function of the gene length with a binomial distribution. Eventually 20 probands have to be sequenced for genes with lengths greater than 1 kbp (90% of the total number of transcripts), 10 probands are sufficient for genes of 9 kbp or more (64%), while for smaller transcripts (<500 bp), a considerably higher number of probands is required (>50) (Figure S3). Nevertheless, some of these 33 genes are of clinical interest because they can be important contributors of phenotypic variability or may confer selective advantages to the cells and promote tumorigenesis. Among these were the genes edar-associated death domain (EDARADD [MIM: 606603]), neurexin 1 (NRXN1 [MIM: 600565]), lysine-specific methyltransferase 2C (MLL3 [MIM: 606833]), adenosine deaminase 2 (CECR1 [MIM: 607575]), phosphatase and tensin homolog pseudogene (PTENP1 [MIM: 613531]), fatty aldehyde dehydrogenase (ALDH3A2 [MIM: 609523]), adenylate kinase 2 (AK2 [MIM: 103020]), and anaplastic lymphoma kinase (ALK [MIM: 105590]), which have been reported to be associated with syndromes or cancer-related disease.
In conclusion, a full accounting of cell-type-imprinted genes using a single-cell-based approach is practically conceivable. This framework is easily scalable to a broad spectrum of tissues and a large cohort of samples from family trios and shows the potential to comprehensively analyze genomic imprinting and its impact on the phenotypic variability and penetrance of genetic traits.
Acknowledgments
We thank Olivier Delaneau for help with the SHAPEIT software. This work was supported by grants from the Swiss National Science Foundation (SNF-144082), the European Research Council (ERC-249968), and the ChildCare foundation to S.E.A.. The computations were performed at the Vital-IT Center for high-performance computing of the Swiss Institute of Bioinformatics (SIB).
Published: February 9, 2017
Footnotes
Supplemental Data include three figures and eight tables and can be found with this article online at http://dx.doi.org/10.1016/j.ajhg.2017.01.028.
Contributor Information
Federico A. Santoni, Email: federico.santoni@unige.ch.
Stylianos E. Antonarakis, Email: stylianos.antonarakis@unige.ch.
Accession Numbers
Newly generated RNA and DNA sequencing data are deposited in the European Genome-phenome Archive for controlled accesses; the study accession number is EGAS00001002187.
Bulk datasets from individual 3 were uploaded to the Gene Expression Omnibus (GEO) database under accession number GSE55426.
GenCord datasets are deposited in the European Genome-phenome Archive for controlled accesses; the study accession number is EGAS00001000446.
Web Resources
European Genome-phenome Archive (EGA), https://www.ebi.ac.uk/ega
Geneimprint, http://www.geneimprint.com/
OMIM, http://www.omim.org/
Vital-IT, http://www.vital-it.ch
Supplemental Data
References
- 1.Albert F.W., Kruglyak L. The role of regulatory variation in complex traits and disease. Nat. Rev. Genet. 2015;16:197–212. doi: 10.1038/nrg3891. [DOI] [PubMed] [Google Scholar]
- 2.Pastinen T. Genome-wide allele-specific analysis: insights into regulatory variation. Nat. Rev. Genet. 2010;11:533–538. doi: 10.1038/nrg2815. [DOI] [PubMed] [Google Scholar]
- 3.Surani M.A., Barton S.C., Norris M.L. Development of reconstituted mouse eggs suggests imprinting of the genome during gametogenesis. Nature. 1984;308:548–550. doi: 10.1038/308548a0. [DOI] [PubMed] [Google Scholar]
- 4.McGrath J., Solter D. Completion of mouse embryogenesis requires both the maternal and paternal genomes. Cell. 1984;37:179–183. doi: 10.1016/0092-8674(84)90313-1. [DOI] [PubMed] [Google Scholar]
- 5.Deng X., Berletch J.B., Nguyen D.K., Disteche C.M. X chromosome regulation: diverse patterns in development, tissues and disease. Nat. Rev. Genet. 2014;15:367–378. doi: 10.1038/nrg3687. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 6.Plasschaert R.N., Bartolomei M.S. Genomic imprinting in development, growth, behavior and stem cells. Development. 2014;141:1805–1813. doi: 10.1242/dev.101428. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 7.Donlon T.A. Similar molecular deletions on chromosome 15q11.2 are encountered in both the Prader-Willi and Angelman syndromes. Hum. Genet. 1988;80:322–328. doi: 10.1007/BF00273644. [DOI] [PubMed] [Google Scholar]
- 8.Knoll J.H., Nicholls R.D., Magenis R.E., Graham J.M., Jr., Lalande M., Latt S.A. Angelman and Prader-Willi syndromes share a common chromosome 15 deletion but differ in parental origin of the deletion. Am. J. Med. Genet. 1989;32:285–290. doi: 10.1002/ajmg.1320320235. [DOI] [PubMed] [Google Scholar]
- 9.Nicholls R.D., Knoll J.H., Butler M.G., Karam S., Lalande M. Genetic imprinting suggested by maternal heterodisomy in nondeletion Prader-Willi syndrome. Nature. 1989;342:281–285. doi: 10.1038/342281a0. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 10.Schinzel A. NICHD conference. Genomic imprinting: consequences of uniparental disomy for human disease. Am. J. Med. Genet. 1993;46:683–684. doi: 10.1002/ajmg.1320460616. [DOI] [PubMed] [Google Scholar]
- 11.Reik W., Brown K.W., Schneid H., Le Bouc Y., Bickmore W., Maher E.R. Imprinting mutations in the Beckwith-Wiedemann syndrome suggested by altered imprinting pattern in the IGF2-H19 domain. Hum. Mol. Genet. 1995;4:2379–2385. doi: 10.1093/hmg/4.12.2379. [DOI] [PubMed] [Google Scholar]
- 12.Weksberg R., Teshima I., Williams B.R., Greenberg C.R., Pueschel S.M., Chernos J.E., Fowlow S.B., Hoyme E., Anderson I.J., Whiteman D.A. Molecular characterization of cytogenetic alterations associated with the Beckwith-Wiedemann syndrome (BWS) phenotype refines the localization and suggests the gene for BWS is imprinted. Hum. Mol. Genet. 1993;2:549–556. doi: 10.1093/hmg/2.5.549. [DOI] [PubMed] [Google Scholar]
- 13.Eggerding F.A., Schonberg S.A., Chehab F.F., Norton M.E., Cox V.A., Epstein C.J. Uniparental isodisomy for paternal 7p and maternal 7q in a child with growth retardation. Am. J. Hum. Genet. 1994;55:253–265. [PMC free article] [PubMed] [Google Scholar]
- 14.Abu-Amero S., Monk D., Frost J., Preece M., Stanier P., Moore G.E. The genetic aetiology of Silver-Russell syndrome. J. Med. Genet. 2008;45:193–199. doi: 10.1136/jmg.2007.053017. [DOI] [PubMed] [Google Scholar]
- 15.Ogata T., Kagami M., Ferguson-Smith A.C. Molecular mechanisms regulating phenotypic outcome in paternal and maternal uniparental disomy for chromosome 14. Epigenetics. 2008;3:181–187. doi: 10.4161/epi.3.4.6550. [DOI] [PubMed] [Google Scholar]
- 16.Barlow D.P., Stöger R., Herrmann B.G., Saito K., Schweifer N. The mouse insulin-like growth factor type-2 receptor is imprinted and closely linked to the Tme locus. Nature. 1991;349:84–87. doi: 10.1038/349084a0. [DOI] [PubMed] [Google Scholar]
- 17.DeChiara T.M., Robertson E.J., Efstratiadis A. Parental imprinting of the mouse insulin-like growth factor II gene. Cell. 1991;64:849–859. doi: 10.1016/0092-8674(91)90513-x. [DOI] [PubMed] [Google Scholar]
- 18.Bartolomei M.S., Zemel S., Tilghman S.M. Parental imprinting of the mouse H19 gene. Nature. 1991;351:153–155. doi: 10.1038/351153a0. [DOI] [PubMed] [Google Scholar]
- 19.Babak T., DeVeale B., Tsang E.K., Zhou Y., Li X., Smith K.S., Kukurba K.R., Zhang R., Li J.B., van der Kooy D. Genetic conflict reflected in tissue-specific maps of genomic imprinting in human and mouse. Nat. Genet. 2015;47:544–549. doi: 10.1038/ng.3274. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 20.Baran Y., Subramaniam M., Biton A., Tukiainen T., Tsang E.K., Rivas M.A., Pirinen M., Gutierrez-Arcelus M., Smith K.S., Kukurba K.R., GTEx Consortium The landscape of genomic imprinting across diverse adult human tissues. Genome Res. 2015;25:927–936. doi: 10.1101/gr.192278.115. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 21.Borel C., Ferreira P.G., Santoni F., Delaneau O., Fort A., Popadin K.Y., Garieri M., Falconnet E., Ribaux P., Guipponi M. Biased allelic expression in human primary fibroblast single cells. Am. J. Hum. Genet. 2015;96:70–80. doi: 10.1016/j.ajhg.2014.12.001. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 22.Deng Q., Ramsköld D., Reinius B., Sandberg R. Single-cell RNA-seq reveals dynamic, random monoallelic gene expression in mammalian cells. Science. 2014;343:193–196. doi: 10.1126/science.1245316. [DOI] [PubMed] [Google Scholar]
- 23.Marinov G.K., Williams B.A., McCue K., Schroth G.P., Gertz J., Myers R.M., Wold B.J. From single-cell to cell-pool transcriptomes: stochasticity in gene expression and RNA splicing. Genome Res. 2014;24:496–510. doi: 10.1101/gr.161034.113. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 24.Kim J.K., Kolodziejczyk A.A., Ilicic T., Teichmann S.A., Marioni J.C. Characterizing noise structure in single-cell RNA-seq distinguishes genuine from technical stochastic allelic expression. Nat. Commun. 2015;6:8687. doi: 10.1038/ncomms9687. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 25.Gutierrez-Arcelus M., Lappalainen T., Montgomery S.B., Buil A., Ongen H., Yurovsky A., Bryois J., Giger T., Romano L., Planchon A. Passive and active DNA methylation and the interplay with genetic variation in gene regulation. eLife. 2013;2:e00523. doi: 10.7554/eLife.00523. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 26.Li H., Handsaker B., Wysoker A., Fennell T., Ruan J., Homer N., Marth G., Abecasis G., Durbin R., 1000 Genome Project Data Processing Subgroup The Sequence Alignment/Map format and SAMtools. Bioinformatics. 2009;25:2078–2079. doi: 10.1093/bioinformatics/btp352. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 27.Ye M., Zhang H., Amabile G., Yang H., Staber P.B., Zhang P., Levantini E., Alberich-Jordà M., Zhang J., Kawasaki A., Tenen D.G. C/EBPa controls acquisition and maintenance of adult haematopoietic stem cell quiescence. Nat. Cell Biol. 2013;15:385–394. doi: 10.1038/ncb2698. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 28.Zheng D., Frankish A., Baertsch R., Kapranov P., Reymond A., Choo S.W., Lu Y., Denoeud F., Antonarakis S.E., Snyder M. Pseudogenes in the ENCODE regions: consensus annotation, analysis of transcription, and evolution. Genome Res. 2007;17:839–851. doi: 10.1101/gr.5586307. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 29.Trapnell C., Pachter L., Salzberg S.L. TopHat: discovering splice junctions with RNA-Seq. Bioinformatics. 2009;25:1105–1111. doi: 10.1093/bioinformatics/btp120. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 30.Marco-Sola S., Sammeth M., Guigó R., Ribeca P. The GEM mapper: fast, accurate and versatile alignment by filtration. Nat. Methods. 2012;9:1185–1188. doi: 10.1038/nmeth.2221. [DOI] [PubMed] [Google Scholar]
- 31.Dimas A.S., Deutsch S., Stranger B.E., Montgomery S.B., Borel C., Attar-Cohen H., Ingle C., Beazley C., Gutierrez Arcelus M., Sekowska M. Common regulatory variation impacts gene expression in a cell type-dependent manner. Science. 2009;325:1246–1250. doi: 10.1126/science.1174148. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 32.Suter D.M., Molina N., Gatfield D., Schneider K., Schibler U., Naef F. Mammalian genes are transcribed with widely different bursting kinetics. Science. 2011;332:472–474. doi: 10.1126/science.1198817. [DOI] [PubMed] [Google Scholar]
- 33.Kagami M., Sekita Y., Nishimura G., Irie M., Kato F., Okada M., Yamamori S., Kishimoto H., Nakayama M., Tanaka Y. Deletions and epimutations affecting the human 14q32.2 imprinted region in individuals with paternal and maternal upd(14)-like phenotypes. Nat. Genet. 2008;40:237–242. doi: 10.1038/ng.2007.56. [DOI] [PubMed] [Google Scholar]
- 34.da Rocha S.T., Edwards C.A., Ito M., Ogata T., Ferguson-Smith A.C. Genomic imprinting at the mammalian Dlk1-Dio3 domain. Trends Genet. 2008;24:306–316. doi: 10.1016/j.tig.2008.03.011. [DOI] [PubMed] [Google Scholar]
- 35.Temple I.K., Cockwell A., Hassold T., Pettay D., Jacobs P. Maternal uniparental disomy for chromosome 14. J. Med. Genet. 1991;28:511–514. doi: 10.1136/jmg.28.8.511. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 36.Benetatos L., Vartholomatos G., Hatzimichael E. MEG3 imprinted gene contribution in tumorigenesis. Int. J. Cancer. 2011;129:773–779. doi: 10.1002/ijc.26052. [DOI] [PubMed] [Google Scholar]
- 37.Kagami M., O’Sullivan M.J., Green A.J., Watabe Y., Arisaka O., Masawa N., Matsuoka K., Fukami M., Matsubara K., Kato F. The IG-DMR and the MEG3-DMR at human chromosome 14q32.2: hierarchical interaction and distinct functional properties as imprinting control centers. PLoS Genet. 2010;6:e1000992. doi: 10.1371/journal.pgen.1000992. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 38.Court F., Tayama C., Romanelli V., Martin-Trujillo A., Iglesias-Platas I., Okamura K., Sugahara N., Simón C., Moore H., Harness J.V. Genome-wide parent-of-origin DNA methylation analysis reveals the intricacies of human imprinting and suggests a germline methylation-independent mechanism of establishment. Genome Res. 2014;24:554–569. doi: 10.1101/gr.164913.113. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 39.Weber M.J. Mammalian small nucleolar RNAs are mobile genetic elements. PLoS Genet. 2006;2:e205. doi: 10.1371/journal.pgen.0020205. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 40.Cavaillé J., Seitz H., Paulsen M., Ferguson-Smith A.C., Bachellerie J.P. Identification of tandemly-repeated C/D snoRNA genes at the imprinted human 14q32 domain reminiscent of those at the Prader-Willi/Angelman syndrome region. Hum. Mol. Genet. 2002;11:1527–1538. doi: 10.1093/hmg/11.13.1527. [DOI] [PubMed] [Google Scholar]
- 41.Wylie A.A., Murphy S.K., Orton T.C., Jirtle R.L. Novel imprinted DLK1/GTL2 domain on human chromosome 14 contains motifs that mimic those implicated in IGF2/H19 regulation. Genome Res. 2000;10:1711–1718. doi: 10.1101/gr.161600. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 42.Charlier C., Segers K., Wagenaar D., Karim L., Berghmans S., Jaillon O., Shay T., Weissenbach J., Cockett N., Gyapay G., Georges M. Human-ovine comparative sequencing of a 250-kb imprinted domain encompassing the callipyge (clpg) locus and identification of six imprinted transcripts: DLK1, DAT, GTL2, PEG11, antiPEG11, and MEG8. Genome Res. 2001;11:850–862. doi: 10.1101/gr.172701. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 43.Seitz H., Royo H., Bortolin M.L., Lin S.P., Ferguson-Smith A.C., Cavaillé J. A large imprinted microRNA gene cluster at the mouse Dlk1-Gtl2 domain. Genome Res. 2004;14:1741–1748. doi: 10.1101/gr.2743304. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 44.Delaneau O., Marchini J., Zagury J.F. A linear complexity phasing method for thousands of genomes. Nat. Methods. 2011;9:179–181. doi: 10.1038/nmeth.1785. [DOI] [PubMed] [Google Scholar]
- 45.Chung J., Tsai S., James A.H., Thames B.H., Shytle S., Piedrahita J.A. Lack of genomic imprinting of DNA primase, polypeptide 2 (PRIM2) in human term placenta and white blood cells. Epigenetics. 2012;7:429–431. doi: 10.4161/epi.19777. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 46.Pant P.V., Tao H., Beilharz E.J., Ballinger D.G., Cox D.R., Frazer K.A. Analysis of allelic differential expression in human white blood cells. Genome Res. 2006;16:331–339. doi: 10.1101/gr.4559106. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 47.Andrés A.M., Dennis M.Y., Kretzschmar W.W., Cannons J.L., Lee-Lin S.Q., Hurle B., Schwartzberg P.L., Williamson S.H., Bustamante C.D., Nielsen R., NISC Comparative Sequencing Program Balancing selection maintains a form of ERAP2 that undergoes nonsense-mediated decay and affects antigen presentation. PLoS Genet. 2010;6:e1001157. doi: 10.1371/journal.pgen.1001157. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 48.Kuiper J.J., Van Setten J., Ripke S., Van 'T Slot R., Mulder F., Missotten T., Baarsma G.S., Francioli L.C., Pulit S.L., De Kovel C.G. A genome-wide association study identifies a functional ERAP2 haplotype associated with birdshot chorioretinopathy. Hum. Mol. Genet. 2014;23:6081–6087. doi: 10.1093/hmg/ddu307. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 49.Joshi R.S., Garg P., Zaitlen N., Lappalainen T., Watson C.T., Azam N., Ho D., Li X., Antonarakis S.E., Brunner H.G. DNA methylation profiling of uniparental disomy subjects provides a map of parental epigenetic bias in the human genome. Am. J. Hum. Genet. 2016;99:555–566. doi: 10.1016/j.ajhg.2016.06.032. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 50.Sanchez-Delgado M., Court F., Vidal E., Medrano J., Monteagudo-Sánchez A., Martin-Trujillo A., Tayama C., Iglesias-Platas I., Kondova I., Bontrop R. Human oocyte-derived methylation differences persist in the placenta revealing widespread transient imprinting. PLoS Genet. 2016;12:e1006427. doi: 10.1371/journal.pgen.1006427. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 51.Abecasis G.R., Altshuler D., Auton A., Brooks L.D., Durbin R.M., Gibbs R.A., Hurles M.E., McVean G.A., 1000 Genomes Project Consortium A map of human genome variation from population-scale sequencing. Nature. 2010;467:1061–1073. doi: 10.1038/nature09534. [DOI] [PMC free article] [PubMed] [Google Scholar]
Associated Data
This section collects any data citations, data availability statements, or supplementary materials included in this article.