- Split View
-
Views
-
Cite
Cite
Xu Zhang, Erika L. Moen, Cong Liu, Wenbo Mu, Eric R. Gamazon, Shannon M. Delaney, Claudia Wing, Lucy A. Godley, M. Eileen Dolan, Wei Zhang, Linking the genetic architecture of cytosine modifications with human complex traits, Human Molecular Genetics, Volume 23, Issue 22, 15 November 2014, Pages 5893–5905, https://doi.org/10.1093/hmg/ddu313
- Share Icon Share
Interindividual variation in cytosine modifications could contribute to heterogeneity in disease risks and other complex traits. We assessed the genetic architecture of cytosine modifications at 283 540 CpG sites in lymphoblastoid cell lines (LCLs) derived from independent samples of European and African descent. Our study suggests that cytosine modification variation was primarily controlled in local by single major modification quantitative trait locus (mQTL) and additional minor loci. Local genetic epistasis was detectable for a small proportion of CpG sites, which were enriched by more than 9-fold for CpG sites mapped to population-specific mQTL. Genetically dependent CpG sites whose modification levels negatively (repressive sites) or positively (facilitative sites) correlated with gene expression levels significantly co-localized with transcription factor binding, with the repressive sites predominantly associated with active promoters whereas the facilitative sites rarely at active promoters. Genetically independent repressive or facilitative sites preferentially modulated gene expression variation by influencing local chromatin accessibility, with the facilitative sites primarily antagonizing H3K27me3 and H3K9me3 deposition. In comparison with expression quantitative trait loci (eQTL), mQTL detected from LCLs were enriched in associations for a broader range of disease categories including chronic inflammatory, autoimmune and psychiatric disorders, suggesting that cytosine modification variation, while possesses a degree of cell linage specificity, is more stably inherited over development than gene expression variation. About 11% of unique single-nucleotide polymorphisms reported in the Genome-Wide Association Study Catalog were annotated, 78% as mQTL and 31% as eQTL in LCLs, which covered 37% of the investigated diseases/traits and provided insights to the biological mechanisms.
INTRODUCTION
Interindividual gene expression variation underlies many cellular and organism-level phenotypes; therefore, study of genetic regulation of gene expression enhances our understanding of the genetic basis of disease susceptibility and other complex traits. Previous studies utilizing the International HapMap Project (1,2) lymphoblastoid cell lines (LCLs) derived from apparently healthy individuals of major ethnic groups have demonstrated genetic contribution to gene expression in the form of single-nucleotide polymorphisms (SNPs) (3–7), known as expression quantitative trait loci (eQTL). Furthermore, eQTL have been shown to be enriched among SNPs associated with a broad spectrum of complex traits (8) including pharmacologic traits (9) identified from genome-wide association studies (GWAS). Some eQTL are known to affect the binding of transcription factors to regulate gene expression (10). However, in general, the functional basis and regulatory mechanisms of the identified eQTL and most GWAS loci remain to be fully characterized.
Both genetic and environmental regulation of gene expression can be mediated through epigenetic mechanisms. Genetic effects on cytosine modification were first assessed in human brain tissues to identify cytosine methylation quantitative trait loci (11,12). Local or cis-acting genetic variants were also found in LCLs that contribute to within- and between-population epigenetic variations (13,14) and that in turn correlated with gene expression (15,16). On the other hand, the relative flexibility of the epigenetic systems in response to environmental and physiological changes, for example exposure to drugs (17), development (18) and aging (19), indicates that they play a critical role in regulating gene expression beyond DNA sequence variations, which may be subject to more functional constraints and selective pressure (20). Epigenetic and genetic dissection of gene expression variation therefore will aid in our understanding of the environmental, developmental and genetic contributions to gene regulation (21). The DNA methylome is known to maintain genome integrity (22) and X chromosome inactivation (23). Cytosine methylation in cis silences gene expression and is thought to be a global repressive mechanism (24). Given the importance of cytosine modifications (primarily CpG methylation) in gene regulation, the cytosine modification quantitative trait loci (mQTL) represent a novel layer of annotations for genetic variants associated with complex traits (25).
We profiled cytosine modification levels in a panel of LCLs derived from Western African (YRI: Yoruba people from Ibadan, Nigeria) and Northern/Western European (CEU: Caucasian residents from Utah, USA) ancestry using the Illumina Infinium HumanMethylation450 BeadChip (450K array), which covers genome-wide CpG sites with high density (>480 000 CpGs) (26). The 450K array requires bisulfite conversion of genomic DNA in order to distinguish modified from unmodified cytosines. This technique, however, cannot distinguish between 5-methylcytosine and 5-hydroxymethylcytosine (27,28), a recently rediscovered cytosine modification type whose functional consequence is not fully understood (29,30). Therefore, we chose to use the term ‘cytosine modification’ to avoid any potential bias. In this study, we assessed the genetic architecture of cytosine modifications to gain novel biological insights into the genetic basis of human complex traits including disease predispositions, and characterized the epigenetic and genetic contributions to gene expression variation.
RESULTS
Genetic architecture of cytosine modifications
Association scans of SNPs <1 Mb away from target CpG sites revealed that mQTL with strong effects were predominantly located within the ±100 kb regions (Fig. 1C and D), on which our final analysis of local genetic regulation was focused. At 5% FDR, 72 526 associations for 5240 CpG sites were identified in the CEU (Supplementary Data), and 55 066 associations for 7306 CpG sites in the YRI samples (Supplementary Data), representing 1.8 and 2.6% of the analyzed CpG sites, respectively (Table 1). Combining the results from the two populations, which were similar in summary statistics, the association r2 ranged from 0.23 to 0.97 with a median of 0.35. Less than 13% of mQTL associated with multiple CpG sites, potentially reflecting regional cytosine modifications. CpG sites mapped to mQTL occurred more frequently in gene bodies than in CpG islands (2.0 versus 1.3% in CEU, 2.7 versus 1.6% in YRI; Supplementary Data). mQTL detected in this study highly overlapped previously published mQTL in the CEU (14) and YRI (13,14) samples using the Illumina 27K array (>200-fold enrichment, see Supplementary Data). Among the CpG sites mapped to mQTL, we chose three CpG sites and measured their modification levels using bisulfite sequencing (Supplementary Data and Supplementary Data): SNP rs7641545 associated with modification levels of CpG site cg01566999 (P = 0.001) located in the gene body of NIPAL2; SNP rs11346248 associated with modification levels of cg11346248 (P = 0.02) located in the promoter region of SLFN12; SNP rs2238542 associated with modification levels of cg03233876 (P = 0.04) located in the gene body of BSG. The mQTL mapping results have been deposited into our integrated SCAN Database (32) (http://www.scandb.org) developed to provide novel annotations for genetic variants.
Type of association . | Population . | No. of association . | No. of CpGa . | No. of Genea . | %CpGb . | %Geneb . |
---|---|---|---|---|---|---|
mQTL | CEU | 72 526 | 5240 | 1.8% | ||
YRI | 55 066 | 7306 | 2.6% | |||
meQTL | CEU | 5472 | 390 | 4.5% | ||
YRI | 2184 | 289 | 2.7% | |||
CpG-gene expression | CEU | 1277 | 1232 | 614 | 0.47% | 3.7% |
YRI | 1120 | 1090 | 535 | 0.41% | 3.2% |
Type of association . | Population . | No. of association . | No. of CpGa . | No. of Genea . | %CpGb . | %Geneb . |
---|---|---|---|---|---|---|
mQTL | CEU | 72 526 | 5240 | 1.8% | ||
YRI | 55 066 | 7306 | 2.6% | |||
meQTL | CEU | 5472 | 390 | 4.5% | ||
YRI | 2184 | 289 | 2.7% | |||
CpG-gene expression | CEU | 1277 | 1232 | 614 | 0.47% | 3.7% |
YRI | 1120 | 1090 | 535 | 0.41% | 3.2% |
aThe number of unique CpG sites (or genes) mapped.
bThe percentage of unique CpG sites (or genes) mapped among the unique CpG sites (or genes) analyzed.
Type of association . | Population . | No. of association . | No. of CpGa . | No. of Genea . | %CpGb . | %Geneb . |
---|---|---|---|---|---|---|
mQTL | CEU | 72 526 | 5240 | 1.8% | ||
YRI | 55 066 | 7306 | 2.6% | |||
meQTL | CEU | 5472 | 390 | 4.5% | ||
YRI | 2184 | 289 | 2.7% | |||
CpG-gene expression | CEU | 1277 | 1232 | 614 | 0.47% | 3.7% |
YRI | 1120 | 1090 | 535 | 0.41% | 3.2% |
Type of association . | Population . | No. of association . | No. of CpGa . | No. of Genea . | %CpGb . | %Geneb . |
---|---|---|---|---|---|---|
mQTL | CEU | 72 526 | 5240 | 1.8% | ||
YRI | 55 066 | 7306 | 2.6% | |||
meQTL | CEU | 5472 | 390 | 4.5% | ||
YRI | 2184 | 289 | 2.7% | |||
CpG-gene expression | CEU | 1277 | 1232 | 614 | 0.47% | 3.7% |
YRI | 1120 | 1090 | 535 | 0.41% | 3.2% |
aThe number of unique CpG sites (or genes) mapped.
bThe percentage of unique CpG sites (or genes) mapped among the unique CpG sites (or genes) analyzed.
As LCLs, especially the CEU cell lines, have been in culture for many years, there is a concern about the extent of genomic correlation between LCLs and primary lymphocytes. To address this, we compared the eQTL between LCLs and non-transformed peripheral blood samples (33). Local eQTL detected in peripheral blood samples were enriched by more than 2-fold with those detected in LCLs (Supplementary Data). Furthermore, we compared the mQTL between LCLs and cerebellum samples from a Caucasian cohort using 27K array (11). We found >5-fold enrichment of local cerebellum mQTL with those detected in LCLs (Supplementary Data). These observations demonstrated significant genomic correlations between LCLs and primary cells/tissues, suggesting that LCL is a biologically relevant cell model for human genetic studies.
Local polygenic control and genetic epistasis
Although the recognition motifs of cytosine modification enzymes, for instance, DNA methyltransferases, are simple, structural variations of the target DNA strands may generally modulate the enzyme–substrate interaction (34). Cytosine modification levels of a CpG site could therefore be affected by several local polymorphisms with distinct combinatory effects. We first assessed additive local polygenic control, assuming that polymorphisms contributed independently to the variation of cytosine modification levels. For each CpG site mapped to mQTL at 5% FDR, we fitted cytosine modification levels on allele dosages of all mQTL detected for the site, and defined the optimal number of mQTL by forward–backward model selection. We found that >30% of CpG sites were associated with multiple mQTL. For these CpG sites, however, the majority of the genetically explained variation was frequently attributed to one single major locus. Figure 1D shows the comparison of the adjusted association r2 between the most significant mQTL and multiple mQTL for these CpG sites, suggesting that, in general, additive polygenic control could be interpreted as the combinatorial effects of one major mQTL and several minor loci.
Polygenic control can be due to genetic epistasis as well, where polymorphisms at one site modify the effects of polymorphisms at the other sites. We searched for simple two-locus interactions within ±100 kb regions of target CpG sites. To be statistically sound, we focused on SNP pairs that were common, with minor allele frequency (MAF) >0.2 and genotype frequency >0.1, and were not in linkage disequilibrium (LD) of each other (r2 < 0.3). We detected a small number of CpG sites with local genetic interactions (104 in the CEU and 60 in the YRI samples). Figure 1E shows an interacting SNP pair, rs1050816 and rs4674390, in the YRI samples, which explained no variation in cytosine modification levels at CpG site cg00680696 by additive effect, whereas explained 47% of variation when allowing interaction effect. Large epistatic effect, however, was not common (Supplementary Data). Several studies have observed prominent population specificity of mQTL effects between the CEU and YRI samples, and hypothesized that genetic epistasis could be one explanation (14,16). To test this hypothesis, we identified 6726 CpG sites mapped to common mQTL and 1216 CpG sites mapped to population-specific mQTL (Supplementary Data; also see Materials and Methods). Compared with CpG sites mapped to common mQTL, CpG sites mapped to population-specific mQTL were enriched for CpG sites under epistatic regulation in the CEU samples by 9.4-fold (binomial test P < 4 × 10−29), suggesting that local genetic epistasis may contribute to population specificity of mQTL effects. Supplementary Data illustrates two possible scenarios including population-specific epistasis and different genetic heterogeneity of the interacting loci.
Genetic and epigenetic dissection of gene expression variation
Genetic and non-genetic gene regulation mediated through cytosine modifications
Similar to the genetically dependent repressive sites, genetically independent repressive sites were significantly enriched for transcription factor binding sites (Fig. 4C, lower panel) and H3K4me3 peaks (Fig. 4D, lower panel) in the 5′ end of genes. In contrast, genetically independent facilitative sites showed little enrichment for transcription factor binding sites (Fig. 4C, upper panel), but significantly overlapped with H3K27me3 peaks in gene boundaries, which mark polycomb repressive complex 2 (PRC2) targeting (Fig. 4D, upper panel). Interestingly, H3K9me3 peaks that mark heterochromatin and repetitive elements were significantly enriched in genetically independent repressive and facilitative sites, in the 5′ end of genes and gene boundaries, respectively (Fig. 4D). Similar patterns were observed in the YRI samples (Supplementary Data).
We reasoned that as chromatin accessibility to transcription apparatus is a global feature, whereas transcription factor binding is gene-specific, the strength of correlation between absolute cytosine modification levels and absolute gene expression levels across CpG-gene pairs may reflect the underlying regulatory mechanism of the CpG sites. The correlation would be strong if CpG sites modulate the accessibility of chromatin to transcription apparatus, but would be weak if CpG sites affect expression levels by interfering with transcription factor binding. Compared with genetically dependent CpG sites, genetically independent CpG sites showed stronger correlations of absolute cytosine modification levels with absolute gene expression levels across CpG-gene pairs (Spearman's ρ = −0.24 versus ρ = −0.06 for repressive sites; ρ = 0.29 versus ρ = 0.11 for facilitative sites), suggesting that genetically independent CpG sites were more likely to modulate chromatin accessibility. Such influence on chromatin was highly localized, as only 3% of genetically independent CpG sites correlated (across samples) with expression levels of multiple (≥2) genes, compared with 16% of genetically dependent sites. Notably, none of the genetically independent CpG sites, whereas 25 of the genetic dependent CpG sites, correlated with expressions of multiple genes in opposite directions (test of equal proportions, P = 0.001).
Annotations of GWAS loci by mQTL and eQTL
We further carried out a systematic annotation of these GWAS SNPs, by associating them with cytosine modification levels and gene expression levels in the CEU samples. Here, only the GWAS SNPs < 100 kb away from CpG sites or gene starts/stops were considered. At 5% FDR, we detected 361 unique GWAS SNPs, covering 193 (34%) unique diseases/traits, as mQTL associated with cytosine modification levels of 316 CpG sites (Supplementary Data). In addition, we detected 144 GWAS SNPs, covering 83 (15%) diseases/traits, as eQTL associated with expression levels of 92 genes (Supplementary Data). Overall, 11% of the unique GWAS SNPs were annotated, with 78% being mQTL and 31% being eQTL, covering 37% of the investigated diseases/traits. The annotated GWAS loci shed light into the genetic associations. In Table 2, we present diseases/traits whose GWAS loci were simultaneously mapped to mQTL and eQTL. The annotations frequently narrowed down to one of the candidate genes reported by the original GWAS, but occasionally suggested alternative or additional mechanisms. For example, SNPs associated with progressive supranuclear palsy (38), Parkinson's disease (39) and interstitial lung disease (40) were thought to alter MAPT (microtubule-associated protein tau) function. We found that these SNPs did associate with cytosine modifications at CpG sites within MAPT, but also with expression of the downstream KANSL1 (KAT8 regulatory NSL complex subunit 1) gene, suggesting more complicated mechanisms underlying the disease associations, as pointed out by a previous study (41).
Category . | Disease/trait . | GWAS SNP . | Associated CpG sites (−log10P)a . | Associated genes (−log10P)b . |
---|---|---|---|---|
Autoimmune disorder | Rheumatoid arthritis | rs6457620 | cg23464743 (6.1) | HLA-DQA2 (3.6) |
Rheumatoid arthritis | rs6457617 | cg23464743 (6.1) | HLA-DQA2 (3.6) | |
Systemic sclerosis | ||||
Graves’ disease | ||||
Rheumatoid arthritis | rs2736340 | cg21701351 (4.3) | BLK (4.5), FAM167A (3.9) | |
Kawasaki disease | ||||
Crohn's disease | rs3764147 | cg00423124 (6.0) | LACC1 (6.6) | |
Ulcerative colitis | rs798502 | cg01139696 (4.0), cg08377331 (8.5), cg15092213 (5.0), cg17393140 (10), cg24648384 (12) | GNA12 (4.7) | |
Systemic lupus erythematosus | rs13277113 | cg21701351 (4.1) | BLK (4.3) | |
Systemic lupus erythematosus | rs2618476 | cg21701351 (4.0) | BLK (4.0), FAM167A (4.1) | |
Type 1 diabetes | rs2647044 | HLA-DQB2 (3.2) | ||
Inflammatory bowel disease | rs2382817 | cg00012203 (5.5), cg05991184 (9.5) | TMBIM1 (5.1) | |
Bacterial infection | Helicobacter pylori serologic status | rs10004195 | cg09316306 (4.1) | TLR6 (4.6) |
Leprosy | rs3764147 | cg00423124 (6.0) | LACC1 (6.6) | |
Cancer | Hepatocellular carcinoma (hepatitis B virus related) | rs9275319 | cg04418355 (4.2), cg14255617 (4.9) | HLA-DQA2 (3.6) |
Multiple myeloma (hyperdiploidy) | rs2272007 | cg05589743 (5.1) | ULK4 (9.3) | |
Multiple myeloma | rs1052501 | cg05589743 (5.1) | ULK4 (9.3) | |
Nasopharyngeal carcinoma | rs3129055 | cg02157626 (4.1), cg02355653 (6.4), cg03449857 (6.4), cg06458771 (4.6), cg13835168 (4.0), cg15570656 (6.8), cg15708526 (3.9), cg16994534 (5.2), cg24100841 (5.3), cg24444631 (4.2), cg26021304 (10), cg27230769 (5.3) | ZFP57 (4.8) | |
Prostate cancer | rs130067 | cg07414487 (7.1), cg21334609 (4.4) | CCHCR1 (4.6), TCF19 (7.3) | |
Wilms tumor | rs3755132 | cg12888861 (4.6), cg24674087 (5.1) | DDX1 (5.0) | |
Chronic inflammatory disease | COPD-related biomarkers | rs2074488 | cg04710276 (5.2), cg18923388 (3.9) | HLA-B (3.7) |
COPD-related biomarkers | rs1265093 | cg07414487 (13) | TCF19 (4.3) | |
Asthma | rs9273349 | cg19864342 (4.2), cg23464743 (7.5) | HLA-DQB2 (6.0) | |
Coagulation | End-stage coagulation | rs10665, rs2181540, rs6041 | cg19086603 (19, 19, 22) | LOC100506079 (4.1, 3.5, 3.2) |
Prothrombin time | rs561241 | cg19086603 (23) | LOC100506079 (3.2) | |
Factor VII | ||||
Metabolism syndrome related | Palmitoleic acid (16:1n-7) plasma levels | rs11190604 | cg07080220 (4.4), cg07570687 (4.7) | SEC31B (3.5) |
Phospholipid levels (plasma) | rs1077989 | cg16224230 (4.8) | TMEM229B (3.9) | |
Metabolic syndrome | rs782590 | cg18811423 (4.2) | PNPT1 (4.8) | |
Body mass index | rs4771122 | cg22138327 (8.9) | GTF3A (4.6) | |
Neurological disorder | Progressive supranuclear palsy | rs8070723 | cg07368061 (4.9), cg18228076 (7.3) | KANSL1 (5.2) |
Parkinson's disease | ||||
Other | Height | rs1182188 | cg01139696 (5.5), cg08377331 (11), cg15092213 (4.6), cg24648384 (14) | GNA12 (5.8) |
Height | rs798489 | cg01139696 (4.6), cg08377331 (10), cg15092213 (4.5), cg17393140 (9.1), cg24648384 (15) | GNA12 (4.5) | |
Height | rs2665838 | cg23090583 (4.6) | FTSJ3 (3.9) | |
Height | rs6457620 | cg23464743 (6.1) | HLA-DQA2 (3.6) | |
Height | rs12694997 | cg17240976 (33) | FARP2 (3.4) | |
Height | rs2941551 | cg23090583 (4.6) | FTSJ3 (4.0) | |
Diastolic blood pressure | rs9815354 | cg05589743 (4.3) | ULK4 (9.0) | |
Menopause (age at onset) | rs2303369 | cg04845466 (4.2), cg12000995 (7.2), cg12648201 (6.8), cg17158414 (4.0), cg24768116 (4.5) | NRBP1 (3.4) | |
Liver enzyme levels (gamma-glutamyl transferase) | rs2739330 | cg04234412 (9.2), cg09033563 (6.8), cg17293426 (4.1), cg18538332 (4.2), cg24846343 (4.9) | GSTT1 (9.7) | |
Metabolic traits | rs2403254 | cg24872173 (5.6) | GTF2H1 (4.4) | |
Mean corpuscular hemoglobin concentration | rs4580814 | cg04380229 (6.0), cg04467119 (4.2), cg08344943 (3.9), cg08702805 (4.8), cg12199905 (4.0), cg22955595 (4.3), cg25388971 (4.5), cg25970471 (4.1) | SLC12A7 (5.8) | |
Response to temozolomide | rs477692 | cg05714579 (6.8) | MGMT (3.3) | |
Renal function-related traits (BUN) | rs11123170 | cg07594247 (5.0), cg17445212 (5.0), cg23564664 (5.0) | PAX8 (4.8) | |
Multiple sclerosis (OCB status) | rs3129720 | cg23464743 (6.9) | HLA-DQB2 (5.6) | |
Multiple sclerosis | rs2523393 | cg02355653 (11), cg15570656 (4.3), cg17221604 (4.2), cg25046571 (5.8), cg26021304 (7.3), cg27230769 (24) | HLA-F (3.3), ZFP57 (3.9) | |
Interstitial lung disease | rs1981997 | cg07368061 (4.9), cg18228076 (7.3), cg24801230 (11) | KANSL1 (5.2) |
Category . | Disease/trait . | GWAS SNP . | Associated CpG sites (−log10P)a . | Associated genes (−log10P)b . |
---|---|---|---|---|
Autoimmune disorder | Rheumatoid arthritis | rs6457620 | cg23464743 (6.1) | HLA-DQA2 (3.6) |
Rheumatoid arthritis | rs6457617 | cg23464743 (6.1) | HLA-DQA2 (3.6) | |
Systemic sclerosis | ||||
Graves’ disease | ||||
Rheumatoid arthritis | rs2736340 | cg21701351 (4.3) | BLK (4.5), FAM167A (3.9) | |
Kawasaki disease | ||||
Crohn's disease | rs3764147 | cg00423124 (6.0) | LACC1 (6.6) | |
Ulcerative colitis | rs798502 | cg01139696 (4.0), cg08377331 (8.5), cg15092213 (5.0), cg17393140 (10), cg24648384 (12) | GNA12 (4.7) | |
Systemic lupus erythematosus | rs13277113 | cg21701351 (4.1) | BLK (4.3) | |
Systemic lupus erythematosus | rs2618476 | cg21701351 (4.0) | BLK (4.0), FAM167A (4.1) | |
Type 1 diabetes | rs2647044 | HLA-DQB2 (3.2) | ||
Inflammatory bowel disease | rs2382817 | cg00012203 (5.5), cg05991184 (9.5) | TMBIM1 (5.1) | |
Bacterial infection | Helicobacter pylori serologic status | rs10004195 | cg09316306 (4.1) | TLR6 (4.6) |
Leprosy | rs3764147 | cg00423124 (6.0) | LACC1 (6.6) | |
Cancer | Hepatocellular carcinoma (hepatitis B virus related) | rs9275319 | cg04418355 (4.2), cg14255617 (4.9) | HLA-DQA2 (3.6) |
Multiple myeloma (hyperdiploidy) | rs2272007 | cg05589743 (5.1) | ULK4 (9.3) | |
Multiple myeloma | rs1052501 | cg05589743 (5.1) | ULK4 (9.3) | |
Nasopharyngeal carcinoma | rs3129055 | cg02157626 (4.1), cg02355653 (6.4), cg03449857 (6.4), cg06458771 (4.6), cg13835168 (4.0), cg15570656 (6.8), cg15708526 (3.9), cg16994534 (5.2), cg24100841 (5.3), cg24444631 (4.2), cg26021304 (10), cg27230769 (5.3) | ZFP57 (4.8) | |
Prostate cancer | rs130067 | cg07414487 (7.1), cg21334609 (4.4) | CCHCR1 (4.6), TCF19 (7.3) | |
Wilms tumor | rs3755132 | cg12888861 (4.6), cg24674087 (5.1) | DDX1 (5.0) | |
Chronic inflammatory disease | COPD-related biomarkers | rs2074488 | cg04710276 (5.2), cg18923388 (3.9) | HLA-B (3.7) |
COPD-related biomarkers | rs1265093 | cg07414487 (13) | TCF19 (4.3) | |
Asthma | rs9273349 | cg19864342 (4.2), cg23464743 (7.5) | HLA-DQB2 (6.0) | |
Coagulation | End-stage coagulation | rs10665, rs2181540, rs6041 | cg19086603 (19, 19, 22) | LOC100506079 (4.1, 3.5, 3.2) |
Prothrombin time | rs561241 | cg19086603 (23) | LOC100506079 (3.2) | |
Factor VII | ||||
Metabolism syndrome related | Palmitoleic acid (16:1n-7) plasma levels | rs11190604 | cg07080220 (4.4), cg07570687 (4.7) | SEC31B (3.5) |
Phospholipid levels (plasma) | rs1077989 | cg16224230 (4.8) | TMEM229B (3.9) | |
Metabolic syndrome | rs782590 | cg18811423 (4.2) | PNPT1 (4.8) | |
Body mass index | rs4771122 | cg22138327 (8.9) | GTF3A (4.6) | |
Neurological disorder | Progressive supranuclear palsy | rs8070723 | cg07368061 (4.9), cg18228076 (7.3) | KANSL1 (5.2) |
Parkinson's disease | ||||
Other | Height | rs1182188 | cg01139696 (5.5), cg08377331 (11), cg15092213 (4.6), cg24648384 (14) | GNA12 (5.8) |
Height | rs798489 | cg01139696 (4.6), cg08377331 (10), cg15092213 (4.5), cg17393140 (9.1), cg24648384 (15) | GNA12 (4.5) | |
Height | rs2665838 | cg23090583 (4.6) | FTSJ3 (3.9) | |
Height | rs6457620 | cg23464743 (6.1) | HLA-DQA2 (3.6) | |
Height | rs12694997 | cg17240976 (33) | FARP2 (3.4) | |
Height | rs2941551 | cg23090583 (4.6) | FTSJ3 (4.0) | |
Diastolic blood pressure | rs9815354 | cg05589743 (4.3) | ULK4 (9.0) | |
Menopause (age at onset) | rs2303369 | cg04845466 (4.2), cg12000995 (7.2), cg12648201 (6.8), cg17158414 (4.0), cg24768116 (4.5) | NRBP1 (3.4) | |
Liver enzyme levels (gamma-glutamyl transferase) | rs2739330 | cg04234412 (9.2), cg09033563 (6.8), cg17293426 (4.1), cg18538332 (4.2), cg24846343 (4.9) | GSTT1 (9.7) | |
Metabolic traits | rs2403254 | cg24872173 (5.6) | GTF2H1 (4.4) | |
Mean corpuscular hemoglobin concentration | rs4580814 | cg04380229 (6.0), cg04467119 (4.2), cg08344943 (3.9), cg08702805 (4.8), cg12199905 (4.0), cg22955595 (4.3), cg25388971 (4.5), cg25970471 (4.1) | SLC12A7 (5.8) | |
Response to temozolomide | rs477692 | cg05714579 (6.8) | MGMT (3.3) | |
Renal function-related traits (BUN) | rs11123170 | cg07594247 (5.0), cg17445212 (5.0), cg23564664 (5.0) | PAX8 (4.8) | |
Multiple sclerosis (OCB status) | rs3129720 | cg23464743 (6.9) | HLA-DQB2 (5.6) | |
Multiple sclerosis | rs2523393 | cg02355653 (11), cg15570656 (4.3), cg17221604 (4.2), cg25046571 (5.8), cg26021304 (7.3), cg27230769 (24) | HLA-F (3.3), ZFP57 (3.9) | |
Interstitial lung disease | rs1981997 | cg07368061 (4.9), cg18228076 (7.3), cg24801230 (11) | KANSL1 (5.2) |
The −log10P was displayed within parentheses for associations.
aBetween the cytosine modification levels of the CpG site and the GWAS SNP.
bBetween the gene expression levels of the gene and the GWAS SNP. COPD, chronic obstructive pulmonary disease.
Category . | Disease/trait . | GWAS SNP . | Associated CpG sites (−log10P)a . | Associated genes (−log10P)b . |
---|---|---|---|---|
Autoimmune disorder | Rheumatoid arthritis | rs6457620 | cg23464743 (6.1) | HLA-DQA2 (3.6) |
Rheumatoid arthritis | rs6457617 | cg23464743 (6.1) | HLA-DQA2 (3.6) | |
Systemic sclerosis | ||||
Graves’ disease | ||||
Rheumatoid arthritis | rs2736340 | cg21701351 (4.3) | BLK (4.5), FAM167A (3.9) | |
Kawasaki disease | ||||
Crohn's disease | rs3764147 | cg00423124 (6.0) | LACC1 (6.6) | |
Ulcerative colitis | rs798502 | cg01139696 (4.0), cg08377331 (8.5), cg15092213 (5.0), cg17393140 (10), cg24648384 (12) | GNA12 (4.7) | |
Systemic lupus erythematosus | rs13277113 | cg21701351 (4.1) | BLK (4.3) | |
Systemic lupus erythematosus | rs2618476 | cg21701351 (4.0) | BLK (4.0), FAM167A (4.1) | |
Type 1 diabetes | rs2647044 | HLA-DQB2 (3.2) | ||
Inflammatory bowel disease | rs2382817 | cg00012203 (5.5), cg05991184 (9.5) | TMBIM1 (5.1) | |
Bacterial infection | Helicobacter pylori serologic status | rs10004195 | cg09316306 (4.1) | TLR6 (4.6) |
Leprosy | rs3764147 | cg00423124 (6.0) | LACC1 (6.6) | |
Cancer | Hepatocellular carcinoma (hepatitis B virus related) | rs9275319 | cg04418355 (4.2), cg14255617 (4.9) | HLA-DQA2 (3.6) |
Multiple myeloma (hyperdiploidy) | rs2272007 | cg05589743 (5.1) | ULK4 (9.3) | |
Multiple myeloma | rs1052501 | cg05589743 (5.1) | ULK4 (9.3) | |
Nasopharyngeal carcinoma | rs3129055 | cg02157626 (4.1), cg02355653 (6.4), cg03449857 (6.4), cg06458771 (4.6), cg13835168 (4.0), cg15570656 (6.8), cg15708526 (3.9), cg16994534 (5.2), cg24100841 (5.3), cg24444631 (4.2), cg26021304 (10), cg27230769 (5.3) | ZFP57 (4.8) | |
Prostate cancer | rs130067 | cg07414487 (7.1), cg21334609 (4.4) | CCHCR1 (4.6), TCF19 (7.3) | |
Wilms tumor | rs3755132 | cg12888861 (4.6), cg24674087 (5.1) | DDX1 (5.0) | |
Chronic inflammatory disease | COPD-related biomarkers | rs2074488 | cg04710276 (5.2), cg18923388 (3.9) | HLA-B (3.7) |
COPD-related biomarkers | rs1265093 | cg07414487 (13) | TCF19 (4.3) | |
Asthma | rs9273349 | cg19864342 (4.2), cg23464743 (7.5) | HLA-DQB2 (6.0) | |
Coagulation | End-stage coagulation | rs10665, rs2181540, rs6041 | cg19086603 (19, 19, 22) | LOC100506079 (4.1, 3.5, 3.2) |
Prothrombin time | rs561241 | cg19086603 (23) | LOC100506079 (3.2) | |
Factor VII | ||||
Metabolism syndrome related | Palmitoleic acid (16:1n-7) plasma levels | rs11190604 | cg07080220 (4.4), cg07570687 (4.7) | SEC31B (3.5) |
Phospholipid levels (plasma) | rs1077989 | cg16224230 (4.8) | TMEM229B (3.9) | |
Metabolic syndrome | rs782590 | cg18811423 (4.2) | PNPT1 (4.8) | |
Body mass index | rs4771122 | cg22138327 (8.9) | GTF3A (4.6) | |
Neurological disorder | Progressive supranuclear palsy | rs8070723 | cg07368061 (4.9), cg18228076 (7.3) | KANSL1 (5.2) |
Parkinson's disease | ||||
Other | Height | rs1182188 | cg01139696 (5.5), cg08377331 (11), cg15092213 (4.6), cg24648384 (14) | GNA12 (5.8) |
Height | rs798489 | cg01139696 (4.6), cg08377331 (10), cg15092213 (4.5), cg17393140 (9.1), cg24648384 (15) | GNA12 (4.5) | |
Height | rs2665838 | cg23090583 (4.6) | FTSJ3 (3.9) | |
Height | rs6457620 | cg23464743 (6.1) | HLA-DQA2 (3.6) | |
Height | rs12694997 | cg17240976 (33) | FARP2 (3.4) | |
Height | rs2941551 | cg23090583 (4.6) | FTSJ3 (4.0) | |
Diastolic blood pressure | rs9815354 | cg05589743 (4.3) | ULK4 (9.0) | |
Menopause (age at onset) | rs2303369 | cg04845466 (4.2), cg12000995 (7.2), cg12648201 (6.8), cg17158414 (4.0), cg24768116 (4.5) | NRBP1 (3.4) | |
Liver enzyme levels (gamma-glutamyl transferase) | rs2739330 | cg04234412 (9.2), cg09033563 (6.8), cg17293426 (4.1), cg18538332 (4.2), cg24846343 (4.9) | GSTT1 (9.7) | |
Metabolic traits | rs2403254 | cg24872173 (5.6) | GTF2H1 (4.4) | |
Mean corpuscular hemoglobin concentration | rs4580814 | cg04380229 (6.0), cg04467119 (4.2), cg08344943 (3.9), cg08702805 (4.8), cg12199905 (4.0), cg22955595 (4.3), cg25388971 (4.5), cg25970471 (4.1) | SLC12A7 (5.8) | |
Response to temozolomide | rs477692 | cg05714579 (6.8) | MGMT (3.3) | |
Renal function-related traits (BUN) | rs11123170 | cg07594247 (5.0), cg17445212 (5.0), cg23564664 (5.0) | PAX8 (4.8) | |
Multiple sclerosis (OCB status) | rs3129720 | cg23464743 (6.9) | HLA-DQB2 (5.6) | |
Multiple sclerosis | rs2523393 | cg02355653 (11), cg15570656 (4.3), cg17221604 (4.2), cg25046571 (5.8), cg26021304 (7.3), cg27230769 (24) | HLA-F (3.3), ZFP57 (3.9) | |
Interstitial lung disease | rs1981997 | cg07368061 (4.9), cg18228076 (7.3), cg24801230 (11) | KANSL1 (5.2) |
Category . | Disease/trait . | GWAS SNP . | Associated CpG sites (−log10P)a . | Associated genes (−log10P)b . |
---|---|---|---|---|
Autoimmune disorder | Rheumatoid arthritis | rs6457620 | cg23464743 (6.1) | HLA-DQA2 (3.6) |
Rheumatoid arthritis | rs6457617 | cg23464743 (6.1) | HLA-DQA2 (3.6) | |
Systemic sclerosis | ||||
Graves’ disease | ||||
Rheumatoid arthritis | rs2736340 | cg21701351 (4.3) | BLK (4.5), FAM167A (3.9) | |
Kawasaki disease | ||||
Crohn's disease | rs3764147 | cg00423124 (6.0) | LACC1 (6.6) | |
Ulcerative colitis | rs798502 | cg01139696 (4.0), cg08377331 (8.5), cg15092213 (5.0), cg17393140 (10), cg24648384 (12) | GNA12 (4.7) | |
Systemic lupus erythematosus | rs13277113 | cg21701351 (4.1) | BLK (4.3) | |
Systemic lupus erythematosus | rs2618476 | cg21701351 (4.0) | BLK (4.0), FAM167A (4.1) | |
Type 1 diabetes | rs2647044 | HLA-DQB2 (3.2) | ||
Inflammatory bowel disease | rs2382817 | cg00012203 (5.5), cg05991184 (9.5) | TMBIM1 (5.1) | |
Bacterial infection | Helicobacter pylori serologic status | rs10004195 | cg09316306 (4.1) | TLR6 (4.6) |
Leprosy | rs3764147 | cg00423124 (6.0) | LACC1 (6.6) | |
Cancer | Hepatocellular carcinoma (hepatitis B virus related) | rs9275319 | cg04418355 (4.2), cg14255617 (4.9) | HLA-DQA2 (3.6) |
Multiple myeloma (hyperdiploidy) | rs2272007 | cg05589743 (5.1) | ULK4 (9.3) | |
Multiple myeloma | rs1052501 | cg05589743 (5.1) | ULK4 (9.3) | |
Nasopharyngeal carcinoma | rs3129055 | cg02157626 (4.1), cg02355653 (6.4), cg03449857 (6.4), cg06458771 (4.6), cg13835168 (4.0), cg15570656 (6.8), cg15708526 (3.9), cg16994534 (5.2), cg24100841 (5.3), cg24444631 (4.2), cg26021304 (10), cg27230769 (5.3) | ZFP57 (4.8) | |
Prostate cancer | rs130067 | cg07414487 (7.1), cg21334609 (4.4) | CCHCR1 (4.6), TCF19 (7.3) | |
Wilms tumor | rs3755132 | cg12888861 (4.6), cg24674087 (5.1) | DDX1 (5.0) | |
Chronic inflammatory disease | COPD-related biomarkers | rs2074488 | cg04710276 (5.2), cg18923388 (3.9) | HLA-B (3.7) |
COPD-related biomarkers | rs1265093 | cg07414487 (13) | TCF19 (4.3) | |
Asthma | rs9273349 | cg19864342 (4.2), cg23464743 (7.5) | HLA-DQB2 (6.0) | |
Coagulation | End-stage coagulation | rs10665, rs2181540, rs6041 | cg19086603 (19, 19, 22) | LOC100506079 (4.1, 3.5, 3.2) |
Prothrombin time | rs561241 | cg19086603 (23) | LOC100506079 (3.2) | |
Factor VII | ||||
Metabolism syndrome related | Palmitoleic acid (16:1n-7) plasma levels | rs11190604 | cg07080220 (4.4), cg07570687 (4.7) | SEC31B (3.5) |
Phospholipid levels (plasma) | rs1077989 | cg16224230 (4.8) | TMEM229B (3.9) | |
Metabolic syndrome | rs782590 | cg18811423 (4.2) | PNPT1 (4.8) | |
Body mass index | rs4771122 | cg22138327 (8.9) | GTF3A (4.6) | |
Neurological disorder | Progressive supranuclear palsy | rs8070723 | cg07368061 (4.9), cg18228076 (7.3) | KANSL1 (5.2) |
Parkinson's disease | ||||
Other | Height | rs1182188 | cg01139696 (5.5), cg08377331 (11), cg15092213 (4.6), cg24648384 (14) | GNA12 (5.8) |
Height | rs798489 | cg01139696 (4.6), cg08377331 (10), cg15092213 (4.5), cg17393140 (9.1), cg24648384 (15) | GNA12 (4.5) | |
Height | rs2665838 | cg23090583 (4.6) | FTSJ3 (3.9) | |
Height | rs6457620 | cg23464743 (6.1) | HLA-DQA2 (3.6) | |
Height | rs12694997 | cg17240976 (33) | FARP2 (3.4) | |
Height | rs2941551 | cg23090583 (4.6) | FTSJ3 (4.0) | |
Diastolic blood pressure | rs9815354 | cg05589743 (4.3) | ULK4 (9.0) | |
Menopause (age at onset) | rs2303369 | cg04845466 (4.2), cg12000995 (7.2), cg12648201 (6.8), cg17158414 (4.0), cg24768116 (4.5) | NRBP1 (3.4) | |
Liver enzyme levels (gamma-glutamyl transferase) | rs2739330 | cg04234412 (9.2), cg09033563 (6.8), cg17293426 (4.1), cg18538332 (4.2), cg24846343 (4.9) | GSTT1 (9.7) | |
Metabolic traits | rs2403254 | cg24872173 (5.6) | GTF2H1 (4.4) | |
Mean corpuscular hemoglobin concentration | rs4580814 | cg04380229 (6.0), cg04467119 (4.2), cg08344943 (3.9), cg08702805 (4.8), cg12199905 (4.0), cg22955595 (4.3), cg25388971 (4.5), cg25970471 (4.1) | SLC12A7 (5.8) | |
Response to temozolomide | rs477692 | cg05714579 (6.8) | MGMT (3.3) | |
Renal function-related traits (BUN) | rs11123170 | cg07594247 (5.0), cg17445212 (5.0), cg23564664 (5.0) | PAX8 (4.8) | |
Multiple sclerosis (OCB status) | rs3129720 | cg23464743 (6.9) | HLA-DQB2 (5.6) | |
Multiple sclerosis | rs2523393 | cg02355653 (11), cg15570656 (4.3), cg17221604 (4.2), cg25046571 (5.8), cg26021304 (7.3), cg27230769 (24) | HLA-F (3.3), ZFP57 (3.9) | |
Interstitial lung disease | rs1981997 | cg07368061 (4.9), cg18228076 (7.3), cg24801230 (11) | KANSL1 (5.2) |
The −log10P was displayed within parentheses for associations.
aBetween the cytosine modification levels of the CpG site and the GWAS SNP.
bBetween the gene expression levels of the gene and the GWAS SNP. COPD, chronic obstructive pulmonary disease.
The GWAS Catalog includes a range of diseases for which pathogenesis stemmed from distinct tissues and cell types, whereas our mQTL and eQTL were detected from LCLs. To examine the cell lineage specificity of mQTL and eQTL in disease determination, we looked at the types of diseases preferentially associated with mQTL or eQTL. We classified GWAS diseases/traits that have clear etiologies to seven major categories (Fig. 5C), and estimated the null distributions of the number of random GWAS SNPs (nominal association P-value < 5 × 10−8) annotated as mQTL or eQTL for each of the disease categories. As shown in Figure 5C, chronic inflammatory diseases and autoimmune disorders were enriched for GWAS loci annotated as mQTL (P = 0.005 and P = 0.002, respectively) as well as loci annotated as eQTL (P < 0.0001 and P = 0.03, respectively). Psychiatric disorders and cancers were relatively enriched for GWAS loci annotated as mQTL (P = 0.09 and P = 0.1, respectively) compared with eQTL (P = 0.8), suggesting both common and specific disease enrichment between mQTL and eQTL (Supplementary Data).
DISCUSSION
We assessed the genetic architecture of cytosine modifications for >280K autosomal CpG sites in two population samples. Our observations suggest that variation in local sequence context was the primary determinant of interindividual cytosine modification differences. Such local genetic regulation appeared to be simple, in that the majority of CpG sites were controlled by a single major mQTL, with 30% of CpG sites further modulated by additional mQTL that had relatively small combinatory effects. The impact of local genetic epistasis on within-population variation was also moderate, judged by the effect size of genetic interactions for the small number of detectable CpG sites (Supplementary Data). Our analysis, however, did reveal a >9-fold enrichment of CpG sites mapped to population-specific mQTL for CpG sites under local epistatic regulation, suggesting that local genetic epistasis was an important contributor to the population specificity of mQTL effects observed previously (14,16). The paucity of trans regulatory loci and the lack of master regulators found in this study could be interpreted by less complex genetic regulation for cytosine modification variations, but could also be due to a lack of power to detect the trans effects.
Our genetic and epigenetic dissection of gene expression variations revealed several interesting patterns. Genetically dependent CpG sites strongly co-localized with binding sites of canonic transcription factors. However, the repressive sites predominantly overlapped with H3K4me3 peaks at active promoters (42) while the facilitative sites overlapped few of such peaks. As the gene expression levels estimated in our study represented the averages across all exons annotated for the given reference genes, these observations suggest that greater levels of cytosine modification at active promoters suppress overall gene expression but at inactive promoters enhance overall gene expression. Cytosine modifications at transcribing promoters have been experimentally shown to suppress gene expression by disruption of transcription factor binding (43). We reasoned that cytosine modifications at inactive promoters may suppress incompetent transcriptional initiation, thereby facilitating the overall efficiency of transcriptional elongation. Indeed, a large proportion of the facilitative sites overlapped with H3K36me3 peaks (Supplementary Data) that mark transcriptional elongation (44,45). Many genetically dependent CpG sites located in intergenic regions, which is consistent with recent discoveries of intergenic CpG islands that associate with development-specific promoters (46).
Genetically independent CpG sites preferentially modulate gene expression through local chromatin accessibility. We found that the facilitative sites significantly co-localized with repressive H3K27me3 and H3K9me3 peaks around gene ends. Negative correlation between DNA methylation and H3K27me3 was observed genome wide (47,48). Profiling of H3K27me3 in DNA hypomethylated somatic cells further suggested that DNA methylome restricted PRC2 targeting (49). The positive interindividual correlations between gene expressions and cytosine modifications at H3K27me3 peaks are therefore consistent with previous observations. Expanding of repressive H3K27me3 and H3K9me3 blocks is the most prominent changes observed in lineage-committed human cells compared with pluripotent cells (50), which may explain why 73% of genetically independent CpG sites were facilitative sites. The co-localization of genetically independent repressive sites with H3K9me3 and H3K4me3 peaks may reflect the recruitment or exclusion of cytosine modifications at the already silenced or activated promoters, respectively, marked by the histone modifications (51,52).
We found that mQTL and eQTL were relatively enriched in GWAS SNPs associated with chronic inflammatory and autoimmune disorders (Supplementary Data), two disease categories related to the functions of lymphocytes from which LCLs were derived. This observation demonstrated the cell linage specificity of cytosine modifications and gene expressions in determining disease outcomes. On the other hand, mQTL were relatively enriched in GWAS SNPs associated with psychiatric disorders and cancers compared with eQTL (Supplementary Data). Cytosine modifications can be inherited over many cell generations (53). It is therefore likely that genetically regulated cytosine modifications, together with potential variations induced by mQTL × environment interaction, are more stably maintained over development than gene expression variations.
Genetic associations based on the 27K array data detected a very small number of mQTL. For example, only 37 CpG sites were mapped to mQTL in the YRI samples (13) and the numbers of CpG sites mapped to mQTL in the CEU and YRI samples in another LCL study were in a similar scale (14). Besides the 17× coverage difference, another cause for the discrepancy between results obtained from the 27K and 450K array data is the distinct genomic regions covered by the two platforms. The 27K array primarily interrogates promoter CpG sites, whereas the 450K array also interrogates a large number of CpG sites within gene bodies. As indicated by our study, the proportion of CpG sites mapped to mQTL is significantly lower in CpG islands compared with gene bodies. Different functional constrains on cytosine modifications and/or different mQTL detection sensitivity for CpG sites between these regions may contribute to the observation. Relatively comprehensive mQTL lists (including our LCL mQTL lists) will facilitate discovery and across-study comparisons (e.g. the correlation between mQTL and eQTL, and the correlation of mQTL across cell/tissue types).
Mapping of mQTL and eQTL aids in the study of complex traits in two ways. As shown in Table 2, annotations of known GWAS SNPs by mQTL and eQTL may help identify potential causal genes and often provide novel biological insights. Furthermore, in rare disease mapping with limited sample size, prioritization of SNPs with strong genetic effect and clear functional interpretation such as mQTL and eQTL could effectively reduce the burden of multiple comparisons and improve the power of the study. mQTL detected here have been deposited to the SCAN Database (32), which has gained >2 million queries from >56 000 unique IP addresses since 2009. We expect that mQTL and eQTL mapping studies using diverse cell types and treatment conditions will greatly expand our understanding of complex traits in general.
MATERIALS AND METHODS
Cell lines
Genomic DNA samples for 60 unrelated CEU (phase II) and 73 unrelated YRI (58 phase II plus 15 phase III samples) HapMap LCLs were purchased from Coriell Institute for Medical Research (Camden, NJ, USA). The cell line identities were confirmed by genotyping 47 SNPs from the Sequenom iPLEX Sample ID Plus Panel in 24 randomly chosen LCLs maintained by the Pharmacogenetics of Anticancer Agents Research Group Cell Core at The University of Chicago.
Cytosine modification data processing and validation
Cytosine modification levels were profiled with the 450K array (Illumina, Inc., San Diego, CA, USA), as described in a previous publication (16). Briefly, 150 ng DNA after bisulfite conversion was obtained for each sample, randomized by population identity and run on the 450K array plates with the Illumina HiScan System. We excluded CpG probes ambiguously mapped to the human genome (31) and CpG probes containing common SNPs (MAF >0.01) based on dbSNP (54) v135. The final dataset is comprised of 283 540 autosomal CpG sites with good hybridization quality (16). M-values, defined as the log2 ratio of the intensities of modified probe versus unmodified probe, were quantile normalized (55) across 133 samples, and adjusted for batch effect (56). Cytosine modification levels at selected CpG sites were validated by bisulfate sequencing (16). The 450K array data for three HapMap samples, NA12891, NA12892 and NA19239, highly correlated with the ENCODE (35) 450K array data (r: 0.95–0.99). The raw cytosine modification data have been deposited into the NCBI Gene Expression Omnibus (GEO Accession Number: GSE39672).
Genotype and gene expression data
SNP genotypes were downloaded from the International HapMap Project (57) (release 27 July 2009). SNPs that deviated from the Hardy–Weinberg equilibrium (P < 0.0001) within a population were removed. The analytic set was comprised of >2 million common SNPs with MAF > 0.05 that were genotyped in at least 48 samples in each population. Imputed genotypes were only used in multilocus model comparisons. The Affymetrix Human Exon Array 1.0ST was previously used to profile gene expression in the CEU and YRI samples (GEO Accession Number: GSE9703) (6). Probe sequences were aligned to human genome (GRCh37) allowing ≤1 miss match. Probes with perfect, unique alignments were further filtered by excluding probes containing common SNPs (MAF > 0.05) based on dbSNP (54) v135 and probes interrogating multiple genes. Probe intensities were log2 transformed, background corrected (58) and quantile normalized. Probe intensity was subtracted by the corresponding probe mean across samples. Gene-level expression intensities were summarized as mean probe intensity within each gene. Genes for which 90% samples had absolute expression levels less than five percentile of all expression levels were further excluded. In total, 16 476 autosomal genes with unique Entrez Gene IDs were included in the analyses. The expression levels of several selected genes that correlated with cytosine modification levels have been validated in the LCLs using quantitative RT–PCR (16).
Genome-wide and local scan of mQTL
In local scans, the top five principal components (PCs) estimated from M-values across CpG sites were regressed out for the 60 CEU samples, to account for potential confounding variables, whereas the top two PCs were regressed out for the 73 YRI samples. The number of PCs regressed out was selected to achieve the greatest detection sensitivity in each population. Cytosine modification levels were regressed on SNP allele dosages within each population. For local scans, FDR was estimated by 100 permutations. For whole-genome scan, FDR was estimated by three permutations in CEU due to the large amount of computation involved. The numbers of random associations called at a range of P-value thresholds were highly consistent across the three permutations. Furthermore, the number of significant calls at permutation-based FDR of 0.05 was close to the number of calls at Bonferroni-adjusted P = 0.05. We therefore used Bonferroni correction, corresponding to nominal P = 10−13, for both CEU and YRI samples in whole-genome scan.
Local genetic epistasis
SNPs with MAF > 0.2 and genotype frequency >0.1 within population were selected for analysis. For each CpG site, SNP pairs located within ±100 kb of the target cytosine were further selected so that LD between the two SNPs of a pair was constrained at r2 < 0.3. A linear model: cytosine modification levels ∼SNP1 + SNP2 + SNP1 × SNP2 + error was used to test for epistasis assuming an additive genetic mode. To avoid model collinearity, the correlation between the interaction term and each of the additive terms was constrained at r2 < 0.7. The significance of the interaction effect was estimated by an F-test that compared the full model with a reduced model without the interaction term. FDR was estimated by 100 permutations using an approximate test (59).
Common and population-specific mQTL
We pooled the CEU and YRI samples to test for common and population-specific mQTL. To avoid associations driven by one population, SNPs with MAF > 0.2 within each population were selected for analysis. A linear model: cytosine modification levels ∼population identity + SNP dosage + error was used to test for common genotype effect. A linear model: cytosine modification levels ∼population identity + SNP dosage + population identity × SNP dosage + error was used to test for population-specific genotype effect. The significance of common genotype effect and population-specific genotype effect was estimated by an F-test comparing a full model with a reduced model without SNP dosage term or without population identity × SNP dosage term, respectively. FDR was estimated by 100 permutations with an approximate test (59).
CpG-gene correlation and local scan of eQTL
We evaluated the within-population correlation between gene expression levels and cytosine modification levels of CpG sites that were <100 kb away from gene starts and gene stops. FDR was evaluated by 100 permutations. We mapped eQTL in 58 CEU and 59 YRI samples that overlapped with samples used in mQTL mapping. SNPs < 100 kb away from gene starts and gene stops were tested. The top 11 PCs estimated from gene expression levels across genes were regressed out for each population samples. Gene expression levels were regressed on SNP allele dosages within population. FDR was estimated by 100 permutations. To test for correlation between absolute cytosine modification levels and absolute gene expression levels across CpG-gene pairs for genetically dependent and genetically independent CpG sites, mean β values across samples and mean expression levels across samples were used.
Co-localization of CpG sites with the ENCODE peaks
We obtained uniformly processed narrow peaks for transcription factor binding sites and broad peaks for histone markers of cell line NA12878 from the ENCODE project (35). Peaks for each of the canonical transcription factors and histone modification markers were examined individually. We mapped all analyzed CpG sites to positional bins including 5 kb bins along the upstream 100 kb from transcriptional start site, 5′UTR, 10 percentile-bin along coding region, 3′UTR and 5 kb bins along the downstream 100 kb from transcriptional end site. To estimate null distributions, we randomly sampled 1000 times the same number of CpG-gene pairs as the number of true CpG-gene expression correlations, matching the positional bin distribution, and counted the number of CpG sites co-localized with the peaks for the given marker at each bin. The true numbers of co-localization were then compared with the 95 percentile thresholds of the null distributions.
Enrichment of GWAS SNPs in mQTL and eQTL
The NHGRI GWAS Catalog (37) has collected published GWAS results for complex traits including risks for common diseases. We obtained the NHGRI SNPs (released in November 6, 2013) reported to be associated with complex traits with a P-value of <5 × 10−8. To assess the enrichment of GWAS SNPs for mQTL and eQTL, we generated null distributions of the number of SNPs overlapping with GWAS SNPs, by random sampling SNPs 10 000 times according to the number and MAF distribution of mQTL and eQTL, respectively. The number of mQTL or eQTL overlapping with GWAS SNPs was then compared with the null distribution. Due to the relatively small number of GWAS SNPs selected at P-value < 5 × 10−8, mQTL and eQTL without LD pruning were used in the enrichment analyses present in Figure 5A and B. However, the enrichments observed were not due to LD, as using mQTL and eQTL pruned by r2 > 0.3 resulted in similar trend: P = 0.0002 for mQTL overlapping and P < 0.0001 for eQTL overlapping. To annotate GWAS SNPs as mQTL or eQTL, GWAS SNPs were associated with cytosine modification levels or gene expression levels in the CEU samples, if they were located within the regions ±100 kb of analyzed CpG sites or genes, respectively. FDR was estimated by 100 permutations. To assess the enrichment of mQTL and eQTL in disease categories, we again generated null distributions for each of the disease categories. For a given disease category, we randomly sampled GWAS SNPs 10 000 times according to the number and MAF distribution of the annotated mQTL or eQTL in that disease category. The true number of mQTL or eQTL annotated to the given disease category was then compared with the null distribution.
Bisulfite sequencing validation
Three CpG sites with local mQTL were selected for bisulfite sequencing validation (Supplementary Data). Bisulfite sequencing at these loci was performed on 10–24 random samples from the original DNA collection used in the 450K array profiling. ZymoTaq polymerase (Zymo Research, Irvine, CA, USA) and primers listed in Supplementary Data were used to amplify ∼250 bp fragments that included the CpG sites interrogated by the 450K array as well as adjacent CpGs.
SUPPLEMENTARY MATERIAL
FUNDING
This work was supported by National Institutes of Health (R21HG006367 to W.Z., L.A.G. and M.E.D.; U01GM061393 to M.E.D.).
ACKNOWLEDGEMENTS
We are grateful to Jennifer McQuade and Jamie Myers for technical support.
Conflict of Interest statement. None declared.
REFERENCES