Skip to main page content
U.S. flag

An official website of the United States government

Dot gov

The .gov means it’s official.
Federal government websites often end in .gov or .mil. Before sharing sensitive information, make sure you’re on a federal government site.

Https

The site is secure.
The https:// ensures that you are connecting to the official website and that any information you provide is encrypted and transmitted securely.

Access keys NCBI Homepage MyNCBI Homepage Main Content Main Navigation
. 2018 Oct 23;9(1):4397.
doi: 10.1038/s41467-018-06694-0.

A reference haplotype panel for genome-wide imputation of short tandem repeats

Affiliations

A reference haplotype panel for genome-wide imputation of short tandem repeats

Shubham Saini et al. Nat Commun. .

Abstract

Short tandem repeats (STRs) are involved in dozens of Mendelian disorders and have been implicated in complex traits. However, genotyping arrays used in genome-wide association studies focus on single nucleotide polymorphisms (SNPs) and do not readily allow identification of STR associations. We leverage next-generation sequencing (NGS) from 479 families to create a SNP + STR reference haplotype panel. Our panel enables imputing STR genotypes into SNP array data when NGS is not available for directly genotyping STRs. Imputed genotypes achieve mean concordance of 97% with observed genotypes in an external dataset compared to 71% expected under a naive model. Performance varies widely across STRs, with near perfect concordance at bi-allelic STRs vs. 70% at highly polymorphic repeats. Imputation increases power over individual SNPs to detect STR associations with gene expression. Imputing STRs into existing SNP datasets will enable the first large-scale STR association studies across a range of complex traits.

PubMed Disclaimer

Conflict of interest statement

The authors declare no competing interests.

Figures

Fig. 1
Fig. 1
A deep catalog of STR variation in the SSC cohort. a Number of STRs called per sample. Dashed line represents the mean of 1.14 million STRs per sample. b Call rate per locus. Dashed line represents the mean call rate of 90%. c Mendelian inheritance rate at filtered vs. unfiltered STRs. The x-axis gives the posterior genotype score (Q) returned by HipSTR. The y-axis gives the average Mendelian inheritance rate for each bin across all calls on chromosome 21. STRs that were homozygous for the reference allele in all members of a family were removed. Colors represent different motif lengths. d Per-STR expected heterozygosity in SSC vs. 1000 Genomes. Only STRs with expected heterozygosity >0.095 in SSC are included. Color scale gives the log10 number of STRs represented in each bin. e Allele frequency distributions at pathogenic STRs obtained in SSC samples vs. previously reported normal alleles. Blue = SSC, Gold = Previously reported. Boxes span the interquartile range and horizontal lines give the medians. Whiskers extend to the minimum and maximum data points. The y-axis gives the number of repeat units. Sources of previously reported allele frequencies are described in detail in Methods. HD Huntington’s disease, SCA spinocerebellar ataxia, DRPLA Dentatorubral-pallidoluysian atrophy, DM1 myotonic dystrophy type 1, HDL Huntington’s disease-like 2
Fig. 2
Fig. 2
Creating a reference SNP-STR haplotype panel. a Schematic of phasing pipeline in the SSC cohort. To create the phased panel, STR genotypes were placed onto phased SNP haplotypes using Beagle. Any missing STR genotypes were imputed. The resulting panel was then used for downstream imputation from orthogonal SNP genotypes. Blue and red denote phased and unphased variants, respectively. Positions in gray are homozygous. b Concordance of imputed STR genotypes vs. expected heterozygosity. Blue denotes observed per-locus values, green denotes values expected under a random model and orange denotes values expected under a naive model. Solid lines give median values for each bin and filled areas span the 25th to 75th percentile of values in each bin. x-axis values were binned by 0.1. Upper gray plot gives the distribution of expected heterozygosity values in our panel. Concordance values are based on the leave-one-out analysis in the SSC cohort. c Per-locus imputation concordance in SSC vs. 1000 Genomes cohorts. Color scale gives the log10 number of STRs represented in each bin. Concordance values are based on the subset of samples from the 1000 Genomes deep WGS cohort with European ancestry. d Per-locus imputation concordance using HipSTR vs. capillary electrophoresis genotypes. Each dot represents one STR. The x-axis and y-axis give imputation concordance using capillary electrophoresis or HipSTR genotypes as a ground truth, respectively. Concordance was measured in separate sets of 1000 Genomes European samples for each technology. e Concordance of imputed vs. 10X STR genotypes in NA12878 stratified by concordance in SSC. STRs were binned by concordance value based on the leave-one-out analysis. Concordance in NA12878 was measured across all STRs in each bin. Dots give mean values for each bin and lines denote ±1 s.d. In all cases leave-one-out refers to analyses performed in the SSC cohort
Fig. 3
Fig. 3
STR imputation improves power to detect STR associations. a Example simulated quantitative phenotype based on SSC genotypes. A quantitative phenotype was simulated assuming a causal STR (red). Power to detect the association was compared between the causal STR, imputed STR genotypes, and all common SNPs (MAF > 0.05) within a 50 kb window of the STR (gray). b Strength of association (-log10 p) is linearly related with LD with the causal variant. For SNPs, the x-axis gives the length r2 calculated using observed genotypes. For the imputed STR (blue), the x-axis gives the length r2 from leave-one-out analysis. c The gain in power using imputed genotypes is linearly related to the gain in length r2 compared to the best tag SNP. Gray contours give the bivariate kernel density estimate. Top and right gray area gives the distribution of points along the x- and y-axes, respectively. Power was calculated based on the number of simulations out of 100 with nominal p < 0.05. d Quantile-quantile plot for eSTR association tests. Each dot represents a single STR×gene test. The x-axis gives the expected log10 p-value distribution under a null model of no eSTR associations. Red and blue dots give log10 p-values for association tests using HipSTR genotypes and imputed STR genotypes, respectively. Black dashed line gives the diagonal. e Comparison of eSTR effect sizes using observed vs. imputed genotypes. Each dot represents a single STR×gene test. The x-axis gives effect sizes obtained using imputed genotypes. Gray dots give the effect size in GTEx whole blood using HipSTR genotypes. Purple dots give effect sizes reported previously in lymphoblastoid cell lines. f, g Example putative causal eSTRs identified using imputed STR genotypes. Left, middle, and right plots give HipSTR STR dosage (red), imputed STR dosage (blue), and the best tag SNP genotype (gray) vs. normalized gene expression, respectively. STR dosage is defined as the average length difference from hg19. One dot represents one sample. P-values are obtained using linear regression of genotype vs. gene expression. STR and SNP sequence information is shown for the coding strand. Gene diagrams are not drawn to scale
Fig. 4
Fig. 4
SNP haplotypes distinguish allele lengths at known pathogenic STRs. a Example SNP-STR haplotypes inferred in European samples at a polyglutamine repeat in ATN1 implicated in DRPLA. Each column represents a SNP from the founder haplotype reported by Veneziano et al. Each row represents a single haplotype inferred in 1000 Genomes Project phase 3 European samples, with gray and black boxes denoting major and minor alleles, respectively. Haplotypes are grouped by the corresponding STR allele. The number of SNP haplotypes for each group of STR alleles is annotated to the left of each box. Alleles seen fewer than 10 times in 1000 Genomes samples were excluded from the visualization. b Comparison of imputed vs. observed STR genotypes in SSC samples at the DRPLA locus. The x-axis gives the maximum likelihood genotype dosage returned by HipSTR and the y-axis gives the imputed dosage. Dosage is defined as the sum of the two allele lengths of each genotype relative to the hg19 reference genome. The bubble size represents the number of samples summarized by each data point. c Distribution of DRPLA repeat length vs. similarity to the pathogenic founder haplotype. The founder haplotype refers to the SNP haplotype reported by Veneziano, et al. on which a pathogenic expansion in ATN1 implicated in DRPLA likely originated. The x-axis gives the Hamming distance between observed haplotypes and the founder haplotype, computed as the number of positions with discordant alleles. White dots represent the median length

Similar articles

Cited by

References

    1. Schizophrenia Working Group of the Psychiatric Genomics C. Biological insights from 108 schizophrenia-associated genetic loci. Nature511, 421–427 (2014). - PMC - PubMed
    1. Scott RA, et al. An expanded genome-wide association study of type 2 diabetes in Europeans. Diabetes. 2017;66:2888–2902. doi: 10.2337/db16-1253. - DOI - PMC - PubMed
    1. Turcot V, et al. Protein-altering variants associated with body mass index implicate pathways that control energy intake and expenditure in obesity. Nat. Genet. 2018;50:26–41. doi: 10.1038/s41588-017-0011-x. - DOI - PMC - PubMed
    1. Marchini J, Howie B. Genotype imputation for genome-wide association studies. Nat. Rev. Genet. 2010;11:499–511. doi: 10.1038/nrg2796. - DOI - PubMed
    1. Yengo, L. et al. Meta-analysis of genome-wide association studies for height and body mass index in ~700,000 individuals of European ancestry. Preprint at https://www.biorxiv.org/content/early/2018/03/22/274654 (2018). - PMC - PubMed

Publication types