Skip to main page content
U.S. flag

An official website of the United States government

Dot gov

The .gov means it’s official.
Federal government websites often end in .gov or .mil. Before sharing sensitive information, make sure you’re on a federal government site.

Https

The site is secure.
The https:// ensures that you are connecting to the official website and that any information you provide is encrypted and transmitted securely.

Access keys NCBI Homepage MyNCBI Homepage Main Content Main Navigation
. 2024 Sep 1;40(Suppl 2):ii11-ii19.
doi: 10.1093/bioinformatics/btae391.

GGTyper: genotyping complex structural variants using short-read sequencing data

Affiliations

GGTyper: genotyping complex structural variants using short-read sequencing data

Tim Mirus et al. Bioinformatics. .

Abstract

Motivation: Complex structural variants (SVs) are genomic rearrangements that involve multiple segments of DNA. They contribute to human diversity and have been shown to cause Mendelian disease. Nevertheless, our abilities to analyse complex SVs are very limited. As opposed to deletions and other canonical types of SVs, there are no established tools that have explicitly been designed for analysing complex SVs.

Results: Here, we describe a new computational approach that we specifically designed for genotyping complex SVs in short-read sequenced genomes. Given a variant description, our approach computes genotype-specific probability distributions for observing aligned read pairs with a wide range of properties. Subsequently, these distributions can be used to efficiently determine the most likely genotype for any set of aligned read pairs observed in a sequenced genome. In addition, we use these distributions to compute a genotyping difficulty for a given variant, which predicts the amount of data needed to achieve a reliable call. Careful evaluation confirms that our approach outperforms other genotypers by making reliable genotype predictions across both simulated and real data. On up to 7829 human genomes, we achieve high concordance with population-genetic assumptions and expected inheritance patterns. On simulated data, we show that precision correlates well with our prediction of genotyping difficulty. This together with low memory and time requirements makes our approach well-suited for application in biomedical studies involving small to very large numbers of short-read sequenced genomes.

Availability and implementation: Source code is available at https://github.com/kehrlab/Complex-SV-Genotyping.

PubMed Disclaimer

Conflict of interest statement

B.V.H. is an employee of deCODE genetics/Amgen Inc.

Figures

Figure 1.
Figure 1.
Overview of our approach for calculating read-pair probabilities. We estimate an insert-size distribution from the BAM file (A). We create a reference (REF) and a variant (VAR) allele model from the variant description (B). We calculate allele-specific read-pair profiles from the insert-size distribution and the allele models (C). These profiles contain expected read-pair occurrence frequencies for all possible combinations of read-pair categories and 5′-differences. We obtain genotype profiles by mixing allele-specific read-pair profiles (D). BP, breakpoint; NJ, novel junction; clipped_xy, read pairs clipped due to novel junctions x and y; spanning_xy, read pairs spanning breakpoints x and y; FF, forward–forward read pairs; FR, forward–reverse read pairs; RR, reverse–reverse read pairs.
Figure 2.
Figure 2.
Precision of GGTyper, Paragraph, and BayesTyper on simulated data by variant type using default filters of BayesTyper and no filters for GGTyper and Paragraph. Error bars indicate standard error across the seven simulated data sets.
Figure 3.
Figure 3.
Genotype certainty and precision on simulated data depending on genotype difficulty [Eq. (4), QE = 100]. Averages per variant type at 10×, 20×, 25×, and 30× coverage are shown.
Figure 4.
Figure 4.
Population statistics on real data for 20 complex SVs. (A) TR and MIER in percent in 3340 Icelandic trios and log-scaled χ2-value when testing for HWE in 4289 Icelandic non-child genomes. Dashed lines mark the expected TR of 50% and the significance threshold (P =.001). We used default filters. (B) Sum-of-squares of the difference in genotype fractions between the Icelandic data and the three continental groups in the Polaris diversity cohort.
Figure 5.
Figure 5.
Status of HWE for 20 complex SVs in the Polaris diversity cohort. Point size indicates log-scaled χ2-value, colour indicates whether HWE could be rejected or not with P =.001. BayesTyper called four variants as homozygous reference in all samples (black crosses) and removed V7 when disabling all filters with BayesTyperTools filter.
Figure 6.
Figure 6.
TR across 20 complex SVs in both Icelandic and Polaris data for GGTyper, Paragraph with graph or VCF input, and BayesTyper, filtered by decreasing genotype quality or certainty. Only 15 variants contribute to BayesTyper’s TR.

Similar articles

References

    1. Aganezov S, Yan SM, Soto DC et al. A complete reference genome improves analysis of human genetic variation. Science 2022;376:eabl3533. - PMC - PubMed
    1. Audano PA, Sulovari A, Graves-Lindsay TA et al. Characterizing the major structural variant alleles of the human genome. Cell 2019;176:663–75.e19. - PMC - PubMed
    1. Beyter D, Ingimundardottir H, Oddsson A et al. Long-read sequencing of 3,622 Icelanders provides insight into the role of structural variants in human diseases and other traits. Nat Genet 2021;53:779–86. - PubMed
    1. Chen S, Krusche P, Dolzhenko E et al. Paragraph: a graph-based structural variant genotyper for short-read sequence data. Genome Biol 2019;20:291. - PMC - PubMed
    1. Chen X, Schulz-Trieglaff O, Shaw R et al. Manta: rapid detection of structural variants and indels for germline and cancer sequencing applications. Bioinformatics 2015;32:1220–2. - PubMed

Publication types

Grants and funding