Skip to main page content
U.S. flag

An official website of the United States government

Dot gov

The .gov means it’s official.
Federal government websites often end in .gov or .mil. Before sharing sensitive information, make sure you’re on a federal government site.

Https

The site is secure.
The https:// ensures that you are connecting to the official website and that any information you provide is encrypted and transmitted securely.

Access keys NCBI Homepage MyNCBI Homepage Main Content Main Navigation
. 2019 Aug;37(8):907-915.
doi: 10.1038/s41587-019-0201-4. Epub 2019 Aug 2.

Graph-based genome alignment and genotyping with HISAT2 and HISAT-genotype

Affiliations

Graph-based genome alignment and genotyping with HISAT2 and HISAT-genotype

Daehwan Kim et al. Nat Biotechnol. 2019 Aug.

Abstract

The human reference genome represents only a small number of individuals, which limits its usefulness for genotyping. We present a method named HISAT2 (hierarchical indexing for spliced alignment of transcripts 2) that can align both DNA and RNA sequences using a graph Ferragina Manzini index. We use HISAT2 to represent and search an expanded model of the human reference genome in which over 14.5 million genomic variants in combination with haplotypes are incorporated into the data structure used for searching and alignment. We benchmark HISAT2 using simulated and real datasets to demonstrate that our strategy of representing a population of genomes, together with a fast, memory-efficient search algorithm, provides more detailed and accurate variant analyses than other methods. We apply HISAT2 for HLA typing and DNA fingerprinting; both applications form part of the HISAT-genotype software that enables analysis of haplotype-resolved genes or genomic regions. HISAT-genotype outperforms other computational methods and matches or exceeds the performance of laboratory-based assays.

PubMed Disclaimer

Figures

Figure 1.
Figure 1.
Graph representation of indels and mutations and its tabular representation. Starting with a 6-bp reference sequence, GAGCTG (a), the lower graph (b) incorporates three variants: a single nucleotide variant (A/T), a 1-bp deletion (T), and a 1-bp insertion (A). A prefix-sorted graph of the graph (c) has 11 nodes and 14 edges. Each node has a unique numerical node ID shown in blue to indicate its lexicographical order (1 being the first) with respect to the other nodes in the graph. The node labeled with ‘Z’ demarcates the end of the reference sequence. The table on the right (d) has two columns under Outgoing edge(s) that show the node IDs and their labels repeated according to the number of their outgoing edges (i.e. node 3, labeled C, is repeated three times with 3 outgoing edges to nodes 7, 8, and 10, respectively). The table has two columns under Incoming edge(s) that show the node IDs and the 14 labels for the preceding nodes (i.e. G is the preceding label for node 1, A and T for node 5). The table is more compact in memory usage than the graph representation.
Figure 2.
Figure 2.
Overview of HISAT2’s indexes and alignment output (a) Hierarchical indexing in the hierarchical graph FM index (HGFM). Hierarchical indexing consists of two types of indexes: (1) a global index that represents the entire human genome and (2) 55,172 overlapping local indexes that collectively cover the genome plus all variants. When both are graph FM indexes, a genome plus a large collection of variants can be searched simultaneously. (b) A repeat index represents genomic sequences that are identical. (c) A read matching repeat sequences (e.g., Read3 and Read4) is aligned to just one location (the repeat sequence). (d) The corresponding genomic locations of repeat aligned reads are retrieved via APIs.
Figure 3.
Figure 3.
Construction of the Graph Human Reference, i.e. a Genotype Genome. The figure illustrates how HISAT-genotype extends the human reference genome (GRCh38) by incorporating known genomic variants from several well-studied genes, DNA fingerprinting loci, and common small variants (i.e. variants with minor allele frequencies of ≥1%) from the dbSNP database. In a, the process begins with analyzing information found in the selected databases to construct consensus sequences. The IMGT/HLA database includes over 15,500 allele sequences for 26 HLA genes. A consensus sequence for each HLA gene is constructed based on the most frequent bases that occur in each position of the multiple sequence alignments. The NIST STRBase database contains allele sequences for 13 DNA fingerprinting loci. Because the sequences of the 13 loci are short tandem repeats, HISAT-genotype chooses the longest allele for each locus as a consensus sequence. In b, the human reference is extended by replacing the HLA genes and 13 DNA fingerprinting loci with their consensus sequences. In c, the known genomic variants are then incorporated into the extended references using HISAT2’s graph data structure. Common small variants from dbSNP such as single nucleotide polymorphisms, deletions, and insertions, are also incorporated into the extended reference. In HISAT-genotype this graph reference is called a Genotype genome.
Figure 4.
Figure 4.
HISAT-genotype’s assembly of two HLA-A alleles through a guided k-mer assembly graph The figure shows an abridged example of HISAT-genotype’s assembly output – see Supplementary File 1 for the full assembly output for NA12878. The first two bands are two alleles predicted by HISAT-genotype, in this case A*01:01:01:01 in dark green and A*11:01:01:01 in dark yellow. Each blue stripe indicates where there is a specific genomic variant with respect to the consensus sequence of the HLA-A gene. (a) Shorter bands indicating read alignments whose color is determined according to their degree of compatibility with either of the initially predicted alleles. Reads equally compatible with both alleles are shown in white. Some reads can be locally aligned, i.e. aligned to virtually the same location with just different variants, such as when reads are aligned with or without deletions near their ends, displayed here in gray. (b) Since the two predicted (in fact true/known) alleles share a large common sequence, read pair information is insufficient to fully separate the alleles. HISAT-genotype splits aligned reads into fixed length k-mers. In this simplified case, reads are 5 nucleotides long and k is 3. A pair of reads are aligned at the 3rd location and the 10th location of the graph representation for the HLA gene, respectively. When reads have divergent k-mers, the graph has a corresponding number of branches. One path traversing the graph from left to right constitutes one potential allele sequence. We call this a guided k-mer assembly graph, with guided emphasizing that k-mers are placed according to their aligned locations. The algorithmic details are given in the main text. (c) In addition, HISAT-genotype uses the predicted alleles to enable full-length assembly of both.
Figure 5.
Figure 5.
Comparisons of HISAT2, Bowtie2, BWA-mem, and VG using 10 million simulated read pairs that include SNPs Alignment sensitivity is defined as the number of correctly aligned read pairs divided by the total number of read pairs. C: alignment sensitivity calculated based on any one of multiple alignments being correct. UC: alignment sensitivity calculated based on pairs being uniquely aligned. SC: alignment sensitivity similar to C, but calculated only for pairs with at least one read that includes one or more SNPs. SUC: alignment sensitivity similar to UC, but calculated only for pairs with at least one read that includes one or more SNPs. PPS: number of pairs processed per second. The suffixes followed by program names stand for as follows: D for default alignment settings, S for sensitive alignment settings, L for linear genome alignment, and G for graph genome alignment. We ran the programs on the same computer as described in Supplementary Table 7.
Figure 6.
Figure 6.
A novel HLA-A allele identified with strong computational evidence. This figure shows an abridged example of HISAT-genotype's assembly output. At the top are shown the two initially predicted alleles, which are the best matches of the data to previously-known HLA-A alleles. The green assembled allele at the bottom, which was generated de novo by HISAT-genotype’s assembler, has one variant different from the predicted allele, A*24:02:01:01. Two reads shown in green support the variant. See Supplementary File 6. for more detailed output from a similar case found in LP6005093-DNA_E03 (a CAAPA genome) at the 2,780th base.

Similar articles

Cited by

References

    1. Genomes Project C et al. A map of human genome variation from population-scale sequencing. Nature 467, 1061–1073 (2010). - PMC - PubMed
    1. Genomes Project C et al. An integrated map of genetic variation from 1,092 human genomes. Nature 491, 56–65 (2012). - PMC - PubMed
    1. Consortium GT The Genotype-Tissue Expression (GTEx) project. Nat Genet 45, 580–585 (2013). - PMC - PubMed
    1. Lappalainen T et al. Transcriptome and genome sequencing uncovers functional variation in humans. Nature 501, 506–511 (2013). - PMC - PubMed
    1. t Hoen PA et al. Reproducibility of high-throughput mRNA and small RNA sequencing across laboratories. Nat Biotechnol 31, 1015–1022 (2013). - PubMed

Publication types