Graph-based genome alignment and genotyping with HISAT2 and HISAT-genotype

doi:10.1038/s41587-019-0201-4

. 2019 Aug;37(8):907-915.

doi: 10.1038/s41587-019-0201-4. Epub 2019 Aug 2.

Graph-based genome alignment and genotyping with HISAT2 and HISAT-genotype

Daehwan Kim¹, Joseph M Paggi², Chanhee Park³, Christopher Bennett³, Steven L Salzberg^{4

5}

Affiliations

¹ Lyda Hill Department of Bioinformatics, University of Texas Southwestern Medical Center, Dallas, TX, USA. daehwan.kim@utsouthwestern.edu.
² Department of Computer Science, Stanford University, Stanford, CA, USA.
³ Lyda Hill Department of Bioinformatics, University of Texas Southwestern Medical Center, Dallas, TX, USA.
⁴ Center for Computational Biology, McKusick-Nathans Institute of Genetic Medicine, School of Medicine, Johns Hopkins University, Baltimore, MD, USA.
⁵ Departments of Biomedical Engineering, Computer Science, and Biostatistics, Johns Hopkins University, Baltimore, MD, USA.

PMID: 31375807
PMCID: PMC7605509
DOI: 10.1038/s41587-019-0201-4

Graph-based genome alignment and genotyping with HISAT2 and HISAT-genotype

Daehwan Kim et al. Nat Biotechnol. 2019 Aug.

. 2019 Aug;37(8):907-915.

doi: 10.1038/s41587-019-0201-4. Epub 2019 Aug 2.

Authors

Daehwan Kim¹, Joseph M Paggi², Chanhee Park³, Christopher Bennett³, Steven L Salzberg^{4

5}

Affiliations

¹ Lyda Hill Department of Bioinformatics, University of Texas Southwestern Medical Center, Dallas, TX, USA. daehwan.kim@utsouthwestern.edu.
² Department of Computer Science, Stanford University, Stanford, CA, USA.
³ Lyda Hill Department of Bioinformatics, University of Texas Southwestern Medical Center, Dallas, TX, USA.
⁴ Center for Computational Biology, McKusick-Nathans Institute of Genetic Medicine, School of Medicine, Johns Hopkins University, Baltimore, MD, USA.
⁵ Departments of Biomedical Engineering, Computer Science, and Biostatistics, Johns Hopkins University, Baltimore, MD, USA.

PMID: 31375807
PMCID: PMC7605509
DOI: 10.1038/s41587-019-0201-4

Abstract

The human reference genome represents only a small number of individuals, which limits its usefulness for genotyping. We present a method named HISAT2 (hierarchical indexing for spliced alignment of transcripts 2) that can align both DNA and RNA sequences using a graph Ferragina Manzini index. We use HISAT2 to represent and search an expanded model of the human reference genome in which over 14.5 million genomic variants in combination with haplotypes are incorporated into the data structure used for searching and alignment. We benchmark HISAT2 using simulated and real datasets to demonstrate that our strategy of representing a population of genomes, together with a fast, memory-efficient search algorithm, provides more detailed and accurate variant analyses than other methods. We apply HISAT2 for HLA typing and DNA fingerprinting; both applications form part of the HISAT-genotype software that enables analysis of haplotype-resolved genes or genomic regions. HISAT-genotype outperforms other computational methods and matches or exceeds the performance of laboratory-based assays.

PubMed Disclaimer

Figures

**Figure 1.**
Graph representation of indels and mutations and its tabular representation. Starting with a 6-bp reference sequence, GAGCTG (a), the lower graph (b) incorporates three variants: a single nucleotide variant (A/T), a 1-bp deletion (T), and a 1-bp insertion (A). A prefix-sorted graph of the graph (c) has 11 nodes and 14 edges. Each node has a unique numerical node ID shown in blue to indicate its lexicographical order (1 being the first) with respect to the other nodes in the graph. The node labeled with ‘Z’ demarcates the end of the reference sequence. The table on the right (d) has two columns under Outgoing edge(s) that show the node IDs and their labels repeated according to the number of their outgoing edges (i.e. node 3, labeled C, is repeated three times with 3 outgoing edges to nodes 7, 8, and 10, respectively). The table has two columns under Incoming edge(s) that show the node IDs and the 14 labels for the preceding nodes (i.e. G is the preceding label for node 1, A and T for node 5). The table is more compact in memory usage than the graph representation.

**Figure 2.**
Overview of HISAT2’s indexes and alignment output (a) Hierarchical indexing in the hierarchical graph FM index (HGFM). Hierarchical indexing consists of two types of indexes: (1) a global index that represents the entire human genome and (2) 55,172 overlapping local indexes that collectively cover the genome plus all variants. When both are graph FM indexes, a genome plus a large collection of variants can be searched simultaneously. (b) A repeat index represents genomic sequences that are identical. (c) A read matching repeat sequences (e.g., *Read3* and *Read4*) is aligned to just one location (the repeat sequence). (d) The corresponding genomic locations of repeat aligned reads are retrieved via APIs.

**Figure 3.**
Construction of the Graph Human Reference, i.e. a *Genotype* Genome. The figure illustrates how HISAT-genotype extends the human reference genome (GRCh38) by incorporating known genomic variants from several well-studied genes, DNA fingerprinting loci, and common small variants (i.e. variants with minor allele frequencies of ≥1%) from the dbSNP database. In a, the process begins with analyzing information found in the selected databases to construct consensus sequences. The IMGT/HLA database includes over 15,500 allele sequences for 26 HLA genes. A consensus sequence for each HLA gene is constructed based on the most frequent bases that occur in each position of the multiple sequence alignments. The NIST STRBase database contains allele sequences for 13 DNA fingerprinting loci. Because the sequences of the 13 loci are short tandem repeats, HISAT-genotype chooses the longest allele for each locus as a consensus sequence. In b, the human reference is extended by replacing the HLA genes and 13 DNA fingerprinting loci with their consensus sequences. In c, the known genomic variants are then incorporated into the extended references using HISAT2’s graph data structure. Common small variants from dbSNP such as single nucleotide polymorphisms, deletions, and insertions, are also incorporated into the extended reference. In HISAT-genotype this graph reference is called a *Genotype* genome.

**Figure 4.**
HISAT-genotype’s assembly of two HLA-A alleles through a guided k-mer assembly graph The figure shows an abridged example of HISAT-genotype’s assembly output – see Supplementary File 1 for the full assembly output for NA12878. The first two bands are two alleles predicted by HISAT-genotype, in this case A*01:01:01:01 in dark green and A*11:01:01:01 in dark yellow. Each blue stripe indicates where there is a specific genomic variant with respect to the consensus sequence of the HLA-A gene. (a) Shorter bands indicating read alignments whose color is determined according to their degree of compatibility with either of the initially predicted alleles. Reads equally compatible with both alleles are shown in white. Some reads can be locally aligned, i.e. aligned to virtually the same location with just different variants, such as when reads are aligned with or without deletions near their ends, displayed here in gray. (b) Since the two predicted (in fact true/known) alleles share a large common sequence, read pair information is insufficient to fully separate the alleles. HISAT-genotype splits aligned reads into fixed length k-mers. In this simplified case, reads are 5 nucleotides long and k is 3. A pair of reads are aligned at the 3rd location and the 10th location of the graph representation for the HLA gene, respectively. When reads have divergent k-mers, the graph has a corresponding number of branches. One path traversing the graph from left to right constitutes one potential allele sequence. We call this a guided k-mer assembly graph, with *guided* emphasizing that k-mers are placed according to their aligned locations. The algorithmic details are given in the main text. (c) In addition, HISAT-genotype uses the predicted alleles to enable full-length assembly of both.

**Figure 5.**
Comparisons of HISAT2, Bowtie2, BWA-mem, and VG using 10 million simulated read pairs that include SNPs Alignment sensitivity is defined as the number of correctly aligned read pairs divided by the total number of read pairs. C: alignment sensitivity calculated based on any one of multiple alignments being correct. UC: alignment sensitivity calculated based on pairs being uniquely aligned. SC: alignment sensitivity similar to C, but calculated only for pairs with at least one read that includes one or more SNPs. SUC: alignment sensitivity similar to UC, but calculated only for pairs with at least one read that includes one or more SNPs. PPS: number of pairs processed per second. The suffixes followed by program names stand for as follows: D for default alignment settings, S for sensitive alignment settings, L for linear genome alignment, and G for graph genome alignment. We ran the programs on the same computer as described in Supplementary Table 7.

**Figure 6.**
A novel HLA-A allele identified with strong computational evidence. This figure shows an abridged example of HISAT-genotype's assembly output. At the top are shown the two initially predicted alleles, which are the best matches of the data to previously-known HLA-A alleles. The green assembled allele at the bottom, which was generated *de novo* by HISAT-genotype’s assembler, has one variant different from the predicted allele, A*24:02:01:01. Two reads shown in green support the variant. See Supplementary File 6. for more detailed output from a similar case found in LP6005093-DNA_E03 (a CAAPA genome) at the 2,780^th base.

See this image and copyright information in PMC

Cited by

Molecular cloning of PRD-like homeobox genes expressed in bovine oocytes and early IVF embryos.
Yaşar B, Boskovic N, Ivask M, Weltner J, Jouhilahti EM, Vill P, Skoog T, Jaakma Ü, Kere J, Bürglin TR, Katayama S, Org T, Kurg A. Yaşar B, et al. BMC Genomics. 2024 Nov 6;25(1):1048. doi: 10.1186/s12864-024-10969-w. BMC Genomics. 2024. PMID: 39506635
Chromosome-level genome assembly of the ivory shell Babylonia areolata.
Zou Y, Fu J, Liang Y, Luo X, Shen M, Huang M, Chen Y, You W, Ke C. Zou Y, et al. Sci Data. 2024 Nov 6;11(1):1201. doi: 10.1038/s41597-024-04001-9. Sci Data. 2024. PMID: 39505919
Discovering a novel glycosyltransferase gene CmUGT1 enhances main metabolites production of Cordyceps militaris.
He RA, Huang C, Zheng CH, Wang J, Yuan SW, Chen BX, Feng K. He RA, et al. Front Microbiol. 2024 Oct 22;15:1437963. doi: 10.3389/fmicb.2024.1437963. eCollection 2024. Front Microbiol. 2024. PMID: 39502416 Free PMC article.
HES1 revitalizes the functionality of aged adipose-derived stem cells by inhibiting the transcription of STAT1.
Li C, Ren S, Yan C, Wang C, Jiang T, Kang Y, Chen J, Xiong H, Guo J, Jiang G, Liu S, Nie P, Chen Z. Li C, et al. Stem Cell Res Ther. 2024 Nov 5;15(1):399. doi: 10.1186/s13287-024-04002-w. Stem Cell Res Ther. 2024. PMID: 39501364 Free PMC article.

See all "Cited by" articles

References

1. Genomes Project C et al. A map of human genome variation from population-scale sequencing. Nature 467, 1061–1073 (2010). - PMC - PubMed
1. Genomes Project C et al. An integrated map of genetic variation from 1,092 human genomes. Nature 491, 56–65 (2012). - PMC - PubMed
1. Consortium GT The Genotype-Tissue Expression (GTEx) project. Nat Genet 45, 580–585 (2013). - PMC - PubMed
1. Lappalainen T et al. Transcriptome and genome sequencing uncovers functional variation in humans. Nature 501, 506–511 (2013). - PMC - PubMed
1. t Hoen PA et al. Reproducibility of high-throughput mRNA and small RNA sequencing across laboratories. Nat Biotechnol 31, 1015–1022 (2013). - PubMed

Publication types

Actions
Actions

MeSH terms

Actions
Actions
Actions
Actions
Actions
Actions
Actions
Actions
Actions
Actions
Actions

Grants and funding

LinkOut - more resources

Full Text Sources
Other Literature Sources
- The Lens - Patent Citations Database
Research Materials
- Coriell Cell Repositories
- NCI CPTC Antibody Characterization Program

[1] Genomes Project C et al. A map of human genome variation from population-scale sequencing. Nature 467, 1061–1073 (2010). - PMC - PubMed

[2] Genomes Project C et al. A map of human genome variation from population-scale sequencing. Nature 467, 1061–1073 (2010). - PMC - PubMed

[3] Genomes Project C et al. An integrated map of genetic variation from 1,092 human genomes. Nature 491, 56–65 (2012). - PMC - PubMed

[4] Genomes Project C et al. An integrated map of genetic variation from 1,092 human genomes. Nature 491, 56–65 (2012). - PMC - PubMed

[5] Consortium GT The Genotype-Tissue Expression (GTEx) project. Nat Genet 45, 580–585 (2013). - PMC - PubMed

[6] Consortium GT The Genotype-Tissue Expression (GTEx) project. Nat Genet 45, 580–585 (2013). - PMC - PubMed

[7] Lappalainen T et al. Transcriptome and genome sequencing uncovers functional variation in humans. Nature 501, 506–511 (2013). - PMC - PubMed

[8] Lappalainen T et al. Transcriptome and genome sequencing uncovers functional variation in humans. Nature 501, 506–511 (2013). - PMC - PubMed

[9] t Hoen PA et al. Reproducibility of high-throughput mRNA and small RNA sequencing across laboratories. Nat Biotechnol 31, 1015–1022 (2013). - PubMed

[10] t Hoen PA et al. Reproducibility of high-throughput mRNA and small RNA sequencing across laboratories. Nat Biotechnol 31, 1015–1022 (2013). - PubMed

Save citation to file

Email citation

Add to Collections

Add to My Bibliography

Your saved search

Create a file for external citation management software

Your RSS Feed

Graph-based genome alignment and genotyping with HISAT2 and HISAT-genotype

Affiliations

Graph-based genome alignment and genotyping with HISAT2 and HISAT-genotype

Authors

Affiliations

Abstract

Figures

Similar articles

Cited by

References

Publication types

MeSH terms

Grants and funding

LinkOut - more resources

Full Text Sources

Other Literature Sources

Research Materials

Abstract

Figures

Similar articles

Cited by

References

Publication types

MeSH terms

Related information

Grants and funding

LinkOut - more resources

Full Text Sources

Other Literature Sources

Research Materials