Skip to main page content
U.S. flag

An official website of the United States government

Dot gov

The .gov means it’s official.
Federal government websites often end in .gov or .mil. Before sharing sensitive information, make sure you’re on a federal government site.

Https

The site is secure.
The https:// ensures that you are connecting to the official website and that any information you provide is encrypted and transmitted securely.

Access keys NCBI Homepage MyNCBI Homepage Main Content Main Navigation
. 2020 Jul 14;12(7):758.
doi: 10.3390/v12070758.

Validation of Variant Assembly Using HAPHPIPE with Next-Generation Sequence Data from Viruses

Affiliations

Validation of Variant Assembly Using HAPHPIPE with Next-Generation Sequence Data from Viruses

Keylie M Gibson et al. Viruses. .

Abstract

Next-generation sequencing (NGS) offers a powerful opportunity to identify low-abundance, intra-host viral sequence variants, yet the focus of many bioinformatic tools on consensus sequence construction has precluded a thorough analysis of intra-host diversity. To take full advantage of the resolution of NGS data, we developed HAplotype PHylodynamics PIPEline (HAPHPIPE), an open-source tool for the de novo and reference-based assembly of viral NGS data, with both consensus sequence assembly and a focus on the quantification of intra-host variation through haplotype reconstruction. We validate and compare the consensus sequence assembly methods of HAPHPIPE to those of two alternative software packages, HyDRA and Geneious, using simulated HIV and empirical HIV, HCV, and SARS-CoV-2 datasets. Our validation methods included read mapping, genetic distance, and genetic diversity metrics. In simulated NGS data, HAPHPIPE generated pol consensus sequences significantly closer to the true consensus sequence than those produced by HyDRA and Geneious and performed comparably to Geneious for HIV gp120 sequences. Furthermore, using empirical data from multiple viruses, we demonstrate that HAPHPIPE can analyze larger sequence datasets due to its greater computational speed. Therefore, we contend that HAPHPIPE provides a more user-friendly platform for users with and without bioinformatics experience to implement current best practices for viral NGS assembly than other currently available options.

Keywords: HCV; HIV; SARS-CoV-2; bioinformatics; consensus; haplotypes; simulation; validation; viruses.

PubMed Disclaimer

Conflict of interest statement

The authors declare no conflict of interest.

Figures

Figure 1
Figure 1
Methods overview. Sequencing reads were simulated for each simulation sample, while reads for empirical data were gathered from NCBI SRA database. Reads for each sample were assembled and a consensus sequence generated through the de novo pipelines for HAPHPIPE (HP01) and Geneious and the reference-based pipelines for HAPHPIPE (HP02), Geneious, and HyDRA. Only HIV samples were analyzed through HyDRA, because HyDRA is HIV-specific. All resulting consensus sequences were analyzed using a variety of metrics including assembly statistics, genetic distance from reference or true sequence metrics, and diversity estimates, such as nucleotide diversity.
Figure 2
Figure 2
Adjusted genetic p-distance (displayed as a difference from 1) between consensus sequence and true sequence for all pipelines for the simulated HIV (A) subtype B dataset and (B) non-subtype B dataset. Ambiguous nucleotides were accounted for by giving fractional weight in alignment. A value closer to 1.00 indicates the consensus sequence is more genetically similar to the true sequence. The x-axis order from left to right for an individual panel: adjusted genetic p-distance between the true sequence and (i) the initial assembled sequence followed by (ii) the final assemble sequence for haphpipe_assemble_01 pipeline (de novo assembly); (iii) the initial assembled sequence followed by (iv) the final assemble sequence for haphpipe_assemble_02 pipeline (reference-based assembly); the final consensus sequence for the Geneious (v) de novo workflow and the (vi) reference-based workflow; and finally, the (vi) average between the final two sequences (one for each read file) for HyDRA. The three amplicons are shown, as well as a combination of PRRT and int amplicons into pol. There are no results for HyDRA in the gp120 gene because HyDRA only analyzes the pol gene.
Figure 3
Figure 3
Adjusted genetic p-distance (displayed as a difference from 1) between consensus sequence and HXB2, the reference sequence for HIV, for all pipelines for the simulated HIV (A) subtype B dataset and (B) non-subtype B dataset. Ambiguous nucleotides were accounted for by giving fractional weight in alignment. A value closer to 1.00 indicates that the consensus sequence is more genetically similar to the reference sequence. The x-axis order from left to right for an individual panel: adjusted genetic p-distance between the reference sequence and (i) the initial assembled sequence followed by (ii) the final assemble sequence for haphpipe_assemble_01 pipeline (de novo assembly); (iii) the initial assembled sequence followed by (iv) the final assemble sequence for haphpipe_assemble_02 pipeline (reference-based assembly); the final consensus sequence for the Geneious (v) de novo workflow and the (vi) reference-based workflow; and finally, the (vi) average between the final two sequences (one for each read file) for HyDRA. The three amplicons are shown, as well as a combination of PRRT and int amplicons into pol. There are no results for HyDRA in the gp120 gene because HyDRA only analyzes the pol gene.
Figure 4
Figure 4
Difference between the estimated genetic diversity from the true sequence and each pipeline (calculated as estimate of true sequences—estimate of pipeline consensus sequences) for the simulated HIV (A) subtype B dataset and (B) non-subtype B dataset. Positive value indicates an underestimation of the genetic diversity with the consensus sequences from the pipeline, and a negative value indicates an overestimation of the genetic diversity with the consensus sequences from the pipeline. PRRT = protease and reverse transcriptase, int = integrase, gp120 = gene within envelope gene region, Pi = nucleotide diversity, Theta = Watterson’s genetic diversity.
Figure 5
Figure 5
Adjusted genetic p-distance (displayed as a difference from 1) between consensus sequence and HXB2, the reference sequence for HIV, for all pipelines for the empirical (A) HIV dataset and (B) HCV dataset. Ambiguous nucleotides were accounted for by giving fractional weight in alignment. A value closer to 1.00 indicates that the consensus sequence is more genetically similar to the reference sequence. The y-axes are different for each HIV and HCV, with HCV showing greater variance between samples. The x-axis order from left to right for an individual panel: adjusted genetic p-distance between the reference sequence and (i) the initial assembled sequence, (ii) the final assemble sequence and (iii) the reconstructed haplotypes for haphpipe_assemble_01 pipeline (de novo assembly); (iv) the initial assembled sequence, (v) the final assemble sequence, and (vi) the reconstructed haplotypes for haphpipe_assemble_02 pipeline (reference-based assembly); the final consensus sequence for the Geneious (vii) de novo workflow and (vi) reference-based workflow; and finally, the (viii) average between the final two sequences (one for each read file) for HyDRA. The three amplicons are shown for both empirical datasets (HIV: PRRT, int, gp120 and HCV: core, E1, E2). There are no results for HyDRA in the gp120 gene for HIV or for any HCV genes because HyDRA only analyzes the pol gene region of HIV.

Similar articles

Cited by

References

    1. Zanini F., Brodin J., Thebo L., Lanz C., Bratt G., Albert J., Neher R.A. Population genomics of intrapatient HIV-1 evolution. Elife. 2015;4:1–26. doi: 10.7554/eLife.11282. - DOI - PMC - PubMed
    1. Bonnaud E.M., Troupin C., Dacheux L., Holmes E.C., Monchatre-Leroy E., Tanguy M., Bouchier C., Cliquet F., Barrat J., Bourhy H. Comparison of intra- and inter-host genetic diversity in rabies virus during experimental cross-species transmission. PLOS Pathog. 2019;15:e1007799. doi: 10.1371/journal.ppat.1007799. - DOI - PMC - PubMed
    1. Pagán I. The diversity, evolution and epidemiology of plant viruses: A phylogenetic view. Infect. Genet. Evol. 2018;65:187–199. doi: 10.1016/j.meegid.2018.07.033. - DOI - PubMed
    1. Pérez-Losada M., Arenas M., Galán J.C., Bracho M.A., Hillung J., García-González N., González-Candelas F. High-throughput sequencing (HTS) for the analysis of viral populations. Infect. Genet. Evol. 2020;80:104208. doi: 10.1016/j.meegid.2020.104208. - DOI - PubMed
    1. Simpson J.T., Pop M. The theory and practice of genome sequence assembly. Annu. Rev. Genom. Hum. Genet. 2015;16:153–172. doi: 10.1146/annurev-genom-090314-050032. - DOI - PubMed

Publication types

LinkOut - more resources