New alignment-based sequence extraction software (ALiBaSeq) and its utility for deep level phylogenetics

doi:10.7717/peerj.11019

. 2021 Mar 31:9:e11019.

doi: 10.7717/peerj.11019. eCollection 2021.

New alignment-based sequence extraction software (ALiBaSeq) and its utility for deep level phylogenetics

Alexander Knyshov¹, Eric R L Gordon², Christiane Weirauch¹

Affiliations

¹ Department of Entomology, University of California, Riverside, Riverside, CA, USA.
² Department of Ecology and Evolutionary Biology, University of Connecticut, Storrs, CT, USA.

PMID: 33850647
PMCID: PMC8019319
DOI: 10.7717/peerj.11019

New alignment-based sequence extraction software (ALiBaSeq) and its utility for deep level phylogenetics

Alexander Knyshov et al. PeerJ. 2021.

. 2021 Mar 31:9:e11019.

doi: 10.7717/peerj.11019. eCollection 2021.

Authors

Alexander Knyshov¹, Eric R L Gordon², Christiane Weirauch¹

Affiliations

¹ Department of Entomology, University of California, Riverside, Riverside, CA, USA.
² Department of Ecology and Evolutionary Biology, University of Connecticut, Storrs, CT, USA.

PMID: 33850647
PMCID: PMC8019319
DOI: 10.7717/peerj.11019

Abstract

Despite many bioinformatic solutions for analyzing sequencing data, few options exist for targeted sequence retrieval from whole genomic sequencing (WGS) data with the ultimate goal of generating a phylogeny. Available tools especially struggle at deep phylogenetic levels and necessitate amino-acid space searches, which may increase rates of false positive results. Many tools are also difficult to install and may lack adequate user resources. Here, we describe a program that uses freely available similarity search tools to find homologs in assembled WGS data with unparalleled freedom to modify parameters. We evaluate its performance compared to other commonly used bioinformatics tools on two divergent insect species (>200 My) for which annotated genomes exist, and on one large set each of highly conserved and more variable loci. Our software is capable of retrieving orthologs from well-curated or unannotated, low or high depth shotgun, and target capture assemblies as well or better than other software as assessed by recovering the most genes with maximal coverage and with a low rate of false positives throughout all datasets. When assessing this combination of criteria, ALiBaSeq is frequently the best evaluated tool for gathering the most comprehensive and accurate phylogenetic alignments on all types of data tested. The software (implemented in Python), tutorials, and manual are freely available at https://github.com/AlexKnyshov/alibaseq.

Keywords: Alignment; BLAST; HMMER; OrthoDB; Orthology; Phylogenomics; UCE.

PubMed Disclaimer

Conflict of interest statement

The authors declare that they have no competing interests.

Figures

**Figure 1. Software workflow.**
The workflow is illustrated by a hypothetical example of a bait for which both an ortholog and several paralogs are present in the sample. For terminology see Table 1. White boxes show contig boundaries, colored boxes represent similarity regions (shown both for the bait and the sample), shading denotes different paralogous genes. Abbreviations are as follows: B1, bait sequence for the locus 1; C1-3, assembly contigs 1-3; PC1-2, pseudocontigs 1-2; SC1-3, supercontigs 1-3; HSP, high-scoring segment pair.

**Figure 2. Performance on the UCE dataset.**
(A) Amount of loci with a high sequence identity to the *R. prolixus* bait region sequence. (B) Amount of loci with a low sequence identity to the *R. prolixus* bait region sequence. (C) The average percentage of the bait region recovered. Vertical panels refer to different datasets (see text for details). Abbreviations are as follows: bln, blastn; blnWC9, blastn with word length 9 bp; dcmgbl, discontinuous megablast; tbln, tblastn; tblx, tblastx; nhmm, nhmmer; phmm, phmmer; phyluceC, capture pipeline of phyluce; assexonR, read-based version of Assexon.

**Figure 3. Speed of the processing of the UCE dataset.**
(A) The search and sequence retrieval time (including assembly in case of the read-based tools). (B) The assembly stage time for the read-based tools or the database creation time for aTRAM. (C) The total time elapsed. X axis scale is log-transformed.

**Figure 4. Performance on the ODB SCO dataset.**
(A) Amount of loci with a high sequence identity to the *R. prolixus* bait region sequence. (B) Amount of loci with a low sequence identity to the *R. prolixus* bait region sequence. (C) The average percentage of the bait region recovered. Vertical panels refer to different datasets (see text for details). Abbreviations are as follows: bln, blastn; blnWC9, blastn with word length 9 bp; dcmgbl, discontinuous megablast; tbln, tblastn; tblx, tblastx; nhmm, nhmmer; phmm, phmmer.

**Figure 5. Speed of the processing of the ODB SCO dataset.**
(A) The search and sequence retrieval time (including assembly in case of the read-based tools). (B) The assembly stage time for the read-based tools or the database creation time for aTRAM. (C) The total time elapsed. X axis scale is log-transformed.

**Figure 6. Relationship between the number of introns in a locus and its coverage in the software output.**
Relationship between the number of introns in a locus and its coverage in the software output. Points represent correct loci found by each program using the 40x ODB SCO dataset. (A) ALiBaSeq with blastn. (B) ALiBaSeq with blastn with word length 9 bp. (C) ALiBaSeq with discontinuous megablast. (D) ALiBaSeq with tblastn. (E) ALiBaSeq with tblastx. (F) ALiBaSeq with nhmmer. (G) ALiBaSeq with phmmer. (H) assexon. (I) FortyTwo. (J) phyluce. (K) Nucleotide-based aTRAM. (L) Protein-based aTRAM. (M) Protein-based HybPiper. (N) kollector.

**Figure 7. Relationship between the number of contigs that a locus is represented by in the assembly and its coverage in the software output.**
Only assembly-based programs are shown. Points represent correct loci found by each program using the 40x ODB SCO dataset. (A) ALiBaSeq with blastn. (B) ALiBaSeq with blastn with word length 9 bp. (C) ALiBaSeq with discontinuous megablast. (D) ALiBaSeq with tblastn. (E) ALiBaSeq with tblastx. (F) ALiBaSeq with nhmmer. (G) ALiBaSeq with phmmer. (H) assexon. (I) FortyTwo. (J) phyluce.

See this image and copyright information in PMC

Cited by

Chromosome-Aware Phylogenomics of Assassin Bugs (Hemiptera: Reduvioidea) Elucidates Ancient Gene Conflict.
Knyshov A, Gordon ERL, Masonick PK, Castillo S, Forero D, Hoey-Chamberlain R, Hwang WS, Johnson KP, Lemmon AR, Moriarty Lemmon E, Standring S, Zhang J, Weirauch C. Knyshov A, et al. Mol Biol Evol. 2023 Aug 3;40(8):msad168. doi: 10.1093/molbev/msad168. Mol Biol Evol. 2023. PMID: 37494292 Free PMC article.
Genomic Approaches to Uncovering the Coevolutionary History of Parasitic Lice.
Johnson KP. Johnson KP. Life (Basel). 2022 Sep 16;12(9):1442. doi: 10.3390/life12091442. Life (Basel). 2022. PMID: 36143478 Free PMC article. Review.
Low hybridization temperatures improve target capture success of invertebrate loci: a case study of leaf-footed bugs (Hemiptera: Coreoidea).
Forthman M, Gordon ERL, Kimball RT. Forthman M, et al. R Soc Open Sci. 2023 Jun 28;10(6):230307. doi: 10.1098/rsos.230307. eCollection 2023 Jun. R Soc Open Sci. 2023. PMID: 37388308 Free PMC article.
Patchwork: Alignment-Based Retrieval and Concatenation of Phylogenetic Markers from Genomic Data.
Thalén F, Köhne CG, Bleidorn C. Thalén F, et al. Genome Biol Evol. 2023 Dec 1;15(12):evad227. doi: 10.1093/gbe/evad227. Genome Biol Evol. 2023. PMID: 38085033 Free PMC article.
Mitonuclear compatibility is maintained despite relaxed selection on male mitochondrial DNA in bivalves with doubly uniparental inheritance.
Smith CH, Mejia-Trujillo R, Havird JC. Smith CH, et al. Evolution. 2024 Oct 28;78(11):1790-1803. doi: 10.1093/evolut/qpae108. Evolution. 2024. PMID: 38995057

References

1. Allen JM, Huang DI, Cronk QC, Johnson KP. aTRAM—automated target restricted assembly method: a fast method for assembling loci across divergent taxa from next-generation sequencing data. BMC Bioinformatics. 2015;16(1):1–7. doi: 10.1186/s12859-015-0515-2. - DOI - PMC - PubMed
1. Allen JM, LaFrance R, Folk RA, Johnson KP, Guralnick RP. aTRAM 2.0: an improved, flexible locus assembler for NGS data. Evolutionary Bioinformatics. 2018;14:0–3. doi: 10.1177/1176934318774546. - DOI - PMC - PubMed
1. Altschul SF, Gish W, Miller W, Myers EW, Lipman DJ. Basic local alignment search tool. Journal of Molecular Biology. 1990;215(3):403–410. doi: 10.1016/S0022-2836(05)80360-2. - DOI - PubMed
1. Bankevich A, Nurk S, Antipov D, Gurevich AA, Dvorkin M, Kulikov AS, Lesin VM, Nikolenko SI, Pham S, Prjibelski AD. SPAdes: a new genome assembly algorithm and its applications to single-cell sequencing. Journal of Computational Biology. 2012;19(5):455–477. doi: 10.1089/cmb.2012.0021. - DOI - PMC - PubMed
1. Barbitoff YA, Polev DE, Glotov AS, Serebryakova EA, Shcherbakova IV, Kiselev AM, Kostareva AA, Glotov OS, Predeus AV. Systematic dissection of biases in whole-exome and whole-genome sequencing reveals major determinants of coding sequence coverage. Scientific Reports. 2020;10(1):418. doi: 10.1038/s41598-020-59026-y. - DOI - PMC - PubMed

Grants and funding

This work was supported by the National Science Foundation (grant number 1655769 to Christiane Weirauch). The funders had no role in study design, data collection and analysis, decision to publish, or preparation of the manuscript.

LinkOut - more resources

Full Text Sources
Other Literature Sources
- scite Smart Citations
Research Materials
- NCI CPTC Antibody Characterization Program

[1] Allen JM, Huang DI, Cronk QC, Johnson KP. aTRAM—automated target restricted assembly method: a fast method for assembling loci across divergent taxa from next-generation sequencing data. BMC Bioinformatics. 2015;16(1):1–7. doi: 10.1186/s12859-015-0515-2. - DOI - PMC - PubMed

[2] Allen JM, Huang DI, Cronk QC, Johnson KP. aTRAM—automated target restricted assembly method: a fast method for assembling loci across divergent taxa from next-generation sequencing data. BMC Bioinformatics. 2015;16(1):1–7. doi: 10.1186/s12859-015-0515-2. - DOI - PMC - PubMed

[3] Allen JM, LaFrance R, Folk RA, Johnson KP, Guralnick RP. aTRAM 2.0: an improved, flexible locus assembler for NGS data. Evolutionary Bioinformatics. 2018;14:0–3. doi: 10.1177/1176934318774546. - DOI - PMC - PubMed

[4] Allen JM, LaFrance R, Folk RA, Johnson KP, Guralnick RP. aTRAM 2.0: an improved, flexible locus assembler for NGS data. Evolutionary Bioinformatics. 2018;14:0–3. doi: 10.1177/1176934318774546. - DOI - PMC - PubMed

[5] Altschul SF, Gish W, Miller W, Myers EW, Lipman DJ. Basic local alignment search tool. Journal of Molecular Biology. 1990;215(3):403–410. doi: 10.1016/S0022-2836(05)80360-2. - DOI - PubMed

[6] Altschul SF, Gish W, Miller W, Myers EW, Lipman DJ. Basic local alignment search tool. Journal of Molecular Biology. 1990;215(3):403–410. doi: 10.1016/S0022-2836(05)80360-2. - DOI - PubMed

[7] Bankevich A, Nurk S, Antipov D, Gurevich AA, Dvorkin M, Kulikov AS, Lesin VM, Nikolenko SI, Pham S, Prjibelski AD. SPAdes: a new genome assembly algorithm and its applications to single-cell sequencing. Journal of Computational Biology. 2012;19(5):455–477. doi: 10.1089/cmb.2012.0021. - DOI - PMC - PubMed

[8] Bankevich A, Nurk S, Antipov D, Gurevich AA, Dvorkin M, Kulikov AS, Lesin VM, Nikolenko SI, Pham S, Prjibelski AD. SPAdes: a new genome assembly algorithm and its applications to single-cell sequencing. Journal of Computational Biology. 2012;19(5):455–477. doi: 10.1089/cmb.2012.0021. - DOI - PMC - PubMed

[9] Barbitoff YA, Polev DE, Glotov AS, Serebryakova EA, Shcherbakova IV, Kiselev AM, Kostareva AA, Glotov OS, Predeus AV. Systematic dissection of biases in whole-exome and whole-genome sequencing reveals major determinants of coding sequence coverage. Scientific Reports. 2020;10(1):418. doi: 10.1038/s41598-020-59026-y. - DOI - PMC - PubMed

[10] Barbitoff YA, Polev DE, Glotov AS, Serebryakova EA, Shcherbakova IV, Kiselev AM, Kostareva AA, Glotov OS, Predeus AV. Systematic dissection of biases in whole-exome and whole-genome sequencing reveals major determinants of coding sequence coverage. Scientific Reports. 2020;10(1):418. doi: 10.1038/s41598-020-59026-y. - DOI - PMC - PubMed

Save citation to file

Email citation

Add to Collections

Add to My Bibliography

Your saved search

Create a file for external citation management software

Your RSS Feed

New alignment-based sequence extraction software (ALiBaSeq) and its utility for deep level phylogenetics

Affiliations

New alignment-based sequence extraction software (ALiBaSeq) and its utility for deep level phylogenetics

Authors

Affiliations

Abstract

Conflict of interest statement

Figures

Similar articles

Cited by

References

Grants and funding

LinkOut - more resources

Full Text Sources

Other Literature Sources

Research Materials