Database indexing for production MegaBLAST searches
- PMID: 18567917
- PMCID: PMC2696921
- DOI: 10.1093/bioinformatics/btn322
Database indexing for production MegaBLAST searches
Erratum in
- Bioinformatics. 2008 Dec 15;24(24):2942
Abstract
Motivation: The BLAST software package for sequence comparison speeds up homology search by preprocessing a query sequence into a lookup table. Numerous research studies have suggested that preprocessing the database instead would give better performance. However, production usage of sequence comparison methods that preprocess the database has been limited to programs such as BLAT and SSAHA that are designed to find matches when query and database subsequences are highly similar.
Results: We developed a new version of the MegaBLAST module of BLAST that does the initial phase of finding short seeds for matches by searching a database index. We also developed a program makembindex that preprocesses the database into a data structure for rapid seed searching. We show that the new 'indexed MegaBLAST' is faster than the 'non-indexed' version for most practical uses. We show that indexed MegaBLAST is faster than miBLAST, another implementation of BLAST nucleotide searching with a preprocessed database, for most of the 200 queries we tested. To deploy indexed MegaBLAST as part of NCBI'sWeb BLAST service, the storage of databases and the queueing mechanism were modified, so that some machines are now dedicated to serving queries for a specific database. The response time for such Web queries is now faster than it was when each computer handled queries for multiple databases.
Availability: The code for indexed MegaBLAST is part of the blastn program in the NCBI C++ toolkit. The preprocessor program makembindex is also in the toolkit. Indexed MegaBLAST has been used in production on NCBI's Web BLAST service to search one version of the human and mouse genomes since October 2007. The Linux command-line executables for blastn and makembindex, documentation, and some query sets used to carry out the tests described below are available in the directory: ftp://ftp.ncbi.nlm.nih.gov/pub/agarwala/indexed_megablast [corrected]
Supplementary information: Supplementary data are available at Bioinformatics online.
Figures
Similar articles
-
muBLASTP: database-indexed protein sequence search on multicore CPUs.BMC Bioinformatics. 2016 Nov 4;17(1):443. doi: 10.1186/s12859-016-1302-4. BMC Bioinformatics. 2016. PMID: 27809763 Free PMC article.
-
Improved BLAST searches using longer words for protein seeding.Bioinformatics. 2007 Nov 1;23(21):2949-51. doi: 10.1093/bioinformatics/btm479. Epub 2007 Oct 6. Bioinformatics. 2007. PMID: 17921491
-
High speed BLASTN: an accelerated MegaBLAST search tool.Nucleic Acids Res. 2015 Sep 18;43(16):7762-8. doi: 10.1093/nar/gkv784. Epub 2015 Aug 6. Nucleic Acids Res. 2015. PMID: 26250111 Free PMC article.
-
Large-scale database searching using tandem mass spectra: looking up the answer in the back of the book.Nat Methods. 2004 Dec;1(3):195-202. doi: 10.1038/nmeth725. Nat Methods. 2004. PMID: 15789030 Review.
-
Improving the accuracy of PSI-BLAST protein database searches with composition-based statistics and other refinements.Nucleic Acids Res. 2001 Jul 15;29(14):2994-3005. doi: 10.1093/nar/29.14.2994. Nucleic Acids Res. 2001. PMID: 11452024 Free PMC article. Review.
Cited by
-
Epidemic myalgia in adults associated with human parechovirus type 3 infection, Yamagata, Japan, 2008.Emerg Infect Dis. 2012 Nov;18(11):1787-93. doi: 10.3201/eid1811.111570. Emerg Infect Dis. 2012. PMID: 23095469 Free PMC article.
-
Mining the Microbiome of Key Species from African Savanna Woodlands: Potential for Soil Health Improvement and Plant Growth Promotion.Microorganisms. 2020 Aug 24;8(9):1291. doi: 10.3390/microorganisms8091291. Microorganisms. 2020. PMID: 32846974 Free PMC article.
-
PhyloPrimer: a taxon-specific oligonucleotide design platform.PeerJ. 2021 Apr 29;9:e11120. doi: 10.7717/peerj.11120. eCollection 2021. PeerJ. 2021. PMID: 33986979 Free PMC article.
-
RASP: an atlas of transcriptome-wide RNA secondary structure probing data.Nucleic Acids Res. 2021 Jan 8;49(D1):D183-D191. doi: 10.1093/nar/gkaa880. Nucleic Acids Res. 2021. PMID: 33068412 Free PMC article.
-
RNA-seq with RNase H-based ribosomal RNA depletion specifically designed for C. elegans.MicroPubl Biol. 2020 Sep 22;2020:10.17912/micropub.biology.000312. doi: 10.17912/micropub.biology.000312. MicroPubl Biol. 2020. PMID: 33005886 Free PMC article. No abstract available.
References
-
- Cao X, et al. Piers: an efficient model for similarity search in DNA sequence databases. ACM SIGMOD Record (Special Issue on Data Engineering for Life Sciences. 2004;33:39–44.
-
- Giladi E, et al. SST: an algorithm for finding near-exact sequence matches in time proportional to the logarithm of the database size. Bioinformatics. 2002;18:873–879. - PubMed
-
- Jiang X, et al. Survey on index based homology search algorithms. J. Supercomput. 2007;40:185–212.
Publication types
MeSH terms
Substances
Grants and funding
LinkOut - more resources
Full Text Sources
Other Literature Sources
Research Materials