NCBI.BLAST2DT is an R package allowing you to submit DNA sequences to NCBI BLAST servers directly from the console, to retrieve potential hits on a genome or sequence database, and to collect all results within an R data.table.
It makes use of the R package hoardeR to submit sequences to the NCBI BLAST API, and then parses the XML BLAST results returned to load them as an R data.table to make it more easy to query, sort, order and subset the resulting hits.
Author: PAGEAUD Y.1
1- DKFZ - Division of Applied Bioinformatics, Germany.
How to cite: Pageaud Y. et al., NCBI.BLAST2DT - Submit DNA sequences to NCBI BLAST and get results in an R data.table.
NCBI.BLAST2DT provides 2 types of functions:
submit_NCBI_BLAST()
andget.NCBI.BLAST2DT()
to submit DNA sequences to NCBI for BLASTing them against a sequence database. Theses functions either take DNA sequences as character strings, or Genbank accession IDs and coordinates of the sequences to extract from them.NCBI_BLAST_XML2DT()
andaggregate_NCBI_BLAST_XMLs2DT()
to load, gather, and order all your BLAST results from NCBI submissions.
In R do:
if (!requireNamespace("BiocManager", quietly = TRUE))
install.packages("BiocManager")
BiocManager::install(c('bamsignals', 'Biostrings', 'GenomicRanges', 'GenomicTools.fileHandler', 'httr', 'IRanges', 'KernSmooth', 'knitr', 'R.utils', 'RCurl', 'rmarkdown', 'Rsamtools', 'S4Vectors', 'seqinr', 'stringr', 'XML'))
inst.pkgs = c('data.table', 'devtools', 'parallel', 'xml2')
install.packages(inst.pkgs)
- In the Git repository click on "Clone or Download".
- Copy the HTTPS link.
- Open a terminal and type:
git clone https://github.com/YoannPa/NCBI.BLAST2DT.git
- Open the folder NCBI.BLAST2DT and open the "NCBI.BLAST2DT.Rproj" file in RStudio.
- In the RStudio console, type:
devtools::install()
For any questions Not related to bugs or development please check the section "Known Issues" available below. If the issue you experience is not adressed in the known issues you can write me at y.pageaud@dkfz.de.
❎ submit_NCBI_BLAST() not responding
Sometimes submit_NCBI_BLAST()
can stop responding, or crash, while expecting a BLAST submission result from NCBI servers. If so:
- Check the log displayed by the console to identify the submission failing.
- Stop R execution.
- Delete manually the last result folder (no XML file should be visible in it) in the result directory.
- Restart R.
- Execute again the same command using the function
submit_NCBI_BLAST()
: The sequence for which results have already been generated will be automatically skipped, and submission will restart by the last failed submission.
⚠ In min(which(seqInfo$seqRID == 0)) : no non-missing arguments to min; returning Inf
This warning can arise from submit_NCBI_BLAST()
and get.NCBI.BLAST2DT()
when NCBI BLAST terminates the request in process. There are different reasons why NCBI BLAST server can terminate your request. To find which reason is invocked you can go to the web interface here and past the run ID from your logs into the field "Request ID". Here an error message might be displayed, explaining the reason why your request has been terminated (e.g. CPU usage limit was exceeded. You may need to change your search strategy.[...])
NCBI database names are not well defined anywhere: it can be tricky to find the right one.
For example, to BLAST sequences against the human genome assembly hg19 version, one must specify db = "genomic/9606/GCF_000001405.25"
in submit_NCBI_BLAST()
, which is not an obvious name for a genome database.
If you encounters issues or if a feature you would expect is not available in a NCBI.BLAST2DT function, please check if an existing issue adresses your point here. If not, create a new issue here.
- hoardeR: Collect and Retrieve Annotation Data for Various Genomic Data Using Different Webservices.
- Johnson, M. et al. NCBI BLAST: a better web interface. Nucleic Acids Research 36, W5–W9 (2008).
- Paradis E. & Schliep K. 2019. ape 5.0: an environment for modern phylogenetics and evolutionary analyses in R. Bioinformatics 35: 526–528. doi:10.1093/bioinformatics/bty633. HAL: ird-01920132.
- xml2: Parse XML.