Hi All,
I'm trying to pull the sequences for a list of 3-400,000 accessions from the NCBI protein database via the biopython entrez api. For my specific use case, I'm processing them in 20,000 groups of around 1000 sequences (for context, each group is an smcog, but that doesn't really matter here), but for the purposes of this question I'm keen to just get as many as possible in one go. For example, if I can find a way to extract 100,000 sequences, I could get the sequences in one go and split them into groups later.
I've heard people talking about extracting e.g. 100,000 sequences with EFetch (https://www.biostars.org/p/280872/), but I'm struggling to get more than 200 sequences at a time. I've tried various methods, including making a http POST as suggested by the docs, but they're all hitting similar limits beyond around 200 accessions. I think this is because there seems to be a limit on the length of URLs that can be taken by browsers (https://stackoverflow.com/questions/417142/what-is-the-maximum-length-of-a-url-in-different-browsers), which presumably impacts the length of the URLs that can be accepted by http POST calls. But if this is true, how can someone manage to get 100,000 sequences?
My latest attempt, below, is iteratively calling EPost to make a list of query keys for a single web env, with each key corresponding to 200 accessions (the limit I've been running up against). So for a group of 1000 accessions, the final output will be a web env and a list of query keys e.g. ['1','2','3','4','5']. My question is, can I join these query keys into a single value for the query_key parameter in EFetch? This would be more efficient than going through each key in a for loop I think, and the url size being generated should be fairly low as it is a list of keys rather than 1000 accessions. I've looked here https://dataguide.nlm.nih.gov/eutilities/history.html, and have tried building a query key string term with %23 or # with the AND key word, but neither work (HTTP Error 500: Internal Server Error), although the link implies what I want to do is possible.
Also, please let me know if this all sounds hideously inefficient, most of this is coming from the docs so there is probably a much better way of doing this. Due to the nature of the accessions (they're from 2012 and NCBI has since done a massive re-org with non-redundant ref seqs, so the local databases may not have these accessions) and my crappy laptop, I'm trying to avoid downloading the database locally, although this might be better. I'd also really like to get this api stuff working if I can, as a learning experience, rather than give up.
This is all part of a bigger script, but the issues described above are in this bit:
#make a web environment and add query keys for each accession_list,
#which together make accession_list_whole
webenv_assigned=False
webenv=''
query_keys=[]
server_limit=200
#accession_list_whole is a big list of accessions e.g. ['accession 1', 'accession 2',...] for >1000 accessions.
for index in range(0,len(accession_list_whole), server_limit):
#make useable accession list
accession_list=accession_list_whole[index:index+server_limit]
#feed into epost.
teststr=','.join(accession_list)
if webenv_assigned:
#build up a list of query keys for a single web env
epost=Entrez.epost(db="protein", id=','.join(accession_list), WebEnv=webenv)
search_results = Entrez.read(epost)
else:
#for first loop, assign a web env
epost=Entrez.epost(db="protein", id=','.join(accession_list))
search_results = Entrez.read(epost)
webenv = search_results["WebEnv"]
query_keys.append(search_results["QueryKey"])
webenv_assigned=True
'''I now have a web env (MCID_5fa80e1470e80f0dc50d04fb) and a list of query keys [1,2,3,4,5]. The next bit of code feeds the web env and query keys to efetch. I could do this in a for loop and iteratively call efetch for each query key, appending the resultant seqRecords to object_new, but if I can do this all at once i think it will be more efficient.'''
for attempt in range(5):
#for loop = this is just in case there's a weird server error, probably unnecessary
try:
object_new=[]
################################
#What is the correct query_key parameter format here?
#I have tied %23<key> and #<key> with AND, neither work.
################################
join='#'
bool='+AND+'
queryKey=join+query_keys[0]+bool
for key in query_keys[1:-1]:
queryKey+=join+key+bool
queryKey+=join+query_keys[-1]
################################
#--> 1+AND+#2+AND+#3+AND+#4+AND+#5+AND+#6+AND+#7+AND+#8+AND+#9
################################
fetch=Entrez.efetch(
db="protein",
rettype='fasta',#rettype="gb",
retmax=len(accession_list_whole),
webenv=webenv,
#the annoying param:
query_key=queryKey
)
object_new=list(SeqIO.parse(fetch, "fasta"))
fetched=True
break
except Exception as e:
print(e)
#HTTP Error 500: Internal Server Error
print('str used to generate web env and query key: ',teststr)
print('query_key used to generate web env and query key: ',queryKey)
fetched=False
continue
#I then write the id and seq for each seqRecord in object_new to a txt file in fasta format
Hi genomax, thanks for the reply.
My problem with the blastdbcmd option is that a lot of the accessions I'm using are not easily available now. They're taken from a paper that used data extracted in 2012, and shortly after this NCBI removed a load of redundant sequences and replaced them with non-redundant ref seqs instead, which have different accessions beginning with WP_. The api seems to be finding them OK despite this, but I'm guessing it's doing things behind the scenes which may not happen with whatever I'm trying to do locally. So my worry is that these old accessions, or links to them, wont be in whatever database I download. Does that make sense/is this correct?
As an example, if I search for YP_005532033.1 in the NCBI web page, it doesn't come up in any of the databases as the record is removed, but it still shows a link to the details of the removed record: https://www.ncbi.nlm.nih.gov/search/all/?term=YP_005532033.1. The web page is linking to what I'm guessing is an archived record in the protein database, and I'm not sure if this archived record would be present in whatever databases I downloaded locally.
Also, I've had a look at the NCBI FTP site before but I can't see an option for the 'protein' database, would this just be nr?
Yes, I've got an NCBI API key, it's further up the script this is all part of but isn't shown here. Entezdirect looks good, but unfortunately I've not got a unix OS - this might sound thick but would it work on the linux subsystem for windows, linux and unix are meant to be pretty similar?
Thanks again! Tim
I guess that is a valid issue and would not be solved by using
nr
db.I am going to show an example of the accession you posted above using Entrezdirect on command line (truncated to show just 2 lines) :
massa.kassa.sc3na has some good suggestions that you should consider as well.
That is the superset of all protein accessions.
Ah yes that does look like it would be ideal. I'll have a play and see if I can get it working on windows (seems possible - http://home.cc.umanitoba.ca/~psgendb/birchhomedir/doc/NCBI/edirect/chapter6.pdf). Thanks a lot for your help with this!