Hi All,
I have a list of bioproject IDs and would like to get corresponding sequences from them. So, I am following a list of steps as below:
1. Using the bioproject ID, I am getting GI ID using elink:
handle = Entrez.elink(dbfrom="bioproject", db="nuccore",id=bioprojecID, linkname="bioproject_nuccore_wgsmaster")
record = Entrez.read(handle)
GI_ID = record[0]["LinkSetDb"]["Link"]["Id"]
2. Then I am trying to get sequence from GI_ID (using efetch and seqIO modules in biopython):
handle = Entrez.efetch(db="nucleotide", id=GI_ID, rettype="gb", retmode="text")
record = SeqIO.read(handle, "genbank")
But this gives unknown sequence when trying to print record.
Can anyone advise if this is the right way to do it or is there a better way to obtain related sequences from bioproject IDs?
Thanks in advance!
Hi, thanks for replying. I tried printing record.seq but it gives weird output (multiple 'N' characters).
It is very common to have multiple 'N' characters at the start of the sequence. Each chromosome may have multiple Ns at the start of the chromosome (could be 100 or 1000 of bases long). Scroll down into your sequence.