Hello!
I now have some protein sequences of a number of species (chicken, human, mouse, ...). I want to use BLASTP to find the homologous sequence of a small peptide "KPWLRVALCPG" in these species. For the convenience of searching directly for multiple species, I have combined the protein sequences of these species into a single file called "merge.fa" and created a blast database.
I try to filter out peptides with a bit-score > 20 and I use a formula (E = query_sequence_length * total_database_length / 2^bit-score) to calculate the corresponding e_value for a bit-score of 20.
I first searched for merge.fa, using the following command.
blastp -db merge.fa -query test.fa -out test.merge.0 -task blastp-short -outfmt 0 -evalue 5283 -word_size 2 -matrix PAM30
Another search was carried out on chicken.fa, using the following command.
blastp -db chicken.fa -query test.fa -out test.chicken.0 -task blastp-short -outfmt 0 -evalue 44 -word_size 2 -matrix PAM30
However, I found two problems.
- Firstly, the bit-score threshold was not the expected "20", but "18".
- Secondly, I compared the search results for "chicken.fa" with those for "merge.fa" and found that many of the results for "chicken.fa" had disappeared!
I am very confused, can anyone give me some advice? Thanks.
What happens if you use the command:
OLD_FSC=1 blastp -db merge.fa -query test.fa [...]
related: Blast E Value Calculation
I'm sorry, it doesn't work. The BLASTP threw an exception:
the
OLD_FSC=1
should come beforeblastp
as you're setting an environment variable. Seehttps://www.ncbi.nlm.nih.gov/books/NBK131777/#_Blast_ReleaseNotes_BLAST_2_2_26_January_
Thanks a million. It works.
I set the environment variable by typing the following command:
This is the result of the search for "merge.fa". The lowest bit-score was 19.7 instead of 20, but its e_value of "5281" was close enough to my expectation of "5283", so it is reasonable to assume that the problem has been solved.
It is also working in the search results for "chicken.fa":
I would like to add a few words to the second problem.
The image below (Fig.1) shows the rows containing "chicken" in the search results for "merge.fa". There are three columns, indicating the sequence name, bit-score and e_value.
The image below (Fig.2) shows the search results for "chicken.fa". Red boxes indicate the protein sequences that appeared in the search results for "merge.fa", and red arrows indicate the protein sequences that satisfy the bit-score requirement but don't appear in the search results for "merge.fa"
Why do the protein sequences that satisfy the bit-score requirements not appear in the search results for "merge.fa"?