Hello,
I am working with R and Bioconductor packages. I have set big sets of sequences such as:
A=c("TATTCCTAGGTTCGCCT", "TTCCCGTTGCCCAGTGA"....) # length ~ 500
B=c("CTCCACACCAAAGCATC", "AACTGTGAGATTAATCT") # length ~10 000 000
What I would like to know ultimately is which are the sequences from B that match each sequence of A with a most 5 mismatches. e.g.: something like
res$A1
5, 5000,8 000 000...
res$A2
3005, 7560,5 003 542...
I could do loops or some "apply"... but it is taking ages...
I looked on the PDict, matchPDict, vwhichPDict side as well. It is much more efficient. But My sequences are too short: PDict would not let me set the max.mismatch parameter to 5.
As the sequences from A and B are exactly of the same length, I do not need searches or alignments. I probably just need to calculate the number of mismatches directly. But I cannot find a way of doing it really efficiently and quickly.
Any ideas please?
Many thanks
Thanks a lot for your suggestions! I wanted to double check with experts that I was not missing anything obvious already implemented in R. I finally tried to write a C function that does what I need and called it from R. I takes around 400 seconds to run. Amazing compared to R or python!
Well done! For the record, a couple of comments.
1) The python implementation is not as bad as I thought. For 500 * 10,000,000 comparisons it should take around ~1/2 hour or less to calculate the Levenshtein distance (~1500s, see test below):
2) You probably realized this already: Since you say that only substitutions are allowed than the Levenshtein distance is unnecessarily expensive to compute. The Hamming distance seems what you need and it should be much faster, I'd guess a 10x faster?
Possibly more than 10x faster since it's more of a bounded Hamming distance (i.e., the function can simply return once the distance >5).
400 seconds still seems a bit on the slow end, though I guess it's fast enough.
Are you allowing insertions/deletions or only substitutions?
Hello! Only substitutions.