-
Notifications
You must be signed in to change notification settings - Fork 18
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add syncmers at ends of reads #452
base: main
Are you sure you want to change the base?
Conversation
Increases the number of seeds per read by 2*w_min seeds
(This ensures that unmappable.41 in the test data really is unmappable) Randstrobes that have no downstream partner at least w_min syncmers away get their second hash set to 0, which means in the case of multi-context seeds that the primary/main hash is also zero because it is chosen as the smaller of the two. When this is done both for the reference and for queries, we get spurious hits for all randstrobes towards the ends of queries, which get mapped to the end of the reference. Using the hash of the primary syncmer also as hash for the second syncmer gets rid of the problem.
We do not need to handle the case that we have fewer than w_min remaining syncmers because the for loop will iterate zero times and we get the desired result anyway.
970c6bf
to
de291f3
Compare
In #426 (comment), I measured that the commit "Make RandstrobeIterator/-Generator yield the same results" caused a significant slowdown in strobealign when multi-context seeds are enabled. I’ve tried to analyze that commit only, but doing so doesn’t make sense because it contains the buggy hashing (where the second hash was always set to zero), which was fixed in a subsequent commit. This PR here contains that particular commit and the fix, so I think it is better to just look at this entire PR and discuss the slowdown here. First, the slowdown when multi-context seeds are disabled is relatively small, about 2%, which is presumably just because more randstrobes are generated per read. Diff (main vs this PR): fruit fly 100 bp reads:
fruit fly 200 bp reads:
I don’t think there should be any gains from generating these extra randstrobes when we are not using multi-context seeds, so we may want to not generate them in that case. Second, the slowdown when multi-context seeds are enabled is about 10%. Profiling shows that strobealign spends 36% more in the I made another observation (which may be obvious to Igor and you), but which I wanted to write down. Regarding these extra randstrobes at the end of queries that this PR adds.
|
Where do you get these numbers from? Is it from data not presented? From the stats you paste it looks to be 14% and 20%. I agree that it could probably also be disabled when not using mcs because of the reasons you mention. Additional to your comments, some of these 'end-syncmers' may also still be base hashes in previously created strobes, so it's likely that very few 'end-syncmers' will be useful. Still, I guess this will come with a decrease in accuracy? (I was the one added them so I guess I observed slightly better results 🤔). Could we test on SIM5 before abandoning the end-syncmers? |
Split out from #426.
To Do