Add syncmers at ends of reads #452

marcelm · 2024-10-07T13:39:32Z

Split out from #426.

To Do

Maybe add these syncmers only when mcs is enabled
Fix tests

Increases the number of seeds per read by 2*w_min seeds

(This ensures that unmappable.41 in the test data really is unmappable) Randstrobes that have no downstream partner at least w_min syncmers away get their second hash set to 0, which means in the case of multi-context seeds that the primary/main hash is also zero because it is chosen as the smaller of the two. When this is done both for the reference and for queries, we get spurious hits for all randstrobes towards the ends of queries, which get mapped to the end of the reference. Using the hash of the primary syncmer also as hash for the second syncmer gets rid of the problem.

We do not need to handle the case that we have fewer than w_min remaining syncmers because the for loop will iterate zero times and we get the desired result anyway.

marcelm · 2024-11-27T14:45:41Z

In #426 (comment), I measured that the commit "Make RandstrobeIterator/-Generator yield the same results" caused a significant slowdown in strobealign when multi-context seeds are enabled. I’ve tried to analyze that commit only, but doing so doesn’t make sense because it contains the buggy hashing (where the second hash was always set to zero), which was fixed in a subsequent commit. This PR here contains that particular commit and the fix, so I think it is better to just look at this entire PR and discuss the slowdown here.

First, the slowdown when multi-context seeds are disabled is relatively small, about 2%, which is presumably just because more randstrobes are generated per read. Diff (main vs this PR):

fruit fly 100 bp reads:

-Number of randstrobes: 56477476 total. Per read: 28.24
+Number of randstrobes: 64477476 total. Per read: 32.24

fruit fly 200 bp reads:

-Number of randstrobes: 59178572 total. Per read: 59.18
+Number of randstrobes: 71178572 total. Per read: 71.18

I don’t think there should be any gains from generating these extra randstrobes when we are not using multi-context seeds, so we may want to not generate them in that case.

Second, the slowdown when multi-context seeds are enabled is about 10%. Profiling shows that strobealign spends 36% more in the find_nams() function and 22% more in merge_matches_into_nams(). This is a bit more than what I had expected given that there aren’t 36% more randstrobes.

I made another observation (which may be obvious to Igor and you), but which I wanted to write down. Regarding these extra randstrobes at the end of queries that this PR adds.

These randstrobes are usually not found in full in the index because the strobes are closer together than w_min (wich was used when indexing).
That means the full lookup is theoretically unnecessary. They will always trigger a partial lookup.
The partial lookup only succeeds when the strobe that we see in the read was chosen as main during indexing, so it fails half the time.

ksahlin · 2024-12-10T14:21:38Z

about 2%, ... Second, the slowdown when multi-context seeds are enabled is about 10%.

Where do you get these numbers from? Is it from data not presented? From the stats you paste it looks to be 14% and 20%.

I agree that it could probably also be disabled when not using mcs because of the reasons you mention. Additional to your comments, some of these 'end-syncmers' may also still be base hashes in previously created strobes, so it's likely that very few 'end-syncmers' will be useful.

Still, I guess this will come with a decrease in accuracy? (I was the one added them so I guess I observed slightly better results 🤔). Could we test on SIM5 before abandoning the end-syncmers?

This was referenced Oct 7, 2024

Use optimized parameters according to optimization in issue #423 #453

Draft

Multi-context seeds plus fixes and optimized parameters #426

Open

ksahlin and others added 4 commits November 25, 2024 15:46

Add syncmers at the ends of reads as seeds without auxiliary hash

59546e7

Increases the number of seeds per read by 2*w_min seeds

Make RandstrobeIterator and -Generator yield the same results

f41a55c

Simplify: Iterating zero times gives same result

de291f3

We do not need to handle the case that we have fewer than w_min remaining syncmers because the for loop will iterate zero times and we get the desired result anyway.

marcelm force-pushed the mcs-read-ends branch from 970c6bf to de291f3 Compare November 25, 2024 16:34

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add syncmers at ends of reads #452

Add syncmers at ends of reads #452

marcelm commented Oct 7, 2024

marcelm commented Nov 27, 2024

ksahlin commented Dec 10, 2024 •

edited

Loading

Add syncmers at ends of reads #452

Are you sure you want to change the base?

Add syncmers at ends of reads #452

Conversation

marcelm commented Oct 7, 2024

marcelm commented Nov 27, 2024

ksahlin commented Dec 10, 2024 • edited Loading

ksahlin commented Dec 10, 2024 •

edited

Loading