Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add syncmers at ends of reads #452

Draft
wants to merge 4 commits into
base: main
Choose a base branch
from
Draft

Add syncmers at ends of reads #452

wants to merge 4 commits into from

Conversation

marcelm
Copy link
Collaborator

@marcelm marcelm commented Oct 7, 2024

Split out from #426.

To Do

  • Maybe add these syncmers only when mcs is enabled
  • Fix tests

ksahlin and others added 4 commits November 25, 2024 15:46
Increases the number of seeds per read by 2*w_min seeds
(This ensures that unmappable.41 in the test data really is unmappable)

Randstrobes that have no downstream partner at least w_min syncmers away get
their second hash set to 0, which means in the case of multi-context seeds
that the primary/main hash is also zero because it is chosen as the smaller
of the two.

When this is done both for the reference and for queries, we get spurious
hits for all randstrobes towards the ends of queries, which get mapped to the
end of the reference.

Using the hash of the primary syncmer also as hash for the second syncmer
gets rid of the problem.
We do not need to handle the case that we have fewer than w_min remaining
syncmers because the for loop will iterate zero times and we get the desired
result anyway.
@marcelm
Copy link
Collaborator Author

marcelm commented Nov 27, 2024

In #426 (comment), I measured that the commit "Make RandstrobeIterator/-Generator yield the same results" caused a significant slowdown in strobealign when multi-context seeds are enabled. I’ve tried to analyze that commit only, but doing so doesn’t make sense because it contains the buggy hashing (where the second hash was always set to zero), which was fixed in a subsequent commit. This PR here contains that particular commit and the fix, so I think it is better to just look at this entire PR and discuss the slowdown here.

First, the slowdown when multi-context seeds are disabled is relatively small, about 2%, which is presumably just because more randstrobes are generated per read. Diff (main vs this PR):

fruit fly 100 bp reads:

-Number of randstrobes: 56477476 total. Per read: 28.24
+Number of randstrobes: 64477476 total. Per read: 32.24

fruit fly 200 bp reads:

-Number of randstrobes: 59178572 total. Per read: 59.18
+Number of randstrobes: 71178572 total. Per read: 71.18

I don’t think there should be any gains from generating these extra randstrobes when we are not using multi-context seeds, so we may want to not generate them in that case.

Second, the slowdown when multi-context seeds are enabled is about 10%. Profiling shows that strobealign spends 36% more in the find_nams() function and 22% more in merge_matches_into_nams(). This is a bit more than what I had expected given that there aren’t 36% more randstrobes.

I made another observation (which may be obvious to Igor and you), but which I wanted to write down. Regarding these extra randstrobes at the end of queries that this PR adds.

  • These randstrobes are usually not found in full in the index because the strobes are closer together than w_min (wich was used when indexing).
  • That means the full lookup is theoretically unnecessary. They will always trigger a partial lookup.
  • The partial lookup only succeeds when the strobe that we see in the read was chosen as main during indexing, so it fails half the time.

@ksahlin
Copy link
Owner

ksahlin commented Dec 10, 2024

about 2%, ... Second, the slowdown when multi-context seeds are enabled is about 10%.

Where do you get these numbers from? Is it from data not presented? From the stats you paste it looks to be 14% and 20%.

I agree that it could probably also be disabled when not using mcs because of the reasons you mention. Additional to your comments, some of these 'end-syncmers' may also still be base hashes in previously created strobes, so it's likely that very few 'end-syncmers' will be useful.

Still, I guess this will come with a decrease in accuracy? (I was the one added them so I guess I observed slightly better results 🤔). Could we test on SIM5 before abandoning the end-syncmers?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants