Are there known good datastructures/algorithms for finding the nearest region like bedtools closest?
From the top of my head:
If I have a datastructure to do interval lookup (like an intervaltree) I can for each range in A: (a_start, a_end) do
hits = []
i = 0
slack = 1000
while not hits:
hits = interval_tree.find(start - slack * i, end + slack * i)
i += 1
find_nearest_in_hits(start, end, hits)
But even in C this is might be slooow, depending on how the data looks.
I can use a large slack, but that would just require me to do more work in find_nearest_in_hits
since it would get a larger result set most of the time.
(Here the intervaltree contains the ranges in B, while I use A to query it).
If you could give some examples, it will help us understand your problems much easier... And in your code, you didn't explain what the 'interval_tree' and 'find_nearest_in_hits' variables are
Agreed. Sorry about that.