Question

Algorithms for finding nearest range given two lists of ranges?

0

Entering edit mode

6.5 years ago

endrebak ▴ 970

Are there known good datastructures/algorithms for finding the nearest region like bedtools closest?

From the top of my head:

If I have a datastructure to do interval lookup (like an intervaltree) I can for each range in A: (a_start, a_end) do

hits = []
i = 0
slack = 1000
while not hits:
    hits = interval_tree.find(start - slack * i, end + slack * i) 
    i += 1

find_nearest_in_hits(start, end, hits)

But even in C this is might be slooow, depending on how the data looks.

I can use a large slack, but that would just require me to do more work in find_nearest_in_hits since it would get a larger result set most of the time.

(Here the intervaltree contains the ranges in B, while I use A to query it).

algorithms • 2.2k views

ADD COMMENT • link updated 17 months ago by Ram 44k • written 6.5 years ago by endrebak ▴ 970

0

Entering edit mode

If you could give some examples, it will help us understand your problems much easier... And in your code, you didn't explain what the 'interval_tree' and 'find_nearest_in_hits' variables are

ADD REPLY • link 6.5 years ago by shoujun.gu ▴ 380

0

Entering edit mode

Agreed. Sorry about that.

ADD REPLY • link 6.5 years ago by endrebak ▴ 970

score 1 · Answer 1 · 2018-05-11

1

Entering edit mode

6.5 years ago

chrchang523 11k

You can use a kd-tree with k=2.

ADD COMMENT • link 6.5 years ago by chrchang523 11k

0

Entering edit mode

Brilliant. There is both a scipy implementation: https://docs.scipy.org/doc/scipy-0.15.1/reference/generated/scipy.spatial.KDTree.html#scipy.spatial.KDTree and a C library that seems like it'll be easy to wrap: https://github.com/jtsiomb/kdtree

ADD REPLY • link 6.5 years ago by endrebak ▴ 970

score 1 · Answer 2 · 2018-05-11

1

Entering edit mode

6.5 years ago

Alex Reynolds 35k

Sort your lists of intervals. You can do a closest-features search more quickly on sorted lists.

For working code and paper, see:

ADD COMMENT • link 6.5 years ago by Alex Reynolds 35k

0

Entering edit mode

https://github.com/bedops/bedops/blob/master/applications/bed/closestfeats/src/ClosestFeature.cpp Thanks. Iff sorting is much faster than KD-tree construction, I'll try to use your method (I suspect it'll be the other way around though).

ADD REPLY • link 6.5 years ago by endrebak ▴ 970

1

Entering edit mode

Sorting and KD-tree time complexity are both O(n log n). Storage of a KD-tree in memory will cost O(n). Comparison of intervals from walking through a pair of sorted lists should generally take much, much less memory, in practice. I suspect using sorted data will work much better with whole-genome scale datasets, but this may depend on your application and what you're trying to do.

ADD REPLY • link 6.5 years ago by Alex Reynolds 35k

0

Entering edit mode

Mostly depends on the length distribution of the intervals, I think. When using sorted lists, I’ve had good results dividing the intervals into two buckets (one large bucket containing all intervals with length < threshold, and one small bucket containing the very long intervals), and keeping track of the maximum interval length to minimize the number of intervals in the second bucket that need to be scanned per query.

ADD REPLY • link 6.5 years ago by chrchang523 11k