-
Notifications
You must be signed in to change notification settings - Fork 689
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
remove TooManyClauses limitation when optimizable #1426
remove TooManyClauses limitation when optimizable #1426
Conversation
Observation: When a search does not perform scoring and does not need location information, disjunction queries are optimized by directly OR'ing the underlying bitset. This avoids all the usual multi-iterator disjunction logic. However, in it's traditional form, we still have a TooManyClauses limit, and this makes sense as all the underlying iterators are still in memory at one time. Observation: The MultiTerm search is unique in that we have a flat list of terms that are used to build the disjunction. This is significant because it means we can ensure that all the underlying searchers are optimizable. By combining these two observations we can introduce a new mode of operation for the MultiTerm search. When it does not perform scoring and does not need location information, we can do a new optimization where we create smaller batches of disjunctions which are immediately optimizable into a single term searcher. By repeating this process across all terms, we end up with the correct searcher, and we never had more than the batch size iterators built in memory at one time. UnadornedPostingsIteratorBitmap was refactored to also implement OptimizablePostingsIterator, this allows us to keep the in-progress final iterator in each batch, simplifying the logic. A new optimization mode "disjunction:unadorned-force" was introduced. It behaves exacdtly the same as "disjunction:unadorned" only it always performs the optimization without regard for the cardinality of the underlying iterators.
I dislike all the duplicated code, but working with |
} | ||
batch, err := makeBatchSearchers(indexReader, batchTerms, field, boost, options) | ||
if err != nil { | ||
return nil, err |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
one doubt here - in case of any subsequent iterations of this loop, we would have a non nil finalSearcher, which would remain un closed if we return on line:101 ?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah, no quick fix here, I have to go back and review how closing is handled when optimizing.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
OK, I have reviewed the way the existing optimization code works and concluded that it never closes the searchers it optimizes. The reason (as I understand it) is that these searchers were already existing, and essentially owned by searcher that was invoking the optimization (disjunction or conjunction). By "owned" what I mean is that they are member variables "searchers" in the struct, and they are cleaned up by the regular Close() call on those parent searchers.
However, our use of the optimization in MultiTermSearch is somewhat different. The searchers don't yet exist (just slice of terms), we create them ourselves. Additionally, in our case, we know they are no longer needed after the optimization has been attempted (either we got an error or failed to optimize and are returning, or we have a single new searcher that is the only one we need to keep around).
So, with that in mind I have changed the code to always invoke cleanup() after optimizing, which should close all searchers in the batch (both the term searchers, and any previous optimized searcher from previous round)
Action item for @mschoch, review if UnadornedPostingsIterator1Hit should be enhanced to satisfy OptimizablePostingsIterator as well (probably). |
It tuns out it isn't possible for UnadornedPostingsIterator1Hit to implement OptimizablePostingsIterator, as it cannot represent more than 1 hit (as required by ReplaceActualBitmap). This actually exposes a failure mode for this optimization and more work is required. |
OK, I have now concluded that it isn't an actual problem, but I'd like to include a description of the original concern, and why I've concluded it isn't an issue. Concern: Since Why I no longer think this is a valid concern: The type of optimization we're performing Further, since the MultiTermSearcher creates all the child searchers directly from a list of terms, there is no possibility of the result of any other optimization being a part of this optimization. |
Please review my last comment @sreekanth-cb |
@mschoch , yes these comments make sense. |
we now assert that the expected number of searchers are started and that all started searchers are closed optimization code adjusted to correct increment stats and one missing close added to the testcase itself
@sreekanth-cb based on your feedback:
|
@sreekanth-cb ping for re-review as github isn't showing that option. |
👍 |
) Observation: When a search does not perform scoring and does not need location information, disjunction queries are optimized by directly OR'ing the underlying bitset. This avoids all the usual multi-iterator disjunction logic. However, in it's traditional form, we still have a TooManyClauses limit, and this makes sense as all the underlying iterators are still in memory at one time. Observation: The MultiTerm search is unique in that we have a flat list of terms that are used to build the disjunction. This is significant because it means we can ensure that all the underlying searchers are optimizable. By combining these two observations we can introduce a new mode of operation for the MultiTerm search. When it does not perform scoring and does not need location information, we can do a new optimization where we create smaller batches of disjunctions which are immediately optimizable into a single term searcher. By repeating this process across all terms, we end up with the correct searcher, and we never had more than the batch size iterators built in memory at one time. UnadornedPostingsIteratorBitmap was refactored to also implement OptimizablePostingsIterator, this allows us to keep the in-progress final iterator in each batch, simplifying the logic. A new optimization mode "disjunction:unadorned-force" was introduced. It behaves exacdtly the same as "disjunction:unadorned" only it always performs the optimization without regard for the cardinality of the underlying iterators.
…vesearch#1426) This is a backport of: blevesearch@1f15e1d. Observation: When a search does not perform scoring and does not need location information, disjunction queries are optimized by directly OR'ing the underlying bitset. This avoids all the usual multi-iterator disjunction logic. However, in it's traditional form, we still have a TooManyClauses limit, and this makes sense as all the underlying iterators are still in memory at one time. Observation: The MultiTerm search is unique in that we have a flat list of terms that are used to build the disjunction. This is significant because it means we can ensure that all the underlying searchers are optimizable. By combining these two observations we can introduce a new mode of operation for the MultiTerm search. When it does not perform scoring and does not need location information, we can do a new optimization where we create smaller batches of disjunctions which are immediately optimizable into a single term searcher. By repeating this process across all terms, we end up with the correct searcher, and we never had more than the batch size iterators built in memory at one time. UnadornedPostingsIteratorBitmap was refactored to also implement OptimizablePostingsIterator, this allows us to keep the in-progress final iterator in each batch, simplifying the logic. A new optimization mode "disjunction:unadorned-force" was introduced. It behaves exacdtly the same as "disjunction:unadorned" only it always performs the optimization without regard for the cardinality of the underlying iterators.
Observation: When a search does not perform scoring and does not
need location information, disjunction queries are optimized by
directly OR'ing the underlying bitset. This avoids all the
usual multi-iterator disjunction logic. However, in it's
traditional form, we still have a TooManyClauses limit, and
this makes sense as all the underlying iterators are still
in memory at one time.
Observation: The MultiTerm search is unique in that we have a
flat list of terms that are used to build the disjunction.
This is significant because it means we can ensure that
all the underlying searchers are optimizable.
By combining these two observations we can introduce a new mode
of operation for the MultiTerm search. When it does not perform
scoring and does not need location information, we can do a new
optimization where we create smaller batches of disjunctions
which are immediately optimizable into a single term searcher.
By repeating this process across all terms, we end up with
the correct searcher, and we never had more than the batch
size iterators built in memory at one time.
UnadornedPostingsIteratorBitmap was refactored to also
implement OptimizablePostingsIterator, this allows us
to keep the in-progress final iterator in each batch,
simplifying the logic.
A new optimization mode "disjunction:unadorned-force" was
introduced. It behaves exacdtly the same as
"disjunction:unadorned" only it always performs the
optimization without regard for the cardinality of the
underlying iterators.