Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

remove TooManyClauses limitation when optimizable #1426

Merged

Conversation

mschoch
Copy link
Contributor

@mschoch mschoch commented Jun 30, 2020

Observation: When a search does not perform scoring and does not
need location information, disjunction queries are optimized by
directly OR'ing the underlying bitset. This avoids all the
usual multi-iterator disjunction logic. However, in it's
traditional form, we still have a TooManyClauses limit, and
this makes sense as all the underlying iterators are still
in memory at one time.

Observation: The MultiTerm search is unique in that we have a
flat list of terms that are used to build the disjunction.
This is significant because it means we can ensure that
all the underlying searchers are optimizable.

By combining these two observations we can introduce a new mode
of operation for the MultiTerm search. When it does not perform
scoring and does not need location information, we can do a new
optimization where we create smaller batches of disjunctions
which are immediately optimizable into a single term searcher.
By repeating this process across all terms, we end up with
the correct searcher, and we never had more than the batch
size iterators built in memory at one time.

UnadornedPostingsIteratorBitmap was refactored to also
implement OptimizablePostingsIterator, this allows us
to keep the in-progress final iterator in each batch,
simplifying the logic.

A new optimization mode "disjunction:unadorned-force" was
introduced. It behaves exacdtly the same as
"disjunction:unadorned" only it always performs the
optimization without regard for the cardinality of the
underlying iterators.

Observation: When a search does not perform scoring and does not
need location information, disjunction queries are optimized by
directly OR'ing the underlying bitset.  This avoids all the
usual multi-iterator disjunction logic.  However, in it's
traditional form, we still have a TooManyClauses limit, and
this makes sense as all the underlying iterators are still
in memory at one time.

Observation: The MultiTerm search is unique in that we have a
flat list of terms that are used to build the disjunction.
This is significant because it means we can ensure that
all the underlying searchers are optimizable.

By combining these two observations we can introduce a new mode
of operation for the MultiTerm search.  When it does not perform
scoring and does not need location information, we can do a new
optimization where we create smaller batches of disjunctions
which are immediately optimizable into a single term searcher.
By repeating this process across all terms, we end up with
the correct searcher, and we never had more than the batch
size iterators built in memory at one time.

UnadornedPostingsIteratorBitmap was refactored to also
implement OptimizablePostingsIterator, this allows us
to keep the in-progress final iterator in each batch,
simplifying the logic.

A new optimization mode "disjunction:unadorned-force" was
introduced.  It behaves exacdtly the same as
"disjunction:unadorned" only it always performs the
optimization without regard for the cardinality of the
underlying iterators.
@mschoch
Copy link
Contributor Author

mschoch commented Jun 30, 2020

I dislike all the duplicated code, but working with []string and [][]byte up and down the stack is a preexisting condition.

search/searcher/search_multi_term.go Show resolved Hide resolved
}
batch, err := makeBatchSearchers(indexReader, batchTerms, field, boost, options)
if err != nil {
return nil, err
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

one doubt here - in case of any subsequent iterations of this loop, we would have a non nil finalSearcher, which would remain un closed if we return on line:101 ?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, no quick fix here, I have to go back and review how closing is handled when optimizing.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

OK, I have reviewed the way the existing optimization code works and concluded that it never closes the searchers it optimizes. The reason (as I understand it) is that these searchers were already existing, and essentially owned by searcher that was invoking the optimization (disjunction or conjunction). By "owned" what I mean is that they are member variables "searchers" in the struct, and they are cleaned up by the regular Close() call on those parent searchers.

However, our use of the optimization in MultiTermSearch is somewhat different. The searchers don't yet exist (just slice of terms), we create them ourselves. Additionally, in our case, we know they are no longer needed after the optimization has been attempted (either we got an error or failed to optimize and are returning, or we have a single new searcher that is the only one we need to keep around).

So, with that in mind I have changed the code to always invoke cleanup() after optimizing, which should close all searchers in the batch (both the term searchers, and any previous optimized searcher from previous round)

@mschoch mschoch requested a review from sreekanth-cb July 7, 2020 14:58
@mschoch mschoch requested a review from sreekanth-cb July 13, 2020 17:45
@mschoch
Copy link
Contributor Author

mschoch commented Jul 22, 2020

Action item for @mschoch, review if UnadornedPostingsIterator1Hit should be enhanced to satisfy OptimizablePostingsIterator as well (probably).

@mschoch
Copy link
Contributor Author

mschoch commented Jul 23, 2020

It tuns out it isn't possible for UnadornedPostingsIterator1Hit to implement OptimizablePostingsIterator, as it cannot represent more than 1 hit (as required by ReplaceActualBitmap). This actually exposes a failure mode for this optimization and more work is required.

@mschoch
Copy link
Contributor Author

mschoch commented Jul 23, 2020

OK, I have now concluded that it isn't an actual problem, but I'd like to include a description of the original concern, and why I've concluded it isn't an issue.

Concern:

Since UnadornedPostingsIterator1Hit doesn't implement OptimizablePostingsIterator, while the optimizeCompositeSearcher attempts to combine N searchers down into 1 searcher, it could fail because one of the underlying Postings iterators from a previous round of optimization was using UnadornedPostingsIterator1Hit, thus causing the attempted optimization to fail.

Why I no longer think this is a valid concern:

The type of optimization we're performing "disjunction:unadorned-force" never creates instances of UnadornedPostingsIterator1Hit, it always uses UnadornedPostingsIteratorBitmap here: https://github.com/blevesearch/bleve/blob/master/index/scorch/optimize.go#L392

Further, since the MultiTermSearcher creates all the child searchers directly from a list of terms, there is no possibility of the result of any other optimization being a part of this optimization.

@mschoch
Copy link
Contributor Author

mschoch commented Jul 23, 2020

Please review my last comment @sreekanth-cb

@mschoch mschoch requested a review from sreekanth-cb July 23, 2020 16:30
@sreekanth-cb
Copy link
Contributor

@mschoch , yes these comments make sense.
does it make sense to have a UT which may explore that single finalSearcher coming out of another set of finalSearchers built from previous levels..

we now assert that the expected number of searchers are started
and that all started searchers are closed

optimization code adjusted to correct increment stats
and one missing close added to the testcase itself
@mschoch
Copy link
Contributor Author

mschoch commented Aug 6, 2020

@sreekanth-cb based on your feedback:

  1. You asked that the test perform multiple rounds batches of optimization to verify that each time we reduce to a single searcher, and the logic is sound for multiple iterations. It turns out the test was already doing 2 batches. There are 4 term searchers, and the max clauses is adjusted 2, thus requiring 2 loop iterations.

  2. I added code to assert that the expected number of searchers is started (6, the 4 original term searchers, one optimized searcher from the first batch, and then the final search resulting from the second batch).

  3. I added code to assert that all started searchers were closed.

  4. Item 3 reveal failures due to 2 reasons, first the final searcher was not closed in the unit test (fixed). Second, the optimize.go did not bump stats when creating new optimized term searchers (fixed).

@mschoch
Copy link
Contributor Author

mschoch commented Aug 6, 2020

@sreekanth-cb ping for re-review as github isn't showing that option.

@sreekanth-cb
Copy link
Contributor

👍

@mschoch mschoch merged commit 1f15e1d into master Aug 13, 2020
@mschoch mschoch deleted the remove-max-clause-limitation-when-multi-term-search-optimizable branch August 13, 2020 15:23
abhinavdangeti pushed a commit to abhinavdangeti/bleve that referenced this pull request Aug 13, 2020
)

Observation: When a search does not perform scoring and does not
need location information, disjunction queries are optimized by
directly OR'ing the underlying bitset.  This avoids all the
usual multi-iterator disjunction logic.  However, in it's
traditional form, we still have a TooManyClauses limit, and
this makes sense as all the underlying iterators are still
in memory at one time.

Observation: The MultiTerm search is unique in that we have a
flat list of terms that are used to build the disjunction.
This is significant because it means we can ensure that
all the underlying searchers are optimizable.

By combining these two observations we can introduce a new mode
of operation for the MultiTerm search.  When it does not perform
scoring and does not need location information, we can do a new
optimization where we create smaller batches of disjunctions
which are immediately optimizable into a single term searcher.
By repeating this process across all terms, we end up with
the correct searcher, and we never had more than the batch
size iterators built in memory at one time.

UnadornedPostingsIteratorBitmap was refactored to also
implement OptimizablePostingsIterator, this allows us
to keep the in-progress final iterator in each batch,
simplifying the logic.

A new optimization mode "disjunction:unadorned-force" was
introduced.  It behaves exacdtly the same as
"disjunction:unadorned" only it always performs the
optimization without regard for the cardinality of the
underlying iterators.
abhinavdangeti pushed a commit to abhinavdangeti/bleve that referenced this pull request Aug 18, 2020
…vesearch#1426)

This is a backport of:
blevesearch@1f15e1d.

Observation: When a search does not perform scoring and does not
need location information, disjunction queries are optimized by
directly OR'ing the underlying bitset.  This avoids all the
usual multi-iterator disjunction logic.  However, in it's
traditional form, we still have a TooManyClauses limit, and
this makes sense as all the underlying iterators are still
in memory at one time.

Observation: The MultiTerm search is unique in that we have a
flat list of terms that are used to build the disjunction.
This is significant because it means we can ensure that
all the underlying searchers are optimizable.

By combining these two observations we can introduce a new mode
of operation for the MultiTerm search.  When it does not perform
scoring and does not need location information, we can do a new
optimization where we create smaller batches of disjunctions
which are immediately optimizable into a single term searcher.
By repeating this process across all terms, we end up with
the correct searcher, and we never had more than the batch
size iterators built in memory at one time.

UnadornedPostingsIteratorBitmap was refactored to also
implement OptimizablePostingsIterator, this allows us
to keep the in-progress final iterator in each batch,
simplifying the logic.

A new optimization mode "disjunction:unadorned-force" was
introduced.  It behaves exacdtly the same as
"disjunction:unadorned" only it always performs the
optimization without regard for the cardinality of the
underlying iterators.
@mschoch mschoch added this to the 1.0.10 milestone Aug 23, 2020
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants