remove TooManyClauses limitation when optimizable #1426

mschoch · 2020-06-30T14:32:17Z

Observation: When a search does not perform scoring and does not
need location information, disjunction queries are optimized by
directly OR'ing the underlying bitset. This avoids all the
usual multi-iterator disjunction logic. However, in it's
traditional form, we still have a TooManyClauses limit, and
this makes sense as all the underlying iterators are still
in memory at one time.

Observation: The MultiTerm search is unique in that we have a
flat list of terms that are used to build the disjunction.
This is significant because it means we can ensure that
all the underlying searchers are optimizable.

By combining these two observations we can introduce a new mode
of operation for the MultiTerm search. When it does not perform
scoring and does not need location information, we can do a new
optimization where we create smaller batches of disjunctions
which are immediately optimizable into a single term searcher.
By repeating this process across all terms, we end up with
the correct searcher, and we never had more than the batch
size iterators built in memory at one time.

UnadornedPostingsIteratorBitmap was refactored to also
implement OptimizablePostingsIterator, this allows us
to keep the in-progress final iterator in each batch,
simplifying the logic.

A new optimization mode "disjunction:unadorned-force" was
introduced. It behaves exacdtly the same as
"disjunction:unadorned" only it always performs the
optimization without regard for the cardinality of the
underlying iterators.

Observation: When a search does not perform scoring and does not need location information, disjunction queries are optimized by directly OR'ing the underlying bitset. This avoids all the usual multi-iterator disjunction logic. However, in it's traditional form, we still have a TooManyClauses limit, and this makes sense as all the underlying iterators are still in memory at one time. Observation: The MultiTerm search is unique in that we have a flat list of terms that are used to build the disjunction. This is significant because it means we can ensure that all the underlying searchers are optimizable. By combining these two observations we can introduce a new mode of operation for the MultiTerm search. When it does not perform scoring and does not need location information, we can do a new optimization where we create smaller batches of disjunctions which are immediately optimizable into a single term searcher. By repeating this process across all terms, we end up with the correct searcher, and we never had more than the batch size iterators built in memory at one time. UnadornedPostingsIteratorBitmap was refactored to also implement OptimizablePostingsIterator, this allows us to keep the in-progress final iterator in each batch, simplifying the logic. A new optimization mode "disjunction:unadorned-force" was introduced. It behaves exacdtly the same as "disjunction:unadorned" only it always performs the optimization without regard for the cardinality of the underlying iterators.

mschoch · 2020-06-30T14:33:28Z

I dislike all the duplicated code, but working with []string and [][]byte up and down the stack is a preexisting condition.

search/searcher/search_multi_term.go

sreekanth-cb · 2020-07-06T07:34:41Z

search/searcher/search_multi_term.go

+		}
+		batch, err := makeBatchSearchers(indexReader, batchTerms, field, boost, options)
+		if err != nil {
+			return nil, err


one doubt here - in case of any subsequent iterations of this loop, we would have a non nil finalSearcher, which would remain un closed if we return on line:101 ?

Yeah, no quick fix here, I have to go back and review how closing is handled when optimizing.

OK, I have reviewed the way the existing optimization code works and concluded that it never closes the searchers it optimizes. The reason (as I understand it) is that these searchers were already existing, and essentially owned by searcher that was invoking the optimization (disjunction or conjunction). By "owned" what I mean is that they are member variables "searchers" in the struct, and they are cleaned up by the regular Close() call on those parent searchers.

However, our use of the optimization in MultiTermSearch is somewhat different. The searchers don't yet exist (just slice of terms), we create them ourselves. Additionally, in our case, we know they are no longer needed after the optimization has been attempted (either we got an error or failed to optimize and are returning, or we have a single new searcher that is the only one we need to keep around).

So, with that in mind I have changed the code to always invoke cleanup() after optimizing, which should close all searchers in the batch (both the term searchers, and any previous optimized searcher from previous round)

search/searcher/search_multi_term.go

mschoch · 2020-07-22T18:13:50Z

Action item for @mschoch, review if UnadornedPostingsIterator1Hit should be enhanced to satisfy OptimizablePostingsIterator as well (probably).

mschoch · 2020-07-23T12:56:48Z

It tuns out it isn't possible for UnadornedPostingsIterator1Hit to implement OptimizablePostingsIterator, as it cannot represent more than 1 hit (as required by ReplaceActualBitmap). This actually exposes a failure mode for this optimization and more work is required.

mschoch · 2020-07-23T16:29:42Z

OK, I have now concluded that it isn't an actual problem, but I'd like to include a description of the original concern, and why I've concluded it isn't an issue.

Concern:

Since UnadornedPostingsIterator1Hit doesn't implement OptimizablePostingsIterator, while the optimizeCompositeSearcher attempts to combine N searchers down into 1 searcher, it could fail because one of the underlying Postings iterators from a previous round of optimization was using UnadornedPostingsIterator1Hit, thus causing the attempted optimization to fail.

Why I no longer think this is a valid concern:

The type of optimization we're performing "disjunction:unadorned-force" never creates instances of UnadornedPostingsIterator1Hit, it always uses UnadornedPostingsIteratorBitmap here: https://github.com/blevesearch/bleve/blob/master/index/scorch/optimize.go#L392

Further, since the MultiTermSearcher creates all the child searchers directly from a list of terms, there is no possibility of the result of any other optimization being a part of this optimization.

mschoch · 2020-07-23T16:30:15Z

Please review my last comment @sreekanth-cb

sreekanth-cb · 2020-07-27T12:32:36Z

@mschoch , yes these comments make sense.
does it make sense to have a UT which may explore that single finalSearcher coming out of another set of finalSearchers built from previous levels..

we now assert that the expected number of searchers are started and that all started searchers are closed optimization code adjusted to correct increment stats and one missing close added to the testcase itself

mschoch · 2020-08-06T21:05:22Z

@sreekanth-cb based on your feedback:

You asked that the test perform multiple rounds batches of optimization to verify that each time we reduce to a single searcher, and the logic is sound for multiple iterations. It turns out the test was already doing 2 batches. There are 4 term searchers, and the max clauses is adjusted 2, thus requiring 2 loop iterations.
I added code to assert that the expected number of searchers is started (6, the 4 original term searchers, one optimized searcher from the first batch, and then the final search resulting from the second batch).
I added code to assert that all started searchers were closed.
Item 3 reveal failures due to 2 reasons, first the final searcher was not closed in the unit test (fixed). Second, the optimize.go did not bump stats when creating new optimized term searchers (fixed).

mschoch · 2020-08-06T21:08:22Z

@sreekanth-cb ping for re-review as github isn't showing that option.

sreekanth-cb · 2020-08-10T11:16:22Z

👍

) Observation: When a search does not perform scoring and does not need location information, disjunction queries are optimized by directly OR'ing the underlying bitset. This avoids all the usual multi-iterator disjunction logic. However, in it's traditional form, we still have a TooManyClauses limit, and this makes sense as all the underlying iterators are still in memory at one time. Observation: The MultiTerm search is unique in that we have a flat list of terms that are used to build the disjunction. This is significant because it means we can ensure that all the underlying searchers are optimizable. By combining these two observations we can introduce a new mode of operation for the MultiTerm search. When it does not perform scoring and does not need location information, we can do a new optimization where we create smaller batches of disjunctions which are immediately optimizable into a single term searcher. By repeating this process across all terms, we end up with the correct searcher, and we never had more than the batch size iterators built in memory at one time. UnadornedPostingsIteratorBitmap was refactored to also implement OptimizablePostingsIterator, this allows us to keep the in-progress final iterator in each batch, simplifying the logic. A new optimization mode "disjunction:unadorned-force" was introduced. It behaves exacdtly the same as "disjunction:unadorned" only it always performs the optimization without regard for the cardinality of the underlying iterators.

…vesearch#1426) This is a backport of: blevesearch@1f15e1d. Observation: When a search does not perform scoring and does not need location information, disjunction queries are optimized by directly OR'ing the underlying bitset. This avoids all the usual multi-iterator disjunction logic. However, in it's traditional form, we still have a TooManyClauses limit, and this makes sense as all the underlying iterators are still in memory at one time. Observation: The MultiTerm search is unique in that we have a flat list of terms that are used to build the disjunction. This is significant because it means we can ensure that all the underlying searchers are optimizable. By combining these two observations we can introduce a new mode of operation for the MultiTerm search. When it does not perform scoring and does not need location information, we can do a new optimization where we create smaller batches of disjunctions which are immediately optimizable into a single term searcher. By repeating this process across all terms, we end up with the correct searcher, and we never had more than the batch size iterators built in memory at one time. UnadornedPostingsIteratorBitmap was refactored to also implement OptimizablePostingsIterator, this allows us to keep the in-progress final iterator in each batch, simplifying the logic. A new optimization mode "disjunction:unadorned-force" was introduced. It behaves exacdtly the same as "disjunction:unadorned" only it always performs the optimization without regard for the cardinality of the underlying iterators.

mschoch requested review from steveyen, abhinavdangeti and sreekanth-cb June 30, 2020 14:32

sreekanth-cb reviewed Jul 6, 2020

View reviewed changes

mschoch added 3 commits July 6, 2020 08:54

address review comments, improve logic structure

3260dec

properly cleanup searchers in batch loop

6a747d2

fix cleanup in other path as well

cef78d9

mschoch requested a review from sreekanth-cb July 7, 2020 14:58

sreekanth-cb reviewed Jul 13, 2020

View reviewed changes

search/searcher/search_multi_term.go Outdated Show resolved Hide resolved

simplify logic in Bytes method as well

7066aad

mschoch requested a review from sreekanth-cb July 13, 2020 17:45

sreekanth-cb approved these changes Jul 14, 2020

View reviewed changes

mschoch added the do not merge label Jul 23, 2020

mschoch removed the do not merge label Jul 23, 2020

mschoch requested a review from sreekanth-cb July 23, 2020 16:30

update test based on review feedback

b65b6f0

we now assert that the expected number of searchers are started and that all started searchers are closed optimization code adjusted to correct increment stats and one missing close added to the testcase itself

sreekanth-cb approved these changes Aug 10, 2020

View reviewed changes

mschoch merged commit 1f15e1d into master Aug 13, 2020

mschoch deleted the remove-max-clause-limitation-when-multi-term-search-optimizable branch August 13, 2020 15:23

mschoch added this to the 1.0.10 milestone Aug 23, 2020

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

remove TooManyClauses limitation when optimizable #1426

remove TooManyClauses limitation when optimizable #1426

mschoch commented Jun 30, 2020

mschoch commented Jun 30, 2020

sreekanth-cb Jul 6, 2020

mschoch Jul 6, 2020

mschoch Jul 7, 2020

mschoch commented Jul 22, 2020

mschoch commented Jul 23, 2020

mschoch commented Jul 23, 2020

mschoch commented Jul 23, 2020

sreekanth-cb commented Jul 27, 2020

mschoch commented Aug 6, 2020

mschoch commented Aug 6, 2020

sreekanth-cb commented Aug 10, 2020

remove TooManyClauses limitation when optimizable #1426

remove TooManyClauses limitation when optimizable #1426

Conversation

mschoch commented Jun 30, 2020

mschoch commented Jun 30, 2020

sreekanth-cb Jul 6, 2020

Choose a reason for hiding this comment

mschoch Jul 6, 2020

Choose a reason for hiding this comment

mschoch Jul 7, 2020

Choose a reason for hiding this comment

mschoch commented Jul 22, 2020

mschoch commented Jul 23, 2020

mschoch commented Jul 23, 2020

mschoch commented Jul 23, 2020

sreekanth-cb commented Jul 27, 2020

mschoch commented Aug 6, 2020

mschoch commented Aug 6, 2020

sreekanth-cb commented Aug 10, 2020