Skip to content
This repository has been archived by the owner on Jan 2, 2025. It is now read-only.

Query planer doesn't work for multiple alternations in a row #1267

Open
hsqStephenZhang opened this issue May 7, 2024 · 1 comment
Open

Comments

@hsqStephenZhang
Copy link

High level details

I added a unittest of plan("[ab][cd][ef]"), which is a query parse testcase in google's code search project(I know it's pretty old), which is expected to expand into "ace, acf, ... bdf", but the test output failed to inline the query strings

Current situation

  • how it's insufficient: After reading the code, I found that current optimizer won't handle the scenario that has over 3 alternations in a row. I'm not sure if it's in anticipation
  • motivations: I'm learning code search techs and comparing google's code search engine's with bloop, the boolean query builder seems to have some diffs, so I'm figuring it out.
  • why we need to change this: To make the query optimization more sufficient and cover more cases.

Proposal

I've implemented some changes, and have passed the test( plan("(a|b)(c|d)(e|f)(g|h)")) added by my self.

in server/bleep/src/query/planner.rs, The Fragment:::add need to handle the following match arm:

             // TODO: do we need to handle cases where the children are not all literals?
            (Fragment::Literal(lit), Fragment::Dense(Op::Or, mut rhs)) => {
                if rhs.iter().all(|x| matches!(x, Fragment::Literal(_))) {
                    for x in &mut rhs {
                        if let Fragment::Literal(s) = x {
                            *s = lit.clone() + s;
                        }
                    }
                    Fragment::Dense(Op::Or, rhs)
                } else {
                    Fragment::Dense(
                        Op::And,
                        vec![Fragment::Literal(lit), Fragment::Dense(Op::Or, rhs)],
                    )
                }
            }

            // TODO: avoid explosion by adding some constraints
            // TODO: do we need to handle cases where the children are not all literals?
            (Fragment::Dense(Op::Or, lhs), Fragment::Dense(Op::Or, rhs)) => {
                if lhs.iter().all(|x| matches!(x, Fragment::Literal(_)))
                    && lhs.len() <= 100
                    && rhs.iter().all(|x| matches!(x, Fragment::Literal(_)))
                    && rhs.len() <= 100
                {
                    let mut cross = Vec::with_capacity(lhs.len() * rhs.len());
                    for x in &lhs {
                        for y in &rhs {
                            let x = x.as_literal().unwrap();
                            let y = y.as_literal().unwrap();
                            cross.push(Fragment::Literal(x.to_owned() + y));
                        }
                    }
                    Fragment::Dense(Op::Or, cross)
                } else {
                    Fragment::Dense(
                        Op::And,
                        vec![Fragment::Dense(Op::Or, lhs), Fragment::Dense(Op::Or, rhs)],
                    )
                }
            }

and the function optimize::run should call flattern_or after the calling of inline

Next steps
Are there any next steps after we implement these changes?

@hsqStephenZhang
Copy link
Author

hsqStephenZhang commented May 7, 2024

Given the content of 'acgf', the regex query "[ab][cd][ef]" will recall this doc, and it's unexpected, since this regex query has a determined set of matched trigrams.

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant