Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Implement hash join executor support #494

Merged
merged 3 commits into from
Mar 17, 2022
Merged

Implement hash join executor support #494

merged 3 commits into from
Mar 17, 2022

Conversation

panarch
Copy link
Member

@panarch panarch commented Mar 12, 2022

Implement hash join executor support which can boost execution speed of some join queries with key = value shape constraints.
Hash join executor loads all data to the memory by building hash map based on evaluating pre-planned key expr.
It can dramatically reduce the overhead of accessing storage and join constraint evaluation time.

  • Add JoinExecutor field to ast::Join.
    key_expr: It used as key of HashMap, evaluation is done when executor builds the map.
    value_expr: value of HashMap, evaluated in the loop of join execution.
    where_clause: filtering expr which can reduce the size of the HashMap.

  • Add join planner module.
    plan/mock.rs - MockStorage impl which is only for testing planners.
    plan/schema.rs - Analyze AST and returns all related table schema list.
    plan/evaluable.rs - Check whether the AST (query) is evaluable of not.
    plan/join.rs - Join Planner, currently the only role is to find HASH JOIN.

plan/index now does not require async, it only finds indexes using pre-fetched schema info
* Add JoinExecutor field to ast::Join.
key_expr: It used as key of HashMap, evaluation is done when executor builds the map.
value_expr: value of HashMap, evaluated in the loop of join execution.
where_clause: filtering expr which can reduce the size of the HashMap.

* Add join planner module.
plan/mock.rs - MockStorage impl which is only for testing planners.
plan/schema.rs - Analyze AST and returns all related table schema list.
plan/evaluable.rs - Check whether the AST (query) is evaluable of not.
plan/join.rs - Join Planner, currently the only role is to find HASH JOIN.
@panarch panarch added the enhancement New feature or request label Mar 12, 2022
@panarch panarch requested review from ever0de and 24seconds March 12, 2022 07:18
@panarch panarch self-assigned this Mar 12, 2022
@codecov-commenter
Copy link

codecov-commenter commented Mar 12, 2022

Codecov Report

Merging #494 (8df33b6) into main (c9d15ea) will increase coverage by 0.60%.
The diff coverage is 90.47%.

Impacted file tree graph

@@            Coverage Diff             @@
##             main     #494      +/-   ##
==========================================
+ Coverage   91.63%   92.23%   +0.60%     
==========================================
  Files         160      173      +13     
  Lines        9393    10745    +1352     
==========================================
+ Hits         8607     9911    +1304     
- Misses        786      834      +48     
Impacted Files Coverage Δ
core/src/translate/mod.rs 100.00% <ø> (ø)
core/src/parse_sql.rs 70.00% <66.66%> (-1.43%) ⬇️
core/src/executor/join/mod.rs 78.04% <78.04%> (ø)
core/src/plan/expr/mod.rs 84.07% <84.07%> (ø)
core/src/plan/schema.rs 85.20% <85.20%> (ø)
core/src/plan/join.rs 90.65% <90.65%> (ø)
core/src/executor/join/hash_key.rs 96.87% <96.87%> (ø)
core/src/plan/evaluable.rs 98.03% <98.03%> (ø)
core/src/plan/context.rs 100.00% <100.00%> (ø)
core/src/plan/expr/aggregate.rs 100.00% <100.00%> (ø)
... and 16 more

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update c9d15ea...8df33b6. Read the comment docs.

@panarch panarch merged commit 863e5a9 into main Mar 17, 2022
@panarch panarch deleted the hash-join-support branch March 17, 2022 11:36
Copy link
Member

@ever0de ever0de left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I leave a review (nit-picking) that has been merged.

Timestamp(NaiveDateTime),
Time(NaiveTime),
Interval(Interval),
Uuid(u128),
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How about using the Uuid type?
Is there a reason you specified u128?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is simple. Value enum also contains Uuid with u128 type.
UUID can be described as [u8; 16] which is identical to u128.

Comment on lines +63 to +64
.map(move |(i, join_clause)| {
let join_columns = Rc::clone(&self.join_columns[i]);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

(Clear if present under Confirm later)
You might need to check the lengths of join_columns and join_clauses.

self,
rows: impl Stream<Item = Result<BlendContext<'a>>> + 'a,
) -> Result<Joined<'a>> {
let init_rows: Joined<'a> = Box::pin(rows.map(|row| row.map(Rc::new)));
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
let init_rows: Joined<'a> = Box::pin(rows.map(|row| row.map(Rc::new)));
let init_rows: Joined = Box::pin(rows.map(|row| row.map(Rc::new)));

When applying a type alias with a lifetime as the default, you do not need to specify the lifetime.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I didn't know that lifetime in here can be omitted, thanks!

Comment on lines +59 to +67
let joins = self
.join_clauses
.iter()
.enumerate()
.map(move |(i, join_clause)| {
let join_columns = Rc::clone(&self.join_columns[i]);

Ok::<_, Error>((join_clause, join_columns))
});
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How about code like this:

Suggested change
let joins = self
.join_clauses
.iter()
.enumerate()
.map(move |(i, join_clause)| {
let join_columns = Rc::clone(&self.join_columns[i]);
Ok::<_, Error>((join_clause, join_columns))
});
let joins = self.join_clauses.iter().zip(self.join_columns.iter()).map(
|(join_clasue, join_columns)| {
let join_columns = Rc::clone(join_columns);
Ok::<_, Error>((join_clasue, join_columns))
},
);

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oh, I wasn't aware of this. zip looks much better.
Thanks a lot. I already merged this so I'll make a new branch to apply this, thanks!

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks, I cleaned a bit more on it.

..
let joins = self                                              
    .join_clauses                                             
    .iter()                                                   
    .zip(self.join_columns.iter().map(Rc::clone));            
                                                              
stream::iter(joins)                                           
    .map(Ok)                                                  
    .try_fold(init_rows, |rows, (join_clause, join_columns)| {
..

Comment on lines +30 to +37
PlanExpr::TwoExprs(expr, expr2) => {
check_expr(context.as_ref().map(Rc::clone), expr) && check_expr(context, expr2)
}
PlanExpr::ThreeExprs(expr, expr2, expr3) => {
check_expr(context.as_ref().map(Rc::clone), expr)
&& check_expr(context.as_ref().map(Rc::clone), expr2)
&& check_expr(context, expr3)
}
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is there any reason not to integrate with MultiExprs?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

MultiExprs always uses Vec.
This enums are for avoiding unnecessary heap uses, to use less boxes.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants