-
Notifications
You must be signed in to change notification settings - Fork 28.5k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[SPARK-14785][SQL] Support correlated scalar subqueries #12822
Conversation
Test build #57478 has finished for PR 12822 at commit
|
Test build #57485 has finished for PR 12822 at commit
|
query match { | ||
case a: Aggregate => checkAggregate(a) | ||
case Project(_, a: Aggregate) => checkAggregate(a) | ||
case fail => failAnalysis(s"Correlated scalar subqueries must be Aggregated: $fail") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can it have an Filter on top of Aggregate (HAVING clause)?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sure I'll add it.
LGTM |
Test build #57532 has finished for PR 12822 at commit
|
@hvanhovell When I tested this patch with TPCDS Q32 and Q92, the optimizer became not stable, it will reach 100 iterations, and the logical plan become huge. Could you fix it before merging? |
@davies something is up with the optimizer. Working on it. |
# Conflicts: # sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/optimizer/Optimizer.scala
@davies the TPCDS queries should work now. Could you take another look? |
@hvanhovell They works well now. Could you also update Filter to not create constraint from predicate that has correlated subquery? |
@@ -109,7 +109,7 @@ case class Filter(condition: Expression, child: LogicalPlan) | |||
|
|||
override protected def validConstraints: Set[Expression] = { | |||
val predicates = splitConjunctivePredicates(condition) | |||
.filterNot(PredicateSubquery.hasPredicateSubquery) | |||
.filterNot(SubqueryExpression.hasCorrelatedSubquery) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@davies I changed the filter to prevent any correlated subquery from being propagated.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Oh I see, I missed this one from latest changes.
LGTM, Will merge this one once it pass the tests. |
Test build #57561 has finished for PR 12822 at commit
|
## What changes were proposed in this pull request? In this PR we add support for correlated scalar subqueries. An example of such a query is: ```SQL select * from tbl1 a where a.value > (select max(value) from tbl2 b where b.key = a.key) ``` The implementation adds the `RewriteCorrelatedScalarSubquery` rule to the Optimizer. This rule plans these subqueries using `LEFT OUTER` joins. It currently supports rewrites for `Project`, `Aggregate` & `Filter` logical plans. I could not find a well defined semantics for the use of scalar subqueries in an `Aggregate`. The current implementation currently evaluates the scalar subquery *before* aggregation. This means that you either have to make scalar subquery part of the grouping expression, or that you have to aggregate it further on. I am open to suggestions on this. The implementation currently forces the uniqueness of a scalar subquery by enforcing that it is aggregated and that the resulting column is wrapped in an `AggregateExpression`. ## How was this patch tested? Added tests to `SubquerySuite`. Author: Herman van Hovell <hvanhovell@questtec.nl> Closes #12822 from hvanhovell/SPARK-14785.
What changes were proposed in this pull request?
In this PR we add support for correlated scalar subqueries. An example of such a query is:
The implementation adds the
RewriteCorrelatedScalarSubquery
rule to the Optimizer. This rule plans these subqueries usingLEFT OUTER
joins. It currently supports rewrites forProject
,Aggregate
&Filter
logical plans.I could not find a well defined semantics for the use of scalar subqueries in an
Aggregate
. The current implementation currently evaluates the scalar subquery before aggregation. This means that you either have to make scalar subquery part of the grouping expression, or that you have to aggregate it further on. I am open to suggestions on this.The implementation currently forces the uniqueness of a scalar subquery by enforcing that it is aggregated and that the resulting column is wrapped in an
AggregateExpression
.How was this patch tested?
Added tests to
SubquerySuite
.