Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[SPARK-14785][SQL] Support correlated scalar subqueries #12822

Closed
wants to merge 7 commits into from

Conversation

hvanhovell
Copy link
Contributor

@hvanhovell hvanhovell commented May 1, 2016

What changes were proposed in this pull request?

In this PR we add support for correlated scalar subqueries. An example of such a query is:

select * from tbl1 a where a.value > (select max(value) from tbl2 b where b.key = a.key)  

The implementation adds the RewriteCorrelatedScalarSubquery rule to the Optimizer. This rule plans these subqueries using LEFT OUTER joins. It currently supports rewrites for Project, Aggregate & Filter logical plans.

I could not find a well defined semantics for the use of scalar subqueries in an Aggregate. The current implementation currently evaluates the scalar subquery before aggregation. This means that you either have to make scalar subquery part of the grouping expression, or that you have to aggregate it further on. I am open to suggestions on this.

The implementation currently forces the uniqueness of a scalar subquery by enforcing that it is aggregated and that the resulting column is wrapped in an AggregateExpression.

How was this patch tested?

Added tests to SubquerySuite.

@hvanhovell
Copy link
Contributor Author

cc @rxin @davies

@SparkQA
Copy link

SparkQA commented May 1, 2016

Test build #57478 has finished for PR 12822 at commit 1827075.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@SparkQA
Copy link

SparkQA commented May 1, 2016

Test build #57485 has finished for PR 12822 at commit d189424.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

query match {
case a: Aggregate => checkAggregate(a)
case Project(_, a: Aggregate) => checkAggregate(a)
case fail => failAnalysis(s"Correlated scalar subqueries must be Aggregated: $fail")
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can it have an Filter on top of Aggregate (HAVING clause)?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sure I'll add it.

@davies
Copy link
Contributor

davies commented May 2, 2016

LGTM

@SparkQA
Copy link

SparkQA commented May 2, 2016

Test build #57532 has finished for PR 12822 at commit 84fff35.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@davies
Copy link
Contributor

davies commented May 2, 2016

@hvanhovell When I tested this patch with TPCDS Q32 and Q92, the optimizer became not stable, it will reach 100 iterations, and the logical plan become huge. Could you fix it before merging?

@hvanhovell
Copy link
Contributor Author

@davies something is up with the optimizer. Working on it.

hvanhovell added 3 commits May 2, 2016 22:07
# Conflicts:
#	sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/optimizer/Optimizer.scala
@hvanhovell
Copy link
Contributor Author

@davies the TPCDS queries should work now. Could you take another look?

@davies
Copy link
Contributor

davies commented May 2, 2016

@hvanhovell They works well now. Could you also update Filter to not create constraint from predicate that has correlated subquery?

@@ -109,7 +109,7 @@ case class Filter(condition: Expression, child: LogicalPlan)

override protected def validConstraints: Set[Expression] = {
val predicates = splitConjunctivePredicates(condition)
.filterNot(PredicateSubquery.hasPredicateSubquery)
.filterNot(SubqueryExpression.hasCorrelatedSubquery)
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@davies I changed the filter to prevent any correlated subquery from being propagated.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oh I see, I missed this one from latest changes.

@davies
Copy link
Contributor

davies commented May 2, 2016

LGTM, Will merge this one once it pass the tests.

@SparkQA
Copy link

SparkQA commented May 2, 2016

Test build #57561 has finished for PR 12822 at commit 831eaa8.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

asfgit pushed a commit that referenced this pull request May 2, 2016
## What changes were proposed in this pull request?
In this PR we add support for correlated scalar subqueries. An example of such a query is:
```SQL
select * from tbl1 a where a.value > (select max(value) from tbl2 b where b.key = a.key)
```
The implementation adds the `RewriteCorrelatedScalarSubquery` rule to the Optimizer. This rule plans these subqueries using `LEFT OUTER` joins. It currently supports rewrites for `Project`, `Aggregate` & `Filter` logical plans.

I could not find a well defined semantics for the use of scalar subqueries in an `Aggregate`. The current implementation currently evaluates the scalar subquery *before* aggregation. This means that you either have to make scalar subquery part of the grouping expression, or that you have to aggregate it further on. I am open to suggestions on this.

The implementation currently forces the uniqueness of a scalar subquery by enforcing that it is aggregated and that the resulting column is wrapped in an `AggregateExpression`.

## How was this patch tested?
Added tests to `SubquerySuite`.

Author: Herman van Hovell <hvanhovell@questtec.nl>

Closes #12822 from hvanhovell/SPARK-14785.
@asfgit asfgit closed this in f362363 May 2, 2016
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants