[SPARK-14785][SQL] Support correlated scalar subqueries #12822

hvanhovell · 2016-05-01T11:32:20Z

What changes were proposed in this pull request?

In this PR we add support for correlated scalar subqueries. An example of such a query is:

select * from tbl1 a where a.value > (select max(value) from tbl2 b where b.key = a.key)

The implementation adds the RewriteCorrelatedScalarSubquery rule to the Optimizer. This rule plans these subqueries using LEFT OUTER joins. It currently supports rewrites for Project, Aggregate & Filter logical plans.

I could not find a well defined semantics for the use of scalar subqueries in an Aggregate. The current implementation currently evaluates the scalar subquery before aggregation. This means that you either have to make scalar subquery part of the grouping expression, or that you have to aggregate it further on. I am open to suggestions on this.

The implementation currently forces the uniqueness of a scalar subquery by enforcing that it is aggregated and that the resulting column is wrapped in an AggregateExpression.

How was this patch tested?

Added tests to SubquerySuite.

hvanhovell · 2016-05-01T11:32:34Z

cc @rxin @davies

SparkQA · 2016-05-01T13:00:49Z

Test build #57478 has finished for PR 12822 at commit 1827075.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2016-05-01T22:33:52Z

Test build #57485 has finished for PR 12822 at commit d189424.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

davies · 2016-05-02T07:36:42Z

sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/analysis/CheckAnalysis.scala

+            query match {
+              case a: Aggregate => checkAggregate(a)
+              case Project(_, a: Aggregate) => checkAggregate(a)
+              case fail => failAnalysis(s"Correlated scalar subqueries must be Aggregated: $fail")


Can it have an Filter on top of Aggregate (HAVING clause)?

Sure I'll add it.

davies · 2016-05-02T16:48:32Z

LGTM

SparkQA · 2016-05-02T17:07:23Z

Test build #57532 has finished for PR 12822 at commit 84fff35.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

davies · 2016-05-02T17:29:45Z

@hvanhovell When I tested this patch with TPCDS Q32 and Q92, the optimizer became not stable, it will reach 100 iterations, and the logical plan become huge. Could you fix it before merging?

hvanhovell · 2016-05-02T17:53:54Z

@davies something is up with the optimizer. Working on it.

# Conflicts: # sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/optimizer/Optimizer.scala

hvanhovell · 2016-05-02T21:48:25Z

@davies the TPCDS queries should work now. Could you take another look?

davies · 2016-05-02T21:55:54Z

@hvanhovell They works well now. Could you also update Filter to not create constraint from predicate that has correlated subquery?

hvanhovell · 2016-05-02T21:59:13Z

...alyst/src/main/scala/org/apache/spark/sql/catalyst/plans/logical/basicLogicalOperators.scala

@@ -109,7 +109,7 @@ case class Filter(condition: Expression, child: LogicalPlan)

  override protected def validConstraints: Set[Expression] = {
    val predicates = splitConjunctivePredicates(condition)
-      .filterNot(PredicateSubquery.hasPredicateSubquery)
+      .filterNot(SubqueryExpression.hasCorrelatedSubquery)


@davies I changed the filter to prevent any correlated subquery from being propagated.

Oh I see, I missed this one from latest changes.

davies · 2016-05-02T22:07:30Z

LGTM, Will merge this one once it pass the tests.

SparkQA · 2016-05-02T23:15:02Z

Test build #57561 has finished for PR 12822 at commit 831eaa8.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

## What changes were proposed in this pull request? In this PR we add support for correlated scalar subqueries. An example of such a query is: ```SQL select * from tbl1 a where a.value > (select max(value) from tbl2 b where b.key = a.key) ``` The implementation adds the `RewriteCorrelatedScalarSubquery` rule to the Optimizer. This rule plans these subqueries using `LEFT OUTER` joins. It currently supports rewrites for `Project`, `Aggregate` & `Filter` logical plans. I could not find a well defined semantics for the use of scalar subqueries in an `Aggregate`. The current implementation currently evaluates the scalar subquery *before* aggregation. This means that you either have to make scalar subquery part of the grouping expression, or that you have to aggregate it further on. I am open to suggestions on this. The implementation currently forces the uniqueness of a scalar subquery by enforcing that it is aggregated and that the resulting column is wrapped in an `AggregateExpression`. ## How was this patch tested? Added tests to `SubquerySuite`. Author: Herman van Hovell <hvanhovell@questtec.nl> Closes #12822 from hvanhovell/SPARK-14785.

Add correlated scalar subqueries.

1827075

hvanhovell mentioned this pull request May 1, 2016

[SPARK-14785][SQL] Support correlated scalar subqueries [WIP] #12815

Closed

Only allow aggregated correlated scalar subqueries.

d189424

davies reviewed May 2, 2016
View reviewed changes

hvanhovell added 2 commits May 2, 2016 14:43

Merge remote-tracking branch 'apache-github/master' into SPARK-14785

28e0878

Add more checks & tests.

84fff35

hvanhovell added 3 commits May 2, 2016 22:07

Add hasCorrelatedSubquery.

d9f1bc8

Merge remote-tracking branch 'apache-github/master' into SPARK-14785

0ae7dee

# Conflicts: # sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/optimizer/Optimizer.scala

Fix TPCDS Q32 & Q92

831eaa8

hvanhovell reviewed May 2, 2016
View reviewed changes

asfgit closed this in f362363 May 2, 2016

peter-toth mentioned this pull request Jun 21, 2020

[SPARK-29375][SPARK-28940][SPARK-32041][SQL] Whole plan exchange and subquery reuse #28885

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[SPARK-14785][SQL] Support correlated scalar subqueries #12822

[SPARK-14785][SQL] Support correlated scalar subqueries #12822

hvanhovell commented May 1, 2016 •

edited

Loading

hvanhovell commented May 1, 2016

SparkQA commented May 1, 2016

SparkQA commented May 1, 2016

davies May 2, 2016

hvanhovell May 2, 2016

davies commented May 2, 2016

SparkQA commented May 2, 2016

davies commented May 2, 2016

hvanhovell commented May 2, 2016

hvanhovell commented May 2, 2016

davies commented May 2, 2016

hvanhovell May 2, 2016

davies May 2, 2016

davies commented May 2, 2016

SparkQA commented May 2, 2016

[SPARK-14785][SQL] Support correlated scalar subqueries #12822

[SPARK-14785][SQL] Support correlated scalar subqueries #12822

Conversation

hvanhovell commented May 1, 2016 • edited Loading

What changes were proposed in this pull request?

How was this patch tested?

hvanhovell commented May 1, 2016

SparkQA commented May 1, 2016

SparkQA commented May 1, 2016

davies May 2, 2016

Choose a reason for hiding this comment

hvanhovell May 2, 2016

Choose a reason for hiding this comment

davies commented May 2, 2016

SparkQA commented May 2, 2016

davies commented May 2, 2016

hvanhovell commented May 2, 2016

hvanhovell commented May 2, 2016

davies commented May 2, 2016

hvanhovell May 2, 2016

Choose a reason for hiding this comment

davies May 2, 2016

Choose a reason for hiding this comment

davies commented May 2, 2016

SparkQA commented May 2, 2016

hvanhovell commented May 1, 2016 •

edited

Loading