-
Notifications
You must be signed in to change notification settings - Fork 1.2k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
perf(sql): speed up parallel sample by with fill #5143
base: master
Are you sure you want to change the base?
Conversation
Not really code you've directly touched, just stumbled on it via your tests. The following error message is rather cryptic. private void guardAgainstFromToWithKeyedSampleBy(boolean isFromTo) throws SqlException {
if (isFromTo) {
throw SqlException.$(0, "FROM-TO intervals are not supported for keyed SAMPLE BY queries");
}
} I have no idea what a keyed query is. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
👍
Thanks for the review! |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
cc @amunra
core/src/main/java/io/questdb/griffin/engine/groupby/FillRangeRecordCursorFactory.java
Outdated
Show resolved
Hide resolved
core/src/main/java/io/questdb/griffin/engine/groupby/FillRangeRecordCursorFactory.java
Outdated
Show resolved
Hide resolved
core/src/main/java/io/questdb/griffin/engine/groupby/FillRangeRecordCursorFactory.java
Outdated
Show resolved
Hide resolved
core/src/main/java/io/questdb/griffin/engine/groupby/FillRangeRecordCursorFactory.java
Outdated
Show resolved
Hide resolved
core/src/main/java/io/questdb/griffin/engine/groupby/FillRangeRecordCursorFactory.java
Show resolved
Hide resolved
[PR Coverage check]😍 pass : 28 / 29 (96.55%) file detail
|
core/src/main/java/io/questdb/griffin/engine/groupby/FillRangeRecordCursorFactory.java
Outdated
Show resolved
Hide resolved
core/src/main/java/io/questdb/griffin/engine/groupby/FillRangeRecordCursorFactory.java
Show resolved
Hide resolved
… timestamp i.e ascending sorted by timestamp. the result is that we generate faster sorts in some cases, and also sort smaller amounts of data (only unfilled) still an outstanding issue with set operations, where a sort is not being appropriately pushed down, causing the fill to be lost
# Conflicts: # core/src/main/java/io/questdb/griffin/model/QueryModel.java
WIP
Changelist
Parallel sample by:
UnsupportedOperationException
and a500
response code. Now it will return a useful message:select timestamp, count from trades sample by 1d from '2008-12-28' to '2009-01-05' FILL(1.0)
invalid fill value, cannot cast DOUBLE to LONG
Benchmarks
Table of 100m rows.
Schema - (TIMESTAMP, DOUBLE)
Data is over a 1 day period.
With 100m rows
Query:
Plan before
``` Sort keys: [timestamp] Fill Range stride: '1T' values: [null] Async Group By workers: 8 keys: [timestamp] values: [avg(price)] filter: null PageFrame Row forward scan Frame forward scan on: bench ````Plan after
Before:
timeout (60s)
After:
22.75s
With 10m rows:
Query:
Before plan:
``` Limit lo: -1 Sort keys: [timestamp] Fill Range stride: '1T' values: [null] Async Group By workers: 8 keys: [timestamp] values: [avg(price)] filter: null PageFrame Row forward scan Frame forward scan on: bench4 ```After plan:
``` Limit lo: -1 Fill Range stride: '1T' values: [null] Radix sort light keys: [timestamp] Async Group By workers: 8 keys: [timestamp] values: [avg(price)] filter: null PageFrame Row forward scan Frame forward scan on: bench4 ```Before: 13.4s, 12.51s, 14.4s
After: 1.9s, 1.43s, 1.33s
Queries which do not require the
sort
will be slower, for example:This is because in the new version, a sort will be generated, and in the old, it would not. However, this was at the cost of the fill not necessarily being correct or filling on the right timestamp.
Alternative
If we allowed group by to return a
timestampIndex
in this case, even though the result will not be sorted (i.e(timestamp, SCAN_DIRECTION_OTHER)
, then we could still use parallel fill in its original form. This requires relaxing the rule thattimestampIndex
meansASC
(designated) timestamp.Notes
This PR reworks the parallel sample by fill in an effort to improve performance and reliability. Previously, the fill would be generated after group by, and before order by. This meant we had to first exhaust the group by result, recording timestamps. Then we filled the rest of the timestamps.
Unfortunately, group by does not guarantee a designated timestamp, making it problematic to identify the correct column to fill. Furthermore, this would cause both filled and unfilled data would be subsequently sorted, so sparse datasets were as expensive to sort as dense datasets.
By moving it to after order by, we ensure that we have a designated timestamp to orient the fill. We also generate
radix sort
more frequently, which can be 2-10x faster than our tree-based sort. We also only have to sort the unfilled data, leading to a much quicker an dmore efficient sort.