Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

perf(sql): speed up parallel sample by with fill #5143

Draft
wants to merge 23 commits into
base: master
Choose a base branch
from

Conversation

nwoolmer
Copy link
Contributor

@nwoolmer nwoolmer commented Nov 6, 2024

WIP

Changelist

Parallel sample by:

  • Parallel sample by:
    • now supports fill with multiple values
    • correctly uses all fill values instead of just the first
    • will now generate radix sorts more frequently, which are 2x-10x faster than our tree-based sort
    • will now only sort unfilled data, instead of sorting both filled and real data
  • a mismatch between column and fill value in from-to queries previously caused UnsupportedOperationException and a 500 response code. Now it will return a useful message:
    • Example query: select timestamp, count from trades sample by 1d from '2008-12-28' to '2009-01-05' FILL(1.0)
    • Error: invalid fill value, cannot cast DOUBLE to LONG

Benchmarks

Table of 100m rows.
Schema - (TIMESTAMP, DOUBLE)
Data is over a 1 day period.

With 100m rows

Query:

select timestamp, avg(price) from bench
sample by 1T fill(null)
Plan before ``` Sort keys: [timestamp] Fill Range stride: '1T' values: [null] Async Group By workers: 8 keys: [timestamp] values: [avg(price)] filter: null PageFrame Row forward scan Frame forward scan on: bench ````
Plan after
Fill Range
  stride: '1T'
  values: [null]
    Radix sort light
      keys: [timestamp]
        Async Group By workers: 8
          keys: [timestamp]
          values: [avg(price)]
          filter: null
            PageFrame
                Row forward scan
                Frame forward scan on: bench

Before: timeout (60s)
After: 22.75s

With 10m rows:

Query:

select timestamp, price from (
select timestamp, avg(price) as price from bench4
sample by 1T fill(null)
) limit -1
Before plan: ``` Limit lo: -1 Sort keys: [timestamp] Fill Range stride: '1T' values: [null] Async Group By workers: 8 keys: [timestamp] values: [avg(price)] filter: null PageFrame Row forward scan Frame forward scan on: bench4 ```
After plan: ``` Limit lo: -1 Fill Range stride: '1T' values: [null] Radix sort light keys: [timestamp] Async Group By workers: 8 keys: [timestamp] values: [avg(price)] filter: null PageFrame Row forward scan Frame forward scan on: bench4 ```

Before: 13.4s, 12.51s, 14.4s
After: 1.9s, 1.43s, 1.33s

Queries which do not require the sort will be slower, for example:

select max(price) from (
    select timestamp, avg(price) from bench
    sample by 1T fill(null)
)

This is because in the new version, a sort will be generated, and in the old, it would not. However, this was at the cost of the fill not necessarily being correct or filling on the right timestamp.

Alternative

If we allowed group by to return a timestampIndex in this case, even though the result will not be sorted (i.e (timestamp, SCAN_DIRECTION_OTHER), then we could still use parallel fill in its original form. This requires relaxing the rule that timestampIndex means ASC (designated) timestamp.

Notes

This PR reworks the parallel sample by fill in an effort to improve performance and reliability. Previously, the fill would be generated after group by, and before order by. This meant we had to first exhaust the group by result, recording timestamps. Then we filled the rest of the timestamps.

Unfortunately, group by does not guarantee a designated timestamp, making it problematic to identify the correct column to fill. Furthermore, this would cause both filled and unfilled data would be subsequently sorted, so sparse datasets were as expensive to sort as dense datasets.

By moving it to after order by, we ensure that we have a designated timestamp to orient the fill. We also generate radix sort more frequently, which can be 2-10x faster than our tree-based sort. We also only have to sort the unfilled data, leading to a much quicker an dmore efficient sort.

@nwoolmer nwoolmer added Bug Incorrect or unexpected behavior SQL Issues or changes relating to SQL execution labels Nov 6, 2024
@nwoolmer nwoolmer marked this pull request as draft November 6, 2024 13:42
@nwoolmer nwoolmer marked this pull request as ready for review November 6, 2024 13:43
@amunra amunra self-requested a review November 14, 2024 16:06
@amunra
Copy link
Contributor

amunra commented Nov 14, 2024

Not really code you've directly touched, just stumbled on it via your tests. The following error message is rather cryptic.

    private void guardAgainstFromToWithKeyedSampleBy(boolean isFromTo) throws SqlException {
        if (isFromTo) {
            throw SqlException.$(0, "FROM-TO intervals are not supported for keyed SAMPLE BY queries");
        }
    }

I have no idea what a keyed query is.

amunra
amunra previously approved these changes Nov 15, 2024
Copy link
Contributor

@amunra amunra left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

👍

@nwoolmer
Copy link
Contributor Author

Thanks for the review!

Copy link
Member

@bluestreak01 bluestreak01 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

cc @amunra

@glasstiger
Copy link
Contributor

[PR Coverage check]

😍 pass : 28 / 29 (96.55%)

file detail

path covered line new line coverage
🔵 io/questdb/griffin/engine/groupby/FillRangeRecordCursorFactory.java 16 17 94.12%
🔵 io/questdb/griffin/SqlCodeGenerator.java 12 12 100.00%

@nwoolmer nwoolmer changed the title fix(sql): add better error message when sample by fill type is mismatched perf(sql): speed up parallel sample by with fill Jan 7, 2025
… timestamp i.e ascending sorted by timestamp. the result is that we generate faster sorts in some cases, and also sort smaller amounts of data (only unfilled)

still an outstanding issue with set operations, where a sort is not being appropriately pushed down, causing the fill to be lost
@bluestreak01 bluestreak01 added the DO NOT MERGE These changes should not be merged to main branch label Jan 13, 2025
@bluestreak01 bluestreak01 marked this pull request as draft January 13, 2025 12:55
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Bug Incorrect or unexpected behavior DO NOT MERGE These changes should not be merged to main branch SQL Issues or changes relating to SQL execution
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants