Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Clarify the semantics of "query timeouts" #13156

Open
macobo opened this issue Dec 6, 2022 · 7 comments
Open

Clarify the semantics of "query timeouts" #13156

macobo opened this issue Dec 6, 2022 · 7 comments
Labels
concept Ideas that need some shaping up still performance Has to do with performance. For PRs, runs the clickhouse query performance suite

Comments

@macobo
Copy link
Contributor

macobo commented Dec 6, 2022

Background context

When talking about query performance, a topic that frequently comes up is "timeouts". However I argue that the current system of timeouts is hap-hazard and ill-specified.

We should clarify what we want out of "timeouts" in our product and how to action them.

Notes on the existing system

  • We've hard-coded max_execution_time=180 into our settings on US and EU cloud: 1, 2.
    • This setting means that as soon as ClickHouse thinks that a query will take longer than 180s it will raise an error. Note that this estimation may (and has historically) misfired.
  • On self-hosted instances this same setting is unset.
  • The same setting is applied for both cache refreshes in celery and user interactions equally.
  • In the backend we have a 15 second timer for every clickhouse SQL query (not the same as an API endpoint load time) emitting a metric when query takes longer than 15 seconds (but not if it's killed earlier).
  • We've historically had issues in load balancer/gateway configurations: If query from a user takes longer than 60 seconds, the API call fails, causing users to see "something went wrong" error. This also happens on self-hosted instances.
  • In the frontend, after 15 seconds after an API call to insights is started, we show an animation for "things are a little slow"
  • In the frontend, if after 15 seconds the query fails sometimes we show "There was an error completing this query". In theory there's also a "Your query took too long to complete" response if response code was not 500, but I didn't manage to hit this case locally - I believe it depends on the gateway behavior described above.

Which of these is a "timeout"? Should we throw all of this out and rethink it all from scratch? I propose so.

Additional context

Not proposing a solution now - this is enough of a mess.

cc @lharries and @pauldambra - we need to clarify all of these semantics before we settle on any OKRs.

@macobo macobo added concept Ideas that need some shaping up still performance Has to do with performance. For PRs, runs the clickhouse query performance suite labels Dec 6, 2022
@lharries
Copy link
Contributor

lharries commented Dec 6, 2022

Good question - I think we should define a timeout from the user side as this gives us maximum flexibility for the solution

So explicitly: a timeout is when a user starts a query but doesn't get a result back due to the query being too slow or clickhouse cancelling the query as it believes it will be too slow. The user very likely ends up seeing "There was an error completing this query"

If we don't already differentiate whether a query failed because it was too slow vs other potential causes of failure we'd likely want to track both, but for this OKR only focus on the failures due to things being too slow

@lharries
Copy link
Contributor

lharries commented Dec 6, 2022

On a side note: A quick win I've seen is if a user sees "There was an error completing this query" we should likely guide them on how to reduce the query complexity to speed it up

In reproducing the timeout, I was able to "fix it" by shorting the time from "All time" to "past 180 days" - it's likely many users wouldn't consider e.g. they aren't familiar with SQL or aren't thinking about the mechanics under the hood

@lharries
Copy link
Contributor

lharries commented Dec 7, 2022

@macobo how are we doing with deciding on this OKR? Would be great to get this finalized by EOD tomorrow (Thursday) if feasible

@lharries
Copy link
Contributor

lharries commented Dec 8, 2022

Discussed sync and the conclusion: "Aligned that timeouts are very important and a cause of frustration - but we don't need to have this at an OKR level as will take a sprint to properly set up all the analytics for it"

@lharries
Copy link
Contributor

lharries commented Dec 8, 2022

@macobo up to you if you want to keep this issue open or close it

@macobo
Copy link
Contributor Author

macobo commented Dec 8, 2022

I'll keep this open - we should definitely fix the problems listed here in the upcoming sprints. I'd love to get help from someone more UI/UX focussed than I am though on it.

@macobo
Copy link
Contributor Author

macobo commented Jan 9, 2023

Raw notes from chat with Paul:

Proposed query timeout semantics (cloud):
- We return a special http error code to frontend if clickhouse timeout occurs
    524 A Timeout Occurred similar to cloudflare
- We set `timeout_before_checking_execution_speed`
- For user-initiated queries, we:
    - timeout at 180s
    - run them with default (max) no of cores
    - We set `timeout_before_checking_execution_speed` to 30 seconds on all queries
        - Clickhouse can mis-estimate query execution times
- For cache refreshes, we:
    - timeout at 9 minutes
    - run the queries using 1/3rd of the max no of cores
    - Set `timeout_before_checking_execution_speed` to 90s
- On the frontend:
    - React to both 524, 408, and 504 errors with timeout screens. Track count of each
    - Show slow query warning after 15 seconds (if haven't received a timeout screen yet)
On self-hosted:
- Make all of this configurable.
- Default to no automatic timeouts, max thread configurations etc

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
concept Ideas that need some shaping up still performance Has to do with performance. For PRs, runs the clickhouse query performance suite
Projects
None yet
Development

No branches or pull requests

2 participants