Clarify the semantics of "query timeouts" #13156

macobo · 2022-12-06T09:38:06Z

Background context

When talking about query performance, a topic that frequently comes up is "timeouts". However I argue that the current system of timeouts is hap-hazard and ill-specified.

We should clarify what we want out of "timeouts" in our product and how to action them.

Notes on the existing system

We've hard-coded max_execution_time=180 into our settings on US and EU cloud: 1, 2.
- This setting means that as soon as ClickHouse thinks that a query will take longer than 180s it will raise an error. Note that this estimation may (and has historically) misfired.
On self-hosted instances this same setting is unset.
The same setting is applied for both cache refreshes in celery and user interactions equally.
In the backend we have a 15 second timer for every clickhouse SQL query (not the same as an API endpoint load time) emitting a metric when query takes longer than 15 seconds (but not if it's killed earlier).
We've historically had issues in load balancer/gateway configurations: If query from a user takes longer than 60 seconds, the API call fails, causing users to see "something went wrong" error. This also happens on self-hosted instances.
In the frontend, after 15 seconds after an API call to insights is started, we show an animation for "things are a little slow"
In the frontend, if after 15 seconds the query fails sometimes we show "There was an error completing this query". In theory there's also a "Your query took too long to complete" response if response code was not 500, but I didn't manage to hit this case locally - I believe it depends on the gateway behavior described above.

Which of these is a "timeout"? Should we throw all of this out and rethink it all from scratch? I propose so.

Additional context

Not proposing a solution now - this is enough of a mess.

cc @lharries and @pauldambra - we need to clarify all of these semantics before we settle on any OKRs.

The text was updated successfully, but these errors were encountered:

lharries · 2022-12-06T15:30:45Z

Good question - I think we should define a timeout from the user side as this gives us maximum flexibility for the solution

So explicitly: a timeout is when a user starts a query but doesn't get a result back due to the query being too slow or clickhouse cancelling the query as it believes it will be too slow. The user very likely ends up seeing "There was an error completing this query"

If we don't already differentiate whether a query failed because it was too slow vs other potential causes of failure we'd likely want to track both, but for this OKR only focus on the failures due to things being too slow

lharries · 2022-12-06T15:31:41Z

On a side note: A quick win I've seen is if a user sees "There was an error completing this query" we should likely guide them on how to reduce the query complexity to speed it up

In reproducing the timeout, I was able to "fix it" by shorting the time from "All time" to "past 180 days" - it's likely many users wouldn't consider e.g. they aren't familiar with SQL or aren't thinking about the mechanics under the hood

lharries · 2022-12-07T22:29:46Z

@macobo how are we doing with deciding on this OKR? Would be great to get this finalized by EOD tomorrow (Thursday) if feasible

lharries · 2022-12-08T10:25:47Z

Discussed sync and the conclusion: "Aligned that timeouts are very important and a cause of frustration - but we don't need to have this at an OKR level as will take a sprint to properly set up all the analytics for it"

lharries · 2022-12-08T10:26:02Z

@macobo up to you if you want to keep this issue open or close it

macobo · 2022-12-08T10:46:23Z

I'll keep this open - we should definitely fix the problems listed here in the upcoming sprints. I'd love to get help from someone more UI/UX focussed than I am though on it.

macobo · 2023-01-09T13:07:37Z

Raw notes from chat with Paul:

Proposed query timeout semantics (cloud):
- We return a special http error code to frontend if clickhouse timeout occurs
    524 A Timeout Occurred similar to cloudflare
- We set `timeout_before_checking_execution_speed`
- For user-initiated queries, we:
    - timeout at 180s
    - run them with default (max) no of cores
    - We set `timeout_before_checking_execution_speed` to 30 seconds on all queries
        - Clickhouse can mis-estimate query execution times
- For cache refreshes, we:
    - timeout at 9 minutes
    - run the queries using 1/3rd of the max no of cores
    - Set `timeout_before_checking_execution_speed` to 90s
- On the frontend:
    - React to both 524, 408, and 504 errors with timeout screens. Track count of each
    - Show slow query warning after 15 seconds (if haven't received a timeout screen yet)
On self-hosted:
- Make all of this configurable.
- Default to no automatic timeouts, max thread configurations etc

macobo added concept Ideas that need some shaping up still performance Has to do with performance. For PRs, runs the clickhouse query performance suite labels Dec 6, 2022

macobo mentioned this issue Dec 6, 2022

2023 Q1 Product analytics PostHog/posthog.com#4747

Merged

4 tasks

macobo mentioned this issue Dec 14, 2022

Tune max_threads for clickhouse queries #13324

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Clarify the semantics of "query timeouts" #13156

Clarify the semantics of "query timeouts" #13156

macobo commented Dec 6, 2022 •

edited

Loading

lharries commented Dec 6, 2022

lharries commented Dec 6, 2022 •

edited

Loading

lharries commented Dec 7, 2022

lharries commented Dec 8, 2022

lharries commented Dec 8, 2022

macobo commented Dec 8, 2022

macobo commented Jan 9, 2023

Clarify the semantics of "query timeouts" #13156

Clarify the semantics of "query timeouts" #13156

Comments

macobo commented Dec 6, 2022 • edited Loading

Background context

Notes on the existing system

Additional context

lharries commented Dec 6, 2022

lharries commented Dec 6, 2022 • edited Loading

lharries commented Dec 7, 2022

lharries commented Dec 8, 2022

lharries commented Dec 8, 2022

macobo commented Dec 8, 2022

macobo commented Jan 9, 2023

macobo commented Dec 6, 2022 •

edited

Loading

lharries commented Dec 6, 2022 •

edited

Loading