Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

app/vmselect/promql: limit staleness detection for increase/increase_pure/delta #8052

Merged
merged 3 commits into from
Jan 16, 2025

Conversation

hagen1778
Copy link
Collaborator

doInternal has adaptive staleness detection mechanism. It is calculated using timestamp distance between samples in selected list of samples. It is dynamic because VM can store signals from many sources with different samples resolution. And while it works for most of cases, there are edge cases for rollup functions that are comparing values between windows: increase, increase_pure, delta.

The edge case 1.
There was a gap between series because of the missed scrape or two. In this case staleness will trigger and increase-like functions will assume the value they need to compare with is 0. In result, this could produce spikes for a flappy data - see #894 This problem was solved by introducing a realPrevValue field - 1f19c16. It stores the closest real sample value on selected interval and is used for comparison when samples start after the gap.

The edge case 2.
realPrevValue doesn't respect staleness interval. In result, even if gap between samples is huge (hours), the increase-like functions will not consider it as a new series that started from 0. See #8002.

Covering both edge cases is tricky, because realPrevValue has to respect and not respect the staleness interval in the same time. In other words, it should be able to ignore periodic missing gaps, but reset if the gap is too big. While "too big gap" can't be figured out empirically, I suggest using -search.maxStalenessInterval for this purpose. If -search.maxStalenessInterval is set to 0 (default), then realPrevValue ignores staleness interval. If -search.maxStalenessInterval is > 0, then realPrevValue respects it as a staleness interval.

Checklist

The following checks are mandatory:

…pure/delta

doInternal has adaptive staleness detection mechanism. It is calculated using timestamp distance
between samples in selected list of samples. It is dynamic because VM can store signals from
many sources with different samples resolution. And while it works for most of cases,
there are edge cases for rollup functions that are comparing values between windows:
increase, increase_pure, delta.

The edge case 1.
There was a gap between series because of the missed scrape or two. In this case staleness will trigger
and increase-like functions will assume the value they need to compare with is 0. In result, this
could produce spikes for a flappy data - see #894
This problem was solved by introducing a `realPrevValue` field - 1f19c16.
It stores the closest real sample value on selected interval and is used for comparison when samples start after the gap.

The edge case 2.
`realPrevValue` doesn't respect staleness interval. In result, even if gap between samples is huge (hours),
the increase-like functions will not consider it as a new series that started from 0.
See #8002.

Covering both edge cases is tricky, because `realPrevValue` has to respect and not respect the staleness interval in the same time.
In other words, it should be able to ignore periodic missing gaps, but reset if the gap is too big. While "too big gap" can't
be figured out empirically, I suggest using `-search.maxStalenessInterval` for this purpose.
If `-search.maxStalenessInterval` is set to 0 (default), then  `realPrevValue` ignores staleness interval.
If `-search.maxStalenessInterval` is > 0, then  `realPrevValue` respects it as a staleness interval.

This also covers #8045

Signed-off-by: hagen1778 <roman@victoriametrics.com>
@hagen1778 hagen1778 requested review from valyala and rtm0 January 16, 2025 08:57
Signed-off-by: hagen1778 <roman@victoriametrics.com>
testRowsEqual(t, gotValues, rc.Timestamps, valuesExpected, timestampsExpected)
})
// even if LookbackDelta < gap
t.Run("step>gapLookbackDelta<gap", func(t *testing.T) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Here and in all other sub-tests: Please use the same sub-test names in both tests (TestRollupDeltaWithStaleness and TestRollupIncreasePureWithStaleness). Up to you which one to choose.

Signed-off-by: hagen1778 <roman@victoriametrics.com>
Copy link
Collaborator

@valyala valyala left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the fix!

@valyala valyala merged commit 7d2a676 into master Jan 16, 2025
10 checks passed
@valyala valyala deleted the promql-increase-staleness branch January 16, 2025 14:01
hagen1778 added a commit that referenced this pull request Jan 16, 2025
…pure/delta (#8052)

`doInternal` has adaptive staleness detection mechanism. It is
calculated using timestamp distance between samples in selected list of
samples. It is dynamic because VM can store signals from many sources
with different samples resolution. And while it works for most of cases,
there are edge cases for rollup functions that are comparing values
between windows: increase, increase_pure, delta.

The edge case 1.
There was a gap between series because of the missed scrape or two. In
this case staleness will trigger and increase-like functions will assume
the value they need to compare with is 0. In result, this could produce
spikes for a flappy data - see
#894 This
problem was solved by introducing a `realPrevValue` field -
1f19c16.
It stores the closest real sample value on selected interval and is used
for comparison when samples start after the gap.

The edge case 2.
`realPrevValue` doesn't respect staleness interval. In result, even if
gap between samples is huge (hours), the increase-like functions will
not consider it as a new series that started from 0. See
#8002.

Covering both edge cases is tricky, because `realPrevValue` has to
respect and not respect the staleness interval in the same time. In
other words, it should be able to ignore periodic missing gaps, but
reset if the gap is too big. While "too big gap" can't be figured out
empirically, I suggest using `-search.maxStalenessInterval` for this
purpose. If `-search.maxStalenessInterval` is set to 0 (default), then
`realPrevValue` ignores staleness interval. If
`-search.maxStalenessInterval` is > 0, then `realPrevValue` respects it
as a staleness interval.

### Checklist

The following checks are **mandatory**:

- [ ] My change adheres [VictoriaMetrics contributing
guidelines](https://docs.victoriametrics.com/contributing/).

---------

Signed-off-by: hagen1778 <roman@victoriametrics.com>
(cherry picked from commit 7d2a676)
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants