Description
Edit: duplicate of #961.
Requirement - what kind of business use case are you trying to solve?
The current behavior of clock skew adjustments is surprising and, why well-meant, not beneficial.
Problem - what in Jaeger blocks you from solving the requirement?
Excuse the pun, I'm aware of the existing issues.
The clock skew adjuster currently makes the following assumptions:
- Spans are from different hosts, with parent and child span having roughly the same duration.
- Clock Skew is something unknown to people using monitoring
- It is best to hide this fact and instead of relying on administrators to fix this and offer a best guess workaround as an alternative, it's best to make things look “nice”
Considering 1.:
The algorithm uses the not well documented ip
tag of the process to distinguish between hosts and assumes a missing tag indicates a different host.
People using OpenTracing or OpenCensus usually only know about the service name, so a missing ip
tag is expected, meaning the algorithm compensates nearly all spans. See
- Clock skew adjuster and spans from web/mobile #722
- Make clock skew adjustment transparent #961
- Adjuster fixes. #606
Then it compares durations, which makes only sense one assumes that spans roughly have the same size.
Most child spans would be shorter, so this is rarely the case. If a child span is a few nanoseconds longer I don't care, if it is really longer it should be flagged, but not shifted. It's wrong data and really hard to diagnose if it is processed afterwards.
Then the algorithm checks if the end time of the child span is after the end time of the parent. While technically wrong, it is pretty easy to close the child span a couple of nanoseconds too late - probably a non-issue considering tracing, and at most something that should be a warning instead of moving the span around.
Lastly it assumes that span duration somehow correlates to network latency.
Proposal - what do you suggest to solve the problem or improve the existing situation?
If clock skew is an issue, the span context should try to compensate.
Considering 2. and 3. it would IMHO be best to not mess with the data by default and offer beautification as an opt-in workaround, which is different than the currently solutions by turning it on by default and searching for a way to get rid of it.
Also, it might be useful to see clock skew by default, because it may affect other areas too.
Any open questions to address
I saw very few comments where this skew adjustent is actually wanted. Maybe some of the supporters can chime in?
Activity
jpkrohling commentedon Apr 5, 2019
+1
As someone looking for the root cause of a problem, how would you like this info to be presented to you?
From what I remember, this is particularly useful for mobile clients, where the clock might be really off. Perhaps clients where the clock might be off should be expected to send a hint within the span?
eikemeier commentedon Apr 7, 2019
In Java for example you would have all kinds of problems. So i.e. having a span in red would be fine. This should be rare, which means that I wouldn't probably care about nanosecond differences, especially spans that are closed slightly too late.
It's like having log entries (which usually have timestamps) and you see that B is happening before A, although impossible. Pointing this out is usually sufficient.
Yes, that would be “opt-in”. One could also add a “ping collector” command, which could give an idea about the clock difference and latency, assuming there is no jitter or other factors. I'm unsure if this is really worth the effort.
Still auto-fixing might be strange too, since you'll always run into the “B before A” case. And the algorithm does not seem to be very successful on mobile clients, see issue #722.
TBBle commentedon Feb 5, 2020
After jaegertracing/jaeger-operator#871, we now have a process tag
host.ip
added by Jaeger Operator-injected Agents, which would be a better input for the Clock Skew Adjustor. The process-localip
tag added by the Jaeger client is within a kubernetes container, and hence won't match other containers sharing a realtime clock, leading to Clock Skew Adjustor misbehaviour.So it might be beneficial to this improvement direction to make it possible to configure the tag used to identify "same host as parent".
eikemeier commentedon Feb 5, 2020
We keep getting comments like “I spent a few hours trying to figure out ...”. The point of this issue is that the default behaviour is unexpected, confusing and harmful.
If anyone wants adjustments, fine. But this should be opt-in and off by default. “Better” unexpected and a little less confusing adjustments are still harmful…
pavolloffay commentedon Feb 5, 2020
It seems like a controversial feature. We should consider making it configurable.
Would it also help people if we were labeling the spans which are candidates for clock adjusting?
TBBle commentedon Feb 5, 2020
I was definitely one of those "I spent a few hours" people. In my case, I did have Clock Skew, but lack of an
ip
tag (using the OpenCensus library) meant that instead of seeing the two hosts's spans shift, the whole pile shifted. #787 made this very hard to diagnose, as even the raw data was showing time-travel effects.I'd very-much like to see it turn off-able, but I don't feel strongly that it should be burned to the ground. And if it's not removed entirely, making it more useful for environments where
gethostinfo('0.0.0.0')
doesn't actually give a unique 1:1 mapping with a real-time clock, would be useful too.And probably lower-hanging fruit in terms of getting a PR up and delivered for disabling the feature entirely, or changing it to off-by-default? Or maybe not, I haven't really looked at the Jaeger codebase.
I also have a use-case for the Clock Skew Adjustor, since I have a down-the-track need to collect spans from web-based javascript pages, and even other random PC users. And experience suggests "random PC users" are awful at keeping their PC clocks accurate.
eikemeier commentedon Feb 5, 2020
Again, I do not vote for “burning to the ground”. It might help people, and thanks to the guys developing and improving it.
The point i'm trying to make is that a (changing) heuristic how to improve faulty data can be helpful, but should be off by default and opt-in, nothing more.
You have some machines with broken clocks which you can't fix - fine. The problem is that you don't know how broken they are. Instead of a simple skew it might be jumping - this happens.
So, for people who are sure it is a simple skew, but are not able to fix that - turn it on, it might help.
For everyone else: I might hide your problem or damage otherwise perfectly correct data. And this is what I'm seeing in most of the cases.
Also, as mentioned above, it destroys data when it's just a programming error. the heuristic could be improved, but this experience should tell you that using it by default is not a very good idea - it is hard enough to understand a system as is, but this adjustment makes it even harder.
flixr commentedon Feb 5, 2020
I also want to add that we ran into this problem, even though all the services were running on the same host (in containers). We initially had the IP field empty and it was adjusted, but using the IP from inside the container also would not have helped. As a workaround we have for now hardcoded spans to all report the same IP.
It seems that in our case the clock skew adjustment happened because:
bobrik commentedon Feb 5, 2020
How about the following?
This way it's a no-op for good spans, and bad ones get the choice, which either points at the issue in their tracing or at clock skew.
jpkrohling commentedon Feb 6, 2020
I added this issue as a topic for the next bi-weekly. Perhaps we can move forward while discussing this with everyone in the same "room": https://docs.google.com/document/d/1ZuBAwTJvQN7xkWVvEFXj5WU9_JmS5TPiNbxCJSvPqX0/edit#heading=h.u5x27f1oevg9
yurishkuro commentedon Feb 6, 2020
We're currently limited by front end. A simple temporary solution, for people who are particularly annoyed by the adjustment, is to add a cli flag to permanently disable adjustment, until someone contributes a respective UI change to toggle it on and off.
jpkrohling commentedon Feb 7, 2020
Given that this feature breaks most people's expectation, shouldn't it be the other way around, even though it means a change in the current behavior?
yurishkuro commentedon Feb 7, 2020
Might be ok too. I think if we go with such change, we need to still run the adjustment logic but only record a warning saying "had the adjustment been enabled, it would've moved the timestamps by (Delta)".
12 remaining items