-
-
Notifications
You must be signed in to change notification settings - Fork 5.2k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Adds a basic implementation of Recheck Interval into the health check service. #10610
base: master
Are you sure you want to change the base?
Conversation
Please let me know if I made the PR correctly, or if any changes are needed. |
Hi @kevinpollet , I'm encountering issues with failing tests due to the interface change from I'd appreciate your input on how best to address this. One option is to modify the interface to eliminate the need for the getter method, perhaps by having SetStatus return the original status before modification. Your suggestions on ensuring compatibility with other services implementing this interface would be invaluable. Thank you. |
Another possibility is to consider expanding the tests to include this new field, which I believe will become necessary regardless. Your guidance on this matter would be greatly appreciated, @kevinpollet. |
c5d2c2a
to
74a7bf6
Compare
Hi @kevinpollet , |
Hi @kevinpollet, |
Thank you @ldez. @kevinpollet Please do let me know if any changes are necessary on this PR |
Hello @sswastik02 and sorry for the delay, We plan to review your pull request soon, is it ok for you if we push some changes if necessary? |
Hi @kevinpollet , Thanks. |
Hello @sswastik02, After looking at your modifications we have a couple of comments, The recheck interval cannot have a default value not equal to the default check interval value this is a breaking change. Redefining the health-check timer in the loop where it is use does not seem to be a usual pattern, and if we are not mistaken this interval will also be modified for healthy backends. WDYT? |
Hi @kevinpollet , I did not fully understand your point about redefining the health-check loop. Are you suggesting that redefining the health-check loop is not the standard approach for implementing the recheck feature? If so, could you guide me on the best practices for this implementation? I'll be happy to make the necessary modifications. Also, as mentioned, the interval changes to the recheck interval when the service is unhealthy and reverts to the normal check interval once the service is healthy again. |
Hi @sswastik02,
What @kevinpollet meant was that having a default value for this recheck interval would cause a change in the behavior of Traefik. The health (re)checks would happen more often, and we do not want to introduce such a breaking behavior.
It is not a "non-standard approach" per se, but, at least, it is not the best design for readability.
The problem is that this affects every target in the load balancer, and the recheck will apply to healthy targets as well. As mentioned earlier, this raises some questions about how to implement this recheck feature. Should the recheck interval be a strict interval applied per unhealthy target, or just a global interval for all unhealthy targets? Currently, the health check is assumed to be "derivative". With only one go routine, one loop, the algorithm remains simple, but is sequential, and the total time spent checking the targets could exceed the healthcheck interval. Is this clearer? |
Hi @rtribotte, Thanks for your explanation. I now understand the potential issues. Setting the default recheck would indeed disrupt existing applications unaware of this feature upon release. Therefore, it's essential to set the default recheck to match the current interval to maintain existing behavior. I agree we should first discuss whether the recheck should be on a per-target basis or for the load balancer as a whole. I believe it should be on a per-target basis, with separate tickers for each service. This way, we can set the default to the Am I going in the right direction ? |
We are not sure we should drop the single ticker approach. If having only a second ticker for all unhealthy targets is manageable, we think it would be preferable. |
Hi @rtribotte ,
Do you mean that we maintain just one additional timer that sequentially performs health checks on all unhealthy targets, similar to how we handle healthy targets? |
Hi @sswastik02,
Yes you are right. |
Hi @kevinpollet, traefik/pkg/healthcheck/healthcheck.go Lines 43 to 55 in 87db330
What are your thoughts ? |
Hi @sswastik02,
We have no strong opinion on this. At a glance, it is hard to reason about it, so it is up to you and we will discuss the final implementation. As a side note, we were wondering if we need another goroutine, and if that's the case how everything will be synchronized if we keep only one map which will be mutated by the two goroutines. |
Hi @kevinpollet ,
Here’s the idea I have: The recheck is essentially a subset of the main check. To avoid breaking changes, the health check must be executed for all targets at the given To implement this, here is a rough plan:
traefik/pkg/healthcheck/healthcheck.go Lines 104 to 141 in 87db330
Let's say the above code is Then we would effectively be doing case <-ticker.C:
mutex.Lock()
go healthcheck(all_targets)
mutex.Unlock()
case <-recheckTicker.C:
if !mutex.IsLocked():
go healthcheck(unhealthy_targets) // we will also check if mutex is locked for every iteration of the loop inside healthcheck The main thread will handle the ticker channels, while the goroutine will manage the respective health checks. |
Hi @sswastik02, Sorry for the late reply. We are not convinced that such an approach would work as expected. Implementing this recheck interval feature is not trivial, and we are not convinced that iterating here in the discussion is the easiest way to proceed. Working out a solution to share with you here in this discussion would require iterating on the code ourselves, which we can do, but it would mean taking on PR. We encourage you to iterate directly on the code so that we can review it in context. |
Hi @rtribotte,
Before proceeding with the code, I’d like to clarify: should the solution focus solely on adding a new ticker for unhealthy targets, or can we also explore the possibility of replacing the single ticker approach with target-level goroutines? |
Hello @sswastik02, Sure, feel free to try an alternate solution to the single ticker. |
…ntially Signed-off-by: sswastik02 <40518186+sswastik02@users.noreply.github.com>
Signed-off-by: sswastik02 <40518186+sswastik02@users.noreply.github.com>
Hi @rtribotte , Could you review this rough implementation and let me know if I’m on the right track? If so, I’ll refine the code further, including adding configuration options for rechecks. |
What does this PR do?
This PR Addresses the issue where the interval for health checks remains static even when a service starts failing. This lack of adaptability may result in unreliable detection of issues on time. By implementing a recheck interval, we ensure that the health status of a service is re-evaluated at regular intervals, especially when it fails, thereby improving the accuracy and responsiveness of issue detection.
Motivation
Aims to improve the reliability and responsiveness of issue detection, particularly in scenarios where services may intermittently fail or experience fluctuations in health status.
More
Additional Notes
Utilization of providers is yet to be implemented. Future work will focus on integrating this functionality to further enhance the flexibility and extensibility of the health check system.
Fixes #6550