Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Adds a basic implementation of Recheck Interval into the health check service. #10610

Open
wants to merge 2 commits into
base: master
Choose a base branch
from

Conversation

sswastik02
Copy link

@sswastik02 sswastik02 commented Apr 12, 2024

What does this PR do?

This PR Addresses the issue where the interval for health checks remains static even when a service starts failing. This lack of adaptability may result in unreliable detection of issues on time. By implementing a recheck interval, we ensure that the health status of a service is re-evaluated at regular intervals, especially when it fails, thereby improving the accuracy and responsiveness of issue detection.

Motivation

Aims to improve the reliability and responsiveness of issue detection, particularly in scenarios where services may intermittently fail or experience fluctuations in health status.

More

  • Added/updated tests
  • Added/updated documentation

Additional Notes

Utilization of providers is yet to be implemented. Future work will focus on integrating this functionality to further enhance the flexibility and extensibility of the health check system.

Fixes #6550

@sswastik02
Copy link
Author

Please let me know if I made the PR correctly, or if any changes are needed.

@sswastik02
Copy link
Author

Hi @kevinpollet ,

I'm encountering issues with failing tests due to the interface change from StatusSetter to StatusHandler. Specifically, for the recheck interval feature to function properly, I need the Getter Method on the status to determine when to switch to recheck and when to revert.

I'd appreciate your input on how best to address this. One option is to modify the interface to eliminate the need for the getter method, perhaps by having SetStatus return the original status before modification. Your suggestions on ensuring compatibility with other services implementing this interface would be invaluable.

Thank you.

@sswastik02
Copy link
Author

Another possibility is to consider expanding the tests to include this new field, which I believe will become necessary regardless. Your guidance on this matter would be greatly appreciated, @kevinpollet.

@sswastik02 sswastik02 force-pushed the master branch 4 times, most recently from c5d2c2a to 74a7bf6 Compare May 22, 2024 13:48
@sswastik02
Copy link
Author

Hi @kevinpollet ,
Apologies for the delay in PR, requesting you to approve the workflows to run.

@sswastik02
Copy link
Author

Hi @kevinpollet,
I have made some changes in the integration tests, Requesting you to approve workflows.

@ldez ldez reopened this Jun 2, 2024
@sswastik02
Copy link
Author

sswastik02 commented Jun 3, 2024

Thank you @ldez. @kevinpollet Please do let me know if any changes are necessary on this PR

@kevinpollet
Copy link
Member

Hello @sswastik02 and sorry for the delay,

We plan to review your pull request soon, is it ok for you if we push some changes if necessary?

@sswastik02
Copy link
Author

Hi @kevinpollet ,
Sure I am ok with you pushing changes to this PR if necessary, but if you tell the changes I can do it too. Let me know whatever works for you. I am happy to help.

Thanks.

@kevinpollet
Copy link
Member

kevinpollet commented Jul 17, 2024

Hello @sswastik02,

After looking at your modifications we have a couple of comments,

The recheck interval cannot have a default value not equal to the default check interval value this is a breaking change.

Redefining the health-check timer in the loop where it is use does not seem to be a usual pattern, and if we are not mistaken this interval will also be modified for healthy backends.

WDYT?

@sswastik02
Copy link
Author

Hi @kevinpollet ,
Sure, we can have the default value of recheck interval to match the default check interval. Could you please explain why it is breaking though ?

I did not fully understand your point about redefining the health-check loop. Are you suggesting that redefining the health-check loop is not the standard approach for implementing the recheck feature? If so, could you guide me on the best practices for this implementation? I'll be happy to make the necessary modifications.

Also, as mentioned, the interval changes to the recheck interval when the service is unhealthy and reverts to the normal check interval once the service is healthy again.

@rtribotte
Copy link
Member

Hi @sswastik02,

Sure, we can have the default value of recheck interval to match the default check interval. Could you please explain why it is breaking though ?

What @kevinpollet meant was that having a default value for this recheck interval would cause a change in the behavior of Traefik. The health (re)checks would happen more often, and we do not want to introduce such a breaking behavior.

I did not fully understand your point about redefining the health-check loop. Are you suggesting that redefining the health-check loop is not the standard approach for implementing the recheck feature? If so, could you guide me on the best practices for this implementation? I'll be happy to make the necessary modifications.

It is not a "non-standard approach" per se, but, at least, it is not the best design for readability.
There are no best practices for this. The best pattern would also depend on the solution we want to bring.

Also, as mentioned, the interval changes to the recheck interval when the service is unhealthy and reverts to the normal check interval once the service is healthy again.

The problem is that this affects every target in the load balancer, and the recheck will apply to healthy targets as well.

As mentioned earlier, this raises some questions about how to implement this recheck feature. Should the recheck interval be a strict interval applied per unhealthy target, or just a global interval for all unhealthy targets?

Currently, the health check is assumed to be "derivative". With only one go routine, one loop, the algorithm remains simple, but is sequential, and the total time spent checking the targets could exceed the healthcheck interval.

Is this clearer?

@sswastik02
Copy link
Author

Hi @rtribotte,

Thanks for your explanation. I now understand the potential issues. Setting the default recheck would indeed disrupt existing applications unaware of this feature upon release. Therefore, it's essential to set the default recheck to match the current interval to maintain existing behavior.

I agree we should first discuss whether the recheck should be on a per-target basis or for the load balancer as a whole. I believe it should be on a per-target basis, with separate tickers for each service. This way, we can set the default to the shc.interval. While a service is unhealthy, we switch to a separate ticker using shc.recheck. This approach avoids unnecessary rechecks for services that are healthy from the load balancer's perspective.

Am I going in the right direction ?

@rtribotte
Copy link
Member

I agree we should first discuss whether the recheck should be on a per-target basis or for the load balancer as a whole. I believe it should be on a per-target basis, with separate tickers for each service. This way, we can set the default to the shc.interval. While a service is unhealthy, we switch to a separate ticker using shc.recheck. This approach avoids unnecessary rechecks for services that are healthy from the load balancer's perspective.

We are not sure we should drop the single ticker approach. If having only a second ticker for all unhealthy targets is manageable, we think it would be preferable.

@sswastik02
Copy link
Author

Hi @rtribotte ,
Sorry for the delay.

If having only a second ticker for all unhealthy targets is manageable, we think it would be preferable.

Do you mean that we maintain just one additional timer that sequentially performs health checks on all unhealthy targets, similar to how we handle healthy targets?

@kevinpollet
Copy link
Member

Hi @sswastik02,

Do you mean that we maintain just one additional timer that sequentially performs health checks on all unhealthy targets, similar to how we handle healthy targets?

Yes you are right.

@sswastik02
Copy link
Author

Hi @kevinpollet,
I believe implementing the second ticker for unhealthy targets is feasible. However, we also need to determine whether a target was previously healthy or not, so we can assign it to the appropriate ticker. To achieve this, we could maintain a separate map that stores the status of each target by name, likely of type map[string]bool. Alternatively, we could modify the targets variable to accommodate the status value.

type ServiceHealthChecker struct {
balancer StatusSetter
info *runtime.ServiceInfo
config *dynamic.ServerHealthCheck
interval time.Duration
timeout time.Duration
metrics metricsHealthCheck
client *http.Client
targets map[string]*url.URL
}

What are your thoughts ?

@kevinpollet
Copy link
Member

Hi @sswastik02,

To achieve this, we could maintain a separate map that stores the status of each target by name, likely of type map[string]bool. Alternatively, we could modify the targets variable to accommodate the status value.

We have no strong opinion on this. At a glance, it is hard to reason about it, so it is up to you and we will discuss the final implementation. As a side note, we were wondering if we need another goroutine, and if that's the case how everything will be synchronized if we keep only one map which will be mutated by the two goroutines.

@sswastik02
Copy link
Author

Hi @kevinpollet ,

we were wondering if we need another goroutine

Here’s the idea I have:

The recheck is essentially a subset of the main check. To avoid breaking changes, the health check must be executed for all targets at the given interval. The recheck executes at every recheck period, but it should not run during the main check or when it's triggered, as the main check will handle health checks for all targets regardless of their previous health.

To implement this, here is a rough plan:

  1. Whenever we receive data from the ticker, we will launch a goroutine that will first acquire a mutex. This goroutine will then perform a health check on all targets sequentially, as it does currently. Then release the mutex.

  2. Whenever we receive data from the recheck ticker, we will check if the mutex is acquired. If it is, we do not proceed with the recheck, indicating that a health check is in progress. If the mutex is not acquired, we proceed with the recheck. During each target iteration, we will check if the mutex is acquired and exit if it is.

for proxyName, target := range shc.targets {
select {
case <-ctx.Done():
return
default:
}
up := true
serverUpMetricValue := float64(1)
if err := shc.executeHealthCheck(ctx, shc.config, target); err != nil {
// The context is canceled when the dynamic configuration is refreshed.
if errors.Is(err, context.Canceled) {
return
}
log.Ctx(ctx).Warn().
Str("targetURL", target.String()).
Err(err).
Msg("Health check failed.")
up = false
serverUpMetricValue = float64(0)
}
shc.balancer.SetStatus(ctx, proxyName, up)
statusStr := runtime.StatusDown
if up {
statusStr = runtime.StatusUp
}
shc.info.UpdateServerStatus(target.String(), statusStr)
shc.metrics.ServiceServerUpGauge().
With("service", proxyName, "url", target.String()).
Set(serverUpMetricValue)
}

Let's say the above code is healthcheck(all_targets)

Then we would effectively be doing

case <-ticker.C:
    mutex.Lock()
    go healthcheck(all_targets)
    mutex.Unlock()
case <-recheckTicker.C:
    if !mutex.IsLocked():
        go healthcheck(unhealthy_targets) // we will also check if mutex is locked for every iteration of the loop inside healthcheck

The main thread will handle the ticker channels, while the goroutine will manage the respective health checks.
What do you think about this approach?

@kevinpollet kevinpollet modified the milestones: 3.2, next Oct 3, 2024
@kevinpollet kevinpollet modified the milestones: 3.3, next Dec 17, 2024
@rtribotte
Copy link
Member

Hi @sswastik02,

Sorry for the late reply.

We are not convinced that such an approach would work as expected.
The mutex does not serve the purpose you mention.

Implementing this recheck interval feature is not trivial, and we are not convinced that iterating here in the discussion is the easiest way to proceed.

Working out a solution to share with you here in this discussion would require iterating on the code ourselves, which we can do, but it would mean taking on PR.

We encourage you to iterate directly on the code so that we can review it in context.
If at some point we need to provide you with an alternative solution, and if you agree, we can always push a review commit.

@sswastik02
Copy link
Author

Hi @rtribotte,
Thank you for your response!

We are not sure we should drop the single ticker approach. If having only a second ticker for all unhealthy targets is manageable, we think it would be preferable.

Before proceeding with the code, I’d like to clarify: should the solution focus solely on adding a new ticker for unhealthy targets, or can we also explore the possibility of replacing the single ticker approach with target-level goroutines?

@rtribotte
Copy link
Member

Hello @sswastik02,

Sure, feel free to try an alternate solution to the single ticker.

…ntially

Signed-off-by: sswastik02 <40518186+sswastik02@users.noreply.github.com>
Signed-off-by: sswastik02 <40518186+sswastik02@users.noreply.github.com>
@sswastik02
Copy link
Author

Hi @rtribotte ,
I’ve implemented a basic approach to split the health check loop into goroutines at the target level, synchronized using a ticker within the loop. This ensures that health checks occur at specified intervals while preventing redundant rechecks.
Rechecks are handled on a per-target basis, unlike the primary health checks.

Could you review this rough implementation and let me know if I’m on the right track? If so, I’ll refine the code further, including adding configuration options for rechecks.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Support recheck interval for health check
6 participants