Adds a basic implementation of Recheck Interval into the health check service. #10610

sswastik02 · 2024-04-12T14:23:14Z

What does this PR do?

This PR Addresses the issue where the interval for health checks remains static even when a service starts failing. This lack of adaptability may result in unreliable detection of issues on time. By implementing a recheck interval, we ensure that the health status of a service is re-evaluated at regular intervals, especially when it fails, thereby improving the accuracy and responsiveness of issue detection.

Motivation

Aims to improve the reliability and responsiveness of issue detection, particularly in scenarios where services may intermittently fail or experience fluctuations in health status.

More

Added/updated tests
Added/updated documentation

Additional Notes

Utilization of providers is yet to be implemented. Future work will focus on integrating this functionality to further enhance the flexibility and extensibility of the health check system.

Fixes #6550

sswastik02 · 2024-04-12T14:29:30Z

Please let me know if I made the PR correctly, or if any changes are needed.

sswastik02 · 2024-04-21T07:25:57Z

Hi @kevinpollet ,

I'm encountering issues with failing tests due to the interface change from StatusSetter to StatusHandler. Specifically, for the recheck interval feature to function properly, I need the Getter Method on the status to determine when to switch to recheck and when to revert.

I'd appreciate your input on how best to address this. One option is to modify the interface to eliminate the need for the getter method, perhaps by having SetStatus return the original status before modification. Your suggestions on ensuring compatibility with other services implementing this interface would be invaluable.

Thank you.

sswastik02 · 2024-04-21T09:11:23Z

Another possibility is to consider expanding the tests to include this new field, which I believe will become necessary regardless. Your guidance on this matter would be greatly appreciated, @kevinpollet.

sswastik02 · 2024-05-22T13:59:48Z

Hi @kevinpollet ,
Apologies for the delay in PR, requesting you to approve the workflows to run.

sswastik02 · 2024-05-28T19:33:38Z

Hi @kevinpollet,
I have made some changes in the integration tests, Requesting you to approve workflows.

sswastik02 · 2024-06-03T00:33:39Z

Thank you @ldez. @kevinpollet Please do let me know if any changes are necessary on this PR

kevinpollet · 2024-07-16T10:01:13Z

Hello @sswastik02 and sorry for the delay,

We plan to review your pull request soon, is it ok for you if we push some changes if necessary?

sswastik02 · 2024-07-16T14:45:24Z

Hi @kevinpollet ,
Sure I am ok with you pushing changes to this PR if necessary, but if you tell the changes I can do it too. Let me know whatever works for you. I am happy to help.

Thanks.

kevinpollet · 2024-07-17T14:16:26Z

Hello @sswastik02,

After looking at your modifications we have a couple of comments,

The recheck interval cannot have a default value not equal to the default check interval value this is a breaking change.

Redefining the health-check timer in the loop where it is use does not seem to be a usual pattern, and if we are not mistaken this interval will also be modified for healthy backends.

WDYT?

sswastik02 · 2024-07-17T16:28:37Z

Hi @kevinpollet ,
Sure, we can have the default value of recheck interval to match the default check interval. Could you please explain why it is breaking though ?

I did not fully understand your point about redefining the health-check loop. Are you suggesting that redefining the health-check loop is not the standard approach for implementing the recheck feature? If so, could you guide me on the best practices for this implementation? I'll be happy to make the necessary modifications.

Also, as mentioned, the interval changes to the recheck interval when the service is unhealthy and reverts to the normal check interval once the service is healthy again.

rtribotte · 2024-07-18T14:51:43Z

Hi @sswastik02,

Sure, we can have the default value of recheck interval to match the default check interval. Could you please explain why it is breaking though ?

What @kevinpollet meant was that having a default value for this recheck interval would cause a change in the behavior of Traefik. The health (re)checks would happen more often, and we do not want to introduce such a breaking behavior.

I did not fully understand your point about redefining the health-check loop. Are you suggesting that redefining the health-check loop is not the standard approach for implementing the recheck feature? If so, could you guide me on the best practices for this implementation? I'll be happy to make the necessary modifications.

It is not a "non-standard approach" per se, but, at least, it is not the best design for readability.
There are no best practices for this. The best pattern would also depend on the solution we want to bring.

Also, as mentioned, the interval changes to the recheck interval when the service is unhealthy and reverts to the normal check interval once the service is healthy again.

The problem is that this affects every target in the load balancer, and the recheck will apply to healthy targets as well.

As mentioned earlier, this raises some questions about how to implement this recheck feature. Should the recheck interval be a strict interval applied per unhealthy target, or just a global interval for all unhealthy targets?

Currently, the health check is assumed to be "derivative". With only one go routine, one loop, the algorithm remains simple, but is sequential, and the total time spent checking the targets could exceed the healthcheck interval.

Is this clearer?

sswastik02 · 2024-07-18T20:10:48Z

Hi @rtribotte,

Thanks for your explanation. I now understand the potential issues. Setting the default recheck would indeed disrupt existing applications unaware of this feature upon release. Therefore, it's essential to set the default recheck to match the current interval to maintain existing behavior.

I agree we should first discuss whether the recheck should be on a per-target basis or for the load balancer as a whole. I believe it should be on a per-target basis, with separate tickers for each service. This way, we can set the default to the shc.interval. While a service is unhealthy, we switch to a separate ticker using shc.recheck. This approach avoids unnecessary rechecks for services that are healthy from the load balancer's perspective.

Am I going in the right direction ?

rtribotte · 2024-07-19T08:27:01Z

I agree we should first discuss whether the recheck should be on a per-target basis or for the load balancer as a whole. I believe it should be on a per-target basis, with separate tickers for each service. This way, we can set the default to the shc.interval. While a service is unhealthy, we switch to a separate ticker using shc.recheck. This approach avoids unnecessary rechecks for services that are healthy from the load balancer's perspective.

We are not sure we should drop the single ticker approach. If having only a second ticker for all unhealthy targets is manageable, we think it would be preferable.

sswastik02 · 2024-07-22T17:19:02Z

Hi @rtribotte ,
Sorry for the delay.

If having only a second ticker for all unhealthy targets is manageable, we think it would be preferable.

Do you mean that we maintain just one additional timer that sequentially performs health checks on all unhealthy targets, similar to how we handle healthy targets?

kevinpollet · 2024-07-23T07:42:05Z

Hi @sswastik02,

Do you mean that we maintain just one additional timer that sequentially performs health checks on all unhealthy targets, similar to how we handle healthy targets?

Yes you are right.

sswastik02 · 2024-07-23T17:26:03Z

Hi @kevinpollet,
I believe implementing the second ticker for unhealthy targets is feasible. However, we also need to determine whether a target was previously healthy or not, so we can assign it to the appropriate ticker. To achieve this, we could maintain a separate map that stores the status of each target by name, likely of type map[string]bool. Alternatively, we could modify the targets variable to accommodate the status value.

traefik/pkg/healthcheck/healthcheck.go

Lines 43 to 55 in 87db330

    
           type ServiceHealthChecker struct { 
        
           	balancer StatusSetter 
        
           	info     *runtime.ServiceInfo 
        
           	config   *dynamic.ServerHealthCheck 
        
           	interval time.Duration 
        
           	timeout  time.Duration 
        
           	metrics metricsHealthCheck 
        
           	client  *http.Client 
        
           	targets map[string]*url.URL 
        
           }

What are your thoughts ?

kevinpollet · 2024-07-24T07:44:23Z

Hi @sswastik02,

To achieve this, we could maintain a separate map that stores the status of each target by name, likely of type map[string]bool. Alternatively, we could modify the targets variable to accommodate the status value.

We have no strong opinion on this. At a glance, it is hard to reason about it, so it is up to you and we will discuss the final implementation. As a side note, we were wondering if we need another goroutine, and if that's the case how everything will be synchronized if we keep only one map which will be mutated by the two goroutines.

sswastik02 · 2024-07-24T17:13:04Z

Hi @kevinpollet ,

we were wondering if we need another goroutine

Here’s the idea I have:

The recheck is essentially a subset of the main check. To avoid breaking changes, the health check must be executed for all targets at the given interval. The recheck executes at every recheck period, but it should not run during the main check or when it's triggered, as the main check will handle health checks for all targets regardless of their previous health.

To implement this, here is a rough plan:

Whenever we receive data from the ticker, we will launch a goroutine that will first acquire a mutex. This goroutine will then perform a health check on all targets sequentially, as it does currently. Then release the mutex.
Whenever we receive data from the recheck ticker, we will check if the mutex is acquired. If it is, we do not proceed with the recheck, indicating that a health check is in progress. If the mutex is not acquired, we proceed with the recheck. During each target iteration, we will check if the mutex is acquired and exit if it is.

traefik/pkg/healthcheck/healthcheck.go

Lines 104 to 141 in 87db330

    
           for proxyName, target := range shc.targets { 
        
           	select { 
        
           	case <-ctx.Done(): 
        
           		return 
        
           	default: 
        
           	} 
        
           	up := true 
        
           	serverUpMetricValue := float64(1) 
        
           	if err := shc.executeHealthCheck(ctx, shc.config, target); err != nil { 
        
           		// The context is canceled when the dynamic configuration is refreshed. 
        
           		if errors.Is(err, context.Canceled) { 
        
           			return 
        
           		} 
        
           		log.Ctx(ctx).Warn(). 
        
           			Str("targetURL", target.String()). 
        
           			Err(err). 
        
           			Msg("Health check failed.") 
        
           		up = false 
        
           		serverUpMetricValue = float64(0) 
        
           	} 
        
           	shc.balancer.SetStatus(ctx, proxyName, up) 
        
           	statusStr := runtime.StatusDown 
        
           	if up { 
        
           		statusStr = runtime.StatusUp 
        
           	} 
        
           	shc.info.UpdateServerStatus(target.String(), statusStr) 
        
           	shc.metrics.ServiceServerUpGauge(). 
        
           		With("service", proxyName, "url", target.String()). 
        
           		Set(serverUpMetricValue) 
        
           }

Let's say the above code is healthcheck(all_targets)

Then we would effectively be doing

case <-ticker.C:
    mutex.Lock()
    go healthcheck(all_targets)
    mutex.Unlock()
case <-recheckTicker.C:
    if !mutex.IsLocked():
        go healthcheck(unhealthy_targets) // we will also check if mutex is locked for every iteration of the loop inside healthcheck

The main thread will handle the ticker channels, while the goroutine will manage the respective health checks.
What do you think about this approach?

rtribotte · 2024-12-20T14:31:39Z

Hi @sswastik02,

Sorry for the late reply.

We are not convinced that such an approach would work as expected.
The mutex does not serve the purpose you mention.

Implementing this recheck interval feature is not trivial, and we are not convinced that iterating here in the discussion is the easiest way to proceed.

Working out a solution to share with you here in this discussion would require iterating on the code ourselves, which we can do, but it would mean taking on PR.

We encourage you to iterate directly on the code so that we can review it in context.
If at some point we need to provide you with an alternative solution, and if you agree, we can always push a review commit.

sswastik02 · 2025-01-02T16:38:15Z

Hi @rtribotte,
Thank you for your response!

We are not sure we should drop the single ticker approach. If having only a second ticker for all unhealthy targets is manageable, we think it would be preferable.

Before proceeding with the code, I’d like to clarify: should the solution focus solely on adding a new ticker for unhealthy targets, or can we also explore the possibility of replacing the single ticker approach with target-level goroutines?

rtribotte · 2025-01-03T08:32:36Z

Hello @sswastik02,

Sure, feel free to try an alternate solution to the single ticker.

…ntially Signed-off-by: sswastik02 <40518186+sswastik02@users.noreply.github.com>

Signed-off-by: sswastik02 <40518186+sswastik02@users.noreply.github.com>

sswastik02 · 2025-01-05T05:33:18Z

Hi @rtribotte ,
I’ve implemented a basic approach to split the health check loop into goroutines at the target level, synchronized using a ticker within the loop. This ensures that health checks occur at specified intervals while preventing redundant rechecks.
Rechecks are handled on a per-target basis, unlike the primary health checks.

Could you review this rough implementation and let me know if I’m on the right track? If so, I’ll refine the code further, including adding configuration options for rechecks.

traefiker added area/infrastructure area/provider/k8s/ingress area/provider/k8s/crd area/provider/k8s size/L status/0-needs-triage labels Apr 12, 2024

traefiker added this to the 3.0 milestone Apr 12, 2024

nmengin added kind/enhancement a new or improved feature. and removed status/0-needs-triage labels Apr 15, 2024

kevinpollet changed the base branch from v3.0 to master April 15, 2024 12:20

nmengin added area/healthcheck and removed area/infrastructure area/provider/k8s/ingress area/provider/k8s/crd area/provider/k8s labels Apr 15, 2024

nmengin modified the milestones: 3.0, next Apr 15, 2024

nmengin requested a review from kevinpollet April 15, 2024 12:22

rtribotte added the status/2-needs-review label Apr 17, 2024

kevinpollet self-assigned this Apr 18, 2024

sswastik02 force-pushed the master branch 4 times, most recently from c5d2c2a to 74a7bf6 Compare May 22, 2024 13:48

sswastik02 force-pushed the master branch from 74a7bf6 to 978ae20 Compare May 28, 2024 19:29

ldez reopened this Jun 2, 2024

sswastik02 force-pushed the master branch from a5b0a6c to ab53b7b Compare June 7, 2024 11:04

kevinpollet modified the milestones: 3.1, next Jun 28, 2024

kevinpollet added the contributor/waiting-for-feedback label Jul 22, 2024

kevinpollet modified the milestones: 3.2, next Oct 3, 2024

kevinpollet modified the milestones: 3.3, next Dec 17, 2024

rtribotte removed the contributor/waiting-for-feedback label Dec 19, 2024

sswastik02 added 2 commits January 5, 2025 10:38

HealthChecks runs concurrently on a per-target level instead of seque…

3cf0d42

…ntially Signed-off-by: sswastik02 <40518186+sswastik02@users.noreply.github.com>

Add Recheck Feature into HealthChecks

aad7c01

Signed-off-by: sswastik02 <40518186+sswastik02@users.noreply.github.com>

sswastik02 force-pushed the master branch from ab53b7b to aad7c01 Compare January 5, 2025 05:24

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Adds a basic implementation of Recheck Interval into the health check service. #10610

Adds a basic implementation of Recheck Interval into the health check service. #10610

sswastik02 commented Apr 12, 2024 •

edited

Loading

sswastik02 commented Apr 12, 2024

sswastik02 commented Apr 21, 2024

sswastik02 commented Apr 21, 2024

sswastik02 commented May 22, 2024

sswastik02 commented May 28, 2024

sswastik02 commented Jun 3, 2024 •

edited by ldez

Loading

kevinpollet commented Jul 16, 2024

sswastik02 commented Jul 16, 2024

kevinpollet commented Jul 17, 2024 •

edited

Loading

sswastik02 commented Jul 17, 2024

rtribotte commented Jul 18, 2024

sswastik02 commented Jul 18, 2024

rtribotte commented Jul 19, 2024

sswastik02 commented Jul 22, 2024

kevinpollet commented Jul 23, 2024

sswastik02 commented Jul 23, 2024

kevinpollet commented Jul 24, 2024

sswastik02 commented Jul 24, 2024

rtribotte commented Dec 20, 2024

sswastik02 commented Jan 2, 2025

rtribotte commented Jan 3, 2025

sswastik02 commented Jan 5, 2025

Adds a basic implementation of Recheck Interval into the health check service. #10610

Are you sure you want to change the base?

Adds a basic implementation of Recheck Interval into the health check service. #10610

Conversation

sswastik02 commented Apr 12, 2024 • edited Loading

What does this PR do?

Motivation

More

Additional Notes

sswastik02 commented Apr 12, 2024

sswastik02 commented Apr 21, 2024

sswastik02 commented Apr 21, 2024

sswastik02 commented May 22, 2024

sswastik02 commented May 28, 2024

sswastik02 commented Jun 3, 2024 • edited by ldez Loading

kevinpollet commented Jul 16, 2024

sswastik02 commented Jul 16, 2024

kevinpollet commented Jul 17, 2024 • edited Loading

sswastik02 commented Jul 17, 2024

rtribotte commented Jul 18, 2024

sswastik02 commented Jul 18, 2024

rtribotte commented Jul 19, 2024

sswastik02 commented Jul 22, 2024

kevinpollet commented Jul 23, 2024

sswastik02 commented Jul 23, 2024

kevinpollet commented Jul 24, 2024

sswastik02 commented Jul 24, 2024

rtribotte commented Dec 20, 2024

sswastik02 commented Jan 2, 2025

rtribotte commented Jan 3, 2025

sswastik02 commented Jan 5, 2025

sswastik02 commented Apr 12, 2024 •

edited

Loading

sswastik02 commented Jun 3, 2024 •

edited by ldez

Loading

kevinpollet commented Jul 17, 2024 •

edited

Loading