Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Document and (partially) handle interactions involving k8s terminationGracePeriodSeconds #32487

Merged
merged 4 commits into from
Aug 29, 2024

Conversation

leviramsey
Copy link
Contributor

By default Kubernetes will send SIGTERM, wait 30 seconds, and then send SIGKILL.

The SIGTERM will normally trigger CoordinatedShutdown, which triggers a shard rebalance. In the default configurations, the total CoordinatedShutdown timeout is greater than the k8s terminationGracePeriodSeconds and less than the sharding handoff-timeout.

Adjusting the various phase timeouts to comport with the Kubernetes setting (or even just the default) seems dicey, so this just adds documentation to CoordinatedShutdown suggesting that one consider extending terminationGracePeriodSeconds. It likewise documents that in the shutdown case, the handoff-timeout may not apply.

The behavior of the handoff now changes slightly to use a shorter effective handoff timeout when CoordinatedShutdown is in effect. The visible effect is that, when CoordinatedShutdown triggered the handoff, the warning about how long before entities will be forcibly stopped will be logged and the forcible stop may in fact happen at cluster sharding's hand rather than from the actor system itself.

Copy link
Member

@patriknw patriknw left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

Note that if running in Kubernetes, a SIGKILL will be issued after a set amount of time has passed
since SIGTERM. By default this time is 30 seconds ([`terminationGracePeriodSeconds`](https://kubernetes.io/docs/concepts/containers/container-lifecycle-hooks/#hook-handler-execution)):
it may be worth adjusting the Kubernetes configuration or the phase timeouts to make `CoordinatedShutdown`
more likely to completely exectue before SIGKILL is received.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

would it be worth changing the default-phase-timeout to 4 seconds so that our default totalTimeout is less than 30 seconds?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe not in akka-actor, which would apply to everyone wherever they're running. But it could be a responsibility of the Akka Management modules which imply k8s is in use set the default-phase-timeout like that.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Crazy idea could be to have a mechanism to read the terminationGracePeriodSeconds from k8s and set the default timeout to floor(terminationGracePeriodSeconds / 7).seconds, but that would require setting the default timeout after starting the actorsystem.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

My read of reference.conf is that we have 11 levels of phases before before-actor-system-terminate (two of which have 10s phase timeouts). So only a default-phase-timeout of 1s would keep this below 30 seconds?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

On library (akka-management) can't override reference.conf of another library (akka-actor).
My point is that the value of default-phase-timeout was originally set to something arbitrary, and I can't imagine it would break any existing to change it to 4 seconds. We could do that separately for Akka 2.10.0 (and include note about config change in migration guide).

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, makes sense to do it for the next release.

@johanandren johanandren added this to the 2.9.6 milestone Aug 29, 2024
@johanandren johanandren merged commit 1ef4421 into akka:main Aug 29, 2024
5 checks passed
@leviramsey leviramsey deleted the k8s-shutdown-timeout branch August 29, 2024 19:36
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants