Improve waiting behaviour in case of potential container restarts #2933
Description
Component: salt, scripts, kubelet
Why this is needed:
In #2928 we introduce a sleep 20
to the upgrade script, after local kubelet is upgraded, to make sure any container restart is complete (especially for Salt master).
This kind of hardcoded waiting time is however problematic:
- if the selected value is too large for the system, we are wasting time and slow down user experience
- if the selected value is too small for the system (e.g. in CI), we risk having unwanted failures while we could have waited a little longer
- in any case, we cannot optimize for both extremes with this approach
What should be done:
The issue at hand is a case of "waiting for something that may happen", because a kubelet restart may or may not happen (e.g. this script is re-run after a flaky), and if kubelet restarts, Pods may or may not change (if kubelet is upgraded, it may add labels/annotations, but maybe even in other situations?).
Implementation proposal:
Here's a wild suggestion:
-
Determine whether kubelet has restarted or not
An option could be to parse the output of
state.sls metalk8s.kubernetes.kubelet.standalone
to check if theservice.running
state for kubelet has changes (not sure if that's enough, maybe we should check differently) -
If kubelet has restarted, determine if it has reconciled the Pod of interest
We can look at
status.startTime
on the Pod, which is updated on restart of kubelet - not sure if enough either, but I'd expect a single reconciliation pass for the Pod to include whatever new labels/annotations it needs -
Once the Pod is reconciled, check if it changed
We can compare the Pod's hash with its previous one (visible via
crictl
if needed) - we would need to remember it from before the attempt to update kubelet -
If the Pod changed, wait for the container to be up
Test plan: