Improve waiting behaviour in case of potential container restarts

**Component**: salt, scripts, kubelet



**Why this is needed**:

In #2928 we introduce a `sleep 20` to the upgrade script, after local kubelet is upgraded, to make sure any container restart is complete (especially for Salt master).

This kind of hardcoded waiting time is however problematic:
- if the selected value is too large for the system, we are wasting time and slow down user experience
- if the selected value is too small for the system (e.g. in CI), we risk having unwanted failures while we could have waited a little longer
- in any case, we cannot optimize for both extremes with this approach

**What should be done**:

The issue at hand is a case of "waiting for something that **may** happen", because a kubelet restart may or may not happen (e.g. this script is re-run after a flaky), and if kubelet restarts, Pods may or may not change (if kubelet is upgraded, it may add labels/annotations, but maybe even in other situations?).

**Implementation proposal**:

Here's a wild suggestion:

1. _Determine whether kubelet has restarted or not_

   An option could be to parse the output of `state.sls metalk8s.kubernetes.kubelet.standalone` to check if the `service.running` state for kubelet has changes (not sure if that's enough, maybe we should check differently)

2. _If kubelet has restarted, determine if it has reconciled the Pod of interest_

   We can look at `status.startTime` on the Pod, which is updated on restart of kubelet - not sure if enough either, but I'd expect a single reconciliation pass for the Pod to include whatever new labels/annotations it needs

3. _Once the Pod is reconciled, check if it changed_

   We can compare the Pod's hash with its previous one (visible via `crictl` if needed) - we would need to remember it from before the attempt to update kubelet

4. _If the Pod changed, wait for the container to be up_

**Test plan**:

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Improve waiting behaviour in case of potential container restarts #2933

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development