Skip to content

Improve waiting behaviour in case of potential container restarts #2933

Open
@gdemonet

Description

Component: salt, scripts, kubelet

Why this is needed:

In #2928 we introduce a sleep 20 to the upgrade script, after local kubelet is upgraded, to make sure any container restart is complete (especially for Salt master).

This kind of hardcoded waiting time is however problematic:

  • if the selected value is too large for the system, we are wasting time and slow down user experience
  • if the selected value is too small for the system (e.g. in CI), we risk having unwanted failures while we could have waited a little longer
  • in any case, we cannot optimize for both extremes with this approach

What should be done:

The issue at hand is a case of "waiting for something that may happen", because a kubelet restart may or may not happen (e.g. this script is re-run after a flaky), and if kubelet restarts, Pods may or may not change (if kubelet is upgraded, it may add labels/annotations, but maybe even in other situations?).

Implementation proposal:

Here's a wild suggestion:

  1. Determine whether kubelet has restarted or not

    An option could be to parse the output of state.sls metalk8s.kubernetes.kubelet.standalone to check if the service.running state for kubelet has changes (not sure if that's enough, maybe we should check differently)

  2. If kubelet has restarted, determine if it has reconciled the Pod of interest

    We can look at status.startTime on the Pod, which is updated on restart of kubelet - not sure if enough either, but I'd expect a single reconciliation pass for the Pod to include whatever new labels/annotations it needs

  3. Once the Pod is reconciled, check if it changed

    We can compare the Pod's hash with its previous one (visible via crictl if needed) - we would need to remember it from before the attempt to update kubelet

  4. If the Pod changed, wait for the container to be up

Test plan:

Metadata

Assignees

No one assigned

    Labels

    complexity:mediumSomething that requires one or few days to fixkind:debtTechnical debttopic:flakinessSome test are flaky and cause CI to do transient failingtopic:lifecycleIssues related to upgrade or downgrade of MetalK8s

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions