Skip to content

Upgrades to 123.0.0 can fail after unneeded kubelet restart #3827

Open
@gdemonet

Description

Component: salt

What happened:

On a 3-nodes upgrade, where multiple registry "replicas" were configured, but only the bootstrap node (192.168.1.100 in this example) has the 123.0.0 archive, the rolling update of kube-apiserver fails on node-2 (192.168.1.102) with:

salt.exceptions.CommandExecutionError: Check availability of package container-selinux failed:
[...]
http://192.168.1.102:8080/metalk8s-123.0.0/redhat/7/metalk8s-epel-el7/repodata/repomd.xml: [Errno 14] HTTP Error 404 - Not Found
[...]
http://192.168.1.101:8080/metalk8s-123.0.0/redhat/7/metalk8s-epel-el7/repodata/repomd.xml: [Errno 14] HTTP Error 404 - Not Found

Analysis:

The issue is caused by two main problems:

  • when running metalk8s.orchestrate.apiserver to perform the rolling upgrade, kubelet restarts because of incomplete logic in cri.wait_pod (fixed in salt: Handle duplicates in cri.wait_pod #3828), which ends up marking the repositories-bootstrap Pod as not ready, hence removed it from the endpoints, before running the upgrade on node-2 - at this point, node-2 sees no mirror with the 123.0.0 version of metalk8s-epel, which causes the failure
  • there should not be a situation where this registry "HA setup" is in an incoherent state prior to running an upgrade - we need to implement something for managing the replicas properly

Metadata

Assignees

No one assigned

    Labels

    kind:bugSomething isn't workingtopic:lifecycleIssues related to upgrade or downgrade of MetalK8stopic:saltEverything related to SaltStack in our product

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions