Upgrades to 123.0.0 can fail after unneeded kubelet restart #3827
Open
Description
Component: salt
What happened:
On a 3-nodes upgrade, where multiple registry "replicas" were configured, but only the bootstrap node (192.168.1.100 in this example) has the 123.0.0 archive, the rolling update of kube-apiserver fails on node-2 (192.168.1.102) with:
salt.exceptions.CommandExecutionError: Check availability of package container-selinux failed:
[...]
http://192.168.1.102:8080/metalk8s-123.0.0/redhat/7/metalk8s-epel-el7/repodata/repomd.xml: [Errno 14] HTTP Error 404 - Not Found
[...]
http://192.168.1.101:8080/metalk8s-123.0.0/redhat/7/metalk8s-epel-el7/repodata/repomd.xml: [Errno 14] HTTP Error 404 - Not Found
Analysis:
The issue is caused by two main problems:
- when running
metalk8s.orchestrate.apiserver
to perform the rolling upgrade,kubelet
restarts because of incomplete logic incri.wait_pod
(fixed in salt: Handle duplicates incri.wait_pod
#3828), which ends up marking the repositories-bootstrap Pod as not ready, hence removed it from the endpoints, before running the upgrade on node-2 - at this point, node-2 sees no mirror with the 123.0.0 version of metalk8s-epel, which causes the failure - there should not be a situation where this registry "HA setup" is in an incoherent state prior to running an upgrade - we need to implement something for managing the replicas properly