Upgrades to 123.0.0 can fail after unneeded kubelet restart

**Component**: salt



**What happened**:

On a 3-nodes upgrade, where multiple registry "replicas" were configured, but only the bootstrap node (192.168.1.100 in this example) has the 123.0.0 archive, the rolling update of kube-apiserver fails on node-2 (192.168.1.102) with:

```
salt.exceptions.CommandExecutionError: Check availability of package container-selinux failed:
[...]
http://192.168.1.102:8080/metalk8s-123.0.0/redhat/7/metalk8s-epel-el7/repodata/repomd.xml: [Errno 14] HTTP Error 404 - Not Found
[...]
http://192.168.1.101:8080/metalk8s-123.0.0/redhat/7/metalk8s-epel-el7/repodata/repomd.xml: [Errno 14] HTTP Error 404 - Not Found
```

**Analysis**:

The issue is caused by two main problems:

- when running `metalk8s.orchestrate.apiserver` to perform the rolling upgrade, `kubelet` restarts because of incomplete logic in `cri.wait_pod` (fixed in #3828), which ends up marking the repositories-bootstrap Pod as not ready, hence removed it from the endpoints, before running the upgrade on node-2 - at this point, node-2 sees no mirror with the 123.0.0 version of metalk8s-epel, which causes the failure
- there should not be a situation where this registry "HA setup" is in an incoherent state prior to running an upgrade - we need to implement something for managing the replicas properly

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Upgrades to 123.0.0 can fail after unneeded kubelet restart #3827

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development