[BUG] Resources such as replicas are somehow not mutated when network is unstable #5762
Description
Describe the bug (🐛 if you encounter this issue)
There are two tickets, #5582 and #5613 (comment), reporting the symptom.
The issue surfaced after introducing #3917. From the support bundle, the network status in the two cases was unstable. The replicas that were created or updated did not have the longhornvolume
label, which prevented the listReplicas
function from detecting and listing them and lead to the replica being unmanageable.
The underlying reason for the problem has yet to be identified. Although the root cause has not been determined, we can enhance resilience by
- Validate the labels of resources as they play a critical role in the control plane
- Always mutating the labels when updating resources
cc @weizhe0422
To Reproduce
Steps to reproduce the behavior:
- Go to '...'
- Click on '....'
- Perform '....'
- See error
Expected behavior
A clear and concise description of what you expected to happen.
Log or Support bundle
If applicable, add the Longhorn managers' log or support bundle when the issue happens.
You can generate a Support Bundle using the link at the footer of the Longhorn UI.
Environment
- Longhorn version:
- Installation method (e.g. Rancher Catalog App/Helm/Kubectl):
- Kubernetes distro (e.g. RKE/K3s/EKS/OpenShift) and version:
- Number of management node in the cluster:
- Number of worker node in the cluster:
- Node config
- OS type and version:
- CPU per node:
- Memory per node:
- Disk type(e.g. SSD/NVMe):
- Network bandwidth between the nodes:
- Underlying Infrastructure (e.g. on AWS/GCE, EKS/GKE, VMWare/KVM, Baremetal):
- Number of Longhorn volumes in the cluster:
Workaround
You can delete the stopped
replicas whose label contains longhornnode: "null"
by kubectl
command.
These stopped replicas are not managed by longhorn system because of #5762. Due to the lack of the label, longhorn cannot recognize and delete them.
-
To avoid running into the issue again, you can
kubectl edit mutatingwebhookconfigurations longhorn-webhook-mutator
Then, change
failurePolicy
fromIgnored
toFail
. -
For existing wrong replicas
You have to delete them manually [BUG] Longhorn show different counts of replicas for the same node #6179 (comment).
One user provides a [BUG] Longhorn show different counts of replicas for the same node #6179 (comment) for deleting them. You can check and use it.
Additional context
Add any other context about the problem here.
Metadata
Assignees
Labels
Type
Projects
Status
Closed