[IMPROVEMENT] Longhorn Manager failed to sync disk status for the beginning minutes #10098
Description
Describe the bug
Longhorn manager keep complaining mismatching disks during the beginning minutes after the manager started.
[pod/longhorn-manager-w7t6s/longhorn-manager] time="2024-12-30T08:47:18Z" level=info msg="Failed to sync with disk monitor due to mismatching disks" func="controller.(*NodeController).syncNode" file="node_controller.go:453" controller=longhorn-node error="mismatching disks in node resource object and monitor collected data" node=ubuntu-k3s-worker1
[pod/longhorn-manager-w7t6s/longhorn-manager] time="2024-12-30T08:47:18Z" level=info msg="Failed to sync with disk monitor due to mismatching disks" func="controller.(*NodeController).syncNode" file="node_controller.go:453" controller=longhorn-node error="mismatching disks in node resource object and monitor collected data" node=ubuntu-k3s-worker1
[pod/longhorn-manager-w7t6s/longhorn-manager] time="2024-12-30T08:47:22Z" level=info msg="Failed to sync with disk monitor due to mismatching disks" func="controller.(*NodeController).syncNode" file="node_controller.go:453" controller=longhorn-node error="mismatching disks in node resource object and monitor collected data" node=ubuntu-k3s-worker1
[pod/longhorn-manager-w7t6s/longhorn-manager] time="2024-12-30T08:47:23Z" level=info msg="Failed to sync with disk monitor due to mismatching disks" func="controller.(*NodeController).syncNode" file="node_controller.go:453" controller=longhorn-node error="mismatching disks in node resource object and monitor collected data" node=ubuntu-k3s-worker1
[pod/longhorn-manager-w7t6s/longhorn-manager] time="2024-12-30T08:47:23Z" level=info msg="Failed to sync with disk monitor due to mismatching disks" func="controller.(*NodeController).syncNode" file="node_controller.go:453" controller=longhorn-node error="mismatching disks in node resource object and monitor collected data" node=ubuntu-k3s-worker1
[pod/longhorn-manager-w7t6s/longhorn-manager] time="2024-12-30T08:47:44Z" level=info msg="Failed to sync with disk monitor due to mismatching disks" func="controller.(*NodeController).syncNode" file="node_controller.go:453" controller=longhorn-node error="mismatching disks in node resource object and monitor collected data" node=ubuntu-k3s-worker1
The root cause is that there are two threads in the node controller, the node controller reconciling thread launches the disk monitor, aggregate the information immediately, expecting the status corresponded to the disks listed in lhn, and keep receiving empty result from monitor. Actually, the monitor schedules the initial collecting after 30sec. This mismatching error then break the following node status sync logic. The problem will be recovered automatically after 30sec from started.
To Reproduce
Restart the Longhorn manager:
kubectl -n longhorn-system rollout restart ds/longhorn-manager
Then check the logs of longhorn-manager
pods.
kubectl -n longhorn-system logs ds/longhorn-manager -f --all-pods=true | grep 'mismatching disks'
Expected behavior
The Longhorn manager's disk check works during the initial minutes so that the following environment checks won't be broken by the disk mismatching error.
Environment
- Longhorn version: 1.7.2, 1.8.0-rc1
- Impacted volume (PV):
- Installation method (e.g. Rancher Catalog App/Helm/Kubectl):
- Kubernetes distro (e.g. RKE/K3s/EKS/OpenShift) and version: v1.31.4+k3s1
- Number of control plane nodes in the cluster: 1
- Number of worker nodes in the cluster: 3
- Node config
- OS type and version: Ubuntu 24.04
- Kernel version: 6.8.0
- CPU per node: 3
- Memory per node: 8GB
- Disk type (e.g. SSD/NVMe/HDD):
- Network bandwidth between the nodes (Gbps):
- Underlying Infrastructure (e.g. on AWS/GCE, EKS/GKE, VMWare/KVM, Baremetal): Vagrant VirtualBox
- Number of Longhorn volumes in the cluster: 0
Additional context
Workaround and Mitigation
Metadata
Assignees
Labels
Type
Projects
Status
New Issues