Skip to content

[IMPROVEMENT] Longhorn Manager failed to sync disk status for the beginning minutes #10098

Open
@COLDTURNIP

Description

Describe the bug

Longhorn manager keep complaining mismatching disks during the beginning minutes after the manager started.

[pod/longhorn-manager-w7t6s/longhorn-manager] time="2024-12-30T08:47:18Z" level=info msg="Failed to sync with disk monitor due to mismatching disks" func="controller.(*NodeController).syncNode" file="node_controller.go:453" controller=longhorn-node error="mismatching disks in node resource object and monitor collected data" node=ubuntu-k3s-worker1
[pod/longhorn-manager-w7t6s/longhorn-manager] time="2024-12-30T08:47:18Z" level=info msg="Failed to sync with disk monitor due to mismatching disks" func="controller.(*NodeController).syncNode" file="node_controller.go:453" controller=longhorn-node error="mismatching disks in node resource object and monitor collected data" node=ubuntu-k3s-worker1
[pod/longhorn-manager-w7t6s/longhorn-manager] time="2024-12-30T08:47:22Z" level=info msg="Failed to sync with disk monitor due to mismatching disks" func="controller.(*NodeController).syncNode" file="node_controller.go:453" controller=longhorn-node error="mismatching disks in node resource object and monitor collected data" node=ubuntu-k3s-worker1
[pod/longhorn-manager-w7t6s/longhorn-manager] time="2024-12-30T08:47:23Z" level=info msg="Failed to sync with disk monitor due to mismatching disks" func="controller.(*NodeController).syncNode" file="node_controller.go:453" controller=longhorn-node error="mismatching disks in node resource object and monitor collected data" node=ubuntu-k3s-worker1
[pod/longhorn-manager-w7t6s/longhorn-manager] time="2024-12-30T08:47:23Z" level=info msg="Failed to sync with disk monitor due to mismatching disks" func="controller.(*NodeController).syncNode" file="node_controller.go:453" controller=longhorn-node error="mismatching disks in node resource object and monitor collected data" node=ubuntu-k3s-worker1
[pod/longhorn-manager-w7t6s/longhorn-manager] time="2024-12-30T08:47:44Z" level=info msg="Failed to sync with disk monitor due to mismatching disks" func="controller.(*NodeController).syncNode" file="node_controller.go:453" controller=longhorn-node error="mismatching disks in node resource object and monitor collected data" node=ubuntu-k3s-worker1

The root cause is that there are two threads in the node controller, the node controller reconciling thread launches the disk monitor, aggregate the information immediately, expecting the status corresponded to the disks listed in lhn, and keep receiving empty result from monitor. Actually, the monitor schedules the initial collecting after 30sec. This mismatching error then break the following node status sync logic. The problem will be recovered automatically after 30sec from started.

To Reproduce

Restart the Longhorn manager:

kubectl -n longhorn-system rollout restart ds/longhorn-manager

Then check the logs of longhorn-manager pods.

kubectl -n longhorn-system logs ds/longhorn-manager -f --all-pods=true | grep 'mismatching disks'

Expected behavior

The Longhorn manager's disk check works during the initial minutes so that the following environment checks won't be broken by the disk mismatching error.

Environment

  • Longhorn version: 1.7.2, 1.8.0-rc1
  • Impacted volume (PV):
  • Installation method (e.g. Rancher Catalog App/Helm/Kubectl):
  • Kubernetes distro (e.g. RKE/K3s/EKS/OpenShift) and version: v1.31.4+k3s1
    • Number of control plane nodes in the cluster: 1
    • Number of worker nodes in the cluster: 3
  • Node config
    • OS type and version: Ubuntu 24.04
    • Kernel version: 6.8.0
    • CPU per node: 3
    • Memory per node: 8GB
    • Disk type (e.g. SSD/NVMe/HDD):
    • Network bandwidth between the nodes (Gbps):
  • Underlying Infrastructure (e.g. on AWS/GCE, EKS/GKE, VMWare/KVM, Baremetal): Vagrant VirtualBox
  • Number of Longhorn volumes in the cluster: 0

Additional context

Workaround and Mitigation

Metadata

Assignees

Labels

area/user-experienceImproving user experiencecomponent/longhorn-managerLonghorn manager (control plane)kind/improvementRequest for improvement of existing functionrequire/backportRequire backport. Only used when the specific versions to backport have not been definied.require/qa-review-coverageRequire QA to review coverage

Type

No type

Projects

  • Status

    New Issues

Relationships

None yet

Development

No branches or pull requests

Issue actions