[BUG] All volumes degraded at the same time #10091

Hr46ph · 2024-12-28T09:45:19Z

Describe the bug

Without apparent reason, all volumes are suddenly degraded. Some with one failed replica, others with 2. Several pods are left without data locality.

To Reproduce

I have no idea how to replicate, it just happened (not for the first time I might add).

Expected behavior

Reliability. This stuff scares me.

Support bundle for troubleshooting

WIll add after posting as I need a ticket number.

Environment

Longhorn version: 1.7.2
Impacted volume (PV): All of them
Installation method (e.g. Rancher Catalog App/Helm/Kubectl): helm
Kubernetes distro (e.g. RKE/K3s/EKS/OpenShift) and version: Talos Linux 1.8.0
- Number of control plane nodes in the cluster: 3
- Number of worker nodes in the cluster: 6 including control plane nodes
Node config
- OS type and version: talos 1.8.0
- Kernel version: 6.6.60-talos
- CPU per node: 6
- Memory per node: 64GB
- Disk type (e.g. SSD/NVMe/HDD): NVMe
- Network bandwidth between the nodes (Gbps): 1Gbps
Underlying Infrastructure (e.g. on AWS/GCE, EKS/GKE, VMWare/KVM, Baremetal): Baremetal
Number of Longhorn volumes in the cluster: 19

Additional context

I restarted a postgres pod, went to check its actual size in Longhorn GUI and all volumes went degraded all at once.

Workaround and Mitigation

Patience, I hope. Its still recovering.

derekbit · 2024-12-28T09:47:46Z

Can you provide a support bundle?

Hr46ph · 2024-12-28T09:51:04Z

Can you provide a support bundle?

Yes, I just did. I wanted to complete the bug form first so I could provide the github bug report link with the support bundle. I sent it to longhorn-support-bundle at suse dot com.

derekbit · 2024-12-28T09:52:13Z

One quick question, are the failed replica on the same node?

Hr46ph · 2024-12-28T09:55:21Z

One quick question, are the failed replica on the same node?

No, they were all over the place. Mostly on the wk01, 02 and 03 (I assume you see hostnames in the support bundle).

It did recover fairly quickly, only one volume is rebuilding. Maybe I panicked too quickly but still I don't think this should happen so I really hope you can see why this happens.

Thanks for the quick responses.

derekbit · 2024-12-28T09:56:46Z

Network bandwidth between the nodes (Gbps): 1Gbps

BTW, the network bandwidth is not suitable for Longhorn, and 10 Gbps NIC is recommended. Please check the best practice https://longhorn.io/docs/1.7.2/best-practices/#volume-performance-optimization.

Hr46ph · 2024-12-28T10:16:44Z

Network bandwidth between the nodes (Gbps): 1Gbps

BTW, the network bandwidth is not suitable for Longhorn, and 10 Gbps NIC is recommended. Please check the best practice https://longhorn.io/docs/1.7.2/best-practices/#volume-performance-optimization.

It's not a production cluster its one of the compromises to keep costs to a minimum. Its also 1 disk which not recommended. There will be dedicated disks added in the near future though.

derekbit · 2024-12-29T14:41:35Z

@Hr46ph Has the support bundle provided?

Hr46ph · 2024-12-29T15:00:09Z

@Hr46ph Has the support bundle provided?

Yes, I did so when I replied yesterday.

derekbit · 2024-12-29T15:16:04Z

Thanks @Hr46ph.
I've checked the support bundle. There are lots of I/O timeout errors. Besides Longhorn, other services were experiencing errors. I think it is a temporary network issue in the environment.

Hr46ph · 2024-12-29T15:26:47Z

Thanks @Hr46ph. I've checked the support bundle. There are lots of I/O timeout errors. Besides Longhorn, other services were experiencing errors. I think it is a temporary network issue in the environment.

In which log do you see those so that I can look into those further. Network is professional equipment with certified cabling. In network management softtware I dont see errors (whcih doesnt mean there arent other issues).

I can grep -ri over the extracted supportbundle I see many hits on timeouts, except most are harmless.

If you see anything specific can you share that please so I can look into that.

Thanks!

derekbit · 2024-12-30T01:06:19Z

@Hr46ph

You can check the messages
... level=error msg="R/W Timeout. No response received in 8s" func="dataconn.(*Client).loop" file="client.go:168"....
The latencies between engines to the downstream replicas exceeded 8 seconds and lead to failed replicas. Were there any I/O spikes during this period? The phenomenon usually happens in low bandwidth NIC environment.

innobead · 2024-12-30T01:47:20Z

For this kind of low-spec/non-prod env, might tune the timeout via https://longhorn.io/docs/1.7.2/references/settings/#engine-replica-timeout.

Hr46ph · 2024-12-30T09:24:35Z

I made a couple of changes, I think I have been asking too much of Longhorn.

The timeouts you are referring to were all towards a single longhorn manager instance.

Around 9 AM that particular Longhorn manager pods crashed:

longhorn-manager-kg4vl
    State:          Running
      Started:      Sat, 28 Dec 2024 08:53:56 +0100
    Last State:     Terminated
      Reason:       Unknown
      Exit Code:    255
      Started:      Sat, 30 Nov 2024 15:03:49 +0100
      Finished:     Sat, 28 Dec 2024 08:52:29 +0100

From the logs I can see that before it crashed, it was attempting to hash snapshots but gave warnings that another node was already doing that.

In Grafana I noticed several instance manager pods were capped at just below 1 core cpu usage, which more or less looks like the 12% default cpu usage in Longhorn settings (12% is actually around 75% of a single core but its close enough I suppose).

I can't really see a reason for the termination of the instance manager pod, I do think my recurring jobs might have been asking too much of Longhorn.

So I increased the Guaranteed Instance Manager CPU for V1 Data Engine to 33 (2 cores).
I increased the timeout as suggested to 30 seconds.
I decreased concurrency on snaphot, trim, delete cronjobs from 5 to 2.
I decreased the worker threads from 3 to 2 for recurring jobs
Created a new cronjob to make a snapshot every 4 hours instead of every hour
Removed snapshot cronjob from several volumes that dont actually need hourly snapshotting.
changed several volumes from hourly snapshots to 4 hour snapshots.
I disabled filesystem freeze during snapshots. I enabled that because I thought it was a good idea to ensure data consistency, but that might have been asking too much of this setup.

All in all I think I might have caused this issue myself with the setting things too demanding and asking too much of Longhorn 😳 .

With these more relaxed settings I will monitor. Thanks for the assistance, appreciate it!

derekbit · 2024-12-30T14:21:40Z

@Hr46ph

I cannot find any clues related to the termination of the longhorn-manager-kg4vl pod. I also found the kubelet.log and k3s log are all empty. cc @c3y1huang Do you know the root cause?

longhorn-manager-kg4vl is a control plane pod, so the termination won't impact the data plane's IO.
You mention the high CPU usage due to the snapshot checksum hashing. I'm thinking if the full story is

snapshot checksum hashing (background IO) -> disk is busy -> disk is unable to serve requests from applications (foreground IO) within 8 seconds-> lead to failed IO

For the CPU usage caused by snapshot checksum hashing, we will improve the background IO throttling.
If the issue is triggered again, you can try to disable snapshot checksum hashing and see if the issue remains.

derekbit · 2024-12-30T14:23:16Z

Let's keep the ticket opened for

the empty log issue (cc @c3y1huang)
waiting for @Hr46ph's feedback

Hr46ph · 2024-12-30T14:32:03Z

snapshot checksum hashing (background IO) -> disk is busy -> disk is unable to serve requests from applications (foreground IO) within 8 seconds-> lead to failed IO

Possibly, because I had 5 concurrent jobs (reduced to 2 now). DIdn't realize the hashing or checking of it could cause a lot of disk I/O.

Having said that, I don't really see that many IOPS or throughput but it might have been a combination of things, such as the limiting of cpu usage.

For the CPU usage caused by snapshot checksum hashing, we will improve the background IO throttling. If the issue is triggered again, you can try to disable snapshot checksum hashing and see if the issue remains.

75% of a single core isn't that much, hence I increased it a bit. In combination with less concurrent actions this should hopefully help the situation. I will keep an eye on it!

Hr46ph added kind/bug require/backport Require backport. Only used when the specific versions to backport have not been definied. require/qa-review-coverage Require QA to review coverage labels Dec 28, 2024

github-project-automation bot added this to Longhorn Sprint Dec 28, 2024

github-project-automation bot moved this to New Issues in Longhorn Sprint Dec 28, 2024

longhorn-io-github-bot added this to Community Review Sprint Dec 28, 2024

longhorn-io-github-bot moved this to New in Community Review Sprint Dec 28, 2024

longhorn-io-github-bot moved this from New to In Progress in Community Review Sprint Dec 28, 2024

derekbit moved this from In Progress to Waiting For Reporter in Community Review Sprint Dec 28, 2024

longhorn-io-github-bot moved this from Waiting For Reporter to In Progress in Community Review Sprint Dec 28, 2024

derekbit moved this from In Progress to Waiting For Reporter in Community Review Sprint Dec 28, 2024

longhorn-io-github-bot moved this from Waiting For Reporter to In Progress in Community Review Sprint Dec 28, 2024

derekbit moved this from In Progress to Waiting For Reporter in Community Review Sprint Dec 28, 2024

longhorn-io-github-bot moved this from Waiting For Reporter to In Progress in Community Review Sprint Dec 29, 2024

derekbit moved this from In Progress to Waiting For Reporter in Community Review Sprint Dec 29, 2024

longhorn-io-github-bot moved this from Waiting For Reporter to In Progress in Community Review Sprint Dec 29, 2024

derekbit added the area/environment-issue User-specific related issues, ex: network, DNS, host packages, etc. label Dec 29, 2024

innobead moved this from In Progress to Waiting For Reporter in Community Review Sprint Dec 30, 2024

longhorn-io-github-bot moved this from Waiting For Reporter to In Progress in Community Review Sprint Dec 30, 2024

Hr46ph closed this as completed Dec 30, 2024

github-project-automation bot moved this from New Issues to Closed in Longhorn Sprint Dec 30, 2024

longhorn-io-github-bot moved this from In Progress to Closed in Community Review Sprint Dec 30, 2024

derekbit moved this from Closed to Waiting For Reporter in Community Review Sprint Dec 30, 2024

derekbit moved this from Waiting For Reporter to In Progress in Community Review Sprint Dec 30, 2024

derekbit moved this from Closed to New Issues in Longhorn Sprint Dec 30, 2024

derekbit reopened this Dec 30, 2024

github-project-automation bot moved this from New Issues to Implement in Longhorn Sprint Dec 30, 2024

derekbit moved this from In Progress to Waiting For Reporter in Community Review Sprint Dec 30, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[BUG] All volumes degraded at the same time #10091

[BUG] All volumes degraded at the same time #10091

Hr46ph commented Dec 28, 2024

derekbit commented Dec 28, 2024

Hr46ph commented Dec 28, 2024

derekbit commented Dec 28, 2024

Hr46ph commented Dec 28, 2024

derekbit commented Dec 28, 2024

Hr46ph commented Dec 28, 2024

derekbit commented Dec 29, 2024

Hr46ph commented Dec 29, 2024

derekbit commented Dec 29, 2024

Hr46ph commented Dec 29, 2024

derekbit commented Dec 30, 2024 •

edited

Loading

innobead commented Dec 30, 2024

Hr46ph commented Dec 30, 2024

derekbit commented Dec 30, 2024

derekbit commented Dec 30, 2024

Hr46ph commented Dec 30, 2024

[BUG] All volumes degraded at the same time #10091

[BUG] All volumes degraded at the same time #10091

Comments

Hr46ph commented Dec 28, 2024

Describe the bug

To Reproduce

Expected behavior

Support bundle for troubleshooting

Environment

Additional context

Workaround and Mitigation

derekbit commented Dec 28, 2024

Hr46ph commented Dec 28, 2024

derekbit commented Dec 28, 2024

Hr46ph commented Dec 28, 2024

derekbit commented Dec 28, 2024

Hr46ph commented Dec 28, 2024

derekbit commented Dec 29, 2024

Hr46ph commented Dec 29, 2024

derekbit commented Dec 29, 2024

Hr46ph commented Dec 29, 2024

derekbit commented Dec 30, 2024 • edited Loading

innobead commented Dec 30, 2024

Hr46ph commented Dec 30, 2024

derekbit commented Dec 30, 2024

derekbit commented Dec 30, 2024

Hr46ph commented Dec 30, 2024

derekbit commented Dec 30, 2024 •

edited

Loading