Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[BUG] All volumes degraded at the same time #10091

Open
Hr46ph opened this issue Dec 28, 2024 · 16 comments
Open

[BUG] All volumes degraded at the same time #10091

Hr46ph opened this issue Dec 28, 2024 · 16 comments
Labels
area/environment-issue User-specific related issues, ex: network, DNS, host packages, etc. kind/bug require/backport Require backport. Only used when the specific versions to backport have not been definied. require/qa-review-coverage Require QA to review coverage

Comments

@Hr46ph
Copy link

Hr46ph commented Dec 28, 2024

Describe the bug

Without apparent reason, all volumes are suddenly degraded. Some with one failed replica, others with 2. Several pods are left without data locality.

To Reproduce

I have no idea how to replicate, it just happened (not for the first time I might add).

Expected behavior

Reliability. This stuff scares me.

Support bundle for troubleshooting

WIll add after posting as I need a ticket number.

Environment

  • Longhorn version: 1.7.2
  • Impacted volume (PV): All of them
  • Installation method (e.g. Rancher Catalog App/Helm/Kubectl): helm
  • Kubernetes distro (e.g. RKE/K3s/EKS/OpenShift) and version: Talos Linux 1.8.0
    • Number of control plane nodes in the cluster: 3
    • Number of worker nodes in the cluster: 6 including control plane nodes
  • Node config
    • OS type and version: talos 1.8.0
    • Kernel version: 6.6.60-talos
    • CPU per node: 6
    • Memory per node: 64GB
    • Disk type (e.g. SSD/NVMe/HDD): NVMe
    • Network bandwidth between the nodes (Gbps): 1Gbps
  • Underlying Infrastructure (e.g. on AWS/GCE, EKS/GKE, VMWare/KVM, Baremetal): Baremetal
  • Number of Longhorn volumes in the cluster: 19

Additional context

I restarted a postgres pod, went to check its actual size in Longhorn GUI and all volumes went degraded all at once.

Workaround and Mitigation

Patience, I hope. Its still recovering.

@Hr46ph Hr46ph added kind/bug require/backport Require backport. Only used when the specific versions to backport have not been definied. require/qa-review-coverage Require QA to review coverage labels Dec 28, 2024
@github-project-automation github-project-automation bot moved this to New Issues in Longhorn Sprint Dec 28, 2024
@derekbit
Copy link
Member

Can you provide a support bundle?

@longhorn-io-github-bot longhorn-io-github-bot moved this from New to In Progress in Community Review Sprint Dec 28, 2024
@derekbit derekbit moved this from In Progress to Waiting For Reporter in Community Review Sprint Dec 28, 2024
@Hr46ph
Copy link
Author

Hr46ph commented Dec 28, 2024

Can you provide a support bundle?

Yes, I just did. I wanted to complete the bug form first so I could provide the github bug report link with the support bundle. I sent it to longhorn-support-bundle at suse dot com.

@longhorn-io-github-bot longhorn-io-github-bot moved this from Waiting For Reporter to In Progress in Community Review Sprint Dec 28, 2024
@derekbit
Copy link
Member

One quick question, are the failed replica on the same node?

@Hr46ph
Copy link
Author

Hr46ph commented Dec 28, 2024

One quick question, are the failed replica on the same node?

No, they were all over the place. Mostly on the wk01, 02 and 03 (I assume you see hostnames in the support bundle).

It did recover fairly quickly, only one volume is rebuilding. Maybe I panicked too quickly but still I don't think this should happen so I really hope you can see why this happens.

Thanks for the quick responses.

@derekbit
Copy link
Member

Network bandwidth between the nodes (Gbps): 1Gbps

BTW, the network bandwidth is not suitable for Longhorn, and 10 Gbps NIC is recommended. Please check the best practice https://longhorn.io/docs/1.7.2/best-practices/#volume-performance-optimization.

@derekbit derekbit moved this from In Progress to Waiting For Reporter in Community Review Sprint Dec 28, 2024
@Hr46ph
Copy link
Author

Hr46ph commented Dec 28, 2024

Network bandwidth between the nodes (Gbps): 1Gbps

BTW, the network bandwidth is not suitable for Longhorn, and 10 Gbps NIC is recommended. Please check the best practice https://longhorn.io/docs/1.7.2/best-practices/#volume-performance-optimization.

It's not a production cluster its one of the compromises to keep costs to a minimum. Its also 1 disk which not recommended. There will be dedicated disks added in the near future though.

@longhorn-io-github-bot longhorn-io-github-bot moved this from Waiting For Reporter to In Progress in Community Review Sprint Dec 28, 2024
@derekbit derekbit moved this from In Progress to Waiting For Reporter in Community Review Sprint Dec 28, 2024
@derekbit
Copy link
Member

@Hr46ph Has the support bundle provided?

@longhorn-io-github-bot longhorn-io-github-bot moved this from Waiting For Reporter to In Progress in Community Review Sprint Dec 29, 2024
@derekbit derekbit moved this from In Progress to Waiting For Reporter in Community Review Sprint Dec 29, 2024
@Hr46ph
Copy link
Author

Hr46ph commented Dec 29, 2024

@Hr46ph Has the support bundle provided?

Yes, I did so when I replied yesterday.

@longhorn-io-github-bot longhorn-io-github-bot moved this from Waiting For Reporter to In Progress in Community Review Sprint Dec 29, 2024
@derekbit
Copy link
Member

Thanks @Hr46ph.
I've checked the support bundle. There are lots of I/O timeout errors. Besides Longhorn, other services were experiencing errors. I think it is a temporary network issue in the environment.

@Hr46ph
Copy link
Author

Hr46ph commented Dec 29, 2024

Thanks @Hr46ph. I've checked the support bundle. There are lots of I/O timeout errors. Besides Longhorn, other services were experiencing errors. I think it is a temporary network issue in the environment.

In which log do you see those so that I can look into those further. Network is professional equipment with certified cabling. In network management softtware I dont see errors (whcih doesnt mean there arent other issues).

I can grep -ri over the extracted supportbundle I see many hits on timeouts, except most are harmless.

If you see anything specific can you share that please so I can look into that.

Thanks!

@derekbit derekbit added the area/environment-issue User-specific related issues, ex: network, DNS, host packages, etc. label Dec 29, 2024
@derekbit
Copy link
Member

derekbit commented Dec 30, 2024

@Hr46ph

You can check the messages
... level=error msg="R/W Timeout. No response received in 8s" func="dataconn.(*Client).loop" file="client.go:168"....
The latencies between engines to the downstream replicas exceeded 8 seconds and lead to failed replicas. Were there any I/O spikes during this period? The phenomenon usually happens in low bandwidth NIC environment.

@innobead
Copy link
Member

For this kind of low-spec/non-prod env, might tune the timeout via https://longhorn.io/docs/1.7.2/references/settings/#engine-replica-timeout.

@innobead innobead moved this from In Progress to Waiting For Reporter in Community Review Sprint Dec 30, 2024
@longhorn-io-github-bot longhorn-io-github-bot moved this from Waiting For Reporter to In Progress in Community Review Sprint Dec 30, 2024
@Hr46ph
Copy link
Author

Hr46ph commented Dec 30, 2024

I made a couple of changes, I think I have been asking too much of Longhorn.

The timeouts you are referring to were all towards a single longhorn manager instance.

Around 9 AM that particular Longhorn manager pods crashed:

longhorn-manager-kg4vl
    State:          Running
      Started:      Sat, 28 Dec 2024 08:53:56 +0100
    Last State:     Terminated
      Reason:       Unknown
      Exit Code:    255
      Started:      Sat, 30 Nov 2024 15:03:49 +0100
      Finished:     Sat, 28 Dec 2024 08:52:29 +0100

From the logs I can see that before it crashed, it was attempting to hash snapshots but gave warnings that another node was already doing that.

In Grafana I noticed several instance manager pods were capped at just below 1 core cpu usage, which more or less looks like the 12% default cpu usage in Longhorn settings (12% is actually around 75% of a single core but its close enough I suppose).

I can't really see a reason for the termination of the instance manager pod, I do think my recurring jobs might have been asking too much of Longhorn.

So I increased the Guaranteed Instance Manager CPU for V1 Data Engine to 33 (2 cores).
I increased the timeout as suggested to 30 seconds.
I decreased concurrency on snaphot, trim, delete cronjobs from 5 to 2.
I decreased the worker threads from 3 to 2 for recurring jobs
Created a new cronjob to make a snapshot every 4 hours instead of every hour
Removed snapshot cronjob from several volumes that dont actually need hourly snapshotting.
changed several volumes from hourly snapshots to 4 hour snapshots.
I disabled filesystem freeze during snapshots. I enabled that because I thought it was a good idea to ensure data consistency, but that might have been asking too much of this setup.

All in all I think I might have caused this issue myself with the setting things too demanding and asking too much of Longhorn 😳 .

With these more relaxed settings I will monitor. Thanks for the assistance, appreciate it!

@Hr46ph Hr46ph closed this as completed Dec 30, 2024
@github-project-automation github-project-automation bot moved this from New Issues to Closed in Longhorn Sprint Dec 30, 2024
@longhorn-io-github-bot longhorn-io-github-bot moved this from In Progress to Closed in Community Review Sprint Dec 30, 2024
@derekbit derekbit moved this from Closed to Waiting For Reporter in Community Review Sprint Dec 30, 2024
@derekbit derekbit moved this from Waiting For Reporter to In Progress in Community Review Sprint Dec 30, 2024
@derekbit
Copy link
Member

@Hr46ph

I cannot find any clues related to the termination of the longhorn-manager-kg4vl pod. I also found the kubelet.log and k3s log are all empty. cc @c3y1huang Do you know the root cause?

longhorn-manager-kg4vl is a control plane pod, so the termination won't impact the data plane's IO.
You mention the high CPU usage due to the snapshot checksum hashing. I'm thinking if the full story is

snapshot checksum hashing (background IO) -> disk is busy -> disk is unable to serve requests from applications (foreground IO) within 8 seconds-> lead to failed IO

For the CPU usage caused by snapshot checksum hashing, we will improve the background IO throttling.
If the issue is triggered again, you can try to disable snapshot checksum hashing and see if the issue remains.

@derekbit derekbit moved this from Closed to New Issues in Longhorn Sprint Dec 30, 2024
@derekbit derekbit reopened this Dec 30, 2024
@github-project-automation github-project-automation bot moved this from New Issues to Implement in Longhorn Sprint Dec 30, 2024
@derekbit
Copy link
Member

Let's keep the ticket opened for

@Hr46ph
Copy link
Author

Hr46ph commented Dec 30, 2024

snapshot checksum hashing (background IO) -> disk is busy -> disk is unable to serve requests from applications (foreground IO) within 8 seconds-> lead to failed IO

Possibly, because I had 5 concurrent jobs (reduced to 2 now). DIdn't realize the hashing or checking of it could cause a lot of disk I/O.

Having said that, I don't really see that many IOPS or throughput but it might have been a combination of things, such as the limiting of cpu usage.

For the CPU usage caused by snapshot checksum hashing, we will improve the background IO throttling. If the issue is triggered again, you can try to disable snapshot checksum hashing and see if the issue remains.

75% of a single core isn't that much, hence I increased it a bit. In combination with less concurrent actions this should hopefully help the situation. I will keep an eye on it!

@derekbit derekbit moved this from In Progress to Waiting For Reporter in Community Review Sprint Dec 30, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
area/environment-issue User-specific related issues, ex: network, DNS, host packages, etc. kind/bug require/backport Require backport. Only used when the specific versions to backport have not been definied. require/qa-review-coverage Require QA to review coverage
Projects
Status: Waiting For Reporter
Status: Implement
Development

No branches or pull requests

3 participants