Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[BUG] CSI components CrashLoopBackOff, failed to connect to unix://csi/csi.sock after cluster restart #7116

Closed
yangchiu opened this issue Nov 16, 2023 · 16 comments
Assignees
Labels
area/csi CSI related like control/node driver, sidecars area/stability System or volume stability area/upstream Upstream related like tgt upstream library backport/1.5.4 kind/bug priority/1 Highly recommended to implement or fix in this release (managed by PO) reproduce/rare < 50% reproducible require/backport Require backport. Only used when the specific versions to backport have not been definied.
Milestone

Comments

@yangchiu
Copy link
Member

Describe the bug (🐛 if you encounter this issue)

After cluster restart (reboot all k8s nodes including control plane node), all csi components got stuck in CrashLoopBackOff:

╰─$ kl get pods
NAME                                                     READY   STATUS             RESTARTS          AGE
longhorn-manager-pxlcx                                   1/1     Running            16 (10h ago)      15h
longhorn-manager-gr26g                                   1/1     Running            16 (10h ago)      15h
engine-image-ei-b907910b-m6gr4                           1/1     Running            16 (10h ago)      15h
instance-manager-bb7e4a2035b50ec3896800262c56800c        1/1     Running            0                 10h
engine-image-ei-b907910b-7vfnt                           1/1     Running            16 (10h ago)      15h
instance-manager-dca4e951d12ab6ab6f714f42f1a2416e        1/1     Running            0                 10h
csi-attacher-67f8c99bcd-9fqwj                            1/1     Running            20 (10h ago)      15h
csi-snapshotter-6f6bf9c757-t68d7                         1/1     Running            19 (10h ago)      15h
csi-resizer-545b7c64f5-7l2hv                             1/1     Running            19 (10h ago)      15h
csi-attacher-67f8c99bcd-2chq8                            1/1     Running            19 (10h ago)      15h
csi-provisioner-557c7f7c44-g2x77                         1/1     Running            20 (10h ago)      15h
longhorn-driver-deployer-7f94bb668f-s7zwg                1/1     Running            16 (10h ago)      15h
longhorn-csi-plugin-h5xqw                                3/3     Running            72 (10h ago)      15h
longhorn-ui-646f4bc8df-tbmtt                             1/1     Running            29 (10h ago)      15h
csi-resizer-545b7c64f5-gzm2z                             1/1     Running            19 (10h ago)      15h
csi-provisioner-557c7f7c44-2p6lc                         1/1     Running            19 (10h ago)      15h
engine-image-ei-b907910b-9ztgm                           1/1     Running            16 (10h ago)      15h
csi-snapshotter-6f6bf9c757-jqqp9                         1/1     Running            19 (10h ago)      15h
longhorn-csi-plugin-4sqkv                                3/3     Running            69 (10h ago)      15h
longhorn-manager-dfn6d                                   1/1     Running            16 (10h ago)      15h
instance-manager-0435193fed60526c87fd3a53fb03ba39        1/1     Running            0                 10h
share-manager-pvc-911b2b67-1e45-4f26-914f-23f2090af998   1/1     Running            0                 10h
share-manager-pvc-81c99bd1-b424-4be2-a075-80e8b17334ab   1/1     Running            0                 10h
longhorn-ui-646f4bc8df-c9tqj                             1/1     Running            35 (10h ago)      15h
csi-attacher-67f8c99bcd-jsj54                            0/1     CrashLoopBackOff   141 (2m49s ago)   15h
longhorn-csi-plugin-kxtqn                                0/3     CrashLoopBackOff   535 (2m18s ago)   15h
csi-snapshotter-6f6bf9c757-xwmmn                         0/1     CrashLoopBackOff   141 (2m14s ago)   15h
csi-provisioner-557c7f7c44-wd57k                         0/1     CrashLoopBackOff   141 (2m5s ago)    15h
csi-resizer-545b7c64f5-8jxk4                             0/1     CrashLoopBackOff   141 (61s ago)     15h

They are all unable to connect to unix://csi/csi.sock:

╰─$ kl logs csi-attacher-67f8c99bcd-jsj54
I1116 01:28:03.328201       1 main.go:97] Version: v4.4.0
W1116 01:28:13.329951       1 connection.go:183] Still connecting to unix:///csi/csi.sock
W1116 01:28:23.329624       1 connection.go:183] Still connecting to unix:///csi/csi.sock
W1116 01:28:33.330048       1 connection.go:183] Still connecting to unix:///csi/csi.sock
E1116 01:28:33.330124       1 main.go:136] context deadline exceeded

To Reproduce

Run negative test case Restart Cluster While Workload Heavy Writing repeatedly.

Expected behavior

Support bundle for troubleshooting

supportbundle_ae1d1892-8da7-4733-97d7-16326976bb0e_2023-11-16T03-17-09Z.zip

worker nodes logs:
worker_nodes_log.txt

Environment

  • Longhorn version: v1.5.x-head
  • Installation method (e.g. Rancher Catalog App/Helm/Kubectl): kubectl
  • Kubernetes distro (e.g. RKE/K3s/EKS/OpenShift) and version: v1.27.1+k3s1
    • Number of management node in the cluster:
    • Number of worker node in the cluster:
  • Node config
    • OS type and version:
    • Kernel version:
    • CPU per node:
    • Memory per node:
    • Disk type(e.g. SSD/NVMe/HDD):
    • Network bandwidth between the nodes:
  • Underlying Infrastructure (e.g. on AWS/GCE, EKS/GKE, VMWare/KVM, Baremetal):
  • Number of Longhorn volumes in the cluster:
  • Impacted Longhorn resources:
    • Volume names:

Additional context

https://suse.slack.com/archives/C02DR3N5T24/p1700095454722879?thread_ts=1699951700.806219&cid=C02DR3N5T24

@yangchiu yangchiu added kind/bug reproduce/rare < 50% reproducible require/backport Require backport. Only used when the specific versions to backport have not been definied. labels Nov 16, 2023
@yangchiu yangchiu added this to the v1.6.0 milestone Nov 16, 2023
@ejweber
Copy link
Contributor

ejweber commented Dec 4, 2023

I cannot find any clue in the support bundle. Since the longhorn-csi-plugin is being repeatedly restarted, even if it fails to listen on csi.sock one time, it should succeed the next time, and the system should stabilize.

My immediate guess was SELinux. However, SELinux should have been set to permissive given the Jenkins parameters (I am not able to confirm or deny this from any of the logs we have). Additionally, it should have affected all nodes similarly.

I kicked off https://ci.longhorn.io/job/private/job/longhorn-e2e-test/129/ to try to reproduce it for live debugging. If that doesn't work, we will likely need to wait for another occurence.

@innobead innobead added the priority/1 Highly recommended to implement or fix in this release (managed by PO) label Dec 6, 2023
@PhanLe1010
Copy link
Contributor

I hit this one when restarting a one-node cluster. I was planning to create a ticket but looks like one already exist here.
The issue in my case is that longhorn-csi-plugin stays in crashing loop for about 4-5 mins then it recovers. Not sure if this is just slowing issue

@ejweber
Copy link
Contributor

ejweber commented Dec 11, 2023

I kicked off https://ci.longhorn.io/job/private/job/longhorn-e2e-test/129/ to try to reproduce it for live debugging. If that doesn't work, we will likely need to wait for another occurrence.

This didn't help to reproduce the issue.

I hit this one when restarting a one-node cluster. I was planning to create a ticket but looks like one already exist here.
The issue in my case is that longhorn-csi-plugin stays in crashing loop for about 4-5 mins then it recovers. Not sure if this is just slowing issue

I'll look into this a bit when I have time to see if there is an improvement we can make. It seems different from the original event, since, in that case, the CSI components never recovered.

@yaleman
Copy link

yaleman commented Dec 16, 2023

I'm having this same issue, what kind of troubleshooting can we actually perform to identify the cause? This is on a five node cluster and it was working fine until a power issue took it out, now one node's looping on this error:

Still connecting to unix:///csi/csi.sock

@ejweber
Copy link
Contributor

ejweber commented Dec 18, 2023

Hello @yaleman. Could we have a support bundle please to confirm the identical symptoms and look for additional clues?

Please upload a it (or a paste a link) here, or send it to longhorn-support-bundle@suse.com.

Also, per the analysis below:

  • What version of longhornio/livenessprobe is the longhorn-csi-plugin pod running in your cluster?
  • After you have captured a support bundle, does terminating the offending longhorn-csi-plugin pod resolve the issue? This worked in my reproduce.

@ejweber
Copy link
Contributor

ejweber commented Dec 18, 2023

  • A 30 second timeout was introduced to github.com/kubernetes-csi/csi-lib-utils/connection.Connect in this commit. The commit is included in github.com/kubernetes-csi/csi-lib-utils >= v1.14.0.
  • The latest version (v.2.11.0) of github.com/kubernetes-csi/livenessprobe picked up that change in this commit.
  • In the support bundle provided by @longhorn/qa, we were running longhornio/livenessprobe:v2.11.0. I'm not sure yet why this was the case, since we deploy longhornio/livenessprobe:v2.9.0 by default still. Maybe we were actively testing an improvement?

Here are the latest longhorn-csi-plugin logs from the support bundle:

2023-11-16T03:13:13.665926468Z time="2023-11-16T03:13:13Z" level=info msg="CSI Driver: driver.longhorn.io version: v1.6.0-dev, manager URL http://longhorn-backend:9500/v1" func="csi.(*Manager).Run" file="manager.go:23"
2023-11-16T03:13:13.687607928Z time="2023-11-16T03:13:13Z" level=info msg="Enabling node service capability: GET_VOLUME_STATS" func=csi.getNodeServiceCapabilities file="node_server.go:756"
2023-11-16T03:13:13.687626652Z time="2023-11-16T03:13:13Z" level=info msg="Enabling node service capability: STAGE_UNSTAGE_VOLUME" func=csi.getNodeServiceCapabilities file="node_server.go:756"
2023-11-16T03:13:13.687648727Z time="2023-11-16T03:13:13Z" level=info msg="Enabling node service capability: EXPAND_VOLUME" func=csi.getNodeServiceCapabilities file="node_server.go:756"
2023-11-16T03:13:13.687698615Z time="2023-11-16T03:13:13Z" level=info msg="Enabling controller service capability: CREATE_DELETE_VOLUME" func=csi.getControllerServiceCapabilities file="controller_server.go:1264"
2023-11-16T03:13:13.687704255Z time="2023-11-16T03:13:13Z" level=info msg="Enabling controller service capability: PUBLISH_UNPUBLISH_VOLUME" func=csi.getControllerServiceCapabilities file="controller_server.go:1264"
2023-11-16T03:13:13.687707962Z time="2023-11-16T03:13:13Z" level=info msg="Enabling controller service capability: EXPAND_VOLUME" func=csi.getControllerServiceCapabilities file="controller_server.go:1264"
2023-11-16T03:13:13.687711070Z time="2023-11-16T03:13:13Z" level=info msg="Enabling controller service capability: CREATE_DELETE_SNAPSHOT" func=csi.getControllerServiceCapabilities file="controller_server.go:1264"
2023-11-16T03:13:13.687713239Z time="2023-11-16T03:13:13Z" level=info msg="Enabling controller service capability: CLONE_VOLUME" func=csi.getControllerServiceCapabilities file="controller_server.go:1264"
2023-11-16T03:13:13.687792201Z time="2023-11-16T03:13:13Z" level=info msg="Enabling volume access mode: SINGLE_NODE_WRITER" func=csi.getVolumeCapabilityAccessModes file="controller_server.go:1280"
2023-11-16T03:13:13.687797371Z time="2023-11-16T03:13:13Z" level=info msg="Enabling volume access mode: MULTI_NODE_MULTI_WRITER" func=csi.getVolumeCapabilityAccessModes file="controller_server.go:1280"
2023-11-16T03:13:13.688131494Z time="2023-11-16T03:13:13Z" level=info msg="Listening for connections on address: &net.UnixAddr{Name:\"//csi/csi.sock\", Net:\"unix\"}" func="csi.(*NonBlockingGRPCServer).serve" file="server.go:100"

There is no evidence of anything wrong with longhorn-csi-plugin in this latest run.

Here are the latest livenessprobe logs from the support bundle:

2023-11-16T03:15:31.406665514Z W1116 03:15:31.406497   17969 connection.go:183] Still connecting to unix:///csi/csi.sock
2023-11-16T03:15:41.406718665Z W1116 03:15:41.406526   17969 connection.go:183] Still connecting to unix:///csi/csi.sock
2023-11-16T03:15:51.406020666Z W1116 03:15:51.405855   17969 connection.go:183] Still connecting to unix:///csi/csi.sock
2023-11-16T03:15:51.406094710Z F1116 03:15:51.405889   17969 main.go:146] failed to establish connection to CSI driver: context deadline exceeded

30 seconds after livenessprobe starts, it fails. From the code livenessprobe is expected to loop here indefinitely instead of failing. Due to the changes mentioned above, that does not happen.

The following sequence of events seems possible (though I have not managed to reproduce it):

  • longhorn-csi-plugin fails early and often. This is speculation, but maybe the longhorn-backend service is not up yet, so it cannot connect as it initializes.
  • Instead of waiting forever to connect to longhorn-csi-plugin via csi.sock, livenessprobe fails every 30 seconds.
  • Both longhorn-csi-plugin and livenessprobe fall into exponential restart backoff. As the backoff grows, it becomes possible for livenessprobe to start, fail to connect for 30 seconds, and die while longhorn-csi-plugin is waiting to start.
  • Eventually, longhorn-csi-plugin is able to consistently start successfully (as seen in the logs above). However, it always manages to do so at a time when livenessprobe is down.
  • Without livenessprobe, the liveness probe configured on longhorn-csi-plugin is guaranteed to fail. Kubernetes kills it.
  • Without longhorn-csi-plugin, livenessprobe cannot start. It fails (as seen in the logs above), and Kubernetes restarts it.

Additional evidence:

  • From the support bundle, longhorn-csi-plugin is in 5m0s exponential backoff. It last finished at 2023-11-16T03:15:51Z after having been up for 15s. Its exit code is 143, which is associated with graceful termination at the request of Kubernetes.
  • From the support bundle, livenessprobe is in 5m0s exponential backoff. It last finished at 2023-11-16T03:15:51Z.
  • If we continue this pattern, longhorn-csi-plugin and livenessprobe may eventually run successfully at the same time, but clearly the next many iterations will result in failure.

@ejweber
Copy link
Contributor

ejweber commented Dec 18, 2023

  • In the support bundle provided by @longhorn/qa, we were running longhornio/livenessprobe:v2.11.0. I'm not sure yet why this was the case, since we deploy longhornio/livenessprobe:v2.9.0 by default still. Maybe we were actively testing an improvement?

One mystery solved. #6920 bumped the version of livenessprobe (and other CSI components) in deployment manifests. This change hasn't made it to any released versions yet, though. It is expected to see it in v1.5.x and master test builds, but not out in the wild.

@ejweber
Copy link
Contributor

ejweber commented Dec 18, 2023

Reproduce steps:

  • kubectl edit -n longhorn-system service longhorn-backend and change the port from 9500 to 9501. This effectively breaks the longhorn-backend service temporarily.
  • kubectl delete -n longhorn-system pod longhorn-csi-plugin-<xxxxxx>.
  • Wait some length of time. All three containers go into CrashLoopBackoff.
  • kubectl edit -n longhorn-system service longhorn-backend and restore the correct port.
  • Monitor longhorn-csi-plugin. It can never recover (or at least not for a long time) and exhibits the behavior described above.

In the example below, I kept the longhorn-backend service broken for 6m30s. The situation was not improved 7m30s after that.

longhorn-csi-plugin-6zxtn                           0/3     CrashLoopBackOff   16 (1s ago)     6m36s
longhorn-csi-plugin-6zxtn                           2/3     CrashLoopBackOff   18 (2m11s ago)   8m46s
longhorn-csi-plugin-6zxtn                           0/3     CrashLoopBackOff   18 (2m41s ago)   9m16s
longhorn-csi-plugin-6zxtn                           0/3     CrashLoopBackOff   18 (1s ago)      9m17s
longhorn-csi-plugin-6zxtn                           1/3     CrashLoopBackOff   19 (2m27s ago)   11m
longhorn-csi-plugin-6zxtn                           1/3     CrashLoopBackOff   20 (2s ago)      11m
longhorn-csi-plugin-6zxtn                           0/3     CrashLoopBackOff   20 (1s ago)      12m
longhorn-csi-plugin-6zxtn                           2/3     CrashLoopBackOff   22 (2m16s ago)   14m
longhorn-csi-plugin-6zxtn                           0/3     CrashLoopBackOff   22 (2m46s ago)   14m
longhorn-csi-plugin-6zxtn                           0/3     CrashLoopBackOff   22 (2s ago)      14m

Deleting longhorn-csi-plugin-6zxtn immediately resolved the issue.

@ejweber
Copy link
Contributor

ejweber commented Dec 18, 2023

I replaced livenessprobe v2.11.0 with v2.10.0. While the other two containers went into a crash loop (as expected) when the longhorn-backend service was broken, the livenessprobe container did not.

W1218 22:53:48.550210  114434 connection.go:173] Still connecting to unix:///csi/csi.sock
W1218 22:53:58.549453  114434 connection.go:173] Still connecting to unix:///csi/csi.sock
W1218 22:54:08.549543  114434 connection.go:173] Still connecting to unix:///csi/csi.sock
W1218 22:54:18.550311  114434 connection.go:173] Still connecting to unix:///csi/csi.sock
W1218 22:54:28.550369  114434 connection.go:173] Still connecting to unix:///csi/csi.sock
W1218 22:54:38.550120  114434 connection.go:173] Still connecting to unix:///csi/csi.sock
W1218 22:54:48.549873  114434 connection.go:173] Still connecting to unix:///csi/csi.sock
W1218 22:54:58.549510  114434 connection.go:173] Still connecting to unix:///csi/csi.sock
W1218 22:55:08.549488  114434 connection.go:173] Still connecting to unix:///csi/csi.sock
W1218 22:55:18.550327  114434 connection.go:173] Still connecting to unix:///csi/csi.sock
W1218 22:55:28.549449  114434 connection.go:173] Still connecting to unix:///csi/csi.sock
W1218 22:55:38.552287  114434 connection.go:173] Still connecting to unix:///csi/csi.sock
W1218 22:55:48.549367  114434 connection.go:173] Still connecting to unix:///csi/csi.sock
W1218 22:55:58.549464  114434 connection.go:173] Still connecting to unix:///csi/csi.sock
W1218 22:56:08.549487  114434 connection.go:173] Still connecting to unix:///csi/csi.sock
W1218 22:56:18.549671  114434 connection.go:173] Still connecting to unix:///csi/csi.sock
W1218 22:56:28.549355  114434 connection.go:173] Still connecting to unix:///csi/csi.sock
W1218 22:56:38.549645  114434 connection.go:173] Still connecting to unix:///csi/csi.sock
W1218 22:56:48.549588  114434 connection.go:173] Still connecting to unix:///csi/csi.sock
W1218 22:56:58.549452  114434 connection.go:173] Still connecting to unix:///csi/csi.sock
W1218 22:57:08.549906  114434 connection.go:173] Still connecting to unix:///csi/csi.sock
W1218 22:57:18.549325  114434 connection.go:173] Still connecting to unix:///csi/csi.sock
W1218 22:57:28.549934  114434 connection.go:173] Still connecting to unix:///csi/csi.sock

Since the containers were deep into backoff when I restored the service, it took still took a while to recover (had to wait for the longhorn-csi-plugin container to be allowed to restart). However, full recovery eventually occured.

longhorn-csi-plugin-dgrgh                           2/3     CrashLoopBackOff   10 (92s ago)    5m36s
longhorn-csi-plugin-dgrgh                           1/3     CrashLoopBackOff   10 (2m2s ago)   6m6s
longhorn-csi-plugin-dgrgh                           1/3     CrashLoopBackOff   10 (15s ago)    6m20s
longhorn-csi-plugin-dgrgh                           2/3     CrashLoopBackOff   11 (30s ago)    6m35s
longhorn-csi-plugin-dgrgh                           3/3     Running            12 (2m53s ago)   8m58s

@ejweber
Copy link
Contributor

ejweber commented Dec 18, 2023

Ideas to fix (in order of preference?):

  1. Patch upstream livenessprobe to restore previous behavior. Revert to v2.10.0 (or v2.9.0, since we already have a Longhorn version of it) in the meantime.
  2. Add a startup probe to the longhorn-csi-plugin container with a 5m window or configure its livenessProbe.initialDelaySeconds to 5m. Kubernetes will not kill this container for the first five minutes it is alive. This will keep it alive long enough for livenessprobe to connect with it whenever it comes back, but maintain the existing probe behavior after startup.
  3. Rework longhorn-csi-plugin startup so that it does not crash if it cannot connect to the longhorn-backend service. This may prevent the offset CrashLoopBackoffs from occurring in the first place. Kubernetes liveness probes will still fail, since the container is not ready. This only works as a potential addendum to 2.

@yaleman
Copy link

yaleman commented Dec 18, 2023

Mine seems to have been sorted by restarting the node, it's a transient weirdness unfortunately.. will send a support bundle once I get one.

@ejweber
Copy link
Contributor

ejweber commented Dec 19, 2023

Mine seems to have been sorted by restarting the node, it's a transient weirdness unfortunately.. will send a support bundle once I get one.

Yes, I think that fits with the analysis I've done. Even without the bundle, can you confirm which version of the livenessprobe image you are running in your cluster? For example:

eweber@laptop:~> kl get pod -l app=longhorn-csi-plugin -oyaml | grep livenessprobe
      image: longhornio/livenessprobe:v2.9.0
      image: docker.io/longhornio/livenessprobe:v2.9.0
      imageID: docker.io/longhornio/livenessprobe@sha256:555e27888f3692ab4130616ce4de62f950bdac16da3af4c432e13a71604e0261
      image: longhornio/livenessprobe:v2.9.0
      image: docker.io/longhornio/livenessprobe:v2.9.0
      imageID: docker.io/longhornio/livenessprobe@sha256:555e27888f3692ab4130616ce4de62f950bdac16da3af4c432e13a71604e0261
      image: longhornio/livenessprobe:v2.9.0
      image: docker.io/longhornio/livenessprobe:v2.9.0
      imageID: docker.io/longhornio/livenessprobe@sha256:555e27888f3692ab4130616ce4de62f950bdac16da3af4c432e13a71604e0261

@ejweber
Copy link
Contributor

ejweber commented Dec 22, 2023

I reviewed #6916 and it looks like we really do need to go to livenessprobe v2.11.0 if we want to mitigate the CVEs it mentions.

The upstream bug has some traction, but it's unlikely we'll see a version we can grab and test before v1.6.0 releases. I'll move forward with my mitigation PR.

  • Create an issue to revert the mitigation when we upgrade livenessprobe if the mitigation changes are approved and merged.

@longhorn-io-github-bot
Copy link

longhorn-io-github-bot commented Dec 22, 2023

Pre Ready-For-Testing Checklist

@yangchiu
Copy link
Member Author

I reviewed #6916 and it looks like we really do need to go to livenessprobe v2.11.0 if we want to mitigate the CVEs it mentions.

The upstream bug has some traction, but it's unlikely we'll see a version we can grab and test before v1.6.0 releases. I'll move forward with my mitigation PR.

* [x]  Create an issue to revert the mitigation when we upgrade livenessprobe if the mitigation changes are approved and merged.

ref: #7428

@yangchiu
Copy link
Member Author

Verified passed on master-head (longhorn-manager 859d438) following test steps #7116 (comment) and longhorn/longhorn-manager#2388 (comment). longhorn-csi-plugin pods can run stable after backend service restored.

Also ran negative test case Restart Cluster While Workload Heavy Writing for 60 times, didn't observe this issue. Test results:
https://ci.longhorn.io/job/private/job/longhorn-e2e-test/133/
https://ci.longhorn.io/job/private/job/longhorn-e2e-test/134/
https://ci.longhorn.io/job/private/job/longhorn-e2e-test/135/

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
area/csi CSI related like control/node driver, sidecars area/stability System or volume stability area/upstream Upstream related like tgt upstream library backport/1.5.4 kind/bug priority/1 Highly recommended to implement or fix in this release (managed by PO) reproduce/rare < 50% reproducible require/backport Require backport. Only used when the specific versions to backport have not been definied.
Projects
Status: Closed
Development

No branches or pull requests

6 participants