[BUG] Persistent volume is not ready for workloads #6776

ajoskowski · 2023-09-25T08:39:40Z

Describe the bug (🐛 if you encounter this issue)

Sometimes we encounter on issues when we are not able to mount a longhorn volume to the pod.
Pod is not able to start and following errors are visible:

Kubernetes events for failing pod:

Error: failed to create containerd task: failed to create shim task: OCI runtime create failed: runc create failed: unable to start container process: chdir to cwd ("/src") set in config.json failed: stale NFS file handle: unknown

AttachVolume.Attach failed for volume "pvc-506e824d-414f-43ce-af59-5821b2b9accf" : rpc error: code = Aborted desc = volume pvc-506e824d-414f-43ce-af59-5821b2b9accf is not ready for workloads

To Reproduce

Problem cannot be easily reproduced - it fails randomly.

Expected behavior

Volumes work fine and are able to be mounted to the pods.

Support bundle for troubleshooting

We must not send support bundle due to security reason but we can provide logs and details - see below:

Kubernetes (and Longhorn) nodes:
- ip-X-X-X-57..compute.internal
- ip-X-X-X-142.compute.internal
- ip-X-X-X-140.compute.internal
Pod name: ci-state-pr-2463-env-doaks-prod6-uaenorth-v2wnw-override-refs-in-tf-modules-407269516
Pod events:

AttachVolume.Attach failed for volume "pvc-506e824d-414f-43ce-af59-5821b2b9accf" : rpc error: code = Aborted desc = volume pvc-506e824d-414f-43ce-af59-5821b2b9accf is not ready for workloads

Error: failed to create containerd task: failed to create shim task: OCI runtime create failed: runc create failed: unable to start container process: chdir to cwd ("/src") set in config.json failed: stale NFS file handle: unknown

instance-manager on ip-X-X-X-142.compute.internal node:

time="2023-09-18T03:08:01Z" level=error msg="I/O error" error="no backend available"
[pvc-506e824d-414f-43ce-af59-5821b2b9accf-e-0df2341a] time="2023-09-18T03:08:01Z" level=error msg="I/O error" error="no backend available"
response_process: Receive error for response 3 of seq 310
tgtd: bs_longhorn_request(111) fail to read at 0 for 4096
tgtd: bs_longhorn_request(210) io error 0xc27700 28 -14 4096 0, Success
[pvc-506e824d-414f-43ce-af59-5821b2b9accf-e-0df2341a] time="2023-09-18T03:08:01Z" level=error msg="I/O error" error="no backend available"
response_process: Receive error for response 3 of seq 311
tgtd: bs_longhorn_request(111) fail to read at 0 for 4096
tgtd: bs_longhorn_request(210) io error 0xc27700 28 -14 4096 0, Success
response_process: Receive error for response 3 of seq 312
tgtd: bs_longhorn_request(111) fail to read at 0 for 4096
tgtd: bs_longhorn_request(210) io error 0xc27700 28 -14 4096 0, Success
response_process: Receive error for response 3 of seq 313
tgtd: bs_longhorn_request(97) fail to write at 10737352704 for 65536
tgtd: bs_longhorn_request(210) io error 0xc27700 2a -14 65536 10737352704, Success
[pvc-506e824d-414f-43ce-af59-5821b2b9accf-e-0df2341a] time="2023-09-18T03:08:01Z" level=error msg="I/O error" error="no backend available"

time="2023-09-18T03:08:01Z" level=error msg="I/O error" error="no backend available"
response_process: Receive error for response 3 of seq 314
tgtd: bs_longhorn_request(97) fail to write at 4337664 for 4096
tgtd: bs_longhorn_request(210) io error 0xc27700 2a -14 4096 4337664, Success
response_process: Receive error for response 3 of seq 315
tgtd: bs_longhorn_request(97) fail to write at 37912576 for 4096
tgtd: bs_longhorn_request(210) io error 0xc27700 2a -14 4096 37912576, Success
[pvc-506e824d-414f-43ce-af59-5821b2b9accf-e-0df2341a] time="2023-09-18T03:08:01Z" level=error msg="I/O error" error="no backend available"

time="2023-09-18T03:08:20Z" level=error msg="Error syncing Longhorn engine" controller=longhorn-engine engine=longhorn-system/pvc-506e824d-414f-43ce-af59-5821b2b9accf-e-0df2341a error="failed to sync engine for longhorn-system/pvc-506e824d-414f-43ce-af59-5821b2b9accf-e-0df2341a: failed to start rebuild for pvc-506e824d-414f-43ce-af59-5821b2b9accf-r-6093cefb of pvc-506e824d-414f-43ce-af59-5821b2b9accf-e-0df2341a: timed out waiting for the condition" node=ip-10-44-45-142.eu-central-1.compute.internal

longhorn-csi-plugin on ip-X-X-X-142.compute.internal node:

time="2023-09-18T03:07:34Z" level=error msg="ControllerPublishVolume: err: rpc error: code = Aborted desc = volume pvc-506e824d-414f-43ce-af59-5821b2b9accf is not ready for workloads"

csi-attacher on ip-X-X-X-142.compute.internal node:

I0918 03:07:34.632251       1 csi_handler.go:234] Error processing "csi-635290b8ff08b07c1e7e1bdf2434aec2d8e8ef39dd611f725f8f3da595713bf5": failed to attach: rpc error: code = Aborted desc = volume pvc-506e824d-414f-43ce-af59-5821b2b9accf is not ready for workloads

longhorn-manager on ip-X-X-X-142.compute.internal node:

time="2023-09-18T03:08:00Z" level=error msg="Failed to rebuild replica X.X.X.245:10205" controller=longhorn-engine engine=pvc-506e824d-414f-43ce-af59-5821b2b9accf-e-0df2341a error="proxyServer=X.X.X.201:8501 destination=X.X.X.201:10079: failed to add replica tcp://X.X.X.245:10205 for volume: rpc error: code = Unknown desc = failed to create replica tcp://X.X.X.245:10205 for volume X.X.X.201:10079: rpc error: code = Unknown desc = cannot get valid result for remain snapshot" node=ip-X-X-X-142.eu-central-1.compute.internal volume=pvc-506e824d-414f-43ce-af59-5821b2b9accf

time="2023-09-18T03:08:00Z" level=error msg="Failed to sync Longhorn volume longhorn-system/pvc-506e824d-414f-43ce-af59-5821b2b9accf" controller=longhorn-volume error="failed to sync longhorn-system/pvc-506e824d-414f-43ce-af59-5821b2b9accf: failed to reconcile volume state for pvc-506e824d-414f-43ce-af59-5821b2b9accf: no healthy or scheduled replica for starting" node=ip-X-X-X-142.eu-central-1.compute.internal

Environment

Longhorn version: v1.5.1
Installation method: helm
Kubernetes distro and version: AWS EKS, version v1.26.6
- Number of worker node in the cluster: 3
- Machine type: m5.4xlarge
Number of Longhorn volumes in the cluster: tens of volumes created dynamically as temporary storage for CICD builds (Longhorn + Argo Workflows)
Impacted Longhorn resources:
- Volume names: pvc-506e824d-414f-43ce-af59-5821b2b9accf (only example)

Additional context

Cluster autoscaler is enabled on the cluster - Kubernetes Cluster Autoscaler Enabled (Experimental) is enabled in Longhorn configuration

The text was updated successfully, but these errors were encountered:

derekbit · 2023-09-25T08:41:39Z

Can you provide a support bundle to our e-mail (not public) longhorn-support-bundle@suse.com?

ajoskowski · 2023-09-25T08:43:56Z

@derekbit We are not able to do it due to processes in our company. I wrote all details in the ticket. If you need to get something more please tell me I will try to help you :)

derekbit · 2023-09-25T08:48:27Z

Questions for clarification:

...config.json failed: stale NFS file handle: unknown.. => What's the purpose of the NFS filesystem in your system?
What's the accessMode of the problematic volume?
How many replicas of the problematic volume?
..level=error msg="I/O error" error="no backend available"... => All replicas are failed. Is it because of i/o timeout? Can you help check why the replicas failed?

ajoskowski · 2023-09-25T08:53:17Z

Questions for clarification:

...config.json failed: stale NFS file handle: unknown.. => What's the purpose of the NFS filesystem in your system?

What's the accessMode of the problematic volume?

How many replicas of the problematic volume?

..level=error msg="I/O error" error="no backend available"... => All replicas are failed. Is it because of i/o timeout? Can you help check why the replicas failed?

It a error message from Argo Workflows step - we do not have anything relaated to NFS except Longhorn.
Access mode is RWX. - we have several steps in our pipeline. First step prepares data (example: git clone), next steps can be run parallelly and use already prepared data - this is a reason why we use RWX.
We have 3 worker nodes and we have 3 replicas of Longhorn volumes. We have also cluster autoscaler enabled on the cluster but in this specific case we had only initially created 3 nodes.
Where can I get some details about a reason of this state?

derekbit · 2023-09-25T09:02:34Z

Where can I get some details about a reason of this state?

Check the instance-manager logs and see why the replicas cannot be added to the engine or any i/o timeout error.

karolkieglerski · 2023-09-25T09:10:35Z

This issue is related to the my issue #6641

ajoskowski · 2023-09-25T09:29:17Z

Where can I get some details about a reason of this state?

Check the instance-manager logs and see why the replicas cannot be added to the engine or any i/o timeout error.

Logs from instance manager on ip-X-X-X-142.compute.internal node -
ip-X-X-X-142.compute.internal-instance-manager.log

derekbit · 2023-09-25T09:58:26Z

I the replicas were removed when creating an engine. Can you check why longhorn-manager deleted them at this moment?

[longhorn-instance-manager] time="2023-09-18T03:07:50Z" level=info msg="Removing replica" engineName=pvc-506e824d-414f-43ce-af59-5821b2b9accf-e-0df2341a replicaAddress="tcp://X.X.X.100:10215" replicaName= serviceURL="X.X.X.201:10079"
[longhorn-instance-manager] time="2023-09-18T03:07:50Z" level=info msg="Removing replica" engineName=pvc-506e824d-414f-43ce-af59-5821b2b9accf-e-0df2341a replicaAddress="tcp://X.X.X.245:10195" replicaName= serviceURL="X.X.X.201:10079"
[pvc-506e824d-414f-43ce-af59-5821b2b9accf-e-0df2341a] time="2023-09-18T03:07:50Z" level=info msg="Removing backend: tcp://X.X.X.245:10195"
time="2023-09-18T03:07:50Z" level=info msg="Monitoring stopped tcp://X.X.X.245:10195"
[pvc-506e824d-414f-43ce-af59-5821b2b9accf-e-0df2341a] time="2023-09-18T03:07:50Z" level=info msg="Removing backend: tcp://X.X.X.100:10215"

ajoskowski · 2023-09-25T10:11:07Z

I the replicas were removed when creating an engine. Can you check why longhorn-manager deleted them at this moment?

[longhorn-instance-manager] time="2023-09-18T03:07:50Z" level=info msg="Removing replica" engineName=pvc-506e824d-414f-43ce-af59-5821b2b9accf-e-0df2341a replicaAddress="tcp://X.X.X.100:10215" replicaName= serviceURL="X.X.X.201:10079"
[longhorn-instance-manager] time="2023-09-18T03:07:50Z" level=info msg="Removing replica" engineName=pvc-506e824d-414f-43ce-af59-5821b2b9accf-e-0df2341a replicaAddress="tcp://X.X.X.245:10195" replicaName= serviceURL="X.X.X.201:10079"
[pvc-506e824d-414f-43ce-af59-5821b2b9accf-e-0df2341a] time="2023-09-18T03:07:50Z" level=info msg="Removing backend: tcp://X.X.X.245:10195"
time="2023-09-18T03:07:50Z" level=info msg="Monitoring stopped tcp://X.X.X.245:10195"
[pvc-506e824d-414f-43ce-af59-5821b2b9accf-e-0df2341a] time="2023-09-18T03:07:50Z" level=info msg="Removing backend: tcp://X.X.X.100:10215"

Longhorn manager logs from all (3) instances: all-longhorn-manager.log

derekbit · 2023-09-25T10:34:09Z

RWX volume pvc-506e824d-414f-43ce-af59-5821b2b9accf received many requests from API in a short period
...
-Requested to be detached at 18/Sep/2023:03:06:01 +0000
-Requested to be attached at 18/Sep/2023:03:06:38 +0000
-Requested to be detached at 18/Sep/2023:03:06:50 +0000
-Requested to be detached at 18/Sep/2023:03:07:03 +0000
...

From the log, it looks like a race condition between the attachments and the detachments, but what's the purpose of the intensive requests?

ajoskowski · 2023-09-25T10:50:56Z

Like I said before - we use longhorn as a storage for our CICD stack. We have many steps which do something with data. Some steps take several seconds, some several minutes. In the case which we investigate we have several steps which do not take a lot of time - it looks like a lot of attachments/detachments are valid here. The question is it valid for Longhorn? Should it support such cases? We use longhorn ~2 years in this way and we didn't observer such problems in the past.

derekbit · 2023-09-25T10:56:05Z

The question is it valid for Longhorn? Should it support such cases? We use longhorn ~2 years in this way and we didn't observer such problems in the past.

Got it. Longhorn introduces a new Attachment/Detachment mechanism since v1.5.0. Not sure if it is related and still under investigation. Ref: #3715

cc @PhanLe1010

derekbit · 2023-09-25T15:27:48Z

cc @james-munson

derekbit · 2023-09-26T04:13:17Z

Longhorn manager logs from all (3) instances: all-longhorn-manager.log

@ajoskowski Sorry, I got confused when reading the log file. Does the all-longhorn-manager.log include all longhorn-manager pods' logs or only one longhorn-manager pod?

ajoskowski · 2023-09-26T04:22:54Z

Longhorn manager logs from all (3) instances: all-longhorn-manager.log

@ajoskowski Sorry, I got confused when reading the log file. Does the all-longhorn-manager.log include all longhorn-manager pods' logs or only one longhorn-manager pod?

All logs from all instances of longhorn-managers - it means set of logs from 3 instances.
Do you need to have it in separated files?

derekbit · 2023-09-26T04:24:20Z

Longhorn manager logs from all (3) instances: all-longhorn-manager.log

@ajoskowski Sorry, I got confused when reading the log file. Does the all-longhorn-manager.log include all longhorn-manager pods' logs or only one longhorn-manager pod?

All logs from all instances of longhorn-managers - it means set of logs from 3 instances. Do you need to have it in separated files?

Yeah, some messages are mixed together. Can you help provide separate files? Thank you.

ajoskowski · 2023-09-26T04:53:08Z

@derekbit
ip-X-X-X-142.compute.internal-longhorn-manager.log
ip-X-X-X-57.compute.internal-longhorn-manager.log
ip-X-X-X-140.compute.internal-longhorn-manager.log

james-munson · 2023-09-26T17:49:57Z

Does the CI/CD process involve restarting nodes or pods? The share manager recovery backend logs that it is removing NFS client entries on multiple occasions.

PhanLe1010 · 2023-09-27T00:42:50Z

@ajoskowski Could we have:

the yaml output of kubectl get volumes.longhorn.io,volumeattachments.longhorn.io,engines.longhorn.io,replicas.longhorn.io -n longhorn-system -oyaml
logs from longhorn-csi-plugin-xxx pods in longhorn-system namespace
Is the workload pod ci-state-pr-2463-env-doaks-prod6-uaenorth-v2wnw-override-refs-in-tf-modules-407269516 stuck right now?

PhanLe1010 · 2023-09-27T00:50:05Z

In an effort to reproduce, do you think the following steps similar to your CI pipeline @ajoskowski ?

Create a RWX PVC
Create a deployment using the PVC
Repeatedly quickly scale up and down the deplooyment from 0 to 5 and back to 0
Verify if any pod stuck to come up

ajoskowski · 2023-09-27T04:36:16Z

Does the CI/CD process involve restarting nodes or pods? The share manager recovery backend logs that it is removing NFS client entries on multiple occasions.

Each step in pipeline is a separated pod with shared longhorn volume. It means that if you have 10 steps then you think about them like 10 pods with mounted shared longhorn volume.

@ajoskowski Could we have:

the yaml output of kubectl get volumes.longhorn.io,volumeattachments.longhorn.io,engines.longhorn.io,replicas.longhorn.io -n longhorn-system -oyaml

logs from longhorn-csi-plugin-xxx pods in longhorn-system namespace

Is the workload pod ci-state-pr-2463-env-doaks-prod6-uaenorth-v2wnw-override-refs-in-tf-modules-407269516 stuck right now?

Regarding kubectl get volumes.longhorn.io,volumeattachments.longhorn.io,engines.longhorn.io,replicas.longhorn.io -n longhorn-system -oyaml - no problem, but I am able to do it for current cluster shape (problematic volume was already removed) - volumes_attachments_engines_replicas.log
logs from longhorn-csi-plugin-xxx pods in longhorn-system namespace - I do not see any logs which have information about pvc-506e824d-414f-43ce-af59-5821b2b9accf
Pod ci-state-pr-2463-env-doaks-prod6-uaenorth-v2wnw-override-refs-in-tf-modules-407269516 was not able to start due to problem which I've described in the bug description and it was removed

In an effort to reproduce, do you think the following steps similar to your CI pipeline @ajoskowski ?

Create a RWX PVC

Create a deployment using the PVC

Repeatedly quickly scale up and down the deplooyment from 0 to 5 and back to 0

Verify if any pod stuck to come up

Yeah, it's similar. On our side we create new pod definitions (new steps) instead of scaling of deployment but logic is the same - creating and removing pods with mounting/unmounting longhorn volume.

PhanLe1010 · 2023-09-28T02:05:51Z

Thanks @ajoskowski !
The provided yaml https://github.com/longhorn/longhorn/files/12734412/volumes_attachments_engines_replicas.log doesn't have anything abnormal as it is taken when the problem doesn't exist.

We will try to reproduce the issue in lab.

shuo-wu · 2023-10-20T04:21:43Z

As we discussed last time, this part is problematic, which may lead to unexpected and unnecessary detachment after a temporary node being unavailable (kubelet/network down). In fact, keeping volume.Spec.NodeID the same as ShareManager.Status.OwnerID is unnecessary.

The share-manager-controller workflow can be like the following:

Start the share manager pod scheduled by Kubernetes.
Set the volume attachment ticket to the pod node
Unset the volume attachment ticket when the volume or the share manager pod is error or stopping/stopped

james-munson · 2023-11-02T19:29:58Z

Using a similar script, I was unable to duplicate a replica being deleted. After detach, they were stopped, but that's all.
The attach and detach sequences take from 30 to 50 seconds each to reach the expected state. I wonder whether that is too slow for the CI/CD apparatus.
If the line is removed, the attachment never happens, and the workload pod is stuck in ContainerCreating. The volume itself shows a status of "attached" but the volumeattachment resource show attached=false with an attachError of "failed to attach.... Waiting for volume share to be available."

I'm going to focus on trimming the workflow as described above.

derekbit · 2023-11-03T00:57:27Z

Hello @james-munson What's your environment for the reproduce?

james-munson · 2023-11-06T22:01:36Z

It's a 4-node (1 control-plane, 3 worker) cluster running Ubuntu 20.04 on Kubernetes v1.25.12+rke2r1, running 1.6-dev (current master-head).

james-munson · 2023-11-08T23:22:40Z

I also tried a repro with @phan's idea of using a deployment with an RWX volume, and scaling it up and down quickly.
Specifically, used the rwx example from https://github.com/longhorn/longhorn/examples/rwx/rwx-nginx-deployment.yaml, although I modified the container slightly to do this

      containers:
        - image: ubuntu:xenial
          imagePullPolicy: IfNotPresent
          command: [ "/bin/sh", "-c" ]
          args:
            - sleep 10; touch /data/index.html; while true; do  echo "`hostname` `date`" >> /data/index.html; sleep 1; done;

to include the hostname in the periodic writes to the shared volume.

Even with scaling up to 3 and down to 0 at 10-second intervals (far faster than the attach and detach can be accomplished) no containers crashed, and no replicas were broken. Kubernetes and Longhorn are untroubled by the fact that creating and terminating resources overlap.
In fact, I revised the time after scale up to 60 seconds, and the time after scale down to 0, so new pods were created immediately, and that just had the effect of attaching and writing from the new pods while the old ones were still detaching. So for some interval, there were 6 pods writing to the volume without trouble.
I conclude from that test that this is likely not a representative repro of the situation in this issue.

james-munson · 2023-11-13T23:24:00Z

Did the same with a script that left the PV intact, but deleted and recreated the service & deployment at intervals, rather than just scaling up and down. Still behaved itself.
I assume from the lack of activity from the filer that the symptom has been solved or worked around, perhaps by turning off the cluster autoscaler setting.

longhorn-io-github-bot · 2023-11-27T20:20:40Z

james-munson · 2023-12-06T17:33:39Z

I think this is a good candidate for backport to 1.4 and 1.5. @innobead do you agree?

roger-ryao · 2023-12-14T15:26:40Z

Verified on master-head 20231213

longhorn master-head 97e8c77
longhorn-manager master-head longhorn/longhorn-manager@27101bb

The test steps

Test Method 1 : #6776 (comment)
Test Method 2 refer:#6776 (comment)

Create the deployment using the provided YAML.

deployment_rwx.yaml

apiVersion: v1
kind: Service
metadata:
  name: deployment-rwx
  labels:
    app: deployment-rwx
spec:
  ports:
    - port: 3306
  selector:
    app: deployment-rwx
  clusterIP: None
---
apiVersion: v1
kind: PersistentVolumeClaim
metadata:
  name: deployment-rwx-pvc
spec:
  accessModes:
    - ReadWriteMany    
  storageClassName: longhorn
  resources:
    requests:
      storage: 2Gi
---
apiVersion: apps/v1
kind: Deployment
metadata:
  name: deployment-rwx
  labels:
    app: deployment-rwx
spec:
  selector:
    matchLabels:
      app: deployment-rwx # has to match .spec.template.metadata.labels
  strategy:
    type: Recreate
  template:
    metadata:
      labels:
        app: deployment-rwx
    spec:
      restartPolicy: Always
        #      nodeSelector:
        #        kubernetes.io/hostname: ryao-13x-w2-60671ae2-qpfpr  # worker node      
      containers:
      - image: ubuntu
        name: deployment-rwx
        command: ["/bin/sleep", "3650d"]
        volumeMounts:
        - name: deployment-rwx-volume
          mountPath: "/data/"
        env:
        - name: MYSQL_ROOT_PASSWORD
          value: "rancher"
      volumes:
      - name: deployment-rwx-volume
        persistentVolumeClaim:
          claimName: deployment-rwx-pvc

Scale up the replicas to 10.
Check if all workloads are in the "Running" state.
Scale down the replicas to 0.
Wait for detachment.
Verify that all pods are deleted.
Verify that all pods are terminated.
We can test steps 2-7 using the following shell script.

deployment_rwx_test.sh

#!/bin/bash
# This script for the Github issue #6776.
# It assumes that deployment_rwx.yaml has already been applied.

# Define the deployment name
DEPLOYMENT_NAME="deployment-rwx"
VOLUME_NAME=$1

if [[ $# -lt 1 ]]; then
    echo "Please provide the volume and kubeconfig path arguments."
    echo "Usage: ./test.sh <volume name> [<kubeconfig path>]"
    echo "Examples:"
    echo "  ./test.sh pvc-6a507027-3101-408d-86ad-bc7e18faa061"
    echo "  ./test.sh pvc-6a507027-3101-408d-86ad-bc7e18faa061 kubeconfig.yaml"
    exit 1
fi

if [[ $# -ne 2 ]]; then
    echo "KUBECONFIG=~/.kube/config"
    KUBECONFIG=~/.kube/config # Set the default kubeconfig path
else
    echo "KUBECONFIG=$2"
    KUBECONFIG=$2 # Use the provided kubeconfig path
fi

ATTACH_WAIT_SECONDS=120
DETACH_WAIT_SECONDS=300

for ((i=0; i<200; i++)); do
    # Scale deployment to 10 replicas
    echo "Scale deployment up to 10 replicas."
    kubectl --kubeconfig=$KUBECONFIG scale deployment $DEPLOYMENT_NAME --replicas=10

    # Wait for the deployment to have 10 ready replicas
    until [[ "$(kubectl --kubeconfig=$KUBECONFIG get deployment $DEPLOYMENT_NAME -o=jsonpath='{.status.readyReplicas}')" == "10" ]]; do
        ready_replicas=$(kubectl --kubeconfig=$KUBECONFIG get deployment $DEPLOYMENT_NAME -o=jsonpath='{.status.readyReplicas}')
        echo "Iteration #$i: $DEPLOYMENT_NAME has $ready_replicas ready replicas"
        sleep 1
    done

    # Check if all pods are in the "Running" state within a time limit
    c=0
    while [ $c -lt $ATTACH_WAIT_SECONDS ]
    do
        phase=`kubectl --kubeconfig=$KUBECONFIG get pods -l=app=$DEPLOYMENT_NAME -o=jsonpath="{.items[*].status.phase}" 2>/dev/null`
        if [ x"$phase" == x"Running Running Running Running Running Running Running Running Running Running" ]; then
            echo "All pods are are ready."
            break
        fi

        sleep 1

        if [ x"$c" = x"DETACH_WAIT_SECONDS" ]; then
            echo "Timeout: Not all pods are in the 'Running' state. Elapsed time: $ATTACH_WAIT_SECONDS seconds."
            exit 1
        fi
    done

    # Scale deployment down to 0 replicas
    echo "Scale deployment down to 0 replicas."
    kubectl --kubeconfig=$KUBECONFIG scale deployment $DEPLOYMENT_NAME --replicas=0

    # Wait for the deployment to have 0 ready replicas
    while [ $c -lt $DETACH_WAIT_SECONDS ]
    do
        phase=`kubectl --kubeconfig=$KUBECONFIG -n longhorn-system get volumes $VOLUME_NAME -o=jsonpath="{['status.state']}" 2>/dev/null`
        if [ x"$phase" == x"detached" ]; then
            echo "Successfully detached"
            break
        fi

        sleep 1

        if [ x"$c" = x"DETACH_WAIT_SECONDS" ]; then
            echo "Failed to detach"
            exit 1
        fi
    done

    # Wait until all pods are terminated
    while [[ $(kubectl --kubeconfig=$KUBECONFIG get pods -l=app=$DEPLOYMENT_NAME -o=jsonpath='{.items[*].status.phase}') != "" ]]; do
        pod_status=$(kubectl --kubeconfig=$KUBECONFIG get pods -l=app=$DEPLOYMENT_NAME -o=jsonpath='{.items[*].status.phase}')
        echo "Waiting for pods of $DEPLOYMENT_NAME to be terminated. Current pod status: $pod_status"
        sleep 5
    done
    echo "All pods of $DEPLOYMENT_NAME are terminated."

done

Result Passed

I did not observe the issue in Method 1 & Method 2.
@james-munson , could you please review my test Method 2? If you have no concerns, I suggest building the private image for @ajoskowski. Perhaps users can assist in verifying its effectiveness and checking whether your commit is efficient or not.

james-munson · 2023-12-14T21:55:40Z

@roger-ryao, that looks good. I would be up for building a 1.5.1-based private build (this fix is also being backported to 1.5.4) if @ajoskowski would be up for pre-testing it before 1.5.4 releases.

roger-ryao · 2023-12-25T03:55:55Z

Since we haven't received a response from the user, let's close this issue for now. If the user reports the issue again, we can reopen it.

ajoskowski · 2024-01-02T06:22:40Z

Thanks guys, we will verify this fix in 1.5.4

derekbit · 2024-04-25T17:29:17Z

Hello @ajoskowski ,
Has the fix in version 1.5.4+ resolved the issue? Looking forward to receiving your feedback.

slotdawg · 2024-06-18T10:04:50Z

I am seeing this same behavior with blockmode RWX PVCs in Longhorn v1.6.2 in Harvester 1.3.1. When we attempt to export a volume for backup using Kasten K10, we consistently see FailedAttachVolume errors:

AttachVolume.Attach failed for volume "pvc-51e0acd0-4152-4714-8b6d-ec4e40326c5a" : rpc error: code = Aborted desc = volume pvc-51e0acd0-4152-4714-8b6d-ec4e40326c5a is not ready for workloads

innobead · 2024-06-18T15:39:19Z

I am seeing this same behavior with blockmode RWX PVCs in Longhorn v1.6.2 in Harvester 1.3.1. When we attempt to export a volume for backup using Kasten K10, we consistently see FailedAttachVolume errors:

AttachVolume.Attach failed for volume "pvc-51e0acd0-4152-4714-8b6d-ec4e40326c5a" : rpc error: code = Aborted desc = volume pvc-51e0acd0-4152-4714-8b6d-ec4e40326c5a is not ready for workloads

Harvester is using RWX migratable volume which is different from the traditional RWX. Please create an issue with the reproduce steps and provide the support buddle for the team to check it further. Thanks.

ajoskowski added kind/bug require/backport Require backport. Only used when the specific versions to backport have not been definied. require/qa-review-coverage Require QA to review coverage labels Sep 25, 2023

longhorn-io-github-bot added this to Community Review Sprint Sep 25, 2023

github-project-automation bot moved this to New in Community Review Sprint Sep 25, 2023

innobead added the investigation-needed Identified the issue but require further investigation for resolution (won't be stale) label Sep 25, 2023

c3y1huang moved this from New to In progress in Community Review Sprint Sep 26, 2023

innobead added the priority/0 Must be implement or fixed in this release (managed by PO) label Oct 3, 2023

innobead added area/volume-rwx Volume RWX related and removed investigation-needed Identified the issue but require further investigation for resolution (won't be stale) labels Oct 13, 2023

derekbit mentioned this issue Oct 19, 2023

Verify replica is closed by checking both spec.DesireState and status.CurrentState longhorn/longhorn-manager#2232

Closed

derekbit mentioned this issue Nov 15, 2023

[BUG] ShareManager pod is dying repeatedly until it starts eventually #7112

Open

james-munson mentioned this issue Nov 27, 2023

6776 - Simplify share-manager volume attachment path. longhorn/longhorn-manager#2317

Merged

james-munson added backport/1.4.5 backport/1.5.4 labels Dec 11, 2023

This was referenced Dec 11, 2023

[BACKPORT][v1.5.4][BUG] Persistent volume is not ready for workloads #7314

Closed

[BACKPORT][v1.4.5][BUG] Persistent volume is not ready for workloads #7315

Closed

innobead assigned roger-ryao Dec 12, 2023

james-munson removed the backport/1.4.5 label Dec 12, 2023

james-munson moved this from In progress to Resolved/Scheduled in Community Review Sprint Dec 13, 2023

roger-ryao closed this as completed Dec 25, 2023

derekbit added this to Longhorn Sprint Aug 3, 2024

derekbit moved this to Closed in Longhorn Sprint Aug 3, 2024

[BUG] Persistent volume is not ready for workloads #6776

[BUG] Persistent volume is not ready for workloads #6776

Comments

ajoskowski commented Sep 25, 2023 • edited Loading

Describe the bug (🐛 if you encounter this issue)

To Reproduce

Expected behavior

Support bundle for troubleshooting

Environment

Additional context

derekbit commented Sep 25, 2023 • edited Loading

ajoskowski commented Sep 25, 2023

derekbit commented Sep 25, 2023

ajoskowski commented Sep 25, 2023

derekbit commented Sep 25, 2023

karolkieglerski commented Sep 25, 2023

ajoskowski commented Sep 25, 2023

derekbit commented Sep 25, 2023

ajoskowski commented Sep 25, 2023

derekbit commented Sep 25, 2023

ajoskowski commented Sep 25, 2023

derekbit commented Sep 25, 2023

derekbit commented Sep 25, 2023

derekbit commented Sep 26, 2023

ajoskowski commented Sep 26, 2023

derekbit commented Sep 26, 2023 • edited Loading

ajoskowski commented Sep 26, 2023

james-munson commented Sep 26, 2023

PhanLe1010 commented Sep 27, 2023 • edited Loading

PhanLe1010 commented Sep 27, 2023 • edited Loading

ajoskowski commented Sep 27, 2023

PhanLe1010 commented Sep 28, 2023 • edited Loading

shuo-wu commented Oct 20, 2023

james-munson commented Nov 2, 2023

derekbit commented Nov 3, 2023

james-munson commented Nov 6, 2023

james-munson commented Nov 8, 2023

james-munson commented Nov 13, 2023

longhorn-io-github-bot commented Nov 27, 2023 • edited by james-munson Loading

Pre Ready-For-Testing Checklist

james-munson commented Dec 6, 2023

roger-ryao commented Dec 14, 2023 • edited Loading

james-munson commented Dec 14, 2023

roger-ryao commented Dec 25, 2023

ajoskowski commented Jan 2, 2024

derekbit commented Apr 25, 2024

slotdawg commented Jun 18, 2024

innobead commented Jun 18, 2024

ajoskowski commented Sep 25, 2023 •

edited

Loading

derekbit commented Sep 25, 2023 •

edited

Loading

derekbit commented Sep 26, 2023 •

edited

Loading

PhanLe1010 commented Sep 27, 2023 •

edited

Loading

PhanLe1010 commented Sep 27, 2023 •

edited

Loading

PhanLe1010 commented Sep 28, 2023 •

edited

Loading

longhorn-io-github-bot commented Nov 27, 2023 •

edited by james-munson

Loading

roger-ryao commented Dec 14, 2023 •

edited

Loading