Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[BUG] Unable to create snapshot: cannot get engine client because it isn't deployed #7438

Closed
yangchiu opened this issue Dec 25, 2023 · 6 comments
Assignees
Labels
kind/bug priority/0 Must be implement or fixed in this release (managed by PO) reproduce/rare < 50% reproducible require/backport Require backport. Only used when the specific versions to backport have not been definied. require/qa-review-coverage Require QA to review coverage
Milestone

Comments

@yangchiu
Copy link
Member

yangchiu commented Dec 25, 2023

Describe the bug (🐛 if you encounter this issue)

Run test case test_engine_image_daemonset_restart Repeatedly. After engine image DaemonSet restarted, even though the engine image status.conditions[0].status has become True:

status:
    buildDate: "2023-12-05T06:21:33+00:00"
    cliAPIMinVersion: 3
    cliAPIVersion: 9
    conditions:
    - lastProbeTime: ""
      lastTransitionTime: "2023-12-25T09:44:10Z"
      message: Engine image ei-b907910b (longhornio/longhorn-engine:master-head) is
        fully deployed on all ready nodes
      reason: ""
      status: "True"
      type: ready

It failed at create_snapshot step with error message with reproducibility ~ 20%:

longhorn.ApiError: (ApiError(...), '500 : 
failed to create snapshot: cannot get client for volume longhorn-testvol-rayxu0 on node ip-10-0-2-245: cannot get engine client with image longhornio/longhorn-engine:master-head because it isn\'t deployed\n
{\'code\': 500, \'detail\': \'\', \'message\': "failed to create snapshot: cannot get client for volume longhorn-testvol-rayxu0 on node ip-10-0-2-245: cannot get engine client with image longhornio/longhorn-engine:master-head because it isn\'t deployed", \'status\': 500}')

https://ci.longhorn.io/job/private/job/longhorn-tests-regression/5560/

Not sure if it's a bug or the test case needs to be refined.

To Reproduce

Run test case test_engine_image_daemonset_restart Repeatedly

Expected behavior

Support bundle for troubleshooting

Environment

  • Longhorn version: master-head
  • Installation method (e.g. Rancher Catalog App/Helm/Kubectl):
  • Kubernetes distro (e.g. RKE/K3s/EKS/OpenShift) and version:
    • Number of management node in the cluster:
    • Number of worker node in the cluster:
  • Node config
    • OS type and version:
    • Kernel version:
    • CPU per node:
    • Memory per node:
    • Disk type(e.g. SSD/NVMe/HDD):
    • Network bandwidth between the nodes:
  • Underlying Infrastructure (e.g. on AWS/GCE, EKS/GKE, VMWare/KVM, Baremetal):
  • Number of Longhorn volumes in the cluster:
  • Impacted Longhorn resources:
    • Volume names:

Additional context

@yangchiu yangchiu added kind/bug reproduce/rare < 50% reproducible require/qa-review-coverage Require QA to review coverage require/backport Require backport. Only used when the specific versions to backport have not been definied. labels Dec 25, 2023
@yangchiu yangchiu added this to the v1.6.0 milestone Dec 25, 2023
@innobead
Copy link
Member

From master? this should be a transient issue while merging some v2 changes. Let's revisit this issue later after all related PRs are merged.

cc @derekbit @shuo-wu

@innobead innobead added the priority/0 Must be implement or fixed in this release (managed by PO) label Dec 25, 2023
@PhanLe1010
Copy link
Contributor

Hi @yangchiu How did you capture the state of the engine image at the time the test failed?

status:
    buildDate: "2023-12-05T06:21:33+00:00"
    cliAPIMinVersion: 3
    cliAPIVersion: 9
    conditions:
    - lastProbeTime: ""
      lastTransitionTime: "2023-12-25T09:44:10Z"
      message: Engine image ei-b907910b (longhornio/longhorn-engine:master-head) is
        fully deployed on all ready nodes
      reason: ""
      status: "True"
      type: ready

@yangchiu
Copy link
Member Author

The test case will check the status before taking a snapshot:

# Wait for the restart complete
common.wait_for_engine_image_condition(client, default_img.name, "True")

# Longhorn is still able to use the corresponding engine binary to
# operate snapshot
check_volume_data(volume, snap1_data)
snap2_data = write_volume_random_data(volume)
create_snapshot(client, volume_name)

Which ensure image['conditions'][0]['status'] = True

@PhanLe1010
Copy link
Contributor

PhanLe1010 commented Dec 28, 2023

Analysis

When deleting the engineimage daemonset (similar to the test step test_engine_image_daemonset_restart), watching the status of engineimage CR shows that the status.state normal becomes deploying then deployed. However, sometime, it could transition to deploying -> deployed -> deploying -> deployed like this:

---
apiVersion: longhorn.io/v1beta2
kind: EngineImage
metadata:
  creationTimestamp: "2023-12-27T00:20:43Z"
  finalizers:
  - longhorn.io
  generation: 1
  labels:
    longhorn.io/component: engine-image
    longhorn.io/engine-image: ei-b907910b
    longhorn.io/managed-by: longhorn-manager
  name: ei-b907910b
  namespace: longhorn-system
  resourceVersion: "609096"
  uid: 0bfb1dca-b8d9-4d3a-88b8-206369986dee
spec:
  image: longhornio/longhorn-engine:master-head
status:
  buildDate: "2023-12-05T06:21:33+00:00"
  cliAPIMinVersion: 3
  cliAPIVersion: 9
  conditions:
  - lastProbeTime: ""
    lastTransitionTime: "2023-12-27T21:26:55Z"
    message: engine binary check failed
    reason: daemonSet
    status: "False"
    type: ready
  controllerAPIMinVersion: 3
  controllerAPIVersion: 5
  dataFormatMinVersion: 1
  dataFormatVersion: 1
  gitCommit: 014125a20fd26e78be607365b6d090fe93d1f00f
  noRefSince: "2023-12-27T21:10:00Z"
  nodeDeploymentMap:
    phan-v607-engine-image-pool2-e1b8ff09-cmchx: false
    phan-v607-engine-image-pool2-e1b8ff09-ktjfk: true
    phan-v607-engine-image-pool2-e1b8ff09-nk75k: true
  ownerID: phan-v607-engine-image-pool2-e1b8ff09-nk75k
  refCount: 0
  state: deploying
  version: 014125a2
---
apiVersion: longhorn.io/v1beta2
kind: EngineImage
metadata:
  creationTimestamp: "2023-12-27T00:20:43Z"
  finalizers:
  - longhorn.io
  generation: 1
  labels:
    longhorn.io/component: engine-image
    longhorn.io/engine-image: ei-b907910b
    longhorn.io/managed-by: longhorn-manager
  name: ei-b907910b
  namespace: longhorn-system
  resourceVersion: "609150"
  uid: 0bfb1dca-b8d9-4d3a-88b8-206369986dee
spec:
  image: longhornio/longhorn-engine:master-head
status:
  buildDate: "2023-12-05T06:21:33+00:00"
  cliAPIMinVersion: 3
  cliAPIVersion: 9
  conditions:
  - lastProbeTime: ""
    lastTransitionTime: "2023-12-27T21:26:59Z"
    message: Engine image ei-b907910b (longhornio/longhorn-engine:master-head) is
      fully deployed on all ready nodes
    reason: ""
    status: "True"
    type: ready
  controllerAPIMinVersion: 3
  controllerAPIVersion: 5
  dataFormatMinVersion: 1
  dataFormatVersion: 1
  gitCommit: 014125a20fd26e78be607365b6d090fe93d1f00f
  noRefSince: "2023-12-27T21:10:00Z"
  nodeDeploymentMap:
    phan-v607-engine-image-pool2-e1b8ff09-cmchx: true
    phan-v607-engine-image-pool2-e1b8ff09-ktjfk: true
    phan-v607-engine-image-pool2-e1b8ff09-nk75k: true
  ownerID: phan-v607-engine-image-pool2-e1b8ff09-nk75k
  refCount: 0
  state: deployed
  version: 014125a2
---
apiVersion: longhorn.io/v1beta2
kind: EngineImage
metadata:
  creationTimestamp: "2023-12-27T00:20:43Z"
  finalizers:
  - longhorn.io
  generation: 1
  labels:
    longhorn.io/component: engine-image
    longhorn.io/engine-image: ei-b907910b
    longhorn.io/managed-by: longhorn-manager
  name: ei-b907910b
  namespace: longhorn-system
  resourceVersion: "609161"
  uid: 0bfb1dca-b8d9-4d3a-88b8-206369986dee
spec:
  image: longhornio/longhorn-engine:master-head
status:
  buildDate: "2023-12-05T06:21:33+00:00"
  cliAPIMinVersion: 3
  cliAPIVersion: 9
  conditions:
  - lastProbeTime: ""
    lastTransitionTime: "2023-12-27T21:27:01Z"
    message: 'Engine image is not fully deployed on all nodes: 2 of 3'
    reason: daemonSet
    status: "False"
    type: ready
  controllerAPIMinVersion: 3
  controllerAPIVersion: 5
  dataFormatMinVersion: 1
  dataFormatVersion: 1
  gitCommit: 014125a20fd26e78be607365b6d090fe93d1f00f
  noRefSince: "2023-12-27T21:10:00Z"
  nodeDeploymentMap:
    phan-v607-engine-image-pool2-e1b8ff09-cmchx: true
    phan-v607-engine-image-pool2-e1b8ff09-ktjfk: true
    phan-v607-engine-image-pool2-e1b8ff09-nk75k: false
  ownerID: phan-v607-engine-image-pool2-e1b8ff09-nk75k
  refCount: 0
  state: deploying
  version: 014125a2
---
apiVersion: longhorn.io/v1beta2
kind: EngineImage
metadata:
  creationTimestamp: "2023-12-27T00:20:43Z"
  finalizers:
  - longhorn.io
  generation: 1
  labels:
    longhorn.io/component: engine-image
    longhorn.io/engine-image: ei-b907910b
    longhorn.io/managed-by: longhorn-manager
  name: ei-b907910b
  namespace: longhorn-system
  resourceVersion: "609162"
  uid: 0bfb1dca-b8d9-4d3a-88b8-206369986dee
spec:
  image: longhornio/longhorn-engine:master-head
status:
  buildDate: "2023-12-05T06:21:33+00:00"
  cliAPIMinVersion: 3
  cliAPIVersion: 9
  conditions:
  - lastProbeTime: ""
    lastTransitionTime: "2023-12-27T21:27:01Z"
    message: 'Engine image is not fully deployed on all nodes: 2 of 3'
    reason: daemonSet
    status: "False"
    type: ready
  controllerAPIMinVersion: 3
  controllerAPIVersion: 5
  dataFormatMinVersion: 1
  dataFormatVersion: 1
  gitCommit: 014125a20fd26e78be607365b6d090fe93d1f00f
  noRefSince: "2023-12-27T21:10:00Z"
  nodeDeploymentMap:
    phan-v607-engine-image-pool2-e1b8ff09-cmchx: true
    phan-v607-engine-image-pool2-e1b8ff09-ktjfk: true
    phan-v607-engine-image-pool2-e1b8ff09-nk75k: false
  ownerID: phan-v607-engine-image-pool2-e1b8ff09-ktjfk
  refCount: 0
  state: deploying
  version: 014125a2
---
apiVersion: longhorn.io/v1beta2
kind: EngineImage
metadata:
  creationTimestamp: "2023-12-27T00:20:43Z"
  finalizers:
  - longhorn.io
  generation: 1
  labels:
    longhorn.io/component: engine-image
    longhorn.io/engine-image: ei-b907910b
    longhorn.io/managed-by: longhorn-manager
  name: ei-b907910b
  namespace: longhorn-system
  resourceVersion: "609165"
  uid: 0bfb1dca-b8d9-4d3a-88b8-206369986dee
spec:
  image: longhornio/longhorn-engine:master-head
status:
  buildDate: "2023-12-05T06:21:33+00:00"
  cliAPIMinVersion: 3
  cliAPIVersion: 9
  conditions:
  - lastProbeTime: ""
    lastTransitionTime: "2023-12-27T21:27:01Z"
    message: Engine image ei-b907910b (longhornio/longhorn-engine:master-head) is
      fully deployed on all ready nodes
    reason: ""
    status: "True"
    type: ready
  controllerAPIMinVersion: 3
  controllerAPIVersion: 5
  dataFormatMinVersion: 1
  dataFormatVersion: 1
  gitCommit: 014125a20fd26e78be607365b6d090fe93d1f00f
  noRefSince: "2023-12-27T21:10:00Z"
  nodeDeploymentMap:
    phan-v607-engine-image-pool2-e1b8ff09-cmchx: true
    phan-v607-engine-image-pool2-e1b8ff09-ktjfk: true
    phan-v607-engine-image-pool2-e1b8ff09-nk75k: true
  ownerID: phan-v607-engine-image-pool2-e1b8ff09-ktjfk
  refCount: 0
  state: deployed
  version: 014125a2
---

The flapping flow deploying -> deployed -> deploying -> deployed may caused by the fact that the engine image pods are being restarted and come up not at perfectly the same time like:

  • Start with 3 healthy engine image pods (pod-1, pod-2, pod-3) -> engineimage.status.state is deployed (and the ready condition is true)
  • Kill the engineimage daemonset
  • pod-1 is being terminated -> engineimage.status.state becomes deploying (and the ready condition is false)
  • pod-1-new becomes healthy -> engineimage.status.state becomes deployed (and the ready condition is true)
  • pod-2 is being terminated -> engineimage.status.state becomes deploying (and the ready condition is false)
  • pod-2-new becomes healthy -> engineimage.status.state becomes deployed (and the ready condition is true)
  • ...

When engineimage flaps between deploying -> deployed -> deploying -> deployed, the first time the engineimage becomes deployed, the e2e tests think that it is ok to create volume snapshot. However, when make the API call, engineimage might already transition to deploying and error out with the error ailed to create snapshot: cannot get client for volume longhorn-testvol-rayxu0 on node ip-10-0-2-245: cannot get engine client with image longhornio/longhorn-engine:master-head because it isn\'t deployed", \'status\': 500}

I think this behavior is not a bug in Longhorn. We just need to adjust the e2e test so that it doesn't expect the engineimage.status to strictly follow the flow deploying -> deployed when deleting engineimage daemonset. The e2e test should expect that sometimes the engineimage.status may have a flapping flow deploying -> deployed -> deploying -> deployed. One approach is making sure the e2e waits for engineimage.status to remain stably in deployed for a certain period of time before continuing the test.

@longhorn-io-github-bot
Copy link

longhorn-io-github-bot commented Dec 28, 2023

Pre Ready-For-Testing Checklist

  • Where is the reproduce steps/test steps documented?
    The reproduce steps/test steps are at: issue description

  • Is there a workaround for the issue? If so, where is it documented?
    The workaround is at:

  • Does the PR include the explanation for the fix or the feature?

  • Does the PR include deployment change (YAML/Chart)? If so, where are the PRs for both YAML file and Chart?
    The PR for the YAML change is at:
    The PR for the chart change is at:

  • Have the backend code been merged (Manager, Engine, Instance Manager, BackupStore etc) (including backport-needed/*)?
    The PR is at

  • Which areas/issues this PR might have potential impacts on?
    Area Test
    Issues

  • If labeled: require/LEP Has the Longhorn Enhancement Proposal PR submitted?
    The LEP PR is at

  • If labeled: area/ui Has the UI issue filed or ready to be merged (including backport-needed/*)?
    The UI issue/PR is at

  • If labeled: require/doc Has the necessary document PR submitted or merged (including backport-needed/*)?
    The documentation issue/PR is at

  • If labeled: require/automation-e2e Has the end-to-end test plan been merged? Have QAs agreed on the automation test case? If only test case skeleton w/o implementation, have you created an implementation issue (including backport-needed/*)
    The automation skeleton PR is at
    The automation test case PR is at Fix flaky test test_engine_image_daemonset_restart longhorn-tests#1642
    The issue of automation test case implementation is at (please create by the template)

  • If labeled: require/automation-engine Has the engine integration test been merged (including backport-needed/*)?
    The engine automation PR is at

  • If labeled: require/manual-test-plan Has the manual test plan been documented?
    The updated manual test plan is at

  • If the fix introduces the code for backward compatibility Has a separate issue been filed with the label release/obsolete-compatibility?
    The compatibility issue is filed at

@yangchiu
Copy link
Member Author

Verified passed on master-head (longhorn-tests 9b36237) by running test case test_engine_image_daemonset_restart for 10 times.

Test results: https://ci.longhorn.io/job/private/job/longhorn-tests-regression/5733/ ~ https://ci.longhorn.io/job/private/job/longhorn-tests-regression/5742/

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
kind/bug priority/0 Must be implement or fixed in this release (managed by PO) reproduce/rare < 50% reproducible require/backport Require backport. Only used when the specific versions to backport have not been definied. require/qa-review-coverage Require QA to review coverage
Projects
Status: Closed
Development

No branches or pull requests

4 participants