Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[IMPROVEMENT] Do not count the failure replica reuse failure caused by the disconnection #1923

Closed
shuo-wu opened this issue Oct 29, 2020 · 4 comments
Assignees
Labels
backport/1.4.1 component/longhorn-manager Longhorn manager (control plane) kind/improvement Request for improvement of existing function priority/1 Highly recommended to implement or fix in this release (managed by PO)
Milestone

Comments

@shuo-wu
Copy link
Contributor

shuo-wu commented Oct 29, 2020

Is your feature request related to a problem? Please describe.
This is an enhancement for #1304
The failure replica reuse failure caused by the disconnection should not be count into replica.Spec.RebuildRetryCount. Typically the reuse retry is designed for the data transmission failure during rebuilding.

Describe the solution you'd like
Longhorn can check the failure reason before modifying replica.Spec.RebuildRetryCount. In order to do it, Longhorn needs to record the failure reason for each replica first.

@shuo-wu shuo-wu added kind/feature Feature request, new feature component/longhorn-manager Longhorn manager (control plane) labels Oct 29, 2020
@shuo-wu shuo-wu self-assigned this Oct 29, 2020
@yasker yasker modified the milestones: v1.1.0, v1.1.1 Oct 29, 2020
@yasker yasker added priority/1 Highly recommended to implement or fix in this release (managed by PO) require/automation-engine labels Oct 29, 2020
@yasker yasker modified the milestones: v1.1.1, v1.1.2 Dec 22, 2020
@innobead innobead changed the title [FEATURE]Do not count the failure replica reuse failure caused by the disconnection [FEATURE] Do not count the failure replica reuse failure caused by the disconnection Apr 26, 2021
@innobead innobead modified the milestones: v1.1.2, v1.2.0 Apr 29, 2021
@yasker yasker added the reprioritization-needed Need to reconsider to re-prioritize in another milestone instead of the current one label May 24, 2021
@yasker
Copy link
Member

yasker commented May 24, 2021

Considering moving out of v1.2.0.

@innobead innobead modified the milestones: v1.2.0, v1.3.0 May 25, 2021
@innobead innobead added investigation-needed Identified the issue but require further investigation for resolution (won't be stale) and removed investigation-needed Identified the issue but require further investigation for resolution (won't be stale) reprioritization-needed Need to reconsider to re-prioritize in another milestone instead of the current one labels May 25, 2021
@innobead
Copy link
Member

Hey team! Please add your planning poker estimate with ZenHub @jenting @joshimoo @PhanLe1010 @shuo-wu

@innobead innobead modified the milestones: v1.3.0, v1.4.0 Mar 31, 2022
@innobead innobead added priority/0 Must be implement or fixed in this release (managed by PO) and removed priority/1 Highly recommended to implement or fix in this release (managed by PO) labels Nov 7, 2022
@derekbit derekbit changed the title [FEATURE] Do not count the failure replica reuse failure caused by the disconnection [IMPROVEMENT] Do not count the failure replica reuse failure caused by the disconnection Nov 24, 2022
@derekbit derekbit added kind/improvement Request for improvement of existing function and removed kind/feature Feature request, new feature labels Nov 24, 2022
@longhorn-io-github-bot
Copy link

longhorn-io-github-bot commented Feb 16, 2023

Pre Ready-For-Testing Checklist

  • Where is the reproduce steps/test steps documented?
    The reproduce steps/test steps are at:
  1. Create a volume with 3 replicas and attached to a node
  2. Make a replica A failed and start to rebuild (by making the node down or network of the down)
  3. Make a replica A failed again when rebuilding by making the node down.
  4. Status.Condition of Replica A should record the failed reason in type rebuildfailed
  5. And Spec.rebuildRetryCount will not be increased because node is down or network is unstable.

@yangchiu
Copy link
Member

Verified passed on master-head (longhorn-manager 2976b0d) following the test steps.

Replica A status after the rebuilding triggered at step (2), the rebuildRetryCount is 1:

$ kubectl get replicas.longhorn.io test-1-r-efc67bb4 -n longhorn-system -o yaml
apiVersion: longhorn.io/v1beta2
kind: Replica
metadata:
  creationTimestamp: "2023-02-24T05:39:06Z"
  finalizers:
  - longhorn.io
  generation: 9
  labels:
    longhorn.io/backing-image: ""
    longhorndiskuuid: fb74b396-7faf-433c-9d89-236d9cf21a9b
    longhornnode: ip-10-0-1-164
    longhornvolume: test-1
  name: test-1-r-efc67bb4
  namespace: longhorn-system
  ownerReferences:
  - apiVersion: longhorn.io/v1beta2
    kind: Volume
    name: test-1
    uid: d6bd0a90-1491-4b5c-a625-99502b392a79
  resourceVersion: "13249"
  uid: 879cbddd-cebb-4553-a4d1-cb8bc0ca5982
spec:
  active: true
  backingImage: ""
  baseImage: ""
  dataDirectoryName: test-1-cdd23ac2
  dataPath: ""
  desireState: running
  diskID: fb74b396-7faf-433c-9d89-236d9cf21a9b
  diskPath: /var/lib/longhorn/
  engineImage: longhornio/longhorn-engine:master-head
  engineName: test-1-e-6d206df9
  failedAt: ""
  hardNodeAffinity: ""
  healthyAt: ""
  logRequested: false
  nodeID: ip-10-0-1-164
  rebuildRetryCount: 1
  revisionCounterDisabled: false
  salvageRequested: false
  unmapMarkDiskChainRemovedEnabled: false
  volumeName: test-1
  volumeSize: "21474836480"
status:
  conditions:
  - lastProbeTime: ""
    lastTransitionTime: "2023-02-24T05:39:06Z"
    message: ""
    reason: ""
    status: "True"
    type: InstanceCreation
  - lastProbeTime: ""
    lastTransitionTime: "2023-02-24T05:51:35Z"
    message: ""
    reason: ""
    status: "False"
    type: RebuildFailed
  currentImage: longhornio/longhorn-engine:master-head
  currentState: running
  evictionRequested: false
  instanceManagerName: instance-manager-r-ca8cb45455d676b270031e350c9d67fe
  ip: 10.42.3.29
  logFetched: false
  ownerID: ip-10-0-1-164
  port: 10000
  salvageExecuted: false
  started: true
  storageIP: 10.42.3.29

Replica A status after making the node down again at step (3), the rebuildRetryCount is still 1, and RebuildFailed is marked to True:

kubectl get replicas.longhorn.io test-1-r-efc67bb4 -n longhorn-system -o yaml
apiVersion: longhorn.io/v1beta2
kind: Replica
metadata:
  creationTimestamp: "2023-02-24T05:39:06Z"
  finalizers:
  - longhorn.io
  generation: 10
  labels:
    longhorn.io/backing-image: ""
    longhorndiskuuid: fb74b396-7faf-433c-9d89-236d9cf21a9b
    longhornnode: ip-10-0-1-164
    longhornvolume: test-1
  name: test-1-r-efc67bb4
  namespace: longhorn-system
  ownerReferences:
  - apiVersion: longhorn.io/v1beta2
    kind: Volume
    name: test-1
    uid: d6bd0a90-1491-4b5c-a625-99502b392a79
  resourceVersion: "13309"
  uid: 879cbddd-cebb-4553-a4d1-cb8bc0ca5982
spec:
  active: true
  backingImage: ""
  baseImage: ""
  dataDirectoryName: test-1-cdd23ac2
  dataPath: ""
  desireState: stopped
  diskID: fb74b396-7faf-433c-9d89-236d9cf21a9b
  diskPath: /var/lib/longhorn/
  engineImage: longhornio/longhorn-engine:master-head
  engineName: test-1-e-6d206df9
  failedAt: "2023-02-24T05:52:08Z"
  hardNodeAffinity: ""
  healthyAt: ""
  logRequested: false
  nodeID: ip-10-0-1-164
  rebuildRetryCount: 1
  revisionCounterDisabled: false
  salvageRequested: false
  unmapMarkDiskChainRemovedEnabled: false
  volumeName: test-1
  volumeSize: "21474836480"
status:
  conditions:
  - lastProbeTime: ""
    lastTransitionTime: "2023-02-24T05:39:06Z"
    message: ""
    reason: ""
    status: "True"
    type: InstanceCreation
  - lastProbeTime: ""
    lastTransitionTime: "2023-02-24T05:52:08Z"
    message: 'proxyServer=10.42.2.10:8501 destination=10.42.2.10:10000: failed to
      add replica tcp://10.42.3.29:10000 for volume: rpc error: code = Unknown desc
      = failed to sync files [{FromFileName:volume-snap-37116073-b982-4ba5-a735-54d78e0e3f01.img
      ToFileName:volume-snap-37116073-b982-4ba5-a735-54d78e0e3f01.img ActualSize:10737426432}
      {FromFileName:volume-snap-37116073-b982-4ba5-a735-54d78e0e3f01.img.meta ToFileName:volume-snap-37116073-b982-4ba5-a735-54d78e0e3f01.img.meta
      ActualSize:0}] from tcp://10.42.2.11:10000: rpc error: code = Unavailable desc
      = error reading from server: EOF'
    reason: Disconnection
    status: "True"
    type: RebuildFailed
  currentImage: longhornio/longhorn-engine:master-head
  currentState: running
  evictionRequested: false
  instanceManagerName: instance-manager-r-ca8cb45455d676b270031e350c9d67fe
  ip: 10.42.3.29
  logFetched: false
  ownerID: ip-10-0-1-164
  port: 10000
  salvageExecuted: false
  started: true
  storageIP: 10.42.3.29

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
backport/1.4.1 component/longhorn-manager Longhorn manager (control plane) kind/improvement Request for improvement of existing function priority/1 Highly recommended to implement or fix in this release (managed by PO)
Projects
Status: Closed
Development

No branches or pull requests

8 participants