[IMPROVEMENT] Do not count the failure replica reuse failure caused by the disconnection #1923

shuo-wu · 2020-10-29T06:59:46Z

Is your feature request related to a problem? Please describe.
This is an enhancement for #1304
The failure replica reuse failure caused by the disconnection should not be count into replica.Spec.RebuildRetryCount. Typically the reuse retry is designed for the data transmission failure during rebuilding.

Describe the solution you'd like
Longhorn can check the failure reason before modifying replica.Spec.RebuildRetryCount. In order to do it, Longhorn needs to record the failure reason for each replica first.

The text was updated successfully, but these errors were encountered:

yasker · 2021-05-24T03:28:33Z

Considering moving out of v1.2.0.

innobead · 2021-10-21T02:38:41Z

Hey team! Please add your planning poker estimate with ZenHub @jenting @joshimoo @PhanLe1010 @shuo-wu

longhorn-io-github-bot · 2023-02-16T08:51:46Z

Pre Ready-For-Testing Checklist

Where is the reproduce steps/test steps documented?
The reproduce steps/test steps are at:

Create a volume with 3 replicas and attached to a node
Make a replica A failed and start to rebuild (by making the node down or network of the down)
Make a replica A failed again when rebuilding by making the node down.
Status.Condition of Replica A should record the failed reason in type rebuildfailed
And Spec.rebuildRetryCount will not be increased because node is down or network is unstable.

Have the backend code been merged (Manager, Engine, Instance Manager, BackupStore etc) (including backport-needed/*)?
The PR is at
feat(replica): RebuildRetryCount for disconnection longhorn-manager#1703
Which areas/issues this PR might have potential impacts on?
Area
Issues

yangchiu · 2023-02-24T06:03:12Z

Verified passed on master-head (longhorn-manager 2976b0d) following the test steps.

Replica A status after the rebuilding triggered at step (2), the rebuildRetryCount is 1:

$ kubectl get replicas.longhorn.io test-1-r-efc67bb4 -n longhorn-system -o yaml
apiVersion: longhorn.io/v1beta2
kind: Replica
metadata:
  creationTimestamp: "2023-02-24T05:39:06Z"
  finalizers:
  - longhorn.io
  generation: 9
  labels:
    longhorn.io/backing-image: ""
    longhorndiskuuid: fb74b396-7faf-433c-9d89-236d9cf21a9b
    longhornnode: ip-10-0-1-164
    longhornvolume: test-1
  name: test-1-r-efc67bb4
  namespace: longhorn-system
  ownerReferences:
  - apiVersion: longhorn.io/v1beta2
    kind: Volume
    name: test-1
    uid: d6bd0a90-1491-4b5c-a625-99502b392a79
  resourceVersion: "13249"
  uid: 879cbddd-cebb-4553-a4d1-cb8bc0ca5982
spec:
  active: true
  backingImage: ""
  baseImage: ""
  dataDirectoryName: test-1-cdd23ac2
  dataPath: ""
  desireState: running
  diskID: fb74b396-7faf-433c-9d89-236d9cf21a9b
  diskPath: /var/lib/longhorn/
  engineImage: longhornio/longhorn-engine:master-head
  engineName: test-1-e-6d206df9
  failedAt: ""
  hardNodeAffinity: ""
  healthyAt: ""
  logRequested: false
  nodeID: ip-10-0-1-164
  rebuildRetryCount: 1
  revisionCounterDisabled: false
  salvageRequested: false
  unmapMarkDiskChainRemovedEnabled: false
  volumeName: test-1
  volumeSize: "21474836480"
status:
  conditions:
  - lastProbeTime: ""
    lastTransitionTime: "2023-02-24T05:39:06Z"
    message: ""
    reason: ""
    status: "True"
    type: InstanceCreation
  - lastProbeTime: ""
    lastTransitionTime: "2023-02-24T05:51:35Z"
    message: ""
    reason: ""
    status: "False"
    type: RebuildFailed
  currentImage: longhornio/longhorn-engine:master-head
  currentState: running
  evictionRequested: false
  instanceManagerName: instance-manager-r-ca8cb45455d676b270031e350c9d67fe
  ip: 10.42.3.29
  logFetched: false
  ownerID: ip-10-0-1-164
  port: 10000
  salvageExecuted: false
  started: true
  storageIP: 10.42.3.29

Replica A status after making the node down again at step (3), the rebuildRetryCount is still 1, and RebuildFailed is marked to True:

kubectl get replicas.longhorn.io test-1-r-efc67bb4 -n longhorn-system -o yaml
apiVersion: longhorn.io/v1beta2
kind: Replica
metadata:
  creationTimestamp: "2023-02-24T05:39:06Z"
  finalizers:
  - longhorn.io
  generation: 10
  labels:
    longhorn.io/backing-image: ""
    longhorndiskuuid: fb74b396-7faf-433c-9d89-236d9cf21a9b
    longhornnode: ip-10-0-1-164
    longhornvolume: test-1
  name: test-1-r-efc67bb4
  namespace: longhorn-system
  ownerReferences:
  - apiVersion: longhorn.io/v1beta2
    kind: Volume
    name: test-1
    uid: d6bd0a90-1491-4b5c-a625-99502b392a79
  resourceVersion: "13309"
  uid: 879cbddd-cebb-4553-a4d1-cb8bc0ca5982
spec:
  active: true
  backingImage: ""
  baseImage: ""
  dataDirectoryName: test-1-cdd23ac2
  dataPath: ""
  desireState: stopped
  diskID: fb74b396-7faf-433c-9d89-236d9cf21a9b
  diskPath: /var/lib/longhorn/
  engineImage: longhornio/longhorn-engine:master-head
  engineName: test-1-e-6d206df9
  failedAt: "2023-02-24T05:52:08Z"
  hardNodeAffinity: ""
  healthyAt: ""
  logRequested: false
  nodeID: ip-10-0-1-164
  rebuildRetryCount: 1
  revisionCounterDisabled: false
  salvageRequested: false
  unmapMarkDiskChainRemovedEnabled: false
  volumeName: test-1
  volumeSize: "21474836480"
status:
  conditions:
  - lastProbeTime: ""
    lastTransitionTime: "2023-02-24T05:39:06Z"
    message: ""
    reason: ""
    status: "True"
    type: InstanceCreation
  - lastProbeTime: ""
    lastTransitionTime: "2023-02-24T05:52:08Z"
    message: 'proxyServer=10.42.2.10:8501 destination=10.42.2.10:10000: failed to
      add replica tcp://10.42.3.29:10000 for volume: rpc error: code = Unknown desc
      = failed to sync files [{FromFileName:volume-snap-37116073-b982-4ba5-a735-54d78e0e3f01.img
      ToFileName:volume-snap-37116073-b982-4ba5-a735-54d78e0e3f01.img ActualSize:10737426432}
      {FromFileName:volume-snap-37116073-b982-4ba5-a735-54d78e0e3f01.img.meta ToFileName:volume-snap-37116073-b982-4ba5-a735-54d78e0e3f01.img.meta
      ActualSize:0}] from tcp://10.42.2.11:10000: rpc error: code = Unavailable desc
      = error reading from server: EOF'
    reason: Disconnection
    status: "True"
    type: RebuildFailed
  currentImage: longhornio/longhorn-engine:master-head
  currentState: running
  evictionRequested: false
  instanceManagerName: instance-manager-r-ca8cb45455d676b270031e350c9d67fe
  ip: 10.42.3.29
  logFetched: false
  ownerID: ip-10-0-1-164
  port: 10000
  salvageExecuted: false
  started: true
  storageIP: 10.42.3.29

shuo-wu added kind/feature Feature request, new feature component/longhorn-manager Longhorn manager (control plane) labels Oct 29, 2020

shuo-wu self-assigned this Oct 29, 2020

yasker modified the milestones: v1.1.0, v1.1.1 Oct 29, 2020

yasker added priority/1 Highly recommended to implement or fix in this release (managed by PO) require/automation-engine labels Oct 29, 2020

yasker modified the milestones: v1.1.1, v1.1.2 Dec 22, 2020

yasker unassigned shuo-wu Dec 22, 2020

innobead changed the title ~~[FEATURE]Do not count the failure replica reuse failure caused by the disconnection~~ [FEATURE] Do not count the failure replica reuse failure caused by the disconnection Apr 26, 2021

innobead modified the milestones: v1.1.2, v1.2.0 Apr 29, 2021

yasker added the reprioritization-needed Need to reconsider to re-prioritize in another milestone instead of the current one label May 24, 2021

innobead modified the milestones: v1.2.0, v1.3.0 May 25, 2021

innobead modified the milestones: v1.3.0, v1.4.0 Mar 31, 2022

innobead assigned mantissahz Jun 23, 2022

innobead added priority/0 Must be implement or fixed in this release (managed by PO) and removed priority/1 Highly recommended to implement or fix in this release (managed by PO) labels Nov 7, 2022

derekbit changed the title ~~[FEATURE] Do not count the failure replica reuse failure caused by the disconnection~~ [IMPROVEMENT] Do not count the failure replica reuse failure caused by the disconnection Nov 24, 2022

derekbit added kind/improvement Request for improvement of existing function and removed kind/feature Feature request, new feature labels Nov 24, 2022

rebeccazzzz added priority/1 Highly recommended to implement or fix in this release (managed by PO) and removed priority/0 Must be implement or fixed in this release (managed by PO) labels Dec 9, 2022

mantissahz mentioned this issue Dec 14, 2022

feat(replica): RebuildRetryCount for disconnection longhorn/longhorn-manager#1615

Closed

innobead modified the milestones: v1.4.0, v1.5.0 Dec 20, 2022

innobead added the backport/1.4.1 label Dec 20, 2022

github-actions bot mentioned this issue Dec 20, 2022

[BACKPORT][v1.4.1][IMPROVEMENT] Do not count the failure replica reuse failure caused by the disconnection #5108

Closed

longhorn deleted a comment from longhorn-io-github-bot Feb 14, 2023

mantissahz mentioned this issue Feb 16, 2023

feat(replica): RebuildRetryCount for disconnection longhorn/longhorn-manager#1703

Merged

innobead assigned yangchiu Feb 24, 2023

yangchiu closed this as completed Feb 24, 2023

derekbit added this to Longhorn Sprint Aug 5, 2024

derekbit moved this to Closed in Longhorn Sprint Aug 5, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[IMPROVEMENT] Do not count the failure replica reuse failure caused by the disconnection #1923

[IMPROVEMENT] Do not count the failure replica reuse failure caused by the disconnection #1923

shuo-wu commented Oct 29, 2020

yasker commented May 24, 2021

innobead commented Oct 21, 2021

longhorn-io-github-bot commented Feb 16, 2023 •

edited by mantissahz

Loading

yangchiu commented Feb 24, 2023

[IMPROVEMENT] Do not count the failure replica reuse failure caused by the disconnection #1923

[IMPROVEMENT] Do not count the failure replica reuse failure caused by the disconnection #1923

Comments

shuo-wu commented Oct 29, 2020

yasker commented May 24, 2021

innobead commented Oct 21, 2021

longhorn-io-github-bot commented Feb 16, 2023 • edited by mantissahz Loading

Pre Ready-For-Testing Checklist

yangchiu commented Feb 24, 2023

longhorn-io-github-bot commented Feb 16, 2023 •

edited by mantissahz

Loading