Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[BUG] During volume live engine upgrade, delete replica with old engine image will make volume degraded forever #7012

Closed
chriscchien opened this issue Oct 31, 2023 · 10 comments
Assignees
Labels
area/resilience System or volume resilience area/stability System or volume stability area/v1-data-engine v1 data engine (iSCSI tgt) backport/1.5.4 kind/bug priority/0 Must be implement or fixed in this release (managed by PO) reproduce/always 100% reproducible require/auto-e2e-test Require adding/updating auto e2e test cases if they can be automated require/backport Require backport. Only used when the specific versions to backport have not been definied. require/qa-review-coverage Require QA to review coverage severity/1 Function broken (a critical incident with very high impact (ex: data corruption, failed upgrade)
Milestone

Comments

@chriscchien
Copy link
Contributor

Describe the bug (🐛 if you encounter this issue)

While perform volume live engine upgrade and immediately delete any old replica(replica with old engine image), the volume will kept in detached forever and stuck in the upgrading process.

Because Longhorn can not perform volume engine upgrade when volume is degraded, but can delete replica when perform volume live engine upgrade, It's a corner case and may need developer's clarify if this is expected, thanks.

To Reproduce

  1. Deploy Longhorn master
  2. Deploy previous version of engine image (For example longhornio/longhorn-engine:v1.5.1)
  3. Create a volume and change the engine image to previous one, attach it to a node.
    (Or upgrade Longhorn from previous stable version(have volume attached) to master-head instead of previous steps)
  4. Upgrade engine image to longhornio/longhorn-engine:master-head
  5. Immediately delete any replica with previous version of engine image (longhornio/longhorn-engine:v1.5.1)
  6. The volume kept in degrade state and upgrading state forever

Expected behavior

Prevent replica delete when engine upgrade or volume become healthy after perform reproduce steps

Support bundle for troubleshooting

Replica status (3 replicas with new engine image, 2 with old engine images(1 deleted before)), all are in running state

root@ip-172-31-37-125:/home/ubuntu# k get replicas -A
NAMESPACE         NAME              STATE     NODE               DISK                                   INSTANCEMANAGER                                     IMAGE                                    AGE
longhorn-system   vol1-r-09d1657e   running   ip-172-31-39-5     7534dafa-7aa8-4a38-ab36-6489d4816df8   instance-manager-5a717c7cbbeb5e5be5256f400c06cefa   longhornio/longhorn-engine:v1.5.1        18m
longhorn-system   vol1-r-ceeca1de   running   ip-172-31-33-252   ef5818b7-7a71-4a3d-8a83-ded1db249781   instance-manager-8885f2af73361cc6339316476675ac80   longhornio/longhorn-engine:v1.5.1        18m
longhorn-system   vol1-r-55945773   running   ip-172-31-33-252   ef5818b7-7a71-4a3d-8a83-ded1db249781   instance-manager-8885f2af73361cc6339316476675ac80   longhornio/longhorn-engine:master-head   14m
longhorn-system   vol1-r-6b8a86f7   running   ip-172-31-39-5     7534dafa-7aa8-4a38-ab36-6489d4816df8   instance-manager-5a717c7cbbeb5e5be5256f400c06cefa   longhornio/longhorn-engine:master-head   14m
longhorn-system   vol1-r-6299d11f   running   ip-172-31-37-125   131d1525-237e-4539-8de5-2f182968a0fc   instance-manager-e74b6159a4e44c6a1da0cef333baa415   longhornio/longhorn-engine:master-head   14m

longhorn-manager log(can see info for Engine has been upgraded)

time="2023-10-31T09:40:04Z" level=info msg="Cloned a new matching replica vol1-r-6299d11f from vol1-r-5bcc401a" func="controller.(*VolumeController).createAndStartMatchingReplicas" file="volume_controller.go:3748" accessMode=rwo controller=longhorn-volume frontend=blockdev migratable=false node=ip-172-31-33-252 owner=ip-172-31-33-252 state=attached volume=vol1
time="2023-10-31T09:40:05Z" level=warning msg="Instance vol1-r-55945773 starts running, Storage IP 10.42.1.110" func="controller.(*InstanceHandler).syncStatusWithInstanceManager" file="instance_handler.go:158"
time="2023-10-31T09:40:05Z" level=warning msg="Instance vol1-r-55945773 starts running, IP 10.42.1.110" func="controller.(*InstanceHandler).syncStatusWithInstanceManager" file="instance_handler.go:163"
time="2023-10-31T09:40:05Z" level=warning msg="Instance vol1-r-55945773 starts running, Port 10011" func="controller.(*InstanceHandler).syncStatusWithInstanceManager" file="instance_handler.go:167"
time="2023-10-31T09:40:05Z" level=info msg="Upgrading engine from longhornio/longhorn-engine:v1.5.1 to longhornio/longhorn-engine:master-head" func="controller.(*EngineController).Upgrade" file="engine_controller.go:1969" controller=longhorn-engine engine=vol1-e-0 node=ip-172-31-33-252
time="2023-10-31T09:40:07Z" level=info msg="Event(v1.ObjectReference{Kind:\"Volume\", Namespace:\"longhorn-system\", Name:\"vol1\", UID:\"ebe894c6-74c2-44c3-a518-8c70d8b75473\", APIVersion:\"longhorn.io/v1beta2\", ResourceVersion:\"34338\", FieldPath:\"\"}): type: 'Normal' reason: 'Degraded' volume vol1 became degraded" func="record.(*eventBroadcasterImpl).StartLogging.func1" file="event.go:298"
time="2023-10-31T09:40:08Z" level=info msg="Engine has been upgraded from longhornio/longhorn-engine:v1.5.1 to longhornio/longhorn-engine:master-head" func="controller.(*EngineController).Upgrade" file="engine_controller.go:1974" controller=longhorn-engine engine=vol1-e-0 node=ip-172-31-33-252
time="2023-10-31T09:40:08Z" level=warning msg="Instance vol1-e-0 starts running, Port 10010" func="controller.(*InstanceHandler).syncStatusWithInstanceManager" file="instance_handler.go:167"
time="2023-10-31T09:40:08Z" level=error msg="Failed to run engine live upgrade" func="controller.(*EngineController).syncEngine" file="engine_controller.go:323" controller=longhorn-engine engine=vol1-e-0 error="failed to live upgrade image for vol1-e-0: proxyServer=10.42.1.110:8501 destination=10.42.1.110:10010: failed to get server version: rpc error: code = Unknown desc = failed to get version detail: rpc error: code = Unavailable desc = connection error: desc = \"transport: Error while dialing: dial tcp 10.42.1.110:10010: connect: connection refused\"" node=ip-172-31-33-252
time="2023-10-31T09:40:08Z" level=info msg="Updating engine current replica address map to map[vol1-r-09d1657e:10.42.3.116:10000 vol1-r-ceeca1de:10.42.1.110:10000]" func="controller.(*EngineController).syncEngine" file="engine_controller.go:331" controller=longhorn-engine engine=vol1-e-0 node=ip-172-31-33-252
time="2023-10-31T09:40:09Z" level=error msg="Failed to run engine live upgrade" func="controller.(*EngineController).syncEngine" file="engine_controller.go:323" controller=longhorn-engine engine=vol1-e-0 error="failed to live upgrade image for vol1-e-0: proxyServer=10.42.1.110:8501 destination=10.42.1.110:10010: failed to get server version: rpc error: code = Unknown desc = failed to get version detail: rpc error: code = Unavailable desc = connection error: desc = \"transport: Error while dialing: dial tcp 10.42.1.110:10010: connect: connection refused\"" node=ip-172-31-33-252
time="2023-10-31T09:40:09Z" level=error msg="Failed to run engine live upgrade" func="controller.(*EngineController).syncEngine" file="engine_controller.go:323" controller=longhorn-engine engine=vol1-e-0 error="failed to live upgrade image for vol1-e-0: proxyServer=10.42.1.110:8501 destination=10.42.1.110:10010: failed to get server version: rpc error: code = Unknown desc = failed to get version detail: rpc error: code = Unavailable desc = connection error: desc = \"transport: Error while dialing: dial tcp 10.42.1.110:10010: connect: connection refused\"" node=ip-172-31-33-252
time="2023-10-31T09:40:09Z" level=warning msg="Instance vol1-e-0 starts running, Port 10021" func="controller.(*InstanceHandler).syncStatusWithInstanceManager" file="instance_handler.go:167"
time="2023-10-31T09:40:09Z" level=info msg="Upgrading engine from longhornio/longhorn-engine:v1.5.1 to longhornio/longhorn-engine:master-head" func="controller.(*EngineController).Upgrade" file="engine_controller.go:1969" controller=longhorn-engine engine=vol1-e-0 node=ip-172-31-33-252
time="2023-10-31T09:40:09Z" level=info msg="The existing engine instance already has the new engine image longhornio/longhorn-engine:master-head" func="controller.(*EngineController).UpgradeEngineInstance" file="engine_controller.go:2025" controller=longhorn-engine engine=vol1-e-0 node=ip-172-31-33-252
time="2023-10-31T09:40:09Z" level=info msg="Engine has been upgraded from longhornio/longhorn-engine:v1.5.1 to longhornio/longhorn-engine:master-head" func="controller.(*EngineController).Upgrade" file="engine_controller.go:1974" controller=longhorn-engine engine=vol1-e-0 node=ip-172-31-33-252
10.42.0.1 - - [31/Oct/2023:09:38:14 +0000] "GET /v1/ws/1s/nodes HTTP/1.1" 200 0 "" "Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/118.0.0.0 Safari/537.36"

supportbundle_e6e3e73a-e898-4617-81ad-e21ad5fa3be4_2023-10-31T09-53-33Z.zip

Environment

  • Longhorn version: master-head
  • Installation method (e.g. Rancher Catalog App/Helm/Kubectl): kubectl
  • Kubernetes distro (e.g. RKE/K3s/EKS/OpenShift) and version: v1.28.2+k3s1
  • Volume names: vol1

Additional context

Can reproduce on v1.5.x-head

@chriscchien chriscchien added kind/bug area/v1-data-engine v1 data engine (iSCSI tgt) reproduce/always 100% reproducible severity/3 Function working but has a major issue w/ workaround require/qa-review-coverage Require QA to review coverage require/backport Require backport. Only used when the specific versions to backport have not been definied. labels Oct 31, 2023
@chriscchien chriscchien added this to the v1.6.0 milestone Oct 31, 2023
@innobead innobead added the priority/0 Must be implement or fixed in this release (managed by PO) label Oct 31, 2023
@innobead
Copy link
Member

@chriscchien Is this a regression from 1.5.1/1.4.3? or an existing issue?

@innobead
Copy link
Member

@PhanLe1010 Please help check this.

@chriscchien chriscchien added kind/regression Regression which has worked before severity/1 Function broken (a critical incident with very high impact (ex: data corruption, failed upgrade) and removed severity/3 Function working but has a major issue w/ workaround labels Oct 31, 2023
@chriscchien
Copy link
Contributor Author

chriscchien commented Oct 31, 2023

@chriscchien Is this a regression from 1.5.1/1.4.3? or an existing issue?

It's a regression, we had a manual test case include this scenario.

@PhanLe1010
Copy link
Contributor

PhanLe1010 commented Oct 31, 2023

Looks like this is not a regression as I am able to reproduce it in v1.4.3 by:

  1. Deploy Longhorn v1.4.3
  2. Deploy previous version of engine image (For example longhornio/longhorn-engine:v1.4.2)
  3. Create a volume and change the engine image to previous one, attach it to a node.
  4. Upgrade engine image to longhornio/longhorn-engine:v1.4.3
  5. Immediately delete any replica with previous version of engine image (longhornio/longhorn-engine:v1.4.2)
  6. The volume kept in degrade state and upgrading state forever

In the current implementation, we don't continue the live engine upgrade when the volume is unhealthy (degraded) https://github.com/longhorn/longhorn-manager/blob/b810121b33789d145f220bfd0e41102a7801a354/controller/volume_controller.go#L2735C1-L2739C1. User would need to detach/reattach the volume to get out of this situation.

Maybe we can keep this ticket to see if we can make improvement but I think this one is not a regression/release blocker

@PhanLe1010
Copy link
Contributor

Regarding to the error:

[longhorn-manager-fxtsp] time="2023-10-31T23:25:30Z" level=error msg="Failed to run engine live upgrade" func="controller.(*EngineController).syncEngine" file="engine_controller.go:323" controller=longhorn-engine engine=testvol-e-0 error="failed to live upgrade image for testvol-e-0: proxyServer=10.42.251.72:8501 destination=10.42.251.72:10010: failed to get server version: rpc error: code = Unknown desc = failed to get version detail: rpc error: code = Unavailable desc = connection error: desc = \"transport: Error while dialing: dial tcp 10.42.251.72:10010: connect: connection refused\"" node=phan-v500-pool2-d3e1c5d8-9qxsq

After instance-manager successfully replaced the engine process on old port with new port, looks like the engine controller was trying to resync the engine CR and retry the upgrade again but it wasn't aware that the engine already moved to a new port (the port 10021 in my case). So the retry failed. After some time, the engine monitor updates the port to the new port so the engine controller eventually realizes that it already successfully upgraded the engine.

[longhorn-manager-fxtsp] time="2023-10-31T23:25:30Z" level=info msg="The existing engine instance already has the new engine image longhornio/longhorn-engine:master-head" func="controller.(*EngineController).UpgradeEngineInstance" file="engine_controller.go:2025" controller=longhorn-engine engine=testvol-e-0 node=phan-v500-pool2-d3e1c5d8-9qxsq

@PhanLe1010
Copy link
Contributor

PhanLe1010 commented Dec 14, 2023

Test plan:

  1. Deploy Longhorn master-head
  2. Deploy the previous version of the engine image (For example longhornio/longhorn-engine:v1.5.3)
  3. Create a 5GB volume, change the engine image to previous one, attach it to a node.
  4. Write 1GB of random data to the volume and compute the checksum of the data
  5. Upgrade engine image to longhornio/longhorn-engine:master-head
  6. Immediately delete any replica with the previous version of engine image (longhornio/longhorn-engine:v1.5.3)
  7. The volume should eventually finish the live upgrade
  8. Verify the checksum of the data
  9. Repeated the test 5 times

@longhorn-io-github-bot
Copy link

longhorn-io-github-bot commented Dec 14, 2023

Pre Ready-For-Testing Checklist

@PhanLe1010 PhanLe1010 added require/backport Require backport. Only used when the specific versions to backport have not been definied. backport/1.5.4 and removed require/backport Require backport. Only used when the specific versions to backport have not been definied. backport/1.5.4 labels Dec 14, 2023
@PhanLe1010
Copy link
Contributor

Recommending to backport to v1.5.4.
I am not sure if backporting to 1.4.5 is needed, @innobead ?

@PhanLe1010 PhanLe1010 added the require/auto-e2e-test Require adding/updating auto e2e test cases if they can be automated label Dec 14, 2023
@chriscchien chriscchien self-assigned this Dec 15, 2023
@PhanLe1010
Copy link
Contributor

Hi @chriscchien This one is dependent on the new issue #7396. Let's wait for that one to merge first to fix a regression

@chriscchien
Copy link
Contributor Author

Verified pass on longhorn master(longhorn-manager cc7f12) with test steps

During volume live engine upgrade, delete replica with old engine image, engine upgrade success and volume become healthy, in addition, data in volume is correct.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
area/resilience System or volume resilience area/stability System or volume stability area/v1-data-engine v1 data engine (iSCSI tgt) backport/1.5.4 kind/bug priority/0 Must be implement or fixed in this release (managed by PO) reproduce/always 100% reproducible require/auto-e2e-test Require adding/updating auto e2e test cases if they can be automated require/backport Require backport. Only used when the specific versions to backport have not been definied. require/qa-review-coverage Require QA to review coverage severity/1 Function broken (a critical incident with very high impact (ex: data corruption, failed upgrade)
Projects
Status: Closed
Development

No branches or pull requests

4 participants