[BUG] During volume live engine upgrade, delete replica with old engine image will make volume degraded forever #7012

chriscchien · 2023-10-31T10:21:51Z

Describe the bug (🐛 if you encounter this issue)

While perform volume live engine upgrade and immediately delete any old replica(replica with old engine image), the volume will kept in detached forever and stuck in the upgrading process.

Because Longhorn can not perform volume engine upgrade when volume is degraded, but can delete replica when perform volume live engine upgrade, It's a corner case and may need developer's clarify if this is expected, thanks.

To Reproduce

Deploy Longhorn master
Deploy previous version of engine image (For example longhornio/longhorn-engine:v1.5.1)
Create a volume and change the engine image to previous one, attach it to a node.
(Or upgrade Longhorn from previous stable version(have volume attached) to master-head instead of previous steps)
Upgrade engine image to longhornio/longhorn-engine:master-head
Immediately delete any replica with previous version of engine image (longhornio/longhorn-engine:v1.5.1)
The volume kept in degrade state and upgrading state forever

Expected behavior

Prevent replica delete when engine upgrade or volume become healthy after perform reproduce steps

Support bundle for troubleshooting

Replica status (3 replicas with new engine image, 2 with old engine images(1 deleted before)), all are in running state

root@ip-172-31-37-125:/home/ubuntu# k get replicas -A
NAMESPACE         NAME              STATE     NODE               DISK                                   INSTANCEMANAGER                                     IMAGE                                    AGE
longhorn-system   vol1-r-09d1657e   running   ip-172-31-39-5     7534dafa-7aa8-4a38-ab36-6489d4816df8   instance-manager-5a717c7cbbeb5e5be5256f400c06cefa   longhornio/longhorn-engine:v1.5.1        18m
longhorn-system   vol1-r-ceeca1de   running   ip-172-31-33-252   ef5818b7-7a71-4a3d-8a83-ded1db249781   instance-manager-8885f2af73361cc6339316476675ac80   longhornio/longhorn-engine:v1.5.1        18m
longhorn-system   vol1-r-55945773   running   ip-172-31-33-252   ef5818b7-7a71-4a3d-8a83-ded1db249781   instance-manager-8885f2af73361cc6339316476675ac80   longhornio/longhorn-engine:master-head   14m
longhorn-system   vol1-r-6b8a86f7   running   ip-172-31-39-5     7534dafa-7aa8-4a38-ab36-6489d4816df8   instance-manager-5a717c7cbbeb5e5be5256f400c06cefa   longhornio/longhorn-engine:master-head   14m
longhorn-system   vol1-r-6299d11f   running   ip-172-31-37-125   131d1525-237e-4539-8de5-2f182968a0fc   instance-manager-e74b6159a4e44c6a1da0cef333baa415   longhornio/longhorn-engine:master-head   14m

longhorn-manager log(can see info for Engine has been upgraded)

time="2023-10-31T09:40:04Z" level=info msg="Cloned a new matching replica vol1-r-6299d11f from vol1-r-5bcc401a" func="controller.(*VolumeController).createAndStartMatchingReplicas" file="volume_controller.go:3748" accessMode=rwo controller=longhorn-volume frontend=blockdev migratable=false node=ip-172-31-33-252 owner=ip-172-31-33-252 state=attached volume=vol1
time="2023-10-31T09:40:05Z" level=warning msg="Instance vol1-r-55945773 starts running, Storage IP 10.42.1.110" func="controller.(*InstanceHandler).syncStatusWithInstanceManager" file="instance_handler.go:158"
time="2023-10-31T09:40:05Z" level=warning msg="Instance vol1-r-55945773 starts running, IP 10.42.1.110" func="controller.(*InstanceHandler).syncStatusWithInstanceManager" file="instance_handler.go:163"
time="2023-10-31T09:40:05Z" level=warning msg="Instance vol1-r-55945773 starts running, Port 10011" func="controller.(*InstanceHandler).syncStatusWithInstanceManager" file="instance_handler.go:167"
time="2023-10-31T09:40:05Z" level=info msg="Upgrading engine from longhornio/longhorn-engine:v1.5.1 to longhornio/longhorn-engine:master-head" func="controller.(*EngineController).Upgrade" file="engine_controller.go:1969" controller=longhorn-engine engine=vol1-e-0 node=ip-172-31-33-252
time="2023-10-31T09:40:07Z" level=info msg="Event(v1.ObjectReference{Kind:\"Volume\", Namespace:\"longhorn-system\", Name:\"vol1\", UID:\"ebe894c6-74c2-44c3-a518-8c70d8b75473\", APIVersion:\"longhorn.io/v1beta2\", ResourceVersion:\"34338\", FieldPath:\"\"}): type: 'Normal' reason: 'Degraded' volume vol1 became degraded" func="record.(*eventBroadcasterImpl).StartLogging.func1" file="event.go:298"
time="2023-10-31T09:40:08Z" level=info msg="Engine has been upgraded from longhornio/longhorn-engine:v1.5.1 to longhornio/longhorn-engine:master-head" func="controller.(*EngineController).Upgrade" file="engine_controller.go:1974" controller=longhorn-engine engine=vol1-e-0 node=ip-172-31-33-252
time="2023-10-31T09:40:08Z" level=warning msg="Instance vol1-e-0 starts running, Port 10010" func="controller.(*InstanceHandler).syncStatusWithInstanceManager" file="instance_handler.go:167"
time="2023-10-31T09:40:08Z" level=error msg="Failed to run engine live upgrade" func="controller.(*EngineController).syncEngine" file="engine_controller.go:323" controller=longhorn-engine engine=vol1-e-0 error="failed to live upgrade image for vol1-e-0: proxyServer=10.42.1.110:8501 destination=10.42.1.110:10010: failed to get server version: rpc error: code = Unknown desc = failed to get version detail: rpc error: code = Unavailable desc = connection error: desc = \"transport: Error while dialing: dial tcp 10.42.1.110:10010: connect: connection refused\"" node=ip-172-31-33-252
time="2023-10-31T09:40:08Z" level=info msg="Updating engine current replica address map to map[vol1-r-09d1657e:10.42.3.116:10000 vol1-r-ceeca1de:10.42.1.110:10000]" func="controller.(*EngineController).syncEngine" file="engine_controller.go:331" controller=longhorn-engine engine=vol1-e-0 node=ip-172-31-33-252
time="2023-10-31T09:40:09Z" level=error msg="Failed to run engine live upgrade" func="controller.(*EngineController).syncEngine" file="engine_controller.go:323" controller=longhorn-engine engine=vol1-e-0 error="failed to live upgrade image for vol1-e-0: proxyServer=10.42.1.110:8501 destination=10.42.1.110:10010: failed to get server version: rpc error: code = Unknown desc = failed to get version detail: rpc error: code = Unavailable desc = connection error: desc = \"transport: Error while dialing: dial tcp 10.42.1.110:10010: connect: connection refused\"" node=ip-172-31-33-252
time="2023-10-31T09:40:09Z" level=error msg="Failed to run engine live upgrade" func="controller.(*EngineController).syncEngine" file="engine_controller.go:323" controller=longhorn-engine engine=vol1-e-0 error="failed to live upgrade image for vol1-e-0: proxyServer=10.42.1.110:8501 destination=10.42.1.110:10010: failed to get server version: rpc error: code = Unknown desc = failed to get version detail: rpc error: code = Unavailable desc = connection error: desc = \"transport: Error while dialing: dial tcp 10.42.1.110:10010: connect: connection refused\"" node=ip-172-31-33-252
time="2023-10-31T09:40:09Z" level=warning msg="Instance vol1-e-0 starts running, Port 10021" func="controller.(*InstanceHandler).syncStatusWithInstanceManager" file="instance_handler.go:167"
time="2023-10-31T09:40:09Z" level=info msg="Upgrading engine from longhornio/longhorn-engine:v1.5.1 to longhornio/longhorn-engine:master-head" func="controller.(*EngineController).Upgrade" file="engine_controller.go:1969" controller=longhorn-engine engine=vol1-e-0 node=ip-172-31-33-252
time="2023-10-31T09:40:09Z" level=info msg="The existing engine instance already has the new engine image longhornio/longhorn-engine:master-head" func="controller.(*EngineController).UpgradeEngineInstance" file="engine_controller.go:2025" controller=longhorn-engine engine=vol1-e-0 node=ip-172-31-33-252
time="2023-10-31T09:40:09Z" level=info msg="Engine has been upgraded from longhornio/longhorn-engine:v1.5.1 to longhornio/longhorn-engine:master-head" func="controller.(*EngineController).Upgrade" file="engine_controller.go:1974" controller=longhorn-engine engine=vol1-e-0 node=ip-172-31-33-252
10.42.0.1 - - [31/Oct/2023:09:38:14 +0000] "GET /v1/ws/1s/nodes HTTP/1.1" 200 0 "" "Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/118.0.0.0 Safari/537.36"

supportbundle_e6e3e73a-e898-4617-81ad-e21ad5fa3be4_2023-10-31T09-53-33Z.zip

Environment

Longhorn version: master-head
Installation method (e.g. Rancher Catalog App/Helm/Kubectl): kubectl
Kubernetes distro (e.g. RKE/K3s/EKS/OpenShift) and version: v1.28.2+k3s1
Volume names: vol1

Additional context

Can reproduce on v1.5.x-head

The text was updated successfully, but these errors were encountered:

innobead · 2023-10-31T11:06:25Z

@chriscchien Is this a regression from 1.5.1/1.4.3? or an existing issue?

innobead · 2023-10-31T11:46:55Z

@PhanLe1010 Please help check this.

chriscchien · 2023-10-31T12:30:47Z

@chriscchien Is this a regression from 1.5.1/1.4.3? or an existing issue?

It's a regression, we had a manual test case include this scenario.

PhanLe1010 · 2023-10-31T22:57:58Z

Looks like this is not a regression as I am able to reproduce it in v1.4.3 by:

Deploy Longhorn v1.4.3
Deploy previous version of engine image (For example longhornio/longhorn-engine:v1.4.2)
Create a volume and change the engine image to previous one, attach it to a node.
Upgrade engine image to longhornio/longhorn-engine:v1.4.3
Immediately delete any replica with previous version of engine image (longhornio/longhorn-engine:v1.4.2)
The volume kept in degrade state and upgrading state forever

In the current implementation, we don't continue the live engine upgrade when the volume is unhealthy (degraded) https://github.com/longhorn/longhorn-manager/blob/b810121b33789d145f220bfd0e41102a7801a354/controller/volume_controller.go#L2735C1-L2739C1. User would need to detach/reattach the volume to get out of this situation.

Maybe we can keep this ticket to see if we can make improvement but I think this one is not a regression/release blocker

PhanLe1010 · 2023-10-31T23:52:30Z

Regarding to the error:

[longhorn-manager-fxtsp] time="2023-10-31T23:25:30Z" level=error msg="Failed to run engine live upgrade" func="controller.(*EngineController).syncEngine" file="engine_controller.go:323" controller=longhorn-engine engine=testvol-e-0 error="failed to live upgrade image for testvol-e-0: proxyServer=10.42.251.72:8501 destination=10.42.251.72:10010: failed to get server version: rpc error: code = Unknown desc = failed to get version detail: rpc error: code = Unavailable desc = connection error: desc = \"transport: Error while dialing: dial tcp 10.42.251.72:10010: connect: connection refused\"" node=phan-v500-pool2-d3e1c5d8-9qxsq

After instance-manager successfully replaced the engine process on old port with new port, looks like the engine controller was trying to resync the engine CR and retry the upgrade again but it wasn't aware that the engine already moved to a new port (the port 10021 in my case). So the retry failed. After some time, the engine monitor updates the port to the new port so the engine controller eventually realizes that it already successfully upgraded the engine.

[longhorn-manager-fxtsp] time="2023-10-31T23:25:30Z" level=info msg="The existing engine instance already has the new engine image longhornio/longhorn-engine:master-head" func="controller.(*EngineController).UpgradeEngineInstance" file="engine_controller.go:2025" controller=longhorn-engine engine=testvol-e-0 node=phan-v500-pool2-d3e1c5d8-9qxsq

PhanLe1010 · 2023-12-14T22:37:34Z

Test plan:

Deploy Longhorn master-head
Deploy the previous version of the engine image (For example longhornio/longhorn-engine:v1.5.3)
Create a 5GB volume, change the engine image to previous one, attach it to a node.
Write 1GB of random data to the volume and compute the checksum of the data
Upgrade engine image to longhornio/longhorn-engine:master-head
Immediately delete any replica with the previous version of engine image (longhornio/longhorn-engine:v1.5.3)
The volume should eventually finish the live upgrade
Verify the checksum of the data
Repeated the test 5 times

longhorn-io-github-bot · 2023-12-14T22:37:54Z

Pre Ready-For-Testing Checklist

PhanLe1010 · 2023-12-14T22:50:14Z

Recommending to backport to v1.5.4.
I am not sure if backporting to 1.4.5 is needed, @innobead ?

PhanLe1010 · 2023-12-21T01:09:18Z

Hi @chriscchien This one is dependent on the new issue #7396. Let's wait for that one to merge first to fix a regression

chriscchien · 2023-12-22T02:52:27Z

Verified pass on longhorn master(longhorn-manager cc7f12) with test steps

During volume live engine upgrade, delete replica with old engine image, engine upgrade success and volume become healthy, in addition, data in volume is correct.

chriscchien added this to the v1.6.0 milestone Oct 31, 2023

innobead added the priority/0 Must be implement or fixed in this release (managed by PO) label Oct 31, 2023

innobead assigned PhanLe1010 Oct 31, 2023

chriscchien added kind/regression Regression which has worked before severity/1 Function broken (a critical incident with very high impact (ex: data corruption, failed upgrade) and removed severity/3 Function working but has a major issue w/ workaround labels Oct 31, 2023

innobead removed the kind/regression Regression which has worked before label Nov 1, 2023

PhanLe1010 mentioned this issue Dec 14, 2023

Fix bug volume stuck in live engine upgrading forever if it was degraded during live engine upgrade longhorn/longhorn-manager#2363

Merged

PhanLe1010 added require/backport Require backport. Only used when the specific versions to backport have not been definied. backport/1.5.4 and removed require/backport Require backport. Only used when the specific versions to backport have not been definied. backport/1.5.4 labels Dec 14, 2023

github-actions bot mentioned this issue Dec 14, 2023

[BACKPORT][v1.5.4][BUG] During volume live engine upgrade, delete replica with old engine image will make volume degraded forever #7334

Closed

PhanLe1010 added the backport/1.5.4 label Dec 14, 2023

PhanLe1010 added the require/auto-e2e-test Require adding/updating auto e2e test cases if they can be automated label Dec 14, 2023

chriscchien self-assigned this Dec 15, 2023

github-actions bot mentioned this issue Dec 15, 2023

[TEST][BUG] During volume live engine upgrade, delete replica with old engine image will make volume degraded forever #7346

Open

innobead added area/stability System or volume stability area/resilience System or volume resilience labels Dec 19, 2023

roger-ryao mentioned this issue Dec 20, 2023

[BUG] Failed to check_volume_data after volume engine upgrade/migration #7396

Closed

chriscchien closed this as completed Dec 22, 2023

derekbit added this to Longhorn Sprint Aug 3, 2024

derekbit moved this to Closed in Longhorn Sprint Aug 3, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[BUG] During volume live engine upgrade, delete replica with old engine image will make volume degraded forever #7012

[BUG] During volume live engine upgrade, delete replica with old engine image will make volume degraded forever #7012

chriscchien commented Oct 31, 2023

innobead commented Oct 31, 2023

innobead commented Oct 31, 2023

chriscchien commented Oct 31, 2023 •

edited

Loading

PhanLe1010 commented Oct 31, 2023 •

edited

Loading

PhanLe1010 commented Oct 31, 2023

PhanLe1010 commented Dec 14, 2023 •

edited

Loading

longhorn-io-github-bot commented Dec 14, 2023 •

edited by PhanLe1010

Loading

PhanLe1010 commented Dec 14, 2023

PhanLe1010 commented Dec 21, 2023

chriscchien commented Dec 22, 2023

[BUG] During volume live engine upgrade, delete replica with old engine image will make volume degraded forever #7012

[BUG] During volume live engine upgrade, delete replica with old engine image will make volume degraded forever #7012

Comments

chriscchien commented Oct 31, 2023

Describe the bug (🐛 if you encounter this issue)

To Reproduce

Expected behavior

Support bundle for troubleshooting

Environment

Additional context

innobead commented Oct 31, 2023

innobead commented Oct 31, 2023

chriscchien commented Oct 31, 2023 • edited Loading

PhanLe1010 commented Oct 31, 2023 • edited Loading

PhanLe1010 commented Oct 31, 2023

PhanLe1010 commented Dec 14, 2023 • edited Loading

longhorn-io-github-bot commented Dec 14, 2023 • edited by PhanLe1010 Loading

Pre Ready-For-Testing Checklist

PhanLe1010 commented Dec 14, 2023

PhanLe1010 commented Dec 21, 2023

chriscchien commented Dec 22, 2023

chriscchien commented Oct 31, 2023 •

edited

Loading

PhanLe1010 commented Oct 31, 2023 •

edited

Loading

PhanLe1010 commented Dec 14, 2023 •

edited

Loading

longhorn-io-github-bot commented Dec 14, 2023 •

edited by PhanLe1010

Loading