[BUG] Instance manager may not update instance status for a minute after starting #5809

ejweber · 2023-04-25T20:57:23Z

Describe the bug (🐛 if you encounter this issue)

When a node is cold rebooted, Longhorn eventually notices. After a little while, Longhorn decides all instance processes on that node should be stopped before they can be started again, however, the instance manager controller monitor is no longer running, so InstanceProcessStatuses are not updated.

When instance manager comes back online, the replica controller tries to kill replica processes. However, for a full minute, InstanceProcessStatuses are still not updated, and Longhorn is not aware that the replica processes have been killed. Full recovery can't really begin until this time is up.

To Reproduce

Steps to reproduce the behavior:

Create at least one volume with at least two replicas.
Attach the volume to one node.
Reboot the other node (e.g. SSH and reboot now).
When instance-manager-r comes back online, monitor the logs. You will see something like the following.
- One full minute of attempts to kill a replica process that is definitely not running before anything else happens.
- One full minute of instanceManager.status.instances[].status.state = running, even though this is false.
- One full minute of the replica.spec.desiredState = stopped and replica.status.currentState = running, even though this is false.

2023-04-24T18:20:56.318449496Z time="2023-04-24T18:20:56Z" level=info msg="Storing process logs at path: /var/log/instances"
2023-04-24T18:20:56.318529601Z [longhorn-instance-manager] time="2023-04-24T18:20:56Z" level=warning msg="Failed to add TLS key pair from /tls-files/" error="open /tls-files/tls.crt: no such file or directory"
2023-04-24T18:20:56.318536614Z [longhorn-instance-manager] time="2023-04-24T18:20:56Z" level=info msg="Creating gRPC server with no auth"
2023-04-24T18:20:56.318541016Z [longhorn-instance-manager] time="2023-04-24T18:20:56Z" level=info msg="Instance Manager proxy gRPC server listening to 0.0.0.0:8501"
2023-04-24T18:20:56.318545116Z [longhorn-instance-manager] time="2023-04-24T18:20:56Z" level=info msg="Instance Manager process gRPC server listening to 0.0.0.0:8500"
2023-04-24T18:20:57.110671536Z [longhorn-instance-manager] time="2023-04-24T18:20:57Z" level=info msg="Started new process manager update watch"
2023-04-24T18:20:57.176274620Z [longhorn-instance-manager] time="2023-04-24T18:20:57Z" level=info msg="Process Manager: prepare to delete process 4gb-r-768bdb99"
2023-04-24T18:20:57.343966728Z [longhorn-instance-manager] time="2023-04-24T18:20:57Z" level=info msg="Process Manager: prepare to delete process 4gb-r-768bdb99"
2023-04-24T18:21:07.761683795Z [longhorn-instance-manager] time="2023-04-24T18:21:07Z" level=info msg="Process Manager: prepare to delete process 4gb-r-768bdb99"
2023-04-24T18:21:07.772749817Z [longhorn-instance-manager] time="2023-04-24T18:21:07Z" level=info msg="Process Manager: prepare to delete process 4gb-r-768bdb99"
2023-04-24T18:21:07.935662832Z [longhorn-instance-manager] time="2023-04-24T18:21:07Z" level=info msg="Process Manager: prepare to delete process 4gb-r-768bdb99"
2023-04-24T18:21:07.965271521Z [longhorn-instance-manager] time="2023-04-24T18:21:07Z" level=info msg="Process Manager: prepare to delete process 4gb-r-768bdb99"
2023-04-24T18:21:08.141594846Z [longhorn-instance-manager] time="2023-04-24T18:21:08Z" level=info msg="Process Manager: prepare to delete process 4gb-r-768bdb99"
2023-04-24T18:21:08.147821855Z [longhorn-instance-manager] time="2023-04-24T18:21:08Z" level=info msg="Process Manager: prepare to delete process 4gb-r-768bdb99"
2023-04-24T18:21:08.193646014Z [longhorn-instance-manager] time="2023-04-24T18:21:08Z" level=info msg="Process Manager: prepare to delete process 4gb-r-768bdb99"
2023-04-24T18:21:16.015204433Z [longhorn-instance-manager] time="2023-04-24T18:21:16Z" level=info msg="Process Manager: prepare to delete process 4gb-r-768bdb99"
2023-04-24T18:21:27.843741381Z [longhorn-instance-manager] time="2023-04-24T18:21:27Z" level=info msg="Process Manager: prepare to delete process 4gb-r-768bdb99"
2023-04-24T18:21:38.184150101Z [longhorn-instance-manager] time="2023-04-24T18:21:38Z" level=info msg="Process Manager: prepare to delete process 4gb-r-768bdb99"
2023-04-24T18:21:46.021051790Z [longhorn-instance-manager] time="2023-04-24T18:21:46Z" level=info msg="Process Manager: prepare to delete process 4gb-r-768bdb99"
2023-04-24T18:21:57.304418332Z [longhorn-instance-manager] time="2023-04-24T18:21:57Z" level=info msg="Process Manager: prepare to create process 4gb-r-768bdb99"
2023-04-24T18:21:57.304791715Z [longhorn-instance-manager] time="2023-04-24T18:21:57Z" level=debug msg="Process Manager: validate process path: /host/var/lib/longhorn/engine-binaries/rancher-mirrored-longhornio-longhorn-engine-v1.4.1/longhorn dir: /host/var/lib/longhorn/engine-binaries/ image: rancher-mirrored-longhornio-longhorn-engine-v1.4.1 binary: longhorn"
2023-04-24T18:21:57.305052318Z [longhorn-instance-manager] time="2023-04-24T18:21:57Z" level=info msg="Process Manager: created process 4gb-r-768bdb99"
2023-04-24T18:21:57.335766556Z [4gb-r-768bdb99] time="2023-04-24T18:21:57Z" level=info msg="Listening on data server 0.0.0.0:10046"
2023-04-24T18:21:57.335798653Z time="2023-04-24T18:21:57Z" level=info msg="Listening on gRPC Replica server 0.0.0.0:10045"
2023-04-24T18:21:57.335804247Z time="2023-04-24T18:21:57Z" level=info msg="Listening on sync agent server 0.0.0.0:10047"

Expected behavior

Statuses are correct almost immediately after instance manager-r starts and processes actually start soon after that.

Log or Support bundle

See above.

Environment

Longhorn version: v1.4.1
Installation method (e.g. Rancher Catalog App/Helm/Kubectl):
Kubernetes distro (e.g. RKE/K3s/EKS/OpenShift) and version:
- Number of management node in the cluster:
- Number of worker node in the cluster:
Node config
- OS type and version:
- CPU per node:
- Memory per node:
- Disk type(e.g. SSD/NVMe):
- Network bandwidth between the nodes:
Underlying Infrastructure (e.g. on AWS/GCE, EKS/GKE, VMWare/KVM, Baremetal):
Number of Longhorn volumes in the cluster:

Additional context

Ran across this while trying to solve #5709.

The text was updated successfully, but these errors were encountered:

innobead · 2023-04-27T02:46:15Z

@ejweber Please remember to move the issue to the right pipeline to reflect the status. I assume this is in review?

longhorn-io-github-bot · 2023-04-27T10:07:22Z

innobead · 2023-04-28T08:54:29Z

@ejweber move this to ready-for-testing and provide the testing info for @longhorn/qa. Thanks.

Also, update the backport issues accordingly.

chriscchien · 2023-05-09T08:52:44Z

Verified in longhorn master (longhorn-manager b2f7a6) with test steps
Result Pass

Instance-manager pod now will not wait 1 minutes to update status after restarting.

root@ip-172-31-81-81:/home/ubuntu/k9s/longhorn# k logs instance-manager-bee795b7c8c4dcbaf35a4c810908b1f4 -n longhorn-system
tgtd: work_timer_start(146) use timer_fd based scheduler
tgtd: bs_init(387) use signalfd notification
time="2023-05-09T08:46:03Z" level=info msg="Storing process logs at path: /var/log/instances"
[longhorn-instance-manager] time="2023-05-09T08:46:03Z" level=warning msg="Failed to add TLS key pair from /tls-files/" error="open /tls-files/tls.crt: no such file or directory"
[longhorn-instance-manager] time="2023-05-09T08:46:03Z" level=info msg="Creating gRPC server with no auth"
[longhorn-instance-manager] time="2023-05-09T08:46:03Z" level=info msg="Instance Manager proxy gRPC server listening to 0.0.0.0:8501"
[longhorn-instance-manager] time="2023-05-09T08:46:03Z" level=info msg="Instance Manager process gRPC server listening to 0.0.0.0:8500"
[longhorn-instance-manager] time="2023-05-09T08:46:05Z" level=info msg="Started new process manager update watch"
[longhorn-instance-manager] time="2023-05-09T08:46:08Z" level=info msg="Process Manager: prepare to create process pvc-b57aaafd-9afa-4b2a-a9cc-ed2922bb9965-r-71d1b6a5"
[longhorn-instance-manager] time="2023-05-09T08:46:08Z" level=debug msg="Process Manager: validate process path: /host/var/lib/longhorn/engine-binaries/longhornio-longhorn-engine-master-head/longhorn dir: /host/var/lib/longhorn/engine-binaries/ image: longhornio-longhorn-engine-master-head binary: longhorn"
[longhorn-instance-manager] time="2023-05-09T08:46:08Z" level=info msg="Process Manager: created process pvc-b57aaafd-9afa-4b2a-a9cc-ed2922bb9965-r-71d1b6a5"
[pvc-b57aaafd-9afa-4b2a-a9cc-ed2922bb9965-r-71d1b6a5] time="2023-05-09T08:46:08Z" level=info msg="Listening on data server 0.0.0.0:10001"
[pvc-b57aaafd-9afa-4b2a-a9cc-ed2922bb9965-r-71d1b6a5] time="2023-05-09T08:46:08Z" level=info msg="Listening on gRPC Replica server 0.0.0.0:10000"
time="2023-05-09T08:46:08Z" level=info msg="Listening on sync agent server 0.0.0.0:10002"
[pvc-b57aaafd-9afa-4b2a-a9cc-ed2922bb9965-r-71d1b6a5] time="2023-05-09T08:46:08Z" level=info msg="Listening on sync 0.0.0.0:10002"

ejweber added the kind/bug label Apr 25, 2023

ejweber self-assigned this Apr 25, 2023

ejweber mentioned this issue Apr 25, 2023

Don't wait a minute after starting instance manager to update instance status longhorn/longhorn-manager#1865

Merged

innobead added this to the v1.5.0 milestone Apr 27, 2023

innobead added component/longhorn-manager Longhorn manager (control plane) area/resilience System or volume resilience labels Apr 27, 2023

innobead added backport/1.4.2 backport/1.3.4 labels Apr 28, 2023

This was referenced Apr 28, 2023

[BACKPORT][v1.3.4][BUG] Instance manager may not update instance status for a minute after starting #5828

Closed

[BACKPORT][v1.4.2][BUG] Instance manager may not update instance status for a minute after starting #5829

Closed

innobead assigned roger-ryao May 3, 2023

innobead modified the milestones: v1.5.0, v1.6.0 May 3, 2023

innobead assigned chriscchien and unassigned roger-ryao May 8, 2023

chriscchien closed this as completed May 9, 2023

roger-ryao mentioned this issue May 17, 2023

[BUG] Rebuild rebuilding is possibly issued to a wrong replica #5709

Closed

derekbit added this to Longhorn Sprint Aug 3, 2024

derekbit moved this to Closed in Longhorn Sprint Aug 3, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[BUG] Instance manager may not update instance status for a minute after starting #5809

[BUG] Instance manager may not update instance status for a minute after starting #5809

ejweber commented Apr 25, 2023

innobead commented Apr 27, 2023

longhorn-io-github-bot commented Apr 27, 2023 •

edited by ejweber

Loading

innobead commented Apr 28, 2023

chriscchien commented May 9, 2023

[BUG] Instance manager may not update instance status for a minute after starting #5809

[BUG] Instance manager may not update instance status for a minute after starting #5809

Comments

ejweber commented Apr 25, 2023

Describe the bug (🐛 if you encounter this issue)

To Reproduce

Expected behavior

Log or Support bundle

Environment

Additional context

innobead commented Apr 27, 2023

longhorn-io-github-bot commented Apr 27, 2023 • edited by ejweber Loading

Pre Ready-For-Testing Checklist

innobead commented Apr 28, 2023

chriscchien commented May 9, 2023

longhorn-io-github-bot commented Apr 27, 2023 •

edited by ejweber

Loading