Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[BUG] Instance manager may not update instance status for a minute after starting #5809

Closed
ejweber opened this issue Apr 25, 2023 · 4 comments
Assignees
Labels
Milestone

Comments

@ejweber
Copy link
Contributor

ejweber commented Apr 25, 2023

Describe the bug (🐛 if you encounter this issue)

When a node is cold rebooted, Longhorn eventually notices. After a little while, Longhorn decides all instance processes on that node should be stopped before they can be started again, however, the instance manager controller monitor is no longer running, so InstanceProcessStatuses are not updated.

When instance manager comes back online, the replica controller tries to kill replica processes. However, for a full minute, InstanceProcessStatuses are still not updated, and Longhorn is not aware that the replica processes have been killed. Full recovery can't really begin until this time is up.

To Reproduce

Steps to reproduce the behavior:

  1. Create at least one volume with at least two replicas.
  2. Attach the volume to one node.
  3. Reboot the other node (e.g. SSH and reboot now).
  4. When instance-manager-r comes back online, monitor the logs. You will see something like the following.
    • One full minute of attempts to kill a replica process that is definitely not running before anything else happens.
    • One full minute of instanceManager.status.instances[].status.state = running, even though this is false.
    • One full minute of the replica.spec.desiredState = stopped and replica.status.currentState = running, even though this is false.
2023-04-24T18:20:56.318449496Z time="2023-04-24T18:20:56Z" level=info msg="Storing process logs at path: /var/log/instances"
2023-04-24T18:20:56.318529601Z [longhorn-instance-manager] time="2023-04-24T18:20:56Z" level=warning msg="Failed to add TLS key pair from /tls-files/" error="open /tls-files/tls.crt: no such file or directory"
2023-04-24T18:20:56.318536614Z [longhorn-instance-manager] time="2023-04-24T18:20:56Z" level=info msg="Creating gRPC server with no auth"
2023-04-24T18:20:56.318541016Z [longhorn-instance-manager] time="2023-04-24T18:20:56Z" level=info msg="Instance Manager proxy gRPC server listening to 0.0.0.0:8501"
2023-04-24T18:20:56.318545116Z [longhorn-instance-manager] time="2023-04-24T18:20:56Z" level=info msg="Instance Manager process gRPC server listening to 0.0.0.0:8500"
2023-04-24T18:20:57.110671536Z [longhorn-instance-manager] time="2023-04-24T18:20:57Z" level=info msg="Started new process manager update watch"
2023-04-24T18:20:57.176274620Z [longhorn-instance-manager] time="2023-04-24T18:20:57Z" level=info msg="Process Manager: prepare to delete process 4gb-r-768bdb99"
2023-04-24T18:20:57.343966728Z [longhorn-instance-manager] time="2023-04-24T18:20:57Z" level=info msg="Process Manager: prepare to delete process 4gb-r-768bdb99"
2023-04-24T18:21:07.761683795Z [longhorn-instance-manager] time="2023-04-24T18:21:07Z" level=info msg="Process Manager: prepare to delete process 4gb-r-768bdb99"
2023-04-24T18:21:07.772749817Z [longhorn-instance-manager] time="2023-04-24T18:21:07Z" level=info msg="Process Manager: prepare to delete process 4gb-r-768bdb99"
2023-04-24T18:21:07.935662832Z [longhorn-instance-manager] time="2023-04-24T18:21:07Z" level=info msg="Process Manager: prepare to delete process 4gb-r-768bdb99"
2023-04-24T18:21:07.965271521Z [longhorn-instance-manager] time="2023-04-24T18:21:07Z" level=info msg="Process Manager: prepare to delete process 4gb-r-768bdb99"
2023-04-24T18:21:08.141594846Z [longhorn-instance-manager] time="2023-04-24T18:21:08Z" level=info msg="Process Manager: prepare to delete process 4gb-r-768bdb99"
2023-04-24T18:21:08.147821855Z [longhorn-instance-manager] time="2023-04-24T18:21:08Z" level=info msg="Process Manager: prepare to delete process 4gb-r-768bdb99"
2023-04-24T18:21:08.193646014Z [longhorn-instance-manager] time="2023-04-24T18:21:08Z" level=info msg="Process Manager: prepare to delete process 4gb-r-768bdb99"
2023-04-24T18:21:16.015204433Z [longhorn-instance-manager] time="2023-04-24T18:21:16Z" level=info msg="Process Manager: prepare to delete process 4gb-r-768bdb99"
2023-04-24T18:21:27.843741381Z [longhorn-instance-manager] time="2023-04-24T18:21:27Z" level=info msg="Process Manager: prepare to delete process 4gb-r-768bdb99"
2023-04-24T18:21:38.184150101Z [longhorn-instance-manager] time="2023-04-24T18:21:38Z" level=info msg="Process Manager: prepare to delete process 4gb-r-768bdb99"
2023-04-24T18:21:46.021051790Z [longhorn-instance-manager] time="2023-04-24T18:21:46Z" level=info msg="Process Manager: prepare to delete process 4gb-r-768bdb99"
2023-04-24T18:21:57.304418332Z [longhorn-instance-manager] time="2023-04-24T18:21:57Z" level=info msg="Process Manager: prepare to create process 4gb-r-768bdb99"
2023-04-24T18:21:57.304791715Z [longhorn-instance-manager] time="2023-04-24T18:21:57Z" level=debug msg="Process Manager: validate process path: /host/var/lib/longhorn/engine-binaries/rancher-mirrored-longhornio-longhorn-engine-v1.4.1/longhorn dir: /host/var/lib/longhorn/engine-binaries/ image: rancher-mirrored-longhornio-longhorn-engine-v1.4.1 binary: longhorn"
2023-04-24T18:21:57.305052318Z [longhorn-instance-manager] time="2023-04-24T18:21:57Z" level=info msg="Process Manager: created process 4gb-r-768bdb99"
2023-04-24T18:21:57.335766556Z [4gb-r-768bdb99] time="2023-04-24T18:21:57Z" level=info msg="Listening on data server 0.0.0.0:10046"
2023-04-24T18:21:57.335798653Z time="2023-04-24T18:21:57Z" level=info msg="Listening on gRPC Replica server 0.0.0.0:10045"
2023-04-24T18:21:57.335804247Z time="2023-04-24T18:21:57Z" level=info msg="Listening on sync agent server 0.0.0.0:10047"

Expected behavior

Statuses are correct almost immediately after instance manager-r starts and processes actually start soon after that.

Log or Support bundle

See above.

Environment

  • Longhorn version: v1.4.1
  • Installation method (e.g. Rancher Catalog App/Helm/Kubectl):
  • Kubernetes distro (e.g. RKE/K3s/EKS/OpenShift) and version:
    • Number of management node in the cluster:
    • Number of worker node in the cluster:
  • Node config
    • OS type and version:
    • CPU per node:
    • Memory per node:
    • Disk type(e.g. SSD/NVMe):
    • Network bandwidth between the nodes:
  • Underlying Infrastructure (e.g. on AWS/GCE, EKS/GKE, VMWare/KVM, Baremetal):
  • Number of Longhorn volumes in the cluster:

Additional context

Ran across this while trying to solve #5709.

@innobead
Copy link
Member

@ejweber Please remember to move the issue to the right pipeline to reflect the status. I assume this is in review?

@longhorn-io-github-bot
Copy link

longhorn-io-github-bot commented Apr 27, 2023

Pre Ready-For-Testing Checklist

  • Where is the reproduce steps/test steps documented?
    The reproduce steps are at: [BUG] Instance manager may not update instance status for a minute after starting #5809 (comment).
    The expected output is at: Don't wait a minute after starting instance manager to update instance status longhorn-manager#1865 (comment) (attempt the reproduce after merging the changes and check the output against expected).

  • Is there a workaround for the issue? If so, where is it documented?

  • Does the PR include the explanation for the fix or the feature?

  • ~Does the PR include deployment change (YAML/Chart)? If so, where are the PRs for both YAML file and Chart?

  • Have the backend code been merged (Manager, Engine, Instance Manager, BackupStore etc) (including backport-needed/*)?
    The PR is at Don't wait a minute after starting instance manager to update instance status longhorn-manager#1865

  • Which areas/issues this PR might have potential impacts on?
    This shouldn't have any drastic effects. In specific reboot/restart scenarios, status updates will be more responsive.

  • If labeled: require/LEP Has the Longhorn Enhancement Proposal PR submitted?

  • If labeled: area/ui Has the UI issue filed or ready to be merged (including backport-needed/*)?

  • If labeled: require/doc Has the necessary document PR submitted or merged (including backport-needed/*)?
    The documentation issue/PR is at

  • If labeled: require/automation-e2e Has the end-to-end test plan been merged? Have QAs agreed on the automation test case? If only test case skeleton w/o implementation, have you created an implementation issue (including backport-needed/*)

  • If labeled: require/automation-engine Has the engine integration test been merged (including backport-needed/*)?

  • If labeled: require/manual-test-plan Has the manual test plan been documented?

  • If the fix introduces the code for backward compatibility Has a separate issue been filed with the label?

@innobead
Copy link
Member

@ejweber move this to ready-for-testing and provide the testing info for @longhorn/qa. Thanks.

Also, update the backport issues accordingly.

@innobead innobead modified the milestones: v1.5.0, v1.6.0 May 3, 2023
@innobead innobead assigned chriscchien and unassigned roger-ryao May 8, 2023
@chriscchien
Copy link
Contributor

Verified in longhorn master (longhorn-manager b2f7a6) with test steps
Result Pass

Instance-manager pod now will not wait 1 minutes to update status after restarting.

root@ip-172-31-81-81:/home/ubuntu/k9s/longhorn# k logs instance-manager-bee795b7c8c4dcbaf35a4c810908b1f4 -n longhorn-system
tgtd: work_timer_start(146) use timer_fd based scheduler
tgtd: bs_init(387) use signalfd notification
time="2023-05-09T08:46:03Z" level=info msg="Storing process logs at path: /var/log/instances"
[longhorn-instance-manager] time="2023-05-09T08:46:03Z" level=warning msg="Failed to add TLS key pair from /tls-files/" error="open /tls-files/tls.crt: no such file or directory"
[longhorn-instance-manager] time="2023-05-09T08:46:03Z" level=info msg="Creating gRPC server with no auth"
[longhorn-instance-manager] time="2023-05-09T08:46:03Z" level=info msg="Instance Manager proxy gRPC server listening to 0.0.0.0:8501"
[longhorn-instance-manager] time="2023-05-09T08:46:03Z" level=info msg="Instance Manager process gRPC server listening to 0.0.0.0:8500"
[longhorn-instance-manager] time="2023-05-09T08:46:05Z" level=info msg="Started new process manager update watch"
[longhorn-instance-manager] time="2023-05-09T08:46:08Z" level=info msg="Process Manager: prepare to create process pvc-b57aaafd-9afa-4b2a-a9cc-ed2922bb9965-r-71d1b6a5"
[longhorn-instance-manager] time="2023-05-09T08:46:08Z" level=debug msg="Process Manager: validate process path: /host/var/lib/longhorn/engine-binaries/longhornio-longhorn-engine-master-head/longhorn dir: /host/var/lib/longhorn/engine-binaries/ image: longhornio-longhorn-engine-master-head binary: longhorn"
[longhorn-instance-manager] time="2023-05-09T08:46:08Z" level=info msg="Process Manager: created process pvc-b57aaafd-9afa-4b2a-a9cc-ed2922bb9965-r-71d1b6a5"
[pvc-b57aaafd-9afa-4b2a-a9cc-ed2922bb9965-r-71d1b6a5] time="2023-05-09T08:46:08Z" level=info msg="Listening on data server 0.0.0.0:10001"
[pvc-b57aaafd-9afa-4b2a-a9cc-ed2922bb9965-r-71d1b6a5] time="2023-05-09T08:46:08Z" level=info msg="Listening on gRPC Replica server 0.0.0.0:10000"
time="2023-05-09T08:46:08Z" level=info msg="Listening on sync agent server 0.0.0.0:10002"
[pvc-b57aaafd-9afa-4b2a-a9cc-ed2922bb9965-r-71d1b6a5] time="2023-05-09T08:46:08Z" level=info msg="Listening on sync 0.0.0.0:10002"

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
Status: Closed
Development

No branches or pull requests

5 participants