Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

mds: fix liveness probe timeout #14798

Merged
merged 1 commit into from
Oct 7, 2024

Conversation

BlaineEXE
Copy link
Member

When the MDS liveness probe times out, it should not fail the probe. If the cluster has a network partition or other issue that causes the Ceph mon cluster to become unstable, ceph ... commands can hang and cause a timeout. In this case, the MDS should not be restarted so as to not cause cascading cluster disruption.

Checklist:

  • Commit Message Formatting: Commit titles and messages follow guidelines in the developer guide.
  • Reviewed the developer guide on Submitting a Pull Request
  • Pending release notes updated with breaking and/or notable changes for the next minor release.
  • Documentation has been updated, if necessary.
  • Unit tests have been added, if necessary.
  • Integration tests have been added, if necessary.

@BlaineEXE BlaineEXE added ceph-mds Relating to Ceph filesystem's (CephFS's) MDS daemon backport-release-1.14 backport-release-1.15 labels Oct 3, 2024
Comment on lines -78 to -79
"timeout",
fmt.Sprintf("%d", mdsCmdTimeout),
Copy link
Member Author

@BlaineEXE BlaineEXE Oct 3, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

timeout can keep the script from running too long, but it exits with error code after timeout, which we don't want. To resolve, move the config for the timeout from here, to the new --connect-timeout=<mdsCmdTimeout> in the script.

Copy link
Member

@parth-gr parth-gr left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

lgtm

@BlaineEXE BlaineEXE marked this pull request as draft October 3, 2024 16:36
@BlaineEXE
Copy link
Member Author

Converting to draft while I test what Travis suggested in huddle

@BlaineEXE BlaineEXE force-pushed the mds-liveness-probe-fix-timeout branch 2 times, most recently from bd1feca to 27207af Compare October 4, 2024 22:19
When the MDS liveness probe times out, it should not fail the probe. If
the cluster has a network partition or other issue that causes the Ceph
mon cluster to become unstable, `ceph ...` commands can hang and cause
a timeout. In this case, the MDS should not be restarted so as to not
cause cascading cluster disruption.

Signed-off-by: Blaine Gardner <blaine.gardner@ibm.com>
@BlaineEXE BlaineEXE force-pushed the mds-liveness-probe-fix-timeout branch from 27207af to ad1bae9 Compare October 4, 2024 22:20
@BlaineEXE BlaineEXE marked this pull request as ready for review October 4, 2024 22:21
@BlaineEXE
Copy link
Member Author

Tested well manually in 2 different environments 👍

@subhamkrai subhamkrai merged commit 681f38b into rook:master Oct 7, 2024
55 of 56 checks passed
mergify bot added a commit that referenced this pull request Oct 7, 2024
mergify bot added a commit that referenced this pull request Oct 7, 2024
@BlaineEXE BlaineEXE deleted the mds-liveness-probe-fix-timeout branch October 7, 2024 15:27
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants