mds: fix liveness probe timeout #14798

BlaineEXE · 2024-10-03T14:42:03Z

When the MDS liveness probe times out, it should not fail the probe. If the cluster has a network partition or other issue that causes the Ceph mon cluster to become unstable, ceph ... commands can hang and cause a timeout. In this case, the MDS should not be restarted so as to not cause cascading cluster disruption.

Checklist:

Commit Message Formatting: Commit titles and messages follow guidelines in the developer guide.
Reviewed the developer guide on Submitting a Pull Request
Pending release notes updated with breaking and/or notable changes for the next minor release.
Documentation has been updated, if necessary.
Unit tests have been added, if necessary.
Integration tests have been added, if necessary.

BlaineEXE · 2024-10-03T14:42:53Z

pkg/operator/ceph/file/mds/livenessprobe.go

-					"timeout",
-					fmt.Sprintf("%d", mdsCmdTimeout),


timeout can keep the script from running too long, but it exits with error code after timeout, which we don't want. To resolve, move the config for the timeout from here, to the new --connect-timeout=<mdsCmdTimeout> in the script.

parth-gr

lgtm

BlaineEXE · 2024-10-03T16:37:02Z

Converting to draft while I test what Travis suggested in huddle

When the MDS liveness probe times out, it should not fail the probe. If the cluster has a network partition or other issue that causes the Ceph mon cluster to become unstable, `ceph ...` commands can hang and cause a timeout. In this case, the MDS should not be restarted so as to not cause cascading cluster disruption. Signed-off-by: Blaine Gardner <blaine.gardner@ibm.com>

BlaineEXE · 2024-10-04T22:21:27Z

Tested well manually in 2 different environments 👍

mds: fix liveness probe timeout (backport #14798)

BlaineEXE added ceph-mds Relating to Ceph filesystem's (CephFS's) MDS daemon backport-release-1.14 backport-release-1.15 labels Oct 3, 2024

BlaineEXE requested review from travisn, subhamkrai and parth-gr October 3, 2024 14:42

BlaineEXE commented Oct 3, 2024

View reviewed changes

parth-gr approved these changes Oct 3, 2024

View reviewed changes

BlaineEXE marked this pull request as draft October 3, 2024 16:36

BlaineEXE force-pushed the mds-liveness-probe-fix-timeout branch 2 times, most recently from bd1feca to 27207af Compare October 4, 2024 22:19

BlaineEXE force-pushed the mds-liveness-probe-fix-timeout branch from 27207af to ad1bae9 Compare October 4, 2024 22:20

BlaineEXE marked this pull request as ready for review October 4, 2024 22:21

subhamkrai approved these changes Oct 7, 2024

View reviewed changes

subhamkrai merged commit 681f38b into rook:master Oct 7, 2024
55 of 56 checks passed

This was referenced Oct 7, 2024

mds: fix liveness probe timeout (backport #14798) #14806

Merged

mds: fix liveness probe timeout (backport #14798) #14807

Merged

mergify bot added a commit that referenced this pull request Oct 7, 2024

Merge pull request #14807 from rook/mergify/bp/release-1.15/pr-14798

a043ee9

mds: fix liveness probe timeout (backport #14798)

mergify bot added a commit that referenced this pull request Oct 7, 2024

Merge pull request #14806 from rook/mergify/bp/release-1.14/pr-14798

77ef31f

mds: fix liveness probe timeout (backport #14798)

BlaineEXE deleted the mds-liveness-probe-fix-timeout branch October 7, 2024 15:27

parth-gr added the backport-release-1.13 label Oct 24, 2024

mergify bot mentioned this pull request Oct 24, 2024

mds: fix liveness probe timeout (backport #14798) #14903

Closed

6 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

mds: fix liveness probe timeout #14798

mds: fix liveness probe timeout #14798

BlaineEXE commented Oct 3, 2024

BlaineEXE Oct 3, 2024 •

edited

Loading

parth-gr left a comment

BlaineEXE commented Oct 3, 2024

BlaineEXE commented Oct 4, 2024

mds: fix liveness probe timeout #14798

mds: fix liveness probe timeout #14798

Conversation

BlaineEXE commented Oct 3, 2024

BlaineEXE Oct 3, 2024 • edited Loading

Choose a reason for hiding this comment

parth-gr left a comment

Choose a reason for hiding this comment

BlaineEXE commented Oct 3, 2024

BlaineEXE commented Oct 4, 2024

BlaineEXE Oct 3, 2024 •

edited

Loading