Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add extended health checking of pods/containers #66

Closed
jbeda opened this issue Jun 11, 2014 · 4 comments
Closed

Add extended health checking of pods/containers #66

jbeda opened this issue Jun 11, 2014 · 4 comments
Labels
area/kubelet sig/node Categorizes an issue or PR as relevant to SIG Node.

Comments

@jbeda
Copy link
Contributor

jbeda commented Jun 11, 2014

We should have the kubelet do HTTP based health checking.

We could also use container "run in" support to execute a script in the context of a container to do the health checking.

@bgrant0607
Copy link
Member

I view per-container liveness probes as having 4 main parts:

  1. Probe control parameters: At minimum, there needs to be a probe interval (in seconds is probably fine) and timeout period (in same units as probe interval), with reasonable defaults for both. An initial (post-(re)start) delay is also typically needed, to allow for non-trivial application startup times. We could also support a threshold for the number of failures to allow before action is taken (called unhealthy_threshold in the load-balancing context). This would cover retry in the case of spurious failure. If we do, we may also want to support a number of successes before we reset this failure count (healthy_threshold).
  2. Probe mechanism.
    • HTTP GET includes at least port, path, and perhaps URL parameters. 200==success is easy to implement and to understand, but would mean that it could not share the same handler as load-balancer health (i.e., readiness) checks. Consequently, we may want to treat 404, 500, and 503 as success, also. Intentional failure would be indicated by not responding. Non-standard success/failure criteria and/or more complex logic could be implemented using commands (e.g., wget or curl).
    • Command. Exit 0 would imply success. Agree that "run in" would be lighter-weight than a separate container.
  3. Action control parameters: The main one is the grace period -- how long to wait before using SIGKILL. We could support configuration of a default grace period for all stop operations on the container, but it is also useful to use different grace periods for different kinds of stop reasons.
  4. Action mechanism.
    • SIGTERM. Convenient in many languages but hard to pass other information, such as termination reason and grace period.
    • HTTP POST / web hook.
    • Command, again using "run in".

We also want it to be easy to disable/reenable these checks, such as for attaching a debugger and stopping at a breakpoint.

@bgrant0607
Copy link
Member

It's worth noting that docker stop sends SIGTERM, waits for a parameterized grace period, and then sends SIGKILL, which is basically the behavior we want:
POST /containers/(id)/stop?t=(seconds)
http://docs.docker.com/reference/api/docker_remote_api_v1.12/

FWIW, some do not like SIGKILL:
moby/moby#6446
It was pointed out that kill takes a signal parameter, which maybe they also want to support in stop, but I think a longer grace period is mostly what they need.

@bgrant0607
Copy link
Member

FWIW, here's a description of Marathon's liveness checks:
https://github.com/mesosphere/marathon/wiki/Health-Checks

HTTP responses between 200-399 are considered live. The max # of consecutive failures is configurable (as with GCE's LB readiness checks).

Aurora's are similar:
http://aurora.incubator.apache.org/documentation/latest/configuration-tutorial/

@brendandburns
Copy link
Contributor

I believe this is now fixed.

@dchen1107 dchen1107 added the sig/node Categorizes an issue or PR as relevant to SIG Node. label Feb 4, 2015
feiskyer added a commit to feiskyer/kubernetes that referenced this issue Jan 22, 2016
vishh pushed a commit to vishh/kubernetes that referenced this issue Apr 6, 2016
dlorenc pushed a commit to dlorenc/kubernetes that referenced this issue May 13, 2016
Bundle localkube in the minikube binary as a blob, send that to the VM.
lazypower pushed a commit to lazypower/kubernetes that referenced this issue Oct 28, 2016
Remove the explicit SecurityContextDeny due to failures in e2e
xingzhou pushed a commit to xingzhou/kubernetes that referenced this issue Dec 15, 2016
euank pushed a commit to euank/kubernetes that referenced this issue Jan 20, 2017
PiotrProkop pushed a commit to PiotrProkop/kubernetes that referenced this issue May 19, 2017
iaguis pushed a commit to kinvolk/kubernetes that referenced this issue Feb 6, 2018
whypro pushed a commit to whypro/kubernetes that referenced this issue Nov 13, 2018
Upgrade etcd client to 3.2.25 for release-1.9
yujuhong added a commit to yujuhong/kubernetes that referenced this issue Feb 12, 2019
Add toleration to yet another test pod
ry4nz pushed a commit to ry4nz/kubernetes that referenced this issue Feb 19, 2019
seans3 pushed a commit to seans3/kubernetes that referenced this issue Apr 10, 2019
Add update to 1.4 feature complete date, and feature complete exception process
b3atlesfan pushed a commit to b3atlesfan/kubernetes that referenced this issue Feb 5, 2021
Older kernels don't support looking up interface by name (via netlink).
In these cases, fallback to dumping all interfaces. This patch just
pulls in latest netlink library.

Fixes kubernetes#66
sttts added a commit to sttts/kubernetes that referenced this issue May 12, 2022
…ble-cr-registry

UPSTREAM: <carry>: apiextensions: make CR registry reusable with different store
thockin pushed a commit to thockin/kubernetes that referenced this issue Dec 5, 2024
…nittest

Add a test for schema validation
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
area/kubelet sig/node Categorizes an issue or PR as relevant to SIG Node.
Projects
None yet
Development

No branches or pull requests

5 participants