Add extended health checking of pods/containers #66

jbeda · 2014-06-11T23:14:22Z

We should have the kubelet do HTTP based health checking.

We could also use container "run in" support to execute a script in the context of a container to do the health checking.

bgrant0607 · 2014-06-17T01:01:10Z

I view per-container liveness probes as having 4 main parts:

Probe control parameters: At minimum, there needs to be a probe interval (in seconds is probably fine) and timeout period (in same units as probe interval), with reasonable defaults for both. An initial (post-(re)start) delay is also typically needed, to allow for non-trivial application startup times. We could also support a threshold for the number of failures to allow before action is taken (called unhealthy_threshold in the load-balancing context). This would cover retry in the case of spurious failure. If we do, we may also want to support a number of successes before we reset this failure count (healthy_threshold).
Probe mechanism.
- HTTP GET includes at least port, path, and perhaps URL parameters. 200==success is easy to implement and to understand, but would mean that it could not share the same handler as load-balancer health (i.e., readiness) checks. Consequently, we may want to treat 404, 500, and 503 as success, also. Intentional failure would be indicated by not responding. Non-standard success/failure criteria and/or more complex logic could be implemented using commands (e.g., wget or curl).
- Command. Exit 0 would imply success. Agree that "run in" would be lighter-weight than a separate container.
Action control parameters: The main one is the grace period -- how long to wait before using SIGKILL. We could support configuration of a default grace period for all stop operations on the container, but it is also useful to use different grace periods for different kinds of stop reasons.
Action mechanism.
- SIGTERM. Convenient in many languages but hard to pass other information, such as termination reason and grace period.
- HTTP POST / web hook.
- Command, again using "run in".

We also want it to be easy to disable/reenable these checks, such as for attaching a debugger and stopping at a breakpoint.

bgrant0607 · 2014-06-17T19:48:36Z

It's worth noting that docker stop sends SIGTERM, waits for a parameterized grace period, and then sends SIGKILL, which is basically the behavior we want:
POST /containers/(id)/stop?t=(seconds)
http://docs.docker.com/reference/api/docker_remote_api_v1.12/

FWIW, some do not like SIGKILL:
moby/moby#6446
It was pointed out that kill takes a signal parameter, which maybe they also want to support in stop, but I think a longer grace period is mostly what they need.

bgrant0607 · 2014-07-09T15:58:44Z

FWIW, here's a description of Marathon's liveness checks:
https://github.com/mesosphere/marathon/wiki/Health-Checks

HTTP responses between 200-399 are considered live. The max # of consecutive failures is configurable (as with GCE's LB readiness checks).

Aurora's are similar:
http://aurora.incubator.apache.org/documentation/latest/configuration-tutorial/

brendandburns · 2014-09-03T18:21:10Z

I believe this is now fixed.

Fix hyper container id

Interference detector interface

Bundle localkube in the minikube binary as a blob, send that to the VM.

Remove the explicit SecurityContextDeny due to failures in e2e

README: fix slack link

rktlet: host network support

Fixing bugs

coreos hyperkube v1.3.3

Upgrade etcd client to 3.2.25 for release-1.9

Add toleration to yet another test pod

Bump v1.14.0 alpha.3

Add update to 1.4 feature complete date, and feature complete exception process

Older kernels don't support looking up interface by name (via netlink). In these cases, fallback to dumping all interfaces. This patch just pulls in latest netlink library. Fixes kubernetes#66

…ble-cr-registry UPSTREAM: <carry>: apiextensions: make CR registry reusable with different store

…nittest Add a test for schema validation

jbeda added the enhancement label Jun 11, 2014

This was referenced Jun 17, 2014

More comprehensive reporting of termination reasons #137

Closed

PreStart and PostStop event hooks #140

Closed

bgrant0607 mentioned this issue Jun 25, 2014

Configurable restart behavior #127

Closed

brendandburns self-assigned this Jul 3, 2014

bgrant0607 mentioned this issue Jul 7, 2014

add http health checks. #365

Merged

erictune added the kubelet label Jul 24, 2014

erictune mentioned this issue Aug 9, 2014

Enhancement to the health-checking system #761

Closed

smarterclayton mentioned this issue Aug 11, 2014

Add TCP socket based health checking. #735

Merged

brendandburns closed this as completed Sep 3, 2014

dchen1107 added the sig/node Categorizes an issue or PR as relevant to SIG Node. label Feb 4, 2015

jbeda unassigned brendandburns Aug 12, 2015

bgrant0607 mentioned this issue Aug 18, 2015

Add more probe control parameters #12866

Closed

feiskyer added a commit to feiskyer/kubernetes that referenced this issue Jan 22, 2016

Merge pull request kubernetes#66 from feiskyer/fix-hyper-cid

21aff77

Fix hyper container id

vishh pushed a commit to vishh/kubernetes that referenced this issue Apr 6, 2016

Merge pull request kubernetes#66 from monnand/interference

0e5cf4a

Interference detector interface

dlorenc pushed a commit to dlorenc/kubernetes that referenced this issue May 13, 2016

Merge pull request kubernetes#66 from dlorenc/buildlocalkube

6865446

Bundle localkube in the minikube binary as a blob, send that to the VM.

michelleN mentioned this issue Oct 27, 2016

kubeadm join doesn't set kube config #35729

Closed

ligc mentioned this issue Oct 28, 2016

dnsPolicy=ClusterFirst does not work when hostNetwork=true #35761

Closed

lazypower pushed a commit to lazypower/kubernetes that referenced this issue Oct 28, 2016

Merge pull request kubernetes#66 from chuckbutler/conformance-selinux

2eed204

Remove the explicit SecurityContextDeny due to failures in e2e

druidsbane mentioned this issue Nov 10, 2016

kube-controller-manager with AWS cloud provider crash loop if using http_proxy using "kubeadm init --cloud-provider aws" #36573

Closed

CaoShuFeng mentioned this issue Nov 24, 2016

"make test-integration" fails #37445

Closed

smenon78 mentioned this issue Dec 12, 2016

AWS-Kubeadm install, kube-dns is stuck at containerCreating. #38653

Closed

xingzhou pushed a commit to xingzhou/kubernetes that referenced this issue Dec 15, 2016

Merge pull request kubernetes#66 from philips/fix-slack-link

be76d2d

README: fix slack link

lichen2013 mentioned this issue Dec 16, 2016

Install k8s using vagrant do not work #38856

Closed

euank pushed a commit to euank/kubernetes that referenced this issue Jan 20, 2017

Merge pull request kubernetes#66 from euank/net-host

17792a5

rktlet: host network support

carsonoid mentioned this issue Jan 25, 2017

kubelet flapping in and out of Ready state. #40442

Closed

kchitrapu mentioned this issue Feb 28, 2017

nodePort not responding on all nodeIPs #42265

Closed

jsloyer mentioned this issue Mar 28, 2017

Addon Manager returns "error retrieving RESTMappings to prune" #43755

Closed

PiotrProkop pushed a commit to PiotrProkop/kubernetes that referenced this issue May 19, 2017

Merge pull request kubernetes#66 from PiotrProkop/cpu-manager

3cd4a55

Fixing bugs

iaguis pushed a commit to kinvolk/kubernetes that referenced this issue Feb 6, 2018

Merge pull request kubernetes#66 from pbx0/coreos-hyperkube-v1.3.3

d8dcde9

coreos hyperkube v1.3.3

whypro pushed a commit to whypro/kubernetes that referenced this issue Nov 13, 2018

Merge pull request kubernetes#66 from cofyc/upgradeetcdclient

6260799

Upgrade etcd client to 3.2.25 for release-1.9

yujuhong added a commit to yujuhong/kubernetes that referenced this issue Feb 12, 2019

Merge pull request kubernetes#66 from yujuhong/more-toleration

f2558b1

Add toleration to yet another test pod

ry4nz pushed a commit to ry4nz/kubernetes that referenced this issue Feb 19, 2019

Merge pull request kubernetes#66 from ry4nz/bump-v1.14.0-alpha.3

5ea8b37

Bump v1.14.0 alpha.3

seans3 pushed a commit to seans3/kubernetes that referenced this issue Apr 10, 2019

Merge pull request kubernetes#66 from goltermann/exceptions

1b34f6a

Add update to 1.4 feature complete date, and feature complete exception process

john-delivuk mentioned this issue Jan 22, 2020

Length restrictions on HPA External Metrics #87469

Closed

sttts added a commit to sttts/kubernetes that referenced this issue May 12, 2022

Merge pull request kubernetes#66 from sttts/sttts-apiextensions-reusa…

0a4e92b

…ble-cr-registry UPSTREAM: <carry>: apiextensions: make CR registry reusable with different store

thockin pushed a commit to thockin/kubernetes that referenced this issue Dec 5, 2024

Merge pull request kubernetes#66 from thockin/validation-gen-schema-u…

341ff22

…nittest Add a test for schema validation

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add extended health checking of pods/containers #66

Add extended health checking of pods/containers #66

jbeda commented Jun 11, 2014

bgrant0607 commented Jun 17, 2014

bgrant0607 commented Jun 17, 2014

bgrant0607 commented Jul 9, 2014

brendandburns commented Sep 3, 2014

Add extended health checking of pods/containers #66

Add extended health checking of pods/containers #66

Comments

jbeda commented Jun 11, 2014

bgrant0607 commented Jun 17, 2014

bgrant0607 commented Jun 17, 2014

bgrant0607 commented Jul 9, 2014

brendandburns commented Sep 3, 2014