WIP: pod preconditions #4245

pmorie · 2015-02-09T02:40:46Z

Very rough WIP exploring another factoring for preconditions as discussed in #1768 after my highly unsatisfactory (to me) stab in #3911.

This is just a sketch; looking for feedback on generalities of this approach before fleshing out further.

pmorie · 2015-02-09T02:42:08Z

@bgrant0607 @smarterclayton @derekwaynecarr PTAL

bgrant0607 · 2015-02-09T08:22:09Z

Thanks @pmorie. I'll ponder this. I have to say my initial reaction is fairly Pavlovian, though.

pmorie · 2015-02-09T14:06:21Z

@bgrant0607 salivation or nausea? 🐩 🐩 🐩

smarterclayton · 2015-02-10T19:54:23Z

This looks a lot cleaner - it definitely complicates the kubelet startup code, but it's keeping it located to the same place where health checks are invoked. How would the kubelet communicate the failure due to precondition? Does the kubelet give up after a while?

pmorie · 2015-02-10T20:50:29Z

@smarterclayton Next steps I had in mind around the concerns you mentioned:

If a precondition isn't satisfied, the kubelet should produce an event
There should be configurable retries / timeout for precondition satisfaction
Not sure about the boundary condition for when retries / timeout have elapsed and the preconditions are not met; you should definitely get an event. Should the pod be scheduled off the node?

smarterclayton · 2015-02-10T21:02:09Z

Something would probably watch for that event at the master level and delete the pod / scale down the rc. But I don't know that it's hard and fast rule it get scheduled off. It's kind of like pull failing (pull is a precondition)

On Feb 10, 2015, at 3:50 PM, Paul Morie notifications@github.com wrote:

@smarterclayton Next steps I had in mind around the concerns you mentioned:

If a precondition isn't satisfied, the kubelet should produce an event
There should be configurable retries / timeout for precondition satisfaction
Not sure about the boundary condition for when retries / timeout have elapsed and the preconditions are not met; you should definitely get an event. Should the pod be scheduled off the node?
—
Reply to this email directly or view it on GitHub.

pmorie · 2015-02-10T23:17:23Z

@smarterclayton good point about hard and fast. I think the control surface for preconditions is something like:

retries (optional)
timeout (optional)
failure policy

...with the granularity being precondition level.

bgrant0607 · 2015-02-14T09:03:38Z

I have to admit that it's not very obtrusive in the API, and I like that it's in PodSpec rather than Container. I'm not keen on adding a bunch of parameters to it.

It also has a number of potential issues.

What happens if the objects disappear after the pod is started? Should the pod be stopped? Do they need to exist on restart of the pod? On update of the pod? Certainly when replication controller creates a pod they'll need to exist. Would the same policies apply if we had a ReadinessPrecondition?

Would other components be unhappy in the case of a significant delay between scheduling a pod and its existence on the node? We should at least set the reason for why the pod is still pending.

Should replication controller peek at the dependence in order to avoid creating lots of pods that will fail due to a non-existent dependency?

Users could exploit this to trigger the simultaneous start of pods on many nodes. Is that desirable/ok?

We probably would want to factor out the image pulls, so they could happen before waiting.

However, do we even need this? If the pod uses DNS, it can retry until the DNS name resolves. If it uses link env. vars., it will need to pre-declare them, as per #1768. Now that Kubelet populates the env. vars., it could wait until it has values for the variables, or simply restart the containers when their values change. Is there another use case for this?

Sort-of related: #1996 (comment)

smarterclayton · 2015-02-14T17:34:41Z

On Feb 14, 2015, at 4:03 AM, Brian Grant notifications@github.com wrote:

I have to admit that it's not very obtrusive in the API, and I like that it's in PodSpec rather than Container. I'm not keen on adding a bunch of parameters to it.

It also has a number of potential issues.

What happens if the objects disappear after the pod is started? Should the pod be stopped? Do they need to exist on restart of the pod? On update of the pod? Certainly when replication controller creates a pod they'll need to exist. Would the same policies apply if we had a ReadinessPrecondition?

Would other components be unhappy in the case of a significant delay between scheduling a pod and its existence on the node? We should at least set the reason for why the pod is still pending.

Should replication controller peek at the dependence in order to avoid creating lots of pods that will fail due to a non-existent dependency?

I think not - users use this to delay start only
Users could exploit this to trigger the simultaneous start of pods on many nodes. Is that desirable/ok?

I think that may be very desirable for existing applications that can't tolerate cross version cache use, people doing green/blue deployments with a schema migration, or have problems with starting before their dependencies. Administrators could disallow this use via an admission control. The attack vector seems no worse than today's replication behavior in terms of parallel starts (if the scheduler is fast enough)
We probably would want to factor out the image pulls, so they could happen before waiting.

Yes
However, do we even need this? If the pod uses DNS, it can retry until the DNS name resolves. If it uses link env. vars., it will need to pre-declare them, as per #1768. Now that Kubelet populates the env. vars., it could wait until it has values for the variables, or simply restart the containers when their values change. Is there another use case for this?

I think the two guiding use cases are new, simple applications from users taking their first steps into cloud designed software, and larger existing applications that don't gracefully tolerate out of order startup. In those cases the containers (like JVMs), could take minutes to startup, fail, but not fail and then take minutes to startup again.

On the other side, the vast majority of frameworks and simple applications do not gracefully tolerate missing dbs without existing work. We've started experiencing this as we create on ramp tutorial apps (the canonical web app is one web pod and one db pod) and get new users involved - the first problem they hit was service ordering (now fixed), and now it is out of order starts where the db starts slower than their simple example web apps. For those apps and frameworks that don't fail fast, the user experience is very disappointing (it doesn't work, and it doesn't necessarily start working).

I think of a solution for this problem as a critical onramp problem for Kube for simple and complex apps alike - the "user experience of kicking the tires". While yes, the problem can be fixed in code, the solution "hey, write better code / fix your upstream framework" is user hostile. Like adaptation of env, this is a bridge step that is less focused on cloud designed software and on the much larger category of "people moving to cloud environments."

Core use case: "it should be easy to craft a json definition of a two tier web app with two pods or two rc's such that the web app code starts after the db is running, and I can repeatedly deploy that to different namespaces and environments". I believe that represents as high as 95% of the incoming user population to Kube and cloud platforms in general.

Ideally we accomplish this in a way with value for more complex apps, or for coordination, and definitely it should be unobtrusive to the kubelet. The key part is knowledge and ordering, a after b - I like elements of this proposal because it distributes that knowledge close to the pods. It does require that we have an easy way for the probe to parameterize the url of something that doesn't exist yet (so DNS is one approach, but requires DNS to be working). It also seems to have the advantage over the service dependency work that arbitrary conditions can be defined by user code (db pod can expose a ready endpoint that is very sophisticated). Other approaches seem to couple too closely to kubelet behavior, and this has the advantage of enabling emergent ordering (or delegating a complex ordering to a central coordinator). Certainly open to other alternatives.

Sort-of related: #1996 (comment)

—
Reply to this email directly or view it on GitHub.

thockin · 2015-02-17T05:49:05Z

I concur with Brian - as light as this patch is, I am very concerned about
slippery slopes, and I want to understand why we just can't live without
this. Is this really something users can not approximate without help from
"the kernel"?

On Sat, Feb 14, 2015 at 9:35 AM, Clayton Coleman notifications@github.com
wrote:

On Feb 14, 2015, at 4:03 AM, Brian Grant notifications@github.com
wrote:

I have to admit that it's not very obtrusive in the API, and I like that
it's in PodSpec rather than Container. I'm not keen on adding a bunch of
parameters to it.

It also has a number of potential issues.

What happens if the objects disappear after the pod is started? Should
the pod be stopped? Do they need to exist on restart of the pod? On update
of the pod? Certainly when replication controller creates a pod they'll
need to exist. Would the same policies apply if we had a
ReadinessPrecondition?

Would other components be unhappy in the case of a significant delay
between scheduling a pod and its existence on the node? We should at least
set the reason for why the pod is still pending.

Should replication controller peek at the dependence in order to avoid
creating lots of pods that will fail due to a non-existent dependency?

I think not - users use this to delay start only
Users could exploit this to trigger the simultaneous start of pods on
many nodes. Is that desirable/ok?

I think that may be very desirable for existing applications that can't
tolerate cross version cache use, people doing green/blue deployments with
a schema migration, or have problems with starting before their
dependencies. Administrators could disallow this use via an admission
control. The attack vector seems no worse than today's replication behavior
in terms of parallel starts (if the scheduler is fast enough)
We probably would want to factor out the image pulls, so they could
happen before waiting.

Yes
However, do we even need this? If the pod uses DNS, it can retry until
the DNS name resolves. If it uses link env. vars., it will need to
pre-declare them, as per #1768. Now that Kubelet populates the env. vars.,
it could wait until it has values for the variables, or simply restart the
containers when their values change. Is there another use case for this?

I think the two guiding use cases are new, simple applications from users
taking their first steps into cloud designed software, and larger existing
applications that don't gracefully tolerate out of order startup. In those
cases the containers (like JVMs), could take minutes to startup, fail, but
not fail and then take minutes to startup again.

On the other side, the vast majority of frameworks and simple applications
do not gracefully tolerate missing dbs without existing work. We've started
experiencing this as we create on ramp tutorial apps (the canonical web app
is one web pod and one db pod) and get new users involved - the first
problem they hit was service ordering (now fixed), and now it is out of
order starts where the db starts slower than their simple example web apps.
For those apps and frameworks that don't fail fast, the user experience is
very disappointing (it doesn't work, and it doesn't necessarily start
working).

I think of a solution for this problem as a critical onramp problem for
Kube for simple and complex apps alike - the "user experience of kicking
the tires". While yes, the problem can be fixed in code, the solution "hey,
write better code / fix your upstream framework" is user hostile. Like
adaptation of env, this is a bridge step that is less focused on cloud
designed software and on the much larger category of "people moving to
cloud environments."

Core use case: "it should be easy to craft a json definition of a two tier
web app with two pods or two rc's such that the web app code starts after
the db is running, and I can repeatedly deploy that to different namespaces
and environments". I believe that represents as high as 95% of the incoming
user population to Kube and cloud platforms in general.

Ideally we accomplish this in a way with value for more complex apps, or
for coordination, and definitely it should be unobtrusive to the kubelet.
The key part is knowledge and ordering, a after b - I like elements of this
proposal because it distributes that knowledge close to the pods. It does
require that we have an easy way for the probe to parameterize the url of
something that doesn't exist yet (so DNS is one approach, but requires DNS
to be working). It also seems to have the advantage over the service
dependency work that arbitrary conditions can be defined by user code (db
pod can expose a ready endpoint that is very sophisticated). Other
approaches seem to couple too closely to kubelet behavior, and this has the
advantage of enabling emergent ordering (or delegating a complex ordering
to a central coordinator). Certainly open to other alternatives.

Sort-of related: #1996 (comment)

Reply to this email directly or view it on GitHub.

Reply to this email directly or view it on GitHub
#4245 (comment)
.

bgrant0607 · 2015-02-17T06:02:33Z

Sorry, was out today.

I'm very sympathetic to the use cases -- thank you for explaining them @smarterclayton.

But I would like to consider alternatives, and whether this solves enough of the problem. For instance, this wouldn't work for intra-pod dependencies, nor would it wait for applications to actually start up and become ready to respond, as discussed in #3312.

smarterclayton · 2015-02-17T06:24:34Z

I think we have lenaics intra-pods cases mostly covered with behavior that can be done in pods - ie an advisory lock in a shared volume. I also think that particular app example sits pretty far on the intra-pod complexity spectrum, so I do think it might be on the other side of the "generic tool" threshold.

Would be happy to talk through alternatives.

On Feb 17, 2015, at 1:02 AM, Brian Grant notifications@github.com wrote:

Sorry, was out today.

I'm very sympathetic to the use cases -- thank you for explaining them @smarterclayton.

But I would like to consider alternatives, and whether this solves enough of the problem. For instance, this wouldn't work for intra-pod dependencies, nor would it wait for applications to actually start up and become ready to respond, as discussed in #3312.

—
Reply to this email directly or view it on GitHub.

pmorie · 2015-02-24T21:49:44Z

@bgrant0607 This is not intended to solve the intra-pod ordering, just a coarse-grained ordering of pods with respect to services. Did you have other alternative approaches in mind?

googlebot · 2015-02-24T21:49:46Z

Thanks for your pull request.

It looks like this may be your first contribution to a Google open source project, in which case you'll need to sign a Contributor License Agreement (CLA) at https://cla.developers.google.com/.

If you've already signed a CLA, it's possible we don't have your GitHub username or you're using a different email address. Check the information on your CLA or see this help article on setting the email on your git commits.

Once you've done that, please reply here to let us know. If you signed the CLA as a corporation, please let us know the company's name.

pmorie · 2015-03-04T22:38:13Z

@bgrant0607 After talking with some folks about this, I want to clarify my above comment. This is intended only to solve the use-case of: I don't want my pod to start until some service exists.

vmarmol · 2015-03-06T19:01:56Z

@pmorie just merged #5094 which makes changes in syncPod, this PR will need a rebase.

pmorie · 2015-03-06T20:48:18Z

@vmarmol thanks for the heads up

bgrant0607 · 2015-03-07T16:43:34Z

I have the following concerns:

Service object existence doesn't imply the service is up and usable -- it seems like we'd want service to aggregate readiness from the pods
Doesn't solve the problem for intra-pod dependences
Users will have arbitrary dependencies, not just service dependencies
This is sort of like a probe, but doesn't fit with the existing probe structures, which are on containers

I do want to find a solution to this, however, and not block progress forever. Sorry, I've been overloaded.

Looking at #5093 made we wonder if a pod-level pre-start hook container wouldn't be a better solution. I want that mechanism for other reasons. Admittedly it seems somewhat heavyweight and less transparent, however.

bgrant0607 · 2015-03-07T16:45:43Z

pkg/kubelet/kubelet.go

+		//
+		err = kl.checkPodPreconditions(pod)
+		if err != nil {
+			glog.Errorf("Pod has failed preconditions: %v; Skipping pod %q", err, podFullName)


Assuming we go forward with this approach:

This can't log an error. It needs to be at a high verbosity level. Even that won't be useful, however. Instead, we should create a PodCondition entry for this case and add LastProbeTime, LastTransitionTime, Reason, and Message to PodCondition.

smarterclayton · 2015-03-07T17:35:55Z

On Mar 7, 2015, at 11:43 AM, Brian Grant notifications@github.com wrote:

I have the following concerns:

Service object existence doesn't imply the service is up and usable -- it seems like we'd want service to aggregate readiness from the pods
Agree - but a probe could in the future point to a web endpoint on the service that indicates ready like: /api/v1beta1/services/foo/ready (returns 20)
Doesn't solve the problem for intra-pod dependences
Agreed - I'm on the fence as to whether they're the same problem (could we use precondition probes on containers in prestart and have that work)?
Users will have arbitrary dependencies, not just service dependencies
With probes as arbitrary urls, this becomes more tractable
This is sort of like a probe, but doesn't fit with the existing probe structures, which are on containers
The only benefit to the pod level check was tenseness. If we moved this to a new type of container probe, does it become more reasonable? Simple pods (1-2 containers) aren't that much harder to setup, and then if you can complex dependency order you can emulate it with individual container prestarts or probes?

That does cause the structure to grow a bit.

I do want to find a solution to this, however, and not block progress forever. Sorry, I've been overloaded.

Looking at #5093 made we wonder if a pod-level pre-start hook container wouldn't be a better solution. I want that mechanism for other reasons. Admittedly it seems somewhat heavyweight and less transparent, however.

—
Reply to this email directly or view it on GitHub.

bgrant0607 · 2015-03-07T19:33:40Z

Did we discuss putting this into ReplicationController instead? This kind of orchestration seems more appropriate there, since it's deciding when to create pods and it's cluster-level. Earlier drafts of the description in #3058 contained other start/stop kinds of controls.

bgrant0607 · 2015-03-07T19:40:53Z

ReplicationController is also currently much simpler than both Pods and Services, so it could accommodate some additional complexity, IMO.

Furthermore, this kind of dependence would make less sense for daemons, which shouldn't have such dependencies, and batch workloads, which would want post-execution or data dependencies instead, so putting it in ReplicationController would correctly scope the feature to just service pods.

bgrant0607 · 2015-03-07T19:42:37Z

I could imagine adding an arbitrary URL probe, also, but using an ObjectReference for the Service dependency use case enables use of watch.

bgrant0607 · 2015-03-07T19:44:58Z

If this were moved to ReplicationController and we added ReplicationControllerCondition to ReplicationControllerStatus, then I'm in favor.

bgrant0607 · 2015-03-07T19:46:28Z

The ReplicationController approach would sacrifice image preloading while waiting upon the dependence but would free resources on the nodes in the meantime. Seems like a fine tradeoff, and we'll want to develop another, more general approach to reduce image loading time, anyway.

bgrant0607 · 2015-03-07T19:47:20Z

@smarterclayton I agree we should add a /ready subresource to services, pods, and nodes.

bgrant0607 · 2015-03-07T19:47:52Z

Or just add readiness information to /endpoints.

bgrant0607 · 2015-03-07T19:49:04Z

Putting the preconditions in ReplicationController would make it clear that intra-pod dependencies are not considered to be the same problem.

pmorie · 2015-03-09T20:46:53Z

@smarterclayton I'm not sure I understand this comment, can you clarify at all?

The only benefit to the pod level check was tenseness. If we moved this to a new type of container probe, does it become more reasonable? Simple pods (1-2 containers) aren't that much harder to setup, and then if you can complex dependency order you can emulate it with individual container prestarts or probes?

smarterclayton · 2015-03-09T20:49:36Z

This pull proposed pod level problems. Instead, add a new type of container probe, precondition, which blocks container start (the same as this pull blocked pod start). However, since most pods have only one or two containers, effectively they're the same for the use case this was targeting (i'm waiting for an external condition), while for complex pods you could do interrelationships as you needed (or have a coordinator service).

----- Original Message -----

@smarterclayton I'm not sure I understand this comment, can you clarify at
all?

The only benefit to the pod level check was tenseness. If we moved this to
a new type of container probe, does it become more reasonable? Simple
pods (1-2 containers) aren't that much harder to setup, and then if you
can complex dependency order you can emulate it with individual container
prestarts or probes?

Reply to this email directly or view it on GitHub:
#4245 (comment)

bgrant0607 · 2015-03-18T16:35:13Z

@smarterclayton At that point, I think what you're asking for is similar to a pre-start hook, which I totally agree we need. The difference in behavior would be that a hook would block until the precondition were satisfied, while a probe would be polled and would fail until the precondition were satisfied. If we used the hook approach, we'd need to extend normal request timeouts.

smarterclayton · 2015-03-18T16:59:40Z

Could the hook be treated as a probe? I don't know that implementing a blocking hook is any easier than implementing a probe (the probe can be /my/bin/bash.sh which tries mysql -p 453 $SERVICE_IP).

bgrant0607 · 2015-03-18T22:25:44Z

Implementing a blocking hook probably isn't easier than a probe, but the mechanism is useful for other purposes.

I could get on board with a probe. It has the same implementation challenge as prestart hooks at the moment, though, which is that the container won't exist, so an exec probe would require some kind of trickery. We could say exec is not supported for precondition probes yet.

smarterclayton · 2015-03-18T23:11:39Z

Good point on exec. I originally was thinking more of api preconditions, but a lot of traditional networked apps will have client specific rules (what if you need a client cert to check your prereq?). If hooks can also probe, it means a more clearly encapsulated logic chain.

On Mar 18, 2015, at 6:25 PM, Brian Grant notifications@github.com wrote:

Implementing a blocking hook probably isn't easier than a probe, but the mechanism is useful for other purposes.

I could get on board with a probe. It has the same implementation challenge as prestart hooks at the moment, though, which is that the container won't exist, so an exec probe would require some kind of trickery. We could say exec is not supported for precondition probes yet.

—
Reply to this email directly or view it on GitHub.

pmorie · 2015-03-24T17:49:00Z

We started discussing pre-start hooks in #4890 (comment)

pmorie · 2015-03-24T17:49:46Z

@bgrant0607 @vmarmol @smarterclayton It seems like we we had started discussing starting with @vmarmol's comment above would be applicable here.

piosz · 2015-04-08T16:49:08Z

Is this PR active? If not please close it according to https://github.com/GoogleCloudPlatform/kubernetes/blob/master/docs/devel/pull-requests.md

bgrant0607 · 2015-04-09T23:14:27Z

Hopefully.

bgrant0607 · 2015-04-10T18:49:42Z

@pmorie said we could close this one.

WIP: pod preconditions

6f2aaf2

davidopp added the sig/node Categorizes an issue or PR as relevant to SIG Node. label Feb 15, 2015

googlebot added the cla: no label Feb 24, 2015

erictune assigned bgrant0607 Mar 5, 2015

bgrant0607 reviewed Mar 7, 2015
View reviewed changes

bgrant0607 mentioned this pull request Mar 23, 2015

Missing service environment variables while starting pod #5184

Closed

bgrant0607 mentioned this pull request Apr 10, 2015

Consistently support graceful and immediate termination for all objects #1535

Closed

bgrant0607 closed this Apr 10, 2015

bgrant0607 mentioned this pull request Apr 16, 2015

Places for hooks #3585

Closed

piosz mentioned this pull request Jun 2, 2015

Make DNS required (update docs, add conformance test) #9107

Closed

bgrant0607 mentioned this pull request Jun 20, 2015

Bulk creation API #9633

Open

ncdc mentioned this pull request Mar 13, 2016

Pod start prerequisites #22887

Closed

WIP: pod preconditions #4245

WIP: pod preconditions #4245

Conversation

pmorie commented Feb 9, 2015

pmorie commented Feb 9, 2015

bgrant0607 commented Feb 9, 2015

pmorie commented Feb 9, 2015

smarterclayton commented Feb 10, 2015

pmorie commented Feb 10, 2015

smarterclayton commented Feb 10, 2015

pmorie commented Feb 10, 2015

bgrant0607 commented Feb 14, 2015

smarterclayton commented Feb 14, 2015

thockin commented Feb 17, 2015

bgrant0607 commented Feb 17, 2015

smarterclayton commented Feb 17, 2015

pmorie commented Feb 24, 2015

googlebot commented Feb 24, 2015

pmorie commented Mar 4, 2015

vmarmol commented Mar 6, 2015

pmorie commented Mar 6, 2015

bgrant0607 commented Mar 7, 2015

bgrant0607 Mar 7, 2015

Choose a reason for hiding this comment

smarterclayton commented Mar 7, 2015

bgrant0607 commented Mar 7, 2015

bgrant0607 commented Mar 7, 2015

bgrant0607 commented Mar 7, 2015

bgrant0607 commented Mar 7, 2015

bgrant0607 commented Mar 7, 2015

bgrant0607 commented Mar 7, 2015

bgrant0607 commented Mar 7, 2015

bgrant0607 commented Mar 7, 2015

pmorie commented Mar 9, 2015

smarterclayton commented Mar 9, 2015

bgrant0607 commented Mar 18, 2015

smarterclayton commented Mar 18, 2015

bgrant0607 commented Mar 18, 2015

smarterclayton commented Mar 18, 2015

pmorie commented Mar 24, 2015

pmorie commented Mar 24, 2015

piosz commented Apr 8, 2015

bgrant0607 commented Apr 9, 2015

bgrant0607 commented Apr 10, 2015