-
Notifications
You must be signed in to change notification settings - Fork 40k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
WIP: pod preconditions #4245
WIP: pod preconditions #4245
Conversation
Thanks @pmorie. I'll ponder this. I have to say my initial reaction is fairly Pavlovian, though. |
@bgrant0607 salivation or nausea? 🐩 🐩 🐩 |
This looks a lot cleaner - it definitely complicates the kubelet startup code, but it's keeping it located to the same place where health checks are invoked. How would the kubelet communicate the failure due to precondition? Does the kubelet give up after a while? |
@smarterclayton Next steps I had in mind around the concerns you mentioned:
|
Something would probably watch for that event at the master level and delete the pod / scale down the rc. But I don't know that it's hard and fast rule it get scheduled off. It's kind of like pull failing (pull is a precondition)
|
@smarterclayton good point about hard and fast. I think the control surface for preconditions is something like:
...with the granularity being precondition level. |
I have to admit that it's not very obtrusive in the API, and I like that it's in PodSpec rather than Container. I'm not keen on adding a bunch of parameters to it. It also has a number of potential issues. What happens if the objects disappear after the pod is started? Should the pod be stopped? Do they need to exist on restart of the pod? On update of the pod? Certainly when replication controller creates a pod they'll need to exist. Would the same policies apply if we had a ReadinessPrecondition? Would other components be unhappy in the case of a significant delay between scheduling a pod and its existence on the node? We should at least set the reason for why the pod is still pending. Should replication controller peek at the dependence in order to avoid creating lots of pods that will fail due to a non-existent dependency? Users could exploit this to trigger the simultaneous start of pods on many nodes. Is that desirable/ok? We probably would want to factor out the image pulls, so they could happen before waiting. However, do we even need this? If the pod uses DNS, it can retry until the DNS name resolves. If it uses link env. vars., it will need to pre-declare them, as per #1768. Now that Kubelet populates the env. vars., it could wait until it has values for the variables, or simply restart the containers when their values change. Is there another use case for this? Sort-of related: #1996 (comment) |
On the other side, the vast majority of frameworks and simple applications do not gracefully tolerate missing dbs without existing work. We've started experiencing this as we create on ramp tutorial apps (the canonical web app is one web pod and one db pod) and get new users involved - the first problem they hit was service ordering (now fixed), and now it is out of order starts where the db starts slower than their simple example web apps. For those apps and frameworks that don't fail fast, the user experience is very disappointing (it doesn't work, and it doesn't necessarily start working). I think of a solution for this problem as a critical onramp problem for Kube for simple and complex apps alike - the "user experience of kicking the tires". While yes, the problem can be fixed in code, the solution "hey, write better code / fix your upstream framework" is user hostile. Like adaptation of env, this is a bridge step that is less focused on cloud designed software and on the much larger category of "people moving to cloud environments." Core use case: "it should be easy to craft a json definition of a two tier web app with two pods or two rc's such that the web app code starts after the db is running, and I can repeatedly deploy that to different namespaces and environments". I believe that represents as high as 95% of the incoming user population to Kube and cloud platforms in general. Ideally we accomplish this in a way with value for more complex apps, or for coordination, and definitely it should be unobtrusive to the kubelet. The key part is knowledge and ordering, a after b - I like elements of this proposal because it distributes that knowledge close to the pods. It does require that we have an easy way for the probe to parameterize the url of something that doesn't exist yet (so DNS is one approach, but requires DNS to be working). It also seems to have the advantage over the service dependency work that arbitrary conditions can be defined by user code (db pod can expose a ready endpoint that is very sophisticated). Other approaches seem to couple too closely to kubelet behavior, and this has the advantage of enabling emergent ordering (or delegating a complex ordering to a central coordinator). Certainly open to other alternatives.
|
I concur with Brian - as light as this patch is, I am very concerned about On Sat, Feb 14, 2015 at 9:35 AM, Clayton Coleman notifications@github.com
|
Sorry, was out today. I'm very sympathetic to the use cases -- thank you for explaining them @smarterclayton. But I would like to consider alternatives, and whether this solves enough of the problem. For instance, this wouldn't work for intra-pod dependencies, nor would it wait for applications to actually start up and become ready to respond, as discussed in #3312. |
I think we have lenaics intra-pods cases mostly covered with behavior that can be done in pods - ie an advisory lock in a shared volume. I also think that particular app example sits pretty far on the intra-pod complexity spectrum, so I do think it might be on the other side of the "generic tool" threshold. Would be happy to talk through alternatives.
|
@bgrant0607 This is not intended to solve the intra-pod ordering, just a coarse-grained ordering of pods with respect to services. Did you have other alternative approaches in mind? |
Thanks for your pull request. It looks like this may be your first contribution to a Google open source project, in which case you'll need to sign a Contributor License Agreement (CLA) at https://cla.developers.google.com/. If you've already signed a CLA, it's possible we don't have your GitHub username or you're using a different email address. Check the information on your CLA or see this help article on setting the email on your git commits. Once you've done that, please reply here to let us know. If you signed the CLA as a corporation, please let us know the company's name. |
@bgrant0607 After talking with some folks about this, I want to clarify my above comment. This is intended only to solve the use-case of: I don't want my pod to start until some service exists. |
@vmarmol thanks for the heads up |
I have the following concerns:
I do want to find a solution to this, however, and not block progress forever. Sorry, I've been overloaded. Looking at #5093 made we wonder if a pod-level pre-start hook container wouldn't be a better solution. I want that mechanism for other reasons. Admittedly it seems somewhat heavyweight and less transparent, however. |
// | ||
err = kl.checkPodPreconditions(pod) | ||
if err != nil { | ||
glog.Errorf("Pod has failed preconditions: %v; Skipping pod %q", err, podFullName) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Assuming we go forward with this approach:
This can't log an error. It needs to be at a high verbosity level. Even that won't be useful, however. Instead, we should create a PodCondition entry for this case and add LastProbeTime, LastTransitionTime, Reason, and Message to PodCondition.
That does cause the structure to grow a bit.
|
Did we discuss putting this into ReplicationController instead? This kind of orchestration seems more appropriate there, since it's deciding when to create pods and it's cluster-level. Earlier drafts of the description in #3058 contained other start/stop kinds of controls. |
ReplicationController is also currently much simpler than both Pods and Services, so it could accommodate some additional complexity, IMO. Furthermore, this kind of dependence would make less sense for daemons, which shouldn't have such dependencies, and batch workloads, which would want post-execution or data dependencies instead, so putting it in ReplicationController would correctly scope the feature to just service pods. |
I could imagine adding an arbitrary URL probe, also, but using an ObjectReference for the Service dependency use case enables use of watch. |
If this were moved to ReplicationController and we added ReplicationControllerCondition to ReplicationControllerStatus, then I'm in favor. |
The ReplicationController approach would sacrifice image preloading while waiting upon the dependence but would free resources on the nodes in the meantime. Seems like a fine tradeoff, and we'll want to develop another, more general approach to reduce image loading time, anyway. |
@smarterclayton I agree we should add a /ready subresource to services, pods, and nodes. |
Or just add readiness information to /endpoints. |
Putting the preconditions in ReplicationController would make it clear that intra-pod dependencies are not considered to be the same problem. |
@smarterclayton I'm not sure I understand this comment, can you clarify at all?
|
This pull proposed pod level problems. Instead, add a new type of container probe, precondition, which blocks container start (the same as this pull blocked pod start). However, since most pods have only one or two containers, effectively they're the same for the use case this was targeting (i'm waiting for an external condition), while for complex pods you could do interrelationships as you needed (or have a coordinator service). ----- Original Message -----
|
@smarterclayton At that point, I think what you're asking for is similar to a pre-start hook, which I totally agree we need. The difference in behavior would be that a hook would block until the precondition were satisfied, while a probe would be polled and would fail until the precondition were satisfied. If we used the hook approach, we'd need to extend normal request timeouts. |
Could the hook be treated as a probe? I don't know that implementing a blocking hook is any easier than implementing a probe (the probe can be /my/bin/bash.sh which tries |
Implementing a blocking hook probably isn't easier than a probe, but the mechanism is useful for other purposes. I could get on board with a probe. It has the same implementation challenge as prestart hooks at the moment, though, which is that the container won't exist, so an exec probe would require some kind of trickery. We could say exec is not supported for precondition probes yet. |
Good point on exec. I originally was thinking more of api preconditions, but a lot of traditional networked apps will have client specific rules (what if you need a client cert to check your prereq?). If hooks can also probe, it means a more clearly encapsulated logic chain.
|
We started discussing pre-start hooks in #4890 (comment) |
@bgrant0607 @vmarmol @smarterclayton It seems like we we had started discussing starting with @vmarmol's comment above would be applicable here. |
Is this PR active? If not please close it according to https://github.com/GoogleCloudPlatform/kubernetes/blob/master/docs/devel/pull-requests.md |
Hopefully. |
@pmorie said we could close this one. |
Very rough WIP exploring another factoring for preconditions as discussed in #1768 after my highly unsatisfactory (to me) stab in #3911.
This is just a sketch; looking for feedback on generalities of this approach before fleshing out further.