Philosophy: when and how should kubelet reject assigned pods? #5335

lavalamp · 2015-03-11T22:20:32Z

Summarizing a IRL conversation.

Bug report: kube-scheduler doesn't see its own assignments for [latency amount], so it commonly schedules incompatible pods together. We're removing the atomic check (#5320) which exposes this. I (lavalamp) will try and make a PR to fix this.

The philosophical question is, should this ever be allowed to happen? It's a bad user experience to have your pod mis-scheduled and dropped on the floor.

@brendanburns wanted these constraints to be atomically checked (as they are in boundPods). However, there are two main concerns with this: a) if an error creeps in, the system isn't self-healing, and b) write contention over boundPods slows the system drastically.

Instead, I think we've agreed on a tiered approach, where the goal is that scheduler gets it right 99%+ of the time, but kubelet is able to reject a pod that isn't compatible if necessary. And for the usability concern, we'll discuss letting kubelet unassign incompatible pods instead of setting them to failed.

@erictune @alex-mohr

dchen1107 · 2015-03-11T22:28:29Z

cc/ @bgrant0607

bgrant0607 · 2015-03-11T22:46:41Z

Dupe of #5334.

bgrant0607 · 2015-03-11T22:48:12Z

But, yes, we should fix the scheduler if it screws up often.

bgrant0607 · 2015-03-12T03:24:33Z

In case it wasn't clear from #5334:

Pods should never be treated as durable pets. In general, users shouldn't be creating pods directly. They should almost always use controllers, not unlike auto-scaling groups on AWS (AIUI) or our internal "job" abstraction.

Pod is exposed as a primitive to facilitate writing schedulers, controllers, etc., to facilitate bootstrapping, for separation of concerns -- Kubelet vs. cluster-level components, to facilitate decoupling of replication controllers and services, and so that replication controllers and other similar controllers don't need to proxy instance-level operations.

The right solution for pets is something like nominal services #260. In the not-too-distant future, controllers will need to be able to replace instances in advance of their termination and certainly in advance of deletion (e.g., for planned evictions, image prefetching, unidling, or live pod migration #3949).

I'll update pods.md to clarify this.

lavalamp added kind/design Categorizes issue or PR as related to design. priority/design sig/api-machinery Categorizes an issue or PR as relevant to SIG API Machinery. labels Mar 11, 2015

bgrant0607 closed this as completed Mar 11, 2015

bgrant0607 mentioned this issue Mar 12, 2015

Clarified expectations for pods. #5353

Merged

alex-mohr mentioned this issue Apr 2, 2015

Add PodSpec.NodeFailurePolicy = {Reschedule, Delete, Ignore} #6393

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Philosophy: when and how should kubelet reject assigned pods? #5335

Philosophy: when and how should kubelet reject assigned pods? #5335

lavalamp commented Mar 11, 2015

dchen1107 commented Mar 11, 2015

bgrant0607 commented Mar 11, 2015

bgrant0607 commented Mar 11, 2015

bgrant0607 commented Mar 12, 2015

Philosophy: when and how should kubelet reject assigned pods? #5335

Philosophy: when and how should kubelet reject assigned pods? #5335

Comments

lavalamp commented Mar 11, 2015

dchen1107 commented Mar 11, 2015

bgrant0607 commented Mar 11, 2015

bgrant0607 commented Mar 11, 2015

bgrant0607 commented Mar 12, 2015