-
Notifications
You must be signed in to change notification settings - Fork 40k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Job controller proposal #11746
Job controller proposal #11746
Changes from all commits
File filter
Filter by extension
Conversations
Jump to
Diff view
Diff view
There are no files selected for viewing
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,191 @@ | ||
<!-- BEGIN MUNGE: UNVERSIONED_WARNING --> | ||
|
||
<!-- BEGIN STRIP_FOR_RELEASE --> | ||
|
||
<img src="http://kubernetes.io/img/warning.png" alt="WARNING" | ||
width="25" height="25"> | ||
<img src="http://kubernetes.io/img/warning.png" alt="WARNING" | ||
width="25" height="25"> | ||
<img src="http://kubernetes.io/img/warning.png" alt="WARNING" | ||
width="25" height="25"> | ||
<img src="http://kubernetes.io/img/warning.png" alt="WARNING" | ||
width="25" height="25"> | ||
<img src="http://kubernetes.io/img/warning.png" alt="WARNING" | ||
width="25" height="25"> | ||
|
||
<h2>PLEASE NOTE: This document applies to the HEAD of the source tree</h2> | ||
|
||
If you are using a released version of Kubernetes, you should | ||
refer to the docs that go with that version. | ||
|
||
<strong> | ||
The latest 1.0.x release of this document can be found | ||
[here](http://releases.k8s.io/release-1.0/docs/proposals/job.md). | ||
|
||
Documentation for other releases can be found at | ||
[releases.k8s.io](http://releases.k8s.io). | ||
</strong> | ||
-- | ||
|
||
<!-- END STRIP_FOR_RELEASE --> | ||
|
||
<!-- END MUNGE: UNVERSIONED_WARNING --> | ||
|
||
# Job Controller | ||
|
||
## Abstract | ||
|
||
A proposal for implementing a new controller - Job controller - which will be responsible | ||
for managing pod(s) that require running once to completion even if the machine | ||
the pod is running on fails, in contrast to what ReplicationController currently offers. | ||
|
||
Several existing issues and PRs were already created regarding that particular subject: | ||
* Job Controller [#1624](https://github.com/GoogleCloudPlatform/kubernetes/issues/1624) | ||
* New Job resource [#7380](https://github.com/GoogleCloudPlatform/kubernetes/pull/7380) | ||
|
||
|
||
## Use Cases | ||
|
||
1. Be able to start one or several pods tracked as a single entity. | ||
1. Be able to run batch-oriented workloads on Kubernetes. | ||
1. Be able to get the job status. | ||
1. Be able to specify the number of instances performing a job at any one time. | ||
1. Be able to specify the number of successfully finished instances required to finish a job. | ||
|
||
|
||
## Motivation | ||
|
||
Jobs are needed for executing multi-pod computation to completion; a good example | ||
here would be the ability to implement any type of batch oriented tasks. | ||
|
||
|
||
## Implementation | ||
|
||
Job controller is similar to replication controller in that they manage pods. | ||
This implies they will follow the same controller framework that replication | ||
controllers already defined. The biggest difference between a `Job` and a | ||
`ReplicationController` object is the purpose; `ReplicationController` | ||
ensures that a specified number of Pods are running at any one time, whereas | ||
`Job` is responsible for keeping the desired number of Pods to a completion of | ||
a task. This difference will be represented by the `RestartPolicy` which is | ||
required to always take value of `RestartPolicyNever` or `RestartOnFailure`. | ||
|
||
|
||
The new `Job` object will have the following content: | ||
|
||
```go | ||
// Job represents the configuration of a single job. | ||
type Job struct { | ||
TypeMeta | ||
ObjectMeta | ||
|
||
// Spec is a structure defining the expected behavior of a job. | ||
Spec JobSpec | ||
|
||
// Status is a structure describing current status of a job. | ||
Status JobStatus | ||
} | ||
|
||
// JobList is a collection of jobs. | ||
type JobList struct { | ||
TypeMeta | ||
ListMeta | ||
|
||
Items []Job | ||
} | ||
``` | ||
|
||
`JobSpec` structure is defined to contain all the information how the actual job execution | ||
will look like. | ||
|
||
```go | ||
// JobSpec describes how the job execution will look like. | ||
type JobSpec struct { | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. If one of the use cases is "Be able to specify the number of instances performing a job", shouldn't this struct have a field to specify a max parallelism (i.e. number of tasks to run at any given time). This could be targeted by a resize verb. There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Yeah, I was generally confused about this issue too. Earlier the doc says "Job is responsible for keeping the desired number of Pods to a completion of a task" which seems to simultaneously imply a static and dynamic number of Pods. It seems there are two choices
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. In my thinking I went with the first choice, iow. specify the initial number of Pods created and proceed until successful completion, only failing pods will be restarted. Since we are trying to address a use case of performing a certain batch task my understanding was that we want static amount of Pods that will be running a task. Can you guys provide me with examples where you'll be interested in resizing? I don't see Jobs as a candidate for such, I'm seeing resizing as an option for RC, which is targeted to run all the time and restarting one would mean a undesirable down time, which is totally different to what Jobs are for. There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Choice 1 is restrictive. The total number of completions required should not be conflated with the total number of tasks running. Here is an example. I have a data refresh job that runs over a database and each task is assigned a partition. Partition assignment is built into the application so a Job works for me. There are 500 or 1000 completions to finish this job. During peak hours, I only have resource capacity to run max 5 tasks in parallel, but during off peak hours I want to resize the job to run max 25 tasks in parallel. At no point do I want to run a task for every completion at the same time. There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. @mikedanese You might be able to get that behavior with Choice 1 + something running on top that dynamically adjusts the # tasks in the Spec? There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. It sounds like you are trying to implement admission control - shouldn't the apiserver do that? Obviously, we don't want to bury the apiserver, but it's purpose in life is to figure out what can be run where and when, shouldn't we just let it do its thing? At least until we have evidence that it can't.. There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. I don't think it can do it's thing without a way to preempt job pods when I need to scale up a replication controller. IIUC, the only admission control mechanism that could conceivably do this now is the ResourceQuota. Going with implementing this in admission control, assume I restrict batch workloads to a specific namespace. If I update the namespace's ResourceQuota, will pods be killed or will no new pods be admitted? Suppose pods are killed (which I don't know is the case). Which ones? Do I need a namespace per batch job? Suppose I'm at the limit of my ResourceQuota, which job's pods get admitted? The first request to go through as resource becomes available? Implementation wise this also doesn't work well with the current ControllerExpectation mechanism (think spinlock vs notify). There's a lot to figure out for something that could be solved with a MaxParallelism in the controller manager. I feel that MaxParallelism attached to a job is the correct place for this responsibility. Can't any argument for admission control apply to replication controller as well? I could be overruled/persuaded by the api experts on this and I'm also happy to live this out of the first draft of the Job proposal. There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. In the light of @mikedanese's example I think there are two use cases to consider, one is regarding the actual control over a Job (iow. the MaxParallelism) and the other thing is the ResourceQuota being applied. In my understanding the RQ is a mechanism for a cluster admin to restrict user capabilities, whereas the MaxParallelism is mine (as a user) possibility to drive the job execution, similarly to what you do with RC today. There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more.
Another good argument for MaxParallelism vs Admission Control :) There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Admission control and RQ would just result in a bunch of forbidden messages being returned. +1 for MaxParallelism |
||
|
||
// Parallelism specifies the maximum desired number of pods the job should | ||
// run at any given time. The actual number of pods running in steady state will | ||
// be less than this number when ((.spec.completions - .status.successful) < .spec.parallelism), | ||
// i.e. when the work left to do is less than max parallelism. | ||
Parallelism *int | ||
|
||
// Completions specifies the desired number of successfully finished pods the | ||
// job should be run with. Defaults to 1. | ||
Completions *int | ||
|
||
// Selector is a label query over pods running a job. | ||
Selector map[string]string | ||
|
||
// Template is the object that describes the pod that will be created when | ||
// executing a job. | ||
Template *PodTemplateSpec | ||
} | ||
``` | ||
|
||
`JobStatus` structure is defined to contain informations about pods executing | ||
specified job. The structure holds information about pods currently executing | ||
the job. | ||
|
||
```go | ||
// JobStatus represents the current state of a Job. | ||
type JobStatus struct { | ||
Conditions []JobCondition | ||
|
||
// CreationTime represents time when the job was created | ||
CreationTime util.Time | ||
|
||
// StartTime represents time when the job was started | ||
StartTime util.Time | ||
|
||
// CompletionTime represents time when the job was completed | ||
CompletionTime util.Time | ||
|
||
// Active is the number of actively running pods. | ||
Active int | ||
|
||
// Successful is the number of pods successfully completed their job. | ||
Successful int | ||
|
||
// Unsuccessful is the number of pods failures, this applies only to jobs | ||
// created with RestartPolicyNever, otherwise this value will always be 0. | ||
Unsuccessful int | ||
} | ||
|
||
type JobConditionType string | ||
|
||
// These are valid conditions of a job. | ||
const ( | ||
// JobSucceeded means the job has successfully completed its execution. | ||
JobSucceeded JobConditionType = "Complete" | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. @soltysh Having only a single condition reads a little weird to me. Do we at least need to have another constant for an in-progress job? There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. I don't mind this. I like the idea that someone could define a state machine (i.e. derive Phase) on top of orthogonal conditions. If we add "in-progress" then condition is just a repackaging of Phase. There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. My initial proposal after switching to Conditions was exactly mapping all of the Phases. But after reading #7856 and #12015 I agree with @mikedanese that having just single termination condition is sufficient. |
||
) | ||
|
||
// JobCondition describes current state of a job. | ||
type JobCondition struct { | ||
Type JobConditionType | ||
Status ConditionStatus | ||
LastHeartbeatTime util.Time | ||
LastTransitionTime util.Time | ||
Reason string | ||
Message string | ||
} | ||
``` | ||
|
||
## Events | ||
|
||
Job controller will be emitting the following events: | ||
* JobStart | ||
* JobFinish | ||
|
||
## Future evolution | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. I think something else for the future would be to think about how a user can easily understand what happened to each pod, especially in the RestartPolicyNever case. I think there are multiple ways to do this. One is to keep a list of pointers to pods in the Job (job controller would make sure not to delete them prematurely; it's creating the pods so it should also be the one that deltees them). Another is to store active and failed (and maybe also successful) in []JobCondition not just active. There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Good idea! Thanks, will add them. There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. I think There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Yeah and that should be easily doable, imho. There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. I agree with @erictune. I don't want any O(N) data in JobStatus itself. |
||
|
||
Below are the possible future extensions to the Job controller: | ||
* Be able to limit the execution time for a job, similarly to ActiveDeadlineSeconds for Pods. | ||
* Be able to create a chain of jobs dependent one on another. | ||
* Be able to specify the work each of the workers should execute (see type 1 from | ||
[this comment](https://github.com/GoogleCloudPlatform/kubernetes/issues/1624#issuecomment-97622142)) | ||
* Be able to inspect Pods running a Job, especially after a Job has finished, e.g. | ||
by providing pointers to Pods in the JobStatus ([see comment](https://github.com/kubernetes/kubernetes/pull/11746/files#r37142628)). | ||
|
||
|
||
<!-- BEGIN MUNGE: GENERATED_ANALYTICS --> | ||
[![Analytics](https://kubernetes-site.appspot.com/UA-36037335-10/GitHub/docs/proposals/job.md?pixel)]() | ||
<!-- END MUNGE: GENERATED_ANALYTICS --> |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
should remove "any" b/c - workflow DAGs or graphs are not supported.