Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Job controller proposal #11746

Merged
merged 1 commit into from
Aug 17, 2015
Merged
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
191 changes: 191 additions & 0 deletions docs/proposals/job.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,191 @@
<!-- BEGIN MUNGE: UNVERSIONED_WARNING -->

<!-- BEGIN STRIP_FOR_RELEASE -->

<img src="http://kubernetes.io/img/warning.png" alt="WARNING"
width="25" height="25">
<img src="http://kubernetes.io/img/warning.png" alt="WARNING"
width="25" height="25">
<img src="http://kubernetes.io/img/warning.png" alt="WARNING"
width="25" height="25">
<img src="http://kubernetes.io/img/warning.png" alt="WARNING"
width="25" height="25">
<img src="http://kubernetes.io/img/warning.png" alt="WARNING"
width="25" height="25">

<h2>PLEASE NOTE: This document applies to the HEAD of the source tree</h2>

If you are using a released version of Kubernetes, you should
refer to the docs that go with that version.

<strong>
The latest 1.0.x release of this document can be found
[here](http://releases.k8s.io/release-1.0/docs/proposals/job.md).

Documentation for other releases can be found at
[releases.k8s.io](http://releases.k8s.io).
</strong>
--

<!-- END STRIP_FOR_RELEASE -->

<!-- END MUNGE: UNVERSIONED_WARNING -->

# Job Controller

## Abstract

A proposal for implementing a new controller - Job controller - which will be responsible
for managing pod(s) that require running once to completion even if the machine
the pod is running on fails, in contrast to what ReplicationController currently offers.

Several existing issues and PRs were already created regarding that particular subject:
* Job Controller [#1624](https://github.com/GoogleCloudPlatform/kubernetes/issues/1624)
* New Job resource [#7380](https://github.com/GoogleCloudPlatform/kubernetes/pull/7380)


## Use Cases

1. Be able to start one or several pods tracked as a single entity.
1. Be able to run batch-oriented workloads on Kubernetes.
1. Be able to get the job status.
1. Be able to specify the number of instances performing a job at any one time.
1. Be able to specify the number of successfully finished instances required to finish a job.


## Motivation

Jobs are needed for executing multi-pod computation to completion; a good example
here would be the ability to implement any type of batch oriented tasks.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

should remove "any" b/c - workflow DAGs or graphs are not supported.



## Implementation

Job controller is similar to replication controller in that they manage pods.
This implies they will follow the same controller framework that replication
controllers already defined. The biggest difference between a `Job` and a
`ReplicationController` object is the purpose; `ReplicationController`
ensures that a specified number of Pods are running at any one time, whereas
`Job` is responsible for keeping the desired number of Pods to a completion of
a task. This difference will be represented by the `RestartPolicy` which is
required to always take value of `RestartPolicyNever` or `RestartOnFailure`.


The new `Job` object will have the following content:

```go
// Job represents the configuration of a single job.
type Job struct {
TypeMeta
ObjectMeta

// Spec is a structure defining the expected behavior of a job.
Spec JobSpec

// Status is a structure describing current status of a job.
Status JobStatus
}

// JobList is a collection of jobs.
type JobList struct {
TypeMeta
ListMeta

Items []Job
}
```

`JobSpec` structure is defined to contain all the information how the actual job execution
will look like.

```go
// JobSpec describes how the job execution will look like.
type JobSpec struct {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If one of the use cases is "Be able to specify the number of instances performing a job", shouldn't this struct have a field to specify a max parallelism (i.e. number of tasks to run at any given time). This could be targeted by a resize verb.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, I was generally confused about this issue too. Earlier the doc says "Job is responsible for keeping the desired number of Pods to a completion of a task" which seems to simultaneously imply a static and dynamic number of Pods.

It seems there are two choices

  • specify a count that is the initial number of Pods created; they are restarted on failure (of the container or the machine) but not restarted when they exit successfully; so the total number of Pods goes from that count to zero over time
  • specify a max parallelism (as @mikedanese suggests; also see "Completions" field in New Job resource #7380) with the understanding that the system will decide how many replicas should be running at any time (bounded by that number)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In my thinking I went with the first choice, iow. specify the initial number of Pods created and proceed until successful completion, only failing pods will be restarted. Since we are trying to address a use case of performing a certain batch task my understanding was that we want static amount of Pods that will be running a task. Can you guys provide me with examples where you'll be interested in resizing? I don't see Jobs as a candidate for such, I'm seeing resizing as an option for RC, which is targeted to run all the time and restarting one would mean a undesirable down time, which is totally different to what Jobs are for.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Choice 1 is restrictive. The total number of completions required should not be conflated with the total number of tasks running.

Here is an example. I have a data refresh job that runs over a database and each task is assigned a partition. Partition assignment is built into the application so a Job works for me. There are 500 or 1000 completions to finish this job. During peak hours, I only have resource capacity to run max 5 tasks in parallel, but during off peak hours I want to resize the job to run max 25 tasks in parallel. At no point do I want to run a task for every completion at the same time.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@mikedanese You might be able to get that behavior with Choice 1 + something running on top that dynamically adjusts the # tasks in the Spec?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It sounds like you are trying to implement admission control - shouldn't the apiserver do that? Obviously, we don't want to bury the apiserver, but it's purpose in life is to figure out what can be run where and when, shouldn't we just let it do its thing? At least until we have evidence that it can't..

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think it can do it's thing without a way to preempt job pods when I need to scale up a replication controller. IIUC, the only admission control mechanism that could conceivably do this now is the ResourceQuota. Going with implementing this in admission control, assume I restrict batch workloads to a specific namespace. If I update the namespace's ResourceQuota, will pods be killed or will no new pods be admitted? Suppose pods are killed (which I don't know is the case). Which ones? Do I need a namespace per batch job? Suppose I'm at the limit of my ResourceQuota, which job's pods get admitted? The first request to go through as resource becomes available? Implementation wise this also doesn't work well with the current ControllerExpectation mechanism (think spinlock vs notify). There's a lot to figure out for something that could be solved with a MaxParallelism in the controller manager.

I feel that MaxParallelism attached to a job is the correct place for this responsibility. Can't any argument for admission control apply to replication controller as well? I could be overruled/persuaded by the api experts on this and I'm also happy to live this out of the first draft of the Job proposal.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In the light of @mikedanese's example I think there are two use cases to consider, one is regarding the actual control over a Job (iow. the MaxParallelism) and the other thing is the ResourceQuota being applied. In my understanding the RQ is a mechanism for a cluster admin to restrict user capabilities, whereas the MaxParallelism is mine (as a user) possibility to drive the job execution, similarly to what you do with RC today.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

RQ is a mechanism for a cluster admin

Another good argument for MaxParallelism vs Admission Control :)

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Admission control and RQ would just result in a bunch of forbidden messages being returned. +1 for MaxParallelism


// Parallelism specifies the maximum desired number of pods the job should
// run at any given time. The actual number of pods running in steady state will
// be less than this number when ((.spec.completions - .status.successful) < .spec.parallelism),
// i.e. when the work left to do is less than max parallelism.
Parallelism *int

// Completions specifies the desired number of successfully finished pods the
// job should be run with. Defaults to 1.
Completions *int

// Selector is a label query over pods running a job.
Selector map[string]string

// Template is the object that describes the pod that will be created when
// executing a job.
Template *PodTemplateSpec
}
```

`JobStatus` structure is defined to contain informations about pods executing
specified job. The structure holds information about pods currently executing
the job.

```go
// JobStatus represents the current state of a Job.
type JobStatus struct {
Conditions []JobCondition

// CreationTime represents time when the job was created
CreationTime util.Time

// StartTime represents time when the job was started
StartTime util.Time

// CompletionTime represents time when the job was completed
CompletionTime util.Time

// Active is the number of actively running pods.
Active int

// Successful is the number of pods successfully completed their job.
Successful int

// Unsuccessful is the number of pods failures, this applies only to jobs
// created with RestartPolicyNever, otherwise this value will always be 0.
Unsuccessful int
}

type JobConditionType string

// These are valid conditions of a job.
const (
// JobSucceeded means the job has successfully completed its execution.
JobSucceeded JobConditionType = "Complete"
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@soltysh Having only a single condition reads a little weird to me. Do we at least need to have another constant for an in-progress job?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't mind this. I like the idea that someone could define a state machine (i.e. derive Phase) on top of orthogonal conditions. If we add "in-progress" then condition is just a repackaging of Phase.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

My initial proposal after switching to Conditions was exactly mapping all of the Phases. But after reading #7856 and #12015 I agree with @mikedanese that having just single termination condition is sufficient.

)

// JobCondition describes current state of a job.
type JobCondition struct {
Type JobConditionType
Status ConditionStatus
LastHeartbeatTime util.Time
LastTransitionTime util.Time
Reason string
Message string
}
```

## Events

Job controller will be emitting the following events:
* JobStart
* JobFinish

## Future evolution
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think something else for the future would be to think about how a user can easily understand what happened to each pod, especially in the RestartPolicyNever case. I think there are multiple ways to do this. One is to keep a list of pointers to pods in the Job (job controller would make sure not to delete them prematurely; it's creating the pods so it should also be the one that deltees them). Another is to store active and failed (and maybe also successful) in []JobCondition not just active.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good idea! Thanks, will add them.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think kubectl describe job could join a list of the Pods created by the Job controller with events from those pods, which removes the need to keep the Pod themselves.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah and that should be easily doable, imho.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I agree with @erictune. I don't want any O(N) data in JobStatus itself.


Below are the possible future extensions to the Job controller:
* Be able to limit the execution time for a job, similarly to ActiveDeadlineSeconds for Pods.
* Be able to create a chain of jobs dependent one on another.
* Be able to specify the work each of the workers should execute (see type 1 from
[this comment](https://github.com/GoogleCloudPlatform/kubernetes/issues/1624#issuecomment-97622142))
* Be able to inspect Pods running a Job, especially after a Job has finished, e.g.
by providing pointers to Pods in the JobStatus ([see comment](https://github.com/kubernetes/kubernetes/pull/11746/files#r37142628)).


<!-- BEGIN MUNGE: GENERATED_ANALYTICS -->
[![Analytics](https://kubernetes-site.appspot.com/UA-36037335-10/GitHub/docs/proposals/job.md?pixel)]()
<!-- END MUNGE: GENERATED_ANALYTICS -->