-
Notifications
You must be signed in to change notification settings - Fork 1.5k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add KEP for Indexed Job in implementable status #2245
Conversation
153bd05
to
0430c0d
Compare
annotation. In the event of a kube-controller-manager upgrade, some running or | ||
succeeded Pods might exist without a completion index. | ||
|
||
<<[UNRESOLVED handling Jobs with existing Pods without completion index ]>> |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Introducing the annotation shouldn't impact existing workloads unless they explicitly use it. I don't expect that users would create an indexed job that is tolerant to not having the index. So I don't really see the need for addressing this scenario.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is more about the controller itself. It needs to do something with the existing Pods in terms of completion index tracking.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks good, Aldo. Added some optional suggestions.
b292a1e
to
c4575c8
Compare
/assign @janetkuo |
/label api-review |
46d518a
to
d12b83e
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
/approve
/lgtm
@wojtek-t friendly reminder for PRR |
### Job scaling | ||
|
||
Jobs can be scaled up and down by varying `.spec.parallelism` (note that | ||
`.spec.completions` is an immutable field). |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Please avoid using "scale" here, as Jobs don't follow the normal semantics for scaling. Ref kubernetes/kubernetes#60139
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Fixed
9b5b778
to
14dc865
Compare
/lgtm |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I added some comments, but mostly fairly minor one.
|
||
The Job controller doesn't add the environment variable if there is a name | ||
conflict with an existing environment variables in the Job's Pod template or if | ||
the user already specified another environment variable for the same annotation. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why? What's the problem with having the value referenced by two separate env vars?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think it makes sense to give the user control over the names of the variables. A binary could always have some behavior defined for JOB_COMPLETION_INDEX which is different from the one provided by k8s.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I guess I wasn't clear. I wasn't asking to touch customer-defined variable.
I was just asking - why we don't want to provide JOB_COMPLETION_INDEX unconditionally (either with user-propagated-value if set by user or as above if not).
I don't see a problem if you have two env vars, say FOO and JOB_COMPLETION_INDEX with the same value set to the val of that annotation.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I understood you correctly :)
What I'm saying is: what if a JOB_COMPLETION_INDEX variable causes an undesirable effect on the user binary?
I guess in this case they could override that environment variable with another value.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Make the env name unique, KUBERNETES_JOB_INDEX
, for example. User can always map that annotation to other pre-defined name, but we should always inject it.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't have a strong opinion about the name. But I'm still not convinced we should always inject the value if the user is already adding it.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We should always inject, unless user explicitly overwrites that name.
The behavior should be consistent - if you job is indexed, var X (I don't have strong opinion about name) will always be set (in your behavior you have "the var may or may not be set" - that's super confusing).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sounds good. Removed second condition.
is resolved: | ||
|
||
The Job controller keeps track of completed indexes in the Job status in a | ||
base64 encoded bitmask. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can you explicitly describe how JobStatus is updated then (for completeness)?
Another question is: with the bitmask field and the proposal above, will we end-up with two separate fields for that?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ohh - or you're saying that "we don't solve it for Alpha at all" - just rely on existence of all pods for now?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Correct. This would be a beta graduation criteria, pending on #2307
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
OK - if you're not touching JobStatus for now at all, that's fine.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I agree that tracking completed pods is orthogonal to this proposal.
// The Job is considered complete when there is one successful Pod for each | ||
// index in the range 0 to (.spec.completions - 1). | ||
// When value is `Indexed`, .spec.completions must have a non-zero positive | ||
// value and `.spec.parallelism` must be less than or equal to 10^6. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm assuming that the reasoning behind this 10^6 is the underlying bitmask, right?
i.e. we're saying that tracking 10^6 items in bitmask takes O(125kB) and this is fine, right?
That sounds fine to me, but I would like to ensure that somewhere in the design section we will add a requirement that such big objects won't be updated frequently (e.g. not more often that once per couple seconds or something like that).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, this is about being able to maintain the bitmask. Although, if we are going to use it to track completion, we actually need to cap .spec.completions too. The idea is that in the future we don't depend on lingering pods. @erictune thoughts?
e.g. not more often that once per couple seconds or something like that
Should we be doing some kind of client side throttling? In the current design, Job status updates would come after removing finalizers from finishing Pods, which already throttles Job updates in a way. But if one Pod finishes every 100ms (for example), we might still do more Job updates than what you are suggesting.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I changed the limit to be on completions, instead of parallelism
|
||
If, instead of a downgrade, the feature is disabled: | ||
|
||
- kube-apiserver sets `.spec.completionMode=NonIndexed` for new Jobs. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't understand how this is done (and how you distinguish downgrade from disabling exactly). Can you be more specific?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is standard logic in API strategies for disabled feature gates. Reworded a bit.
I don't exactly distinguish a downgrade, as I cannot control the code of version N-1. So the paragraphs above describe how the code behaves in that scenario.
But what I can control is a feature being disabled, in which case I prefer to ignore Indexed Jobs that were already created.
|
||
* **Are there any tests for feature enablement/disablement?** | ||
|
||
Yes, unit and integration test for the feature enabled and disabled. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If we're going to have something fancy for feature disablement, than we should have (at least) unit test for disablement.
You can see an example of doing that here (under review, but the idea is there):
https://github.com/kubernetes/kubernetes/pull/97058/files#diff-7826f7adbc1996a05ab52e3f5f02429e94b68ce6bce0dc534d1be636154fded3R246
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That was the intention of my text here. Reworded.
|
||
The Job controller doesn't add the environment variable if there is a name | ||
conflict with an existing environment variables in the Job's Pod template or if | ||
the user already specified another environment variable for the same annotation. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I guess I wasn't clear. I wasn't asking to touch customer-defined variable.
I was just asking - why we don't want to provide JOB_COMPLETION_INDEX unconditionally (either with user-propagated-value if set by user or as above if not).
I don't see a problem if you have two env vars, say FOO and JOB_COMPLETION_INDEX with the same value set to the val of that annotation.
is resolved: | ||
|
||
The Job controller keeps track of completed indexes in the Job status in a | ||
base64 encoded bitmask. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
OK - if you're not touching JobStatus for now at all, that's fine.
// The Job is considered complete when there is one successful Pod for each | ||
// index in the range 0 to (.spec.completions - 1). | ||
// When value is `Indexed`, .spec.completions must have a value between 1 | ||
// and 10^6. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, this is about being able to maintain the bitmask. Although, if we are going to use it to track completion, we actually need to cap .spec.completions too. The idea is that in the future we don't depend on lingering pods. @erictune thoughts?
I don't think you need to limit completions. The whole points that Eric made in one of other comments proved that it's enough to limit parallelism
[because if you won't be storing bitmask, but rather [a-b,c-d,e-f,...] format, there will be at most "parallelism` of such pairs, so the length of that is actually bounded by parallelism.
Bounding completions is much more limiting that limiting parallelism to something as high as 10^6.
Should we be doing some kind of client side throttling? In the current design, Job status updates would come after removing finalizers from finishing Pods, which already throttles Job updates in a way. But if one Pod finishes every 100ms (for example), we might still do more Job updates than what you are suggesting.
Yes - we should batch those, and send 1 update per X second. We can tune X later.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't think you need to limit completions. The whole points that Eric made in one of other comments proved that it's enough to limit parallelism [because if you won't be storing bitmask, but rather [a-b,c-d,e-f,...] format, there will be at most "parallelism` of such pairs, so the length of that is actually bounded by parallelism.
So it depends on the implementation. If we use a bitmask, then we have to limit completions. If we limit parallelism, we would have to use the compressed list format, but then the limit might have to be lower than 10^6, as we require more characters. We don't have to be concerned about display space, as we can just limit kubectl describe
to print less characters.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@soltysh any thoughts?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Running 10^6 of pods at the same time is far from supported at this point.
Having more than 10^6 of pods within a single job sounds like a valid usecase to me.
So I would be much more comfortable saying that "parallelism <= 10^5" (so decreasing that by order of magnitude even), than by setting a limit on the completions
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Moved back to limit parallelism and use a compressed list format instead of bitmask.
- kube-apiserver sets `.spec.completionMode=NonIndexed` for new Jobs at | ||
creation time. | ||
- kube-controller-manager ignores existing Indexed Jobs and emits a warning | ||
event. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
How exactly those are ignored? Are we killing the existing pods? Are we just not starting new ones? Are we stacking pods being finished?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Added specificity.
// More completion modes can be added in the future. If a Job controller | ||
// observes a mode that it doesn't recognize, it manages the Job as in | ||
// `NonIndexed`. | ||
CompletionMode string |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'd propose creating a new CompletionModeType
with pre-defined constants, like we do in other places such as this.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Maybe some more generic like JobType
which we could re-use in the future if we decide to implement different behaviour of the job controller. I'm worried that CompletionMode
might be limiting. Other ideas: WorkMode
, JobMode
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think CompletionMode has a good balance of specificity and generality. We already have one in mind for the future (see Alternative): IndexedAndUnique
.
The point is that all refer to how completion is handled. If there is a mode that refers to something other than completion, it should have its own field.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think I agree with @soltysh on making a dedicated type, but I think this will get figured out during API review.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I already gave it a type :)
|
||
The Job controller doesn't add the environment variable if there is a name | ||
conflict with an existing environment variables in the Job's Pod template or if | ||
the user already specified another environment variable for the same annotation. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Make the env name unique, KUBERNETES_JOB_INDEX
, for example. User can always map that annotation to other pre-defined name, but we should always inject it.
is resolved: | ||
|
||
The Job controller keeps track of completed indexes in the Job status in a | ||
base64 encoded bitmask. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I agree that tracking completed pods is orthogonal to this proposal.
Completed Indexes: [1-25,28,30-32] | ||
``` | ||
|
||
The command crops the list of indexes once it reaches 200 characters. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
200 seems a lot, I don't think we have a hard limit in kubectl, but I'm positive this is too much. Also it's not relevant to this proposal, I'd drop it from here. It's implementation detail.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Removed specificity.
|
||
- kube-apiserver sets `.spec.completionMode=NonIndexed` for new Jobs at | ||
creation time. | ||
- kube-controller-manager ignores existing Indexed Jobs and emits a warning |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Warn and run as non-indexed? Warn and ignore? Can you be more specific?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Added specificity.
|
||
If, instead of a downgrade, the cluster administrator disables the feature gate: | ||
|
||
- kube-apiserver sets `.spec.completionMode=NonIndexed` for new Jobs at |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Will there be validation in kube-apiserver not to allow setting indexed jobs when this feature is disabled?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
the value is overriden. Reworded for clarity.
1c795a5
to
b163ebd
Compare
@alculquicondor - this LGTM, please squash the commits |
b163ebd
to
ddae30b
Compare
squashed |
/approve I will give a chance to @soltysh - to take another look. |
[APPROVALNOTIFIER] This PR is APPROVED This pull-request has been approved by: alculquicondor, erictune, janetkuo, wojtek-t The full list of commands accepted by this bot can be found here. The pull request process is described here
Needs approval from an approver in each of these files:
Approvers can indicate their approval by writing |
sorry, I missed that it was already approved :) |
/lgtm since I reviewed it earlier. |
Yeah - I realized after writing this comment :) I think that was the last non-addressed one. So let me cancel the hold - it's in very good shape now. /hold cancel |
Was one day late, but 👍 thanks for that 🎉 |
Ref #2214 kubernetes/kubernetes#97169
/sig apps