Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add KEP for Indexed Job in implementable status #2245

Merged
merged 1 commit into from
Jan 27, 2021

Conversation

alculquicondor
Copy link
Member

@alculquicondor alculquicondor commented Jan 8, 2021

@k8s-ci-robot k8s-ci-robot added sig/apps Categorizes an issue or PR as relevant to SIG Apps. cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. size/XL Denotes a PR that changes 500-999 lines, ignoring generated files. labels Jan 8, 2021
@k8s-ci-robot k8s-ci-robot added the kind/kep Categorizes KEP tracking issues and PRs modifying the KEP directory label Jan 8, 2021
@alculquicondor
Copy link
Member Author

cc @ahg-g @soltysh @erictune

keps/sig-apps/2214-array-job/README.md Outdated Show resolved Hide resolved
keps/sig-apps/2214-array-job/README.md Outdated Show resolved Hide resolved
@alculquicondor alculquicondor changed the title Add KEP for Array Job in provisional status Add KEP for Indexed Job in implementable status Jan 12, 2021
keps/sig-apps/2214-indexed-job/README.md Show resolved Hide resolved
keps/sig-apps/2214-indexed-job/README.md Outdated Show resolved Hide resolved
keps/sig-apps/2214-indexed-job/README.md Show resolved Hide resolved
keps/sig-apps/2214-indexed-job/README.md Outdated Show resolved Hide resolved
keps/sig-apps/2214-indexed-job/README.md Outdated Show resolved Hide resolved
keps/sig-apps/2214-indexed-job/README.md Show resolved Hide resolved
keps/sig-apps/2214-indexed-job/README.md Show resolved Hide resolved
keps/sig-apps/2214-indexed-job/README.md Outdated Show resolved Hide resolved
keps/sig-apps/2214-indexed-job/README.md Outdated Show resolved Hide resolved
annotation. In the event of a kube-controller-manager upgrade, some running or
succeeded Pods might exist without a completion index.

<<[UNRESOLVED handling Jobs with existing Pods without completion index ]>>
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Introducing the annotation shouldn't impact existing workloads unless they explicitly use it. I don't expect that users would create an indexed job that is tolerant to not having the index. So I don't really see the need for addressing this scenario.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is more about the controller itself. It needs to do something with the existing Pods in terms of completion index tracking.

Copy link
Member

@erictune erictune left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good, Aldo. Added some optional suggestions.

keps/sig-apps/2214-indexed-job/README.md Show resolved Hide resolved
keps/sig-apps/2214-indexed-job/README.md Outdated Show resolved Hide resolved
keps/sig-apps/2214-indexed-job/README.md Show resolved Hide resolved
keps/sig-apps/2214-indexed-job/README.md Show resolved Hide resolved
@alculquicondor alculquicondor force-pushed the array-job branch 2 times, most recently from b292a1e to c4575c8 Compare January 14, 2021 22:11
@alculquicondor
Copy link
Member Author

/assign @janetkuo

@alculquicondor
Copy link
Member Author

/label api-review

@k8s-ci-robot k8s-ci-robot added the api-review Categorizes an issue or PR as actively needing an API review. label Jan 19, 2021
Copy link
Member

@erictune erictune left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

/approve
/lgtm

@k8s-ci-robot k8s-ci-robot added the lgtm "Looks good to me", indicates that a PR is ready to be merged. label Jan 20, 2021
@alculquicondor
Copy link
Member Author

@wojtek-t friendly reminder for PRR

keps/sig-apps/2214-indexed-job/README.md Show resolved Hide resolved
keps/sig-apps/2214-indexed-job/README.md Show resolved Hide resolved
### Job scaling

Jobs can be scaled up and down by varying `.spec.parallelism` (note that
`.spec.completions` is an immutable field).
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please avoid using "scale" here, as Jobs don't follow the normal semantics for scaling. Ref kubernetes/kubernetes#60139

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed

@k8s-ci-robot k8s-ci-robot removed the lgtm "Looks good to me", indicates that a PR is ready to be merged. label Jan 21, 2021
@janetkuo
Copy link
Member

/lgtm

@k8s-ci-robot k8s-ci-robot added the lgtm "Looks good to me", indicates that a PR is ready to be merged. label Jan 22, 2021
@wojtek-t wojtek-t self-assigned this Jan 22, 2021
Copy link
Member

@wojtek-t wojtek-t left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I added some comments, but mostly fairly minor one.


The Job controller doesn't add the environment variable if there is a name
conflict with an existing environment variables in the Job's Pod template or if
the user already specified another environment variable for the same annotation.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why? What's the problem with having the value referenced by two separate env vars?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think it makes sense to give the user control over the names of the variables. A binary could always have some behavior defined for JOB_COMPLETION_INDEX which is different from the one provided by k8s.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I guess I wasn't clear. I wasn't asking to touch customer-defined variable.

I was just asking - why we don't want to provide JOB_COMPLETION_INDEX unconditionally (either with user-propagated-value if set by user or as above if not).

I don't see a problem if you have two env vars, say FOO and JOB_COMPLETION_INDEX with the same value set to the val of that annotation.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I understood you correctly :)

What I'm saying is: what if a JOB_COMPLETION_INDEX variable causes an undesirable effect on the user binary?

I guess in this case they could override that environment variable with another value.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Make the env name unique, KUBERNETES_JOB_INDEX, for example. User can always map that annotation to other pre-defined name, but we should always inject it.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't have a strong opinion about the name. But I'm still not convinced we should always inject the value if the user is already adding it.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We should always inject, unless user explicitly overwrites that name.

The behavior should be consistent - if you job is indexed, var X (I don't have strong opinion about name) will always be set (in your behavior you have "the var may or may not be set" - that's super confusing).

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sounds good. Removed second condition.

is resolved:

The Job controller keeps track of completed indexes in the Job status in a
base64 encoded bitmask.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you explicitly describe how JobStatus is updated then (for completeness)?

Another question is: with the bitmask field and the proposal above, will we end-up with two separate fields for that?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ohh - or you're saying that "we don't solve it for Alpha at all" - just rely on existence of all pods for now?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Correct. This would be a beta graduation criteria, pending on #2307

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

OK - if you're not touching JobStatus for now at all, that's fine.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I agree that tracking completed pods is orthogonal to this proposal.

// The Job is considered complete when there is one successful Pod for each
// index in the range 0 to (.spec.completions - 1).
// When value is `Indexed`, .spec.completions must have a non-zero positive
// value and `.spec.parallelism` must be less than or equal to 10^6.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm assuming that the reasoning behind this 10^6 is the underlying bitmask, right?
i.e. we're saying that tracking 10^6 items in bitmask takes O(125kB) and this is fine, right?

That sounds fine to me, but I would like to ensure that somewhere in the design section we will add a requirement that such big objects won't be updated frequently (e.g. not more often that once per couple seconds or something like that).

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, this is about being able to maintain the bitmask. Although, if we are going to use it to track completion, we actually need to cap .spec.completions too. The idea is that in the future we don't depend on lingering pods. @erictune thoughts?

e.g. not more often that once per couple seconds or something like that

Should we be doing some kind of client side throttling? In the current design, Job status updates would come after removing finalizers from finishing Pods, which already throttles Job updates in a way. But if one Pod finishes every 100ms (for example), we might still do more Job updates than what you are suggesting.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I changed the limit to be on completions, instead of parallelism

keps/sig-apps/2214-indexed-job/README.md Show resolved Hide resolved

If, instead of a downgrade, the feature is disabled:

- kube-apiserver sets `.spec.completionMode=NonIndexed` for new Jobs.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't understand how this is done (and how you distinguish downgrade from disabling exactly). Can you be more specific?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is standard logic in API strategies for disabled feature gates. Reworded a bit.

I don't exactly distinguish a downgrade, as I cannot control the code of version N-1. So the paragraphs above describe how the code behaves in that scenario.

But what I can control is a feature being disabled, in which case I prefer to ignore Indexed Jobs that were already created.


* **Are there any tests for feature enablement/disablement?**

Yes, unit and integration test for the feature enabled and disabled.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If we're going to have something fancy for feature disablement, than we should have (at least) unit test for disablement.

You can see an example of doing that here (under review, but the idea is there):
https://github.com/kubernetes/kubernetes/pull/97058/files#diff-7826f7adbc1996a05ab52e3f5f02429e94b68ce6bce0dc534d1be636154fded3R246

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That was the intention of my text here. Reworded.

@k8s-ci-robot k8s-ci-robot removed the lgtm "Looks good to me", indicates that a PR is ready to be merged. label Jan 25, 2021

The Job controller doesn't add the environment variable if there is a name
conflict with an existing environment variables in the Job's Pod template or if
the user already specified another environment variable for the same annotation.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I guess I wasn't clear. I wasn't asking to touch customer-defined variable.

I was just asking - why we don't want to provide JOB_COMPLETION_INDEX unconditionally (either with user-propagated-value if set by user or as above if not).

I don't see a problem if you have two env vars, say FOO and JOB_COMPLETION_INDEX with the same value set to the val of that annotation.

is resolved:

The Job controller keeps track of completed indexes in the Job status in a
base64 encoded bitmask.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

OK - if you're not touching JobStatus for now at all, that's fine.

// The Job is considered complete when there is one successful Pod for each
// index in the range 0 to (.spec.completions - 1).
// When value is `Indexed`, .spec.completions must have a value between 1
// and 10^6.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, this is about being able to maintain the bitmask. Although, if we are going to use it to track completion, we actually need to cap .spec.completions too. The idea is that in the future we don't depend on lingering pods. @erictune thoughts?

I don't think you need to limit completions. The whole points that Eric made in one of other comments proved that it's enough to limit parallelism [because if you won't be storing bitmask, but rather [a-b,c-d,e-f,...] format, there will be at most "parallelism` of such pairs, so the length of that is actually bounded by parallelism.
Bounding completions is much more limiting that limiting parallelism to something as high as 10^6.

Should we be doing some kind of client side throttling? In the current design, Job status updates would come after removing finalizers from finishing Pods, which already throttles Job updates in a way. But if one Pod finishes every 100ms (for example), we might still do more Job updates than what you are suggesting.

Yes - we should batch those, and send 1 update per X second. We can tune X later.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think you need to limit completions. The whole points that Eric made in one of other comments proved that it's enough to limit parallelism [because if you won't be storing bitmask, but rather [a-b,c-d,e-f,...] format, there will be at most "parallelism` of such pairs, so the length of that is actually bounded by parallelism.

So it depends on the implementation. If we use a bitmask, then we have to limit completions. If we limit parallelism, we would have to use the compressed list format, but then the limit might have to be lower than 10^6, as we require more characters. We don't have to be concerned about display space, as we can just limit kubectl describe to print less characters.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@soltysh any thoughts?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Running 10^6 of pods at the same time is far from supported at this point.
Having more than 10^6 of pods within a single job sounds like a valid usecase to me.

So I would be much more comfortable saying that "parallelism <= 10^5" (so decreasing that by order of magnitude even), than by setting a limit on the completions

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Moved back to limit parallelism and use a compressed list format instead of bitmask.

- kube-apiserver sets `.spec.completionMode=NonIndexed` for new Jobs at
creation time.
- kube-controller-manager ignores existing Indexed Jobs and emits a warning
event.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How exactly those are ignored? Are we killing the existing pods? Are we just not starting new ones? Are we stacking pods being finished?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Added specificity.

// More completion modes can be added in the future. If a Job controller
// observes a mode that it doesn't recognize, it manages the Job as in
// `NonIndexed`.
CompletionMode string
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'd propose creating a new CompletionModeType with pre-defined constants, like we do in other places such as this.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe some more generic like JobType which we could re-use in the future if we decide to implement different behaviour of the job controller. I'm worried that CompletionMode might be limiting. Other ideas: WorkMode, JobMode

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think CompletionMode has a good balance of specificity and generality. We already have one in mind for the future (see Alternative): IndexedAndUnique.

The point is that all refer to how completion is handled. If there is a mode that refers to something other than completion, it should have its own field.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think I agree with @soltysh on making a dedicated type, but I think this will get figured out during API review.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I already gave it a type :)


The Job controller doesn't add the environment variable if there is a name
conflict with an existing environment variables in the Job's Pod template or if
the user already specified another environment variable for the same annotation.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Make the env name unique, KUBERNETES_JOB_INDEX, for example. User can always map that annotation to other pre-defined name, but we should always inject it.

is resolved:

The Job controller keeps track of completed indexes in the Job status in a
base64 encoded bitmask.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I agree that tracking completed pods is orthogonal to this proposal.

Completed Indexes: [1-25,28,30-32]
```

The command crops the list of indexes once it reaches 200 characters.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

200 seems a lot, I don't think we have a hard limit in kubectl, but I'm positive this is too much. Also it's not relevant to this proposal, I'd drop it from here. It's implementation detail.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Removed specificity.


- kube-apiserver sets `.spec.completionMode=NonIndexed` for new Jobs at
creation time.
- kube-controller-manager ignores existing Indexed Jobs and emits a warning
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Warn and run as non-indexed? Warn and ignore? Can you be more specific?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Added specificity.


If, instead of a downgrade, the cluster administrator disables the feature gate:

- kube-apiserver sets `.spec.completionMode=NonIndexed` for new Jobs at
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Will there be validation in kube-apiserver not to allow setting indexed jobs when this feature is disabled?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

the value is overriden. Reworded for clarity.

@wojtek-t
Copy link
Member

@alculquicondor - this LGTM, please squash the commits

@alculquicondor
Copy link
Member Author

squashed

@wojtek-t
Copy link
Member

wojtek-t commented Jan 26, 2021

/approve

I will give a chance to @soltysh - to take another look.

@k8s-ci-robot
Copy link
Contributor

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: alculquicondor, erictune, janetkuo, wojtek-t

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@k8s-ci-robot k8s-ci-robot added the approved Indicates a PR has been approved by an approver from all required OWNERS files. label Jan 26, 2021
@alculquicondor
Copy link
Member Author

/hold
for LGTM from @soltysh

@janetkuo could you approve?

@k8s-ci-robot k8s-ci-robot added the do-not-merge/hold Indicates that a PR should not merge because someone has issued a /hold command. label Jan 26, 2021
@alculquicondor
Copy link
Member Author

sorry, I missed that it was already approved :)

@ahg-g
Copy link
Member

ahg-g commented Jan 26, 2021

/lgtm

since I reviewed it earlier.

@k8s-ci-robot k8s-ci-robot added the lgtm "Looks good to me", indicates that a PR is ready to be merged. label Jan 26, 2021
@wojtek-t
Copy link
Member

I already gave it a type :)

Yeah - I realized after writing this comment :) I think that was the last non-addressed one. So let me cancel the hold - it's in very good shape now.

/hold cancel

@k8s-ci-robot k8s-ci-robot removed the do-not-merge/hold Indicates that a PR should not merge because someone has issued a /hold command. label Jan 27, 2021
@k8s-ci-robot k8s-ci-robot merged commit 9106c8f into kubernetes:master Jan 27, 2021
@k8s-ci-robot k8s-ci-robot added this to the v1.21 milestone Jan 27, 2021
@soltysh
Copy link
Contributor

soltysh commented Jan 28, 2021

Was one day late, but 👍 thanks for that 🎉

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
api-review Categorizes an issue or PR as actively needing an API review. approved Indicates a PR has been approved by an approver from all required OWNERS files. cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. kind/kep Categorizes KEP tracking issues and PRs modifying the KEP directory lgtm "Looks good to me", indicates that a PR is ready to be merged. sig/apps Categorizes an issue or PR as relevant to SIG Apps. size/XL Denotes a PR that changes 500-999 lines, ignoring generated files.
Projects
Status: API review completed, 1.21
Development

Successfully merging this pull request may close these issues.

8 participants