Make indices of IndexedJob pods configurable #109131

isibeni · 2022-03-29T22:18:21Z

What would you like to be added?

As a user I would like to be able to configure with which index the pods of a IndexedJob are getting created.

By introducing a new field to the job spec API (possible name: indices -> TBD) we can influence with which index the controller would create pods.

This field then has a mutually exclusive relation to the completions field.

Ideally, we could also have a new status field called failedIndices which contains the indices of the failed pods.
,

Why is this needed?

This is especially helpful when running large batch jobs where some of them are failing.
The user would now have an easy interface to just recreate the failed job and only rerun the failed indices by specifying them via the newly introduced field.
If we also introduce the status field failedIndices the user could easily pick the failed indices from that status and recreate the job with those indices. Introducing this status field also allows automating this workflow

The text was updated successfully, but these errors were encountered:

k8s-ci-robot · 2022-03-29T22:18:29Z

@isibeni: This issue is currently awaiting triage.

If a SIG or subproject determines this is a relevant issue, they will accept it by applying the triage/accepted label and provide further guidance.

The triage/accepted label can be added by org members by writing /triage accepted in a comment.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

isibeni · 2022-03-29T22:22:18Z

/wg batch

k8s-ci-robot · 2022-03-29T22:22:20Z

@isibeni: The label(s) wg/batch cannot be applied, because the repository doesn't have them.

In response to this:

/wg batch

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

isibeni · 2022-03-29T22:24:41Z

/sig apps

pacoxu · 2022-03-30T09:32:59Z

/area batch

alculquicondor · 2022-03-30T13:27:11Z

cc @soltysh

The field probably needs to be a string and it should have some limit for the number of characters.

As for the status, it might be harder to limit. In Indexed Jobs we limit parallelism to indirectly limit the number of characters in completedIndexes. We need to make a similar analysis to guarantee that the field doesn't grow indefinitely.

https://kubernetes.io/docs/reference/kubernetes-api/workload-resources/job-v1/#lifecycle

IMO, The feature would be a nice addition to Indexed Jobs. Let's see what others say.

isibeni · 2022-03-30T20:14:23Z

The field probably needs to be a string and it should have some limit for the number of characters.

Agree. I would propose that it should have the same format as the status.completedIndexes field. e.g. 1,3-5,7 for 1, 3, 4, 5 and 7.
If we then would also introduce a status.failedIndexes field a operator would then be able to just copy the string and insert it into the indexes field to retrigger a new job

isibeni · 2022-03-30T20:32:35Z

In Indexed Jobs we limit parallelism to indirectly limit the number of characters in completedIndexes. We need to make a similar analysis to guarantee that the field doesn't grow indefinitely.

@alculquicondor Could you elaborate a bit what the problem with that is? Performance and memory consumption problems for scenarios where we can't simplify the string (e.g. all pods with even indexes are completing first)?

alculquicondor · 2022-03-30T20:52:51Z

Not just performance, but actually the etcd database has limits. You have identified the correct challenging scenario.

isibeni · 2022-04-04T06:34:14Z

You have identified the correct challenging scenario.

And with introducing the new field the challenging scenario would be when every other index fails directly (e.g. 1,3,5,7,9,11,13,15). We then would have the same value in completedIndexes and failedIndexes. Another one would be when every other index fails and every fourth index succeeds. completedIndexes could then contain the following value 1-3,5-7,9-11,13-15 while failedIndexes contains 1,3,5,7,9,11,13,15.

Based on that we would need to check if we need to reduce the max value for .spec.parallelism even further

alculquicondor · 2022-04-04T14:01:45Z

If an index fails, but succeeds with a follow up pod, we shouldn't list it in failedIndexes.

For completedIndexes, .spec.parallelism limits how many non-continuous indexes can complete, because we create the indexes in order. Can the same rationale be applied for failedIndexes?

isibeni · 2022-04-05T06:48:07Z

sorry I just figured out that I had a small misconception about the behavior of backofflimit on indexed jobs.
I thought the backofflimit is applied for every index so that e.g. with a backofflimit of 2 each failing index will be restarted at least twice and if they reached the limit only those indexes will be considered as failed while the other indexes are allowed to run till completion. Still the overall job will be considered as failed.
If indexed jobs would behave like that the field failedIndexes would make sense. But with the current implementation I don't see the sense of this field and tbh this proposed feature then also makes less sense.

Or am I missing sth?

Maybe we can consider introducing a behaviour like a just described. This behaviour would also help with this issue: #107891

alculquicondor · 2022-04-05T14:03:27Z

The backofflimit is for the entire job.

I think the feature request still makes sense. The job overall is marked as Failed, but some indexes might have completed.

alculquicondor · 2022-04-05T14:05:08Z

Maybe we can consider introducing a behaviour like a just described.

That would be a separate feature request. A we would need a new field, as we cannot change the behavior of existing fields, for backwards compatibility.

isibeni · 2022-04-08T17:10:29Z

That would be a separate feature request. A we would need a new field, as we cannot change the behavior of existing fields, for backwards compatibility.

Agree on that.
Will open another issue these days

I think the feature request still makes sense. The job overall is marked as Failed, but some indexes might have completed.

Yeah you are right. But then I would propose that we don't need the status field failedIndexes for now. This only becomes helpful with the other feature I just proposed

soltysh · 2022-04-20T11:45:10Z

This was brought to my attention during last wg-batch call. A few ideas worth considering.

Do we actually need net-new field to express failed indices? Given current completions 100, for example, and a list of completedIndexes being 1-40,60-100 do we actually need that new filed if we should be able to calculate that using the existing fields?
I'd be curious to hear more about the use-cases for this particular request. @alculquicondor provided one during the call, but I'm not sure that's sufficient. I'd like to better understand the scenarios when and why a job (which by its definition is meant to run to completion) is failing and rather focus on improving the overall stability, reporting the failures or similar.

ahg-g · 2022-04-28T19:05:44Z

I created #109712 to track the idea of a job execution mode where every index is allowed to execute.

isibeni · 2022-05-31T18:00:22Z

Do we actually need net-new field to express failed indices? Given current completions 100, for example, and a list of completedIndexes being 1-40,60-100 do we actually need that new filed if we should be able to calculate that using the existing fields?

Not necessarily. As of now it would be just for convenience. For some it might not be so straightforward to implement the calculation. If we would provide it all clients/users don't need to reimplement it on their own. But for this feature request it is not so super urgent. I could also live with the fact that we keep this out for now.

I'd be curious to hear more about the use-cases for this particular request. @alculquicondor provided one during the call, but I'm not sure that's sufficient. I'd like to better understand the scenarios when and why a job (which by its definition is meant to run to completion) is failing and rather focus on improving the overall stability, reporting the failures or similar.

2 things on that:

this feature (making the indices configurable) isn't only focussing on the use case where a user wants to retry a failed job. There are potentially other use cases where a user want's to have a more fine granular approach on which indices a job gets. Imagine a indexed job that is processing some records. For the first 100 records you may want to have a different configuration for your job then for the next 100 records. If you can configure the indices accordingly you can then have two jobs with - one with the indices from 0-99 and on for 100-199.
regarding your comment for improving stability: Not in all cases the job fails due to some instability of the system itself. it could be that it is failing because of some external dependency (e.g. some outage of a server, or some corrupt data, or some misconfiguration) that could case the job to fail for some indices but not all. In some cases some human intervention might be needed before the job can be retriggered. Only when everything is "back to normal" the user can then go ahead and rerun the job for the remaining indices. I guess for such scenarios there isn't much we can do to improve on overall stavbility

k8s-triage-robot · 2022-08-29T18:35:21Z

The Kubernetes project currently lacks enough contributors to adequately respond to all issues and PRs.

This bot triages issues and PRs according to the following rules:

After 90d of inactivity, lifecycle/stale is applied
After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied
After 30d of inactivity since lifecycle/rotten was applied, the issue is closed

You can:

Mark this issue or PR as fresh with /remove-lifecycle stale
Mark this issue or PR as rotten with /lifecycle rotten
Close this issue or PR with /close
Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/lifecycle stale

alculquicondor · 2022-08-29T18:52:17Z

/remove-lifecycle stale

We will have to tackle this in 1.27 or later.

k8s-triage-robot · 2022-11-27T19:44:04Z

The Kubernetes project currently lacks enough contributors to adequately respond to all issues and PRs.

This bot triages issues and PRs according to the following rules:

After 90d of inactivity, lifecycle/stale is applied
After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied
After 30d of inactivity since lifecycle/rotten was applied, the issue is closed

You can:

Mark this issue or PR as fresh with /remove-lifecycle stale
Mark this issue or PR as rotten with /lifecycle rotten
Close this issue or PR with /close
Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/lifecycle stale

k8s-triage-robot · 2022-12-27T20:34:28Z

The Kubernetes project currently lacks enough active contributors to adequately respond to all issues and PRs.

This bot triages issues and PRs according to the following rules:

After 90d of inactivity, lifecycle/stale is applied
After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied
After 30d of inactivity since lifecycle/rotten was applied, the issue is closed

You can:

Mark this issue or PR as fresh with /remove-lifecycle rotten
Close this issue or PR with /close
Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/lifecycle rotten

ahg-g · 2022-12-27T20:42:57Z

/remove-lifecycle rotten

k8s-triage-robot · 2023-03-27T21:26:16Z

The Kubernetes project currently lacks enough contributors to adequately respond to all issues.

This bot triages un-triaged issues according to the following rules:

After 90d of inactivity, lifecycle/stale is applied
After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied
After 30d of inactivity since lifecycle/rotten was applied, the issue is closed

You can:

Mark this issue as fresh with /remove-lifecycle stale
Close this issue with /close
Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/lifecycle stale

alculquicondor · 2023-03-28T12:40:32Z

/lifecycle frozen
This is blocked on #109712

alculquicondor · 2023-04-11T20:11:13Z

/wg batch

vsoch · 2023-06-14T19:22:40Z

@alculquicondor what specifically is blocking? Why couldn't it be the case that you can configure the indices of the job, but not allow every index to execute?

alculquicondor · 2023-06-14T20:18:32Z

In a way, the features are independent. But doing #109712 first makes this feature more useful. So we started with #109712

The idea is that #109712 will include a field failedIndexes: "1-5,7".
After you adjust some parameters that could have cause the failures, you can simply copy-paste "1-5,7" into a new Job that has a field .spec.indexes: "1-5,7"

vsoch · 2023-06-14T22:00:45Z

Gotcha - that does make sense. I'm very interested in this feature, and if they can be separate (and eventually integrated in the way you described) would it be OK if I started on it? I haven't contributed to Kubernetes proper and would like to try that out (so I'm more empowered to in the future). But if you have someone in mind and/or absolutely want to wait, I can respect that too! I was able to get a prototype for bursting with the Flux Operator, albeit I used a hack to get around this particular issue of not being able to control the indices. Let me know what you think, and if it's something I could help with, I would want to ask to be pointed to step 1. :)

alculquicondor · 2023-06-15T12:13:21Z

I don't think there were any takers previously. Note for a feature like this, you need to follow the KEP process: https://github.com/kubernetes/enhancements/tree/master/keps/sig-architecture/0000-kep-process

Ideally, before starting, it's helpful if you introduce yourself to SIG Apps (https://github.com/kubernetes/community/tree/master/sig-apps#meetings) announcing your intent and high level ideas, so that you clear possible questions and roadblocks in the future.

isibeni added the kind/feature Categorizes issue or PR as related to a new feature. label Mar 29, 2022

k8s-ci-robot added needs-sig Indicates an issue or PR lacks a `sig/foo` label and requires one. needs-triage Indicates an issue or PR lacks a `triage/foo` label and requires one. labels Mar 29, 2022

k8s-ci-robot added sig/apps Categorizes an issue or PR as relevant to SIG Apps. and removed needs-sig Indicates an issue or PR lacks a `sig/foo` label and requires one. labels Mar 29, 2022

k8s-ci-robot added the area/batch label Mar 30, 2022

liggitt added the priority/awaiting-more-evidence Lowest priority. Possibly useful, but not yet enough support to actually get it done. label Mar 31, 2022

ahg-g mentioned this issue Apr 28, 2022

An Indexed Job mode that allows every index to execute #109712

Closed

k8s-ci-robot added the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Aug 29, 2022

k8s-ci-robot removed the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Aug 29, 2022

k8s-ci-robot added the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Nov 27, 2022

k8s-ci-robot added lifecycle/rotten Denotes an issue or PR that has aged beyond stale and will be auto-closed. and removed lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. labels Dec 27, 2022

k8s-ci-robot removed the lifecycle/rotten Denotes an issue or PR that has aged beyond stale and will be auto-closed. label Dec 27, 2022

k8s-ci-robot added the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Mar 27, 2023

k8s-ci-robot added lifecycle/frozen Indicates that an issue or PR should not be auto-closed due to staleness. and removed lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. labels Mar 28, 2023

k8s-ci-robot added the wg/batch Categorizes an issue or PR as relevant to WG Batch. label Apr 11, 2023

helayoty added this to SIG Apps Sep 29, 2023

github-project-automation bot moved this to Needs Triage in SIG Apps Sep 29, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Make indices of IndexedJob pods configurable #109131

Make indices of IndexedJob pods configurable #109131

isibeni commented Mar 29, 2022

k8s-ci-robot commented Mar 29, 2022

isibeni commented Mar 29, 2022 •

edited

Loading

k8s-ci-robot commented Mar 29, 2022

isibeni commented Mar 29, 2022

pacoxu commented Mar 30, 2022

alculquicondor commented Mar 30, 2022

isibeni commented Mar 30, 2022

isibeni commented Mar 30, 2022

alculquicondor commented Mar 30, 2022

isibeni commented Apr 4, 2022

alculquicondor commented Apr 4, 2022

isibeni commented Apr 5, 2022

alculquicondor commented Apr 5, 2022

alculquicondor commented Apr 5, 2022

isibeni commented Apr 8, 2022

soltysh commented Apr 20, 2022

ahg-g commented Apr 28, 2022

isibeni commented May 31, 2022

k8s-triage-robot commented Aug 29, 2022

alculquicondor commented Aug 29, 2022

k8s-triage-robot commented Nov 27, 2022

k8s-triage-robot commented Dec 27, 2022

ahg-g commented Dec 27, 2022

k8s-triage-robot commented Mar 27, 2023

alculquicondor commented Mar 28, 2023

alculquicondor commented Apr 11, 2023

vsoch commented Jun 14, 2023

alculquicondor commented Jun 14, 2023

vsoch commented Jun 14, 2023

alculquicondor commented Jun 15, 2023

Make indices of IndexedJob pods configurable #109131

Make indices of IndexedJob pods configurable #109131

Comments

isibeni commented Mar 29, 2022

What would you like to be added?

Why is this needed?

k8s-ci-robot commented Mar 29, 2022

isibeni commented Mar 29, 2022 • edited Loading

k8s-ci-robot commented Mar 29, 2022

isibeni commented Mar 29, 2022

pacoxu commented Mar 30, 2022

alculquicondor commented Mar 30, 2022

isibeni commented Mar 30, 2022

isibeni commented Mar 30, 2022

alculquicondor commented Mar 30, 2022

isibeni commented Apr 4, 2022

alculquicondor commented Apr 4, 2022

isibeni commented Apr 5, 2022

alculquicondor commented Apr 5, 2022

alculquicondor commented Apr 5, 2022

isibeni commented Apr 8, 2022

soltysh commented Apr 20, 2022

ahg-g commented Apr 28, 2022

isibeni commented May 31, 2022

k8s-triage-robot commented Aug 29, 2022

alculquicondor commented Aug 29, 2022

k8s-triage-robot commented Nov 27, 2022

k8s-triage-robot commented Dec 27, 2022

ahg-g commented Dec 27, 2022

k8s-triage-robot commented Mar 27, 2023

alculquicondor commented Mar 28, 2023

alculquicondor commented Apr 11, 2023

vsoch commented Jun 14, 2023

alculquicondor commented Jun 14, 2023

vsoch commented Jun 14, 2023

alculquicondor commented Jun 15, 2023

isibeni commented Mar 29, 2022 •

edited

Loading