Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Make indices of IndexedJob pods configurable #109131

Open
isibeni opened this issue Mar 29, 2022 · 30 comments
Open

Make indices of IndexedJob pods configurable #109131

isibeni opened this issue Mar 29, 2022 · 30 comments
Labels
area/batch kind/feature Categorizes issue or PR as related to a new feature. lifecycle/frozen Indicates that an issue or PR should not be auto-closed due to staleness. needs-triage Indicates an issue or PR lacks a `triage/foo` label and requires one. priority/awaiting-more-evidence Lowest priority. Possibly useful, but not yet enough support to actually get it done. sig/apps Categorizes an issue or PR as relevant to SIG Apps. wg/batch Categorizes an issue or PR as relevant to WG Batch.

Comments

@isibeni
Copy link

isibeni commented Mar 29, 2022

What would you like to be added?

As a user I would like to be able to configure with which index the pods of a IndexedJob are getting created.

By introducing a new field to the job spec API (possible name: indices -> TBD) we can influence with which index the controller would create pods.

This field then has a mutually exclusive relation to the completions field.

Ideally, we could also have a new status field called failedIndices which contains the indices of the failed pods.
,

Why is this needed?

This is especially helpful when running large batch jobs where some of them are failing.
The user would now have an easy interface to just recreate the failed job and only rerun the failed indices by specifying them via the newly introduced field.
If we also introduce the status field failedIndices the user could easily pick the failed indices from that status and recreate the job with those indices. Introducing this status field also allows automating this workflow

@isibeni isibeni added the kind/feature Categorizes issue or PR as related to a new feature. label Mar 29, 2022
@k8s-ci-robot k8s-ci-robot added needs-sig Indicates an issue or PR lacks a `sig/foo` label and requires one. needs-triage Indicates an issue or PR lacks a `triage/foo` label and requires one. labels Mar 29, 2022
@k8s-ci-robot
Copy link
Contributor

@isibeni: This issue is currently awaiting triage.

If a SIG or subproject determines this is a relevant issue, they will accept it by applying the triage/accepted label and provide further guidance.

The triage/accepted label can be added by org members by writing /triage accepted in a comment.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

@isibeni
Copy link
Author

isibeni commented Mar 29, 2022

/wg batch

@k8s-ci-robot
Copy link
Contributor

@isibeni: The label(s) wg/batch cannot be applied, because the repository doesn't have them.

In response to this:

/wg batch

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

@isibeni
Copy link
Author

isibeni commented Mar 29, 2022

/sig apps

@k8s-ci-robot k8s-ci-robot added sig/apps Categorizes an issue or PR as relevant to SIG Apps. and removed needs-sig Indicates an issue or PR lacks a `sig/foo` label and requires one. labels Mar 29, 2022
@pacoxu
Copy link
Member

pacoxu commented Mar 30, 2022

/area batch

@alculquicondor
Copy link
Member

cc @soltysh

The field probably needs to be a string and it should have some limit for the number of characters.

As for the status, it might be harder to limit. In Indexed Jobs we limit parallelism to indirectly limit the number of characters in completedIndexes. We need to make a similar analysis to guarantee that the field doesn't grow indefinitely.

https://kubernetes.io/docs/reference/kubernetes-api/workload-resources/job-v1/#lifecycle

IMO, The feature would be a nice addition to Indexed Jobs. Let's see what others say.

@isibeni
Copy link
Author

isibeni commented Mar 30, 2022

The field probably needs to be a string and it should have some limit for the number of characters.

Agree. I would propose that it should have the same format as the status.completedIndexes field. e.g. 1,3-5,7 for 1, 3, 4, 5 and 7.
If we then would also introduce a status.failedIndexes field a operator would then be able to just copy the string and insert it into the indexes field to retrigger a new job

@isibeni
Copy link
Author

isibeni commented Mar 30, 2022

In Indexed Jobs we limit parallelism to indirectly limit the number of characters in completedIndexes. We need to make a similar analysis to guarantee that the field doesn't grow indefinitely.

@alculquicondor Could you elaborate a bit what the problem with that is? Performance and memory consumption problems for scenarios where we can't simplify the string (e.g. all pods with even indexes are completing first)?

@alculquicondor
Copy link
Member

Not just performance, but actually the etcd database has limits. You have identified the correct challenging scenario.

@liggitt liggitt added the priority/awaiting-more-evidence Lowest priority. Possibly useful, but not yet enough support to actually get it done. label Mar 31, 2022
@isibeni
Copy link
Author

isibeni commented Apr 4, 2022

You have identified the correct challenging scenario.

And with introducing the new field the challenging scenario would be when every other index fails directly (e.g. 1,3,5,7,9,11,13,15). We then would have the same value in completedIndexes and failedIndexes. Another one would be when every other index fails and every fourth index succeeds. completedIndexes could then contain the following value 1-3,5-7,9-11,13-15 while failedIndexes contains 1,3,5,7,9,11,13,15.

Based on that we would need to check if we need to reduce the max value for .spec.parallelism even further

@alculquicondor
Copy link
Member

If an index fails, but succeeds with a follow up pod, we shouldn't list it in failedIndexes.

For completedIndexes, .spec.parallelism limits how many non-continuous indexes can complete, because we create the indexes in order. Can the same rationale be applied for failedIndexes?

@isibeni
Copy link
Author

isibeni commented Apr 5, 2022

sorry I just figured out that I had a small misconception about the behavior of backofflimit on indexed jobs.
I thought the backofflimit is applied for every index so that e.g. with a backofflimit of 2 each failing index will be restarted at least twice and if they reached the limit only those indexes will be considered as failed while the other indexes are allowed to run till completion. Still the overall job will be considered as failed.
If indexed jobs would behave like that the field failedIndexes would make sense. But with the current implementation I don't see the sense of this field and tbh this proposed feature then also makes less sense.

Or am I missing sth?

Maybe we can consider introducing a behaviour like a just described. This behaviour would also help with this issue: #107891

@alculquicondor
Copy link
Member

The backofflimit is for the entire job.

I think the feature request still makes sense. The job overall is marked as Failed, but some indexes might have completed.

@alculquicondor
Copy link
Member

Maybe we can consider introducing a behaviour like a just described.

That would be a separate feature request. A we would need a new field, as we cannot change the behavior of existing fields, for backwards compatibility.

@isibeni
Copy link
Author

isibeni commented Apr 8, 2022

That would be a separate feature request. A we would need a new field, as we cannot change the behavior of existing fields, for backwards compatibility.

Agree on that.
Will open another issue these days

I think the feature request still makes sense. The job overall is marked as Failed, but some indexes might have completed.

Yeah you are right. But then I would propose that we don't need the status field failedIndexes for now. This only becomes helpful with the other feature I just proposed

@soltysh
Copy link
Contributor

soltysh commented Apr 20, 2022

This was brought to my attention during last wg-batch call. A few ideas worth considering.

  1. Do we actually need net-new field to express failed indices? Given current completions 100, for example, and a list of completedIndexes being 1-40,60-100 do we actually need that new filed if we should be able to calculate that using the existing fields?
  2. I'd be curious to hear more about the use-cases for this particular request. @alculquicondor provided one during the call, but I'm not sure that's sufficient. I'd like to better understand the scenarios when and why a job (which by its definition is meant to run to completion) is failing and rather focus on improving the overall stability, reporting the failures or similar.

@ahg-g
Copy link
Member

ahg-g commented Apr 28, 2022

I created #109712 to track the idea of a job execution mode where every index is allowed to execute.

@isibeni
Copy link
Author

isibeni commented May 31, 2022

Do we actually need net-new field to express failed indices? Given current completions 100, for example, and a list of completedIndexes being 1-40,60-100 do we actually need that new filed if we should be able to calculate that using the existing fields?

Not necessarily. As of now it would be just for convenience. For some it might not be so straightforward to implement the calculation. If we would provide it all clients/users don't need to reimplement it on their own. But for this feature request it is not so super urgent. I could also live with the fact that we keep this out for now.

I'd be curious to hear more about the use-cases for this particular request. @alculquicondor provided one during the call, but I'm not sure that's sufficient. I'd like to better understand the scenarios when and why a job (which by its definition is meant to run to completion) is failing and rather focus on improving the overall stability, reporting the failures or similar.

2 things on that:

  1. this feature (making the indices configurable) isn't only focussing on the use case where a user wants to retry a failed job. There are potentially other use cases where a user want's to have a more fine granular approach on which indices a job gets. Imagine a indexed job that is processing some records. For the first 100 records you may want to have a different configuration for your job then for the next 100 records. If you can configure the indices accordingly you can then have two jobs with - one with the indices from 0-99 and on for 100-199.
  2. regarding your comment for improving stability: Not in all cases the job fails due to some instability of the system itself. it could be that it is failing because of some external dependency (e.g. some outage of a server, or some corrupt data, or some misconfiguration) that could case the job to fail for some indices but not all. In some cases some human intervention might be needed before the job can be retriggered. Only when everything is "back to normal" the user can then go ahead and rerun the job for the remaining indices. I guess for such scenarios there isn't much we can do to improve on overall stavbility

@k8s-triage-robot
Copy link

The Kubernetes project currently lacks enough contributors to adequately respond to all issues and PRs.

This bot triages issues and PRs according to the following rules:

  • After 90d of inactivity, lifecycle/stale is applied
  • After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied
  • After 30d of inactivity since lifecycle/rotten was applied, the issue is closed

You can:

  • Mark this issue or PR as fresh with /remove-lifecycle stale
  • Mark this issue or PR as rotten with /lifecycle rotten
  • Close this issue or PR with /close
  • Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/lifecycle stale

@k8s-ci-robot k8s-ci-robot added the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Aug 29, 2022
@alculquicondor
Copy link
Member

/remove-lifecycle stale

We will have to tackle this in 1.27 or later.

@k8s-ci-robot k8s-ci-robot removed the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Aug 29, 2022
@k8s-triage-robot
Copy link

The Kubernetes project currently lacks enough contributors to adequately respond to all issues and PRs.

This bot triages issues and PRs according to the following rules:

  • After 90d of inactivity, lifecycle/stale is applied
  • After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied
  • After 30d of inactivity since lifecycle/rotten was applied, the issue is closed

You can:

  • Mark this issue or PR as fresh with /remove-lifecycle stale
  • Mark this issue or PR as rotten with /lifecycle rotten
  • Close this issue or PR with /close
  • Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/lifecycle stale

@k8s-ci-robot k8s-ci-robot added the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Nov 27, 2022
@k8s-triage-robot
Copy link

The Kubernetes project currently lacks enough active contributors to adequately respond to all issues and PRs.

This bot triages issues and PRs according to the following rules:

  • After 90d of inactivity, lifecycle/stale is applied
  • After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied
  • After 30d of inactivity since lifecycle/rotten was applied, the issue is closed

You can:

  • Mark this issue or PR as fresh with /remove-lifecycle rotten
  • Close this issue or PR with /close
  • Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/lifecycle rotten

@k8s-ci-robot k8s-ci-robot added lifecycle/rotten Denotes an issue or PR that has aged beyond stale and will be auto-closed. and removed lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. labels Dec 27, 2022
@ahg-g
Copy link
Member

ahg-g commented Dec 27, 2022

/remove-lifecycle rotten

@k8s-ci-robot k8s-ci-robot removed the lifecycle/rotten Denotes an issue or PR that has aged beyond stale and will be auto-closed. label Dec 27, 2022
@k8s-triage-robot
Copy link

The Kubernetes project currently lacks enough contributors to adequately respond to all issues.

This bot triages un-triaged issues according to the following rules:

  • After 90d of inactivity, lifecycle/stale is applied
  • After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied
  • After 30d of inactivity since lifecycle/rotten was applied, the issue is closed

You can:

  • Mark this issue as fresh with /remove-lifecycle stale
  • Close this issue with /close
  • Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/lifecycle stale

@k8s-ci-robot k8s-ci-robot added the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Mar 27, 2023
@alculquicondor
Copy link
Member

/lifecycle frozen
This is blocked on #109712

@k8s-ci-robot k8s-ci-robot added lifecycle/frozen Indicates that an issue or PR should not be auto-closed due to staleness. and removed lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. labels Mar 28, 2023
@alculquicondor
Copy link
Member

/wg batch

@k8s-ci-robot k8s-ci-robot added the wg/batch Categorizes an issue or PR as relevant to WG Batch. label Apr 11, 2023
@vsoch
Copy link

vsoch commented Jun 14, 2023

@alculquicondor what specifically is blocking? Why couldn't it be the case that you can configure the indices of the job, but not allow every index to execute?

@alculquicondor
Copy link
Member

In a way, the features are independent. But doing #109712 first makes this feature more useful. So we started with #109712

The idea is that #109712 will include a field failedIndexes: "1-5,7".
After you adjust some parameters that could have cause the failures, you can simply copy-paste "1-5,7" into a new Job that has a field .spec.indexes: "1-5,7"

@vsoch
Copy link

vsoch commented Jun 14, 2023

Gotcha - that does make sense. I'm very interested in this feature, and if they can be separate (and eventually integrated in the way you described) would it be OK if I started on it? I haven't contributed to Kubernetes proper and would like to try that out (so I'm more empowered to in the future). But if you have someone in mind and/or absolutely want to wait, I can respect that too! I was able to get a prototype for bursting with the Flux Operator, albeit I used a hack to get around this particular issue of not being able to control the indices. Let me know what you think, and if it's something I could help with, I would want to ask to be pointed to step 1. :)

@alculquicondor
Copy link
Member

I don't think there were any takers previously. Note for a feature like this, you need to follow the KEP process: https://github.com/kubernetes/enhancements/tree/master/keps/sig-architecture/0000-kep-process

Ideally, before starting, it's helpful if you introduce yourself to SIG Apps (https://github.com/kubernetes/community/tree/master/sig-apps#meetings) announcing your intent and high level ideas, so that you clear possible questions and roadblocks in the future.

@github-project-automation github-project-automation bot moved this to Needs Triage in SIG Apps Sep 29, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
area/batch kind/feature Categorizes issue or PR as related to a new feature. lifecycle/frozen Indicates that an issue or PR should not be auto-closed due to staleness. needs-triage Indicates an issue or PR lacks a `triage/foo` label and requires one. priority/awaiting-more-evidence Lowest priority. Possibly useful, but not yet enough support to actually get it done. sig/apps Categorizes an issue or PR as relevant to SIG Apps. wg/batch Categorizes an issue or PR as relevant to WG Batch.
Projects
Status: Needs Triage
Development

No branches or pull requests

9 participants