Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Retriable and non-retriable Pod failures for Jobs #3329

Open
8 tasks done
alculquicondor opened this issue Jun 1, 2022 · 84 comments
Open
8 tasks done

Retriable and non-retriable Pod failures for Jobs #3329

alculquicondor opened this issue Jun 1, 2022 · 84 comments
Assignees
Labels
sig/api-machinery Categorizes an issue or PR as relevant to SIG API Machinery. sig/apps Categorizes an issue or PR as relevant to SIG Apps. sig/node Categorizes an issue or PR as relevant to SIG Node. sig/scheduling Categorizes an issue or PR as relevant to SIG Scheduling. stage/beta Denotes an issue tracking an enhancement targeted for Beta status wg/batch Categorizes an issue or PR as relevant to WG Batch.

Comments

@alculquicondor
Copy link
Member

alculquicondor commented Jun 1, 2022

Enhancement Description

@k8s-ci-robot k8s-ci-robot added the needs-sig Indicates an issue or PR lacks a `sig/foo` label and requires one. label Jun 1, 2022
@alculquicondor
Copy link
Member Author

/sig apps
/wg batch

@k8s-ci-robot k8s-ci-robot added sig/apps Categorizes an issue or PR as relevant to SIG Apps. wg/batch Categorizes an issue or PR as relevant to WG Batch. and removed needs-sig Indicates an issue or PR lacks a `sig/foo` label and requires one. labels Jun 1, 2022
@alculquicondor
Copy link
Member Author

/assign

@mimowo
Copy link
Contributor

mimowo commented Jun 2, 2022

/assign

@alculquicondor alculquicondor changed the title Retriable and non-retriable Job pod failures Retriable and non-retriable Pod failures for Jobs Jun 6, 2022
@alculquicondor
Copy link
Member Author

/sig scheduling
/sig api-machinery

@k8s-ci-robot k8s-ci-robot added sig/scheduling Categorizes an issue or PR as relevant to SIG Scheduling. sig/api-machinery Categorizes an issue or PR as relevant to SIG API Machinery. labels Jun 9, 2022
@Priyankasaggu11929
Copy link
Member

Hello @alculquicondor 👋, 1.25 Enhancements team here.

Just checking in as we approach enhancements freeze on 18:00 PT on Thursday June 23, 2022, which is just over 2 days from now.

For note, This enhancement is targeting for stage alpha for 1.25 (correct me, if otherwise)

Here's where this enhancement currently stands:

  • KEP file using the latest template has been merged into the k/enhancements repo.
  • KEP status is marked as implementable
  • KEP has a updated detailed test plan section filled out
  • KEP has up to date graduation criteria
  • KEP has a production readiness review that has been completed and merged into k/enhancements.

The open PR #3374 is addressing all the listed criteria above. We would just require getting it merged by the Enhancements Freeze.

For note, the status of this enhancement is marked as at risk. Please keep the issue description up-to-date with appropriate stages as well. Thank you!

@Priyankasaggu11929 Priyankasaggu11929 added the tracked/yes Denotes an enhancement issue is actively being tracked by the Release Team label Jun 22, 2022
@Priyankasaggu11929 Priyankasaggu11929 added this to the v1.25 milestone Jun 22, 2022
@Priyankasaggu11929
Copy link
Member

With KEP PR #3374 merged, the enhancement is ready for the 1.25 Enhancements Freeze.

For note, the status is now marked as tracked. Thank you so much!

@kcmartin
Copy link

Hello @alculquicondor 👋, 1.25 Release Docs Lead here.
This enhancement is marked as ‘Needs Docs’ for 1.25 release.

Please follow the steps detailed in the documentation to open a PR against dev-1.25 branch in the k/website repo. This PR can be just a placeholder at this time, and must be created by August 4.
 Also, take a look at Documenting for a release to familiarize yourself with the docs requirement for the release.

Thank you!

@Atharva-Shinde
Copy link
Contributor

Atharva-Shinde commented Jul 25, 2022

Hi @alculquicondor @mimowo, Enhancements team here again 👋

Checking in as we approach Code Freeze at 01:00 UTC on Wednesday, 3rd August 2022.

Please ensure that the following items are completed before the code-freeze:

Let me know if there are any additional k/k PRs besides the ones listed above

Currently, the status of the enhancement is marked as at-risk

Thanks :)

@mimowo
Copy link
Contributor

mimowo commented Jul 29, 2022

@Atharva-Shinde @alculquicondor there is one more PR that should be included before the code freeze: kubernetes/kubernetes#111475

@Atharva-Shinde
Copy link
Contributor

thank you @mimowo,
I have updated my comment with the PR and have also tagged you for future reference :)

@alculquicondor
Copy link
Member Author

Yes, that is my proposal

I don't want to stand in the way of solving real problems, but I worry that this becomes conceptual debt that we will never pay off (or even remember!)

@kerthcet already has an open WIP KEP :)
And we have already received good user feedback about the failure policy at the job level.

@thockin
Copy link
Member

thockin commented Jun 2, 2023 via email

@kerthcet
Copy link
Member

kerthcet commented Jun 2, 2023

This could also play nicely with #3967 . Also currently limited to restartPolicy=Never, but we could extend to restartPolicy=OnFailure, and define it in this case to skip counting individual container restarts

That's the point I think Job API can leverage the maxRestartTimesOnFailure to achieve Backoff Limit Per Index.

The max-restarts KEP isn't the same as restart-rules, though, right? They
all seem complementary but not the same.

Yes, currently max-restarts KEP only accounts for the restartPolicy=OnFailure, because restartPolicy=Never & maxRestartTimes=x looks conflict.

@mimowo
Copy link
Contributor

mimowo commented Jun 2, 2023

The max-restarts KEP isn't the same as restart-rules, though, right? They
all seem complementary but not the same.

Yes, these are the pieces of work that need to be done to support fully restart policy Never and OnFailure:

  1. restartPolicy=Never (this KEP, currently Beta without blocking issues open)
  2. maxRestartTimes (started as a new KEP)
  3. enable restartPolicy=OnFailure (not started, but seems a small follow-up KEP); plus change to Job's backoffLimit when maxRestartTimes is specified
  4. containerRestartPolicy in pod spec (not started, AFAIK purely node changes) (enabling API similar to the one in Retriable and non-retriable Pod failures for Jobs #3329 (comment))

Note that:

  • (2.) is a prerequisite for (3.)
  • (3.) and (4.) and independent technically

IMO doing (1.) - (4.) under this KEP would prolong its graduation substantially, and different points have different priorities, so it is useful to de-couple. Recently we got this slack ask to support OnFailure. We will need to keep track of how much demand there is for (3.) and (4.) and prioritize accordingly, but de-coupling allows that.

@johnbelamaric
Copy link
Member

Is this planning to go to beta in 1.28? It's on the board for PRR but I don't see an open KEP PR linked here?

@mimowo
Copy link
Contributor

mimowo commented Jun 14, 2023

@johnbelamaric the KEP is already in Beta.

There are two updates for the third iteration of Beta:

  1. https://github.com/kubernetes/enhancements/pull/3965/files (just reflecting a bugfix that is already merged, and status update)
  2. https://github.com/kubernetes/enhancements/pull/3940/files/8eb45d267c0885159d53ed9355d5b7995eefdd39#diff-01c599fe2c09bcb8104563c0fd5162bd711cc04814d049357c6dda0ac7bf2a2a (and update currently coupled with the KEP-3939) to delay pod recreation until we have analyzed with pod failure policy the failed pod.

Do you think (2.) should be decoupled as a dedicated PR for this KEP -3329 or can stay as is?

@mimowo
Copy link
Contributor

mimowo commented Jun 14, 2023

Do you think (2.) should be decoupled as a dedicated PR for this KEP -3329 or can stay as is?

Also asking @wojtek-t who is PRR reviewer of the other KEP (3939).

@mimowo
Copy link
Contributor

mimowo commented Jun 14, 2023

The other KEP is close to merge so seems safe to keep the update coupled there.

@alculquicondor
Copy link
Member Author

I discussed this with @deads2k and he was ok having a single PR for changes in both KEPs, because changes in 3329 are minimal and highly related to the other KEP

@johnbelamaric
Copy link
Member

Sure that's fine

@wojtek-t
Copy link
Member

Sure that's fine

I reviewed that integration part from the PRR POV carefully too.

@npolshakova
Copy link

Hi @mimowo and @alculquicondor, just checking in again before the enhancement freeze coming up on 01:00 UTC Friday, 16th June 2023. This enhancement is marked as at risk, but it looks like the only thing left is for #3940 to go in. Let me know if I've missed anything. Thanks!

@alculquicondor
Copy link
Member Author

#3940 just merged 😄

@Atharva-Shinde
Copy link
Contributor

With all the requirements fulfilled this enhancement is now marked as tracked for the upcoming enhancements freeze🚀

@AdminTurnedDevOps
Copy link

Hi @alculquicondor

1.28 Docs Shadow here.

Does this enhancement work planned for 1.28 require any new docs or modification to existing docs?

If so, please follows the steps here to open a PR against dev-1.28 branch in the k/website repo. This PR can be just a placeholder at this time and must be created before Thursday 20th July 2023.

Also, take a look at Documenting for a release to get yourself familiarize with the docs requirement for the release.

Thank you!

@npolshakova
Copy link

npolshakova commented Jul 11, 2023

Hey again @alculquicondor 👋

Just checking in as we approach Code freeze at 01:00 UTC Friday, 19th July 2023 .

Here’s the enhancement’s state for the upcoming code freeze:

For this enhancement, it looks like the following code related PR/s are open and they need to be merged or should be in merge-ready state before the code freeze commences :

Also please let me know if there are other PRs in k/k we should be tracking for this KEP.
As always, we are here to help if any questions come up. Thanks!

@mimowo
Copy link
Contributor

mimowo commented Jul 12, 2023

We will also have an improvement to this KEP as part of: kubernetes/kubernetes#117015 in 1.28.

@Rishit-dagli
Copy link
Member

Hey @alculquicondor , could you please create a docs PR even if it is a draft PR with no content yet against dev-1.28 branch in the k/website repo. The deadline to create this draft PR is Thursday 20th July 2023.

@alculquicondor
Copy link
Member Author

Thanks @Rishit-dagli, that's gonna be kubernetes/website#41745 (yes, the enhancements are tightly coupled).

I added the necessary PRs to the issue description above.

@AdminTurnedDevOps
Copy link

AdminTurnedDevOps commented Jul 25, 2023

removed.

@npolshakova
Copy link

/remove-label lead-opted-in

@k8s-ci-robot k8s-ci-robot removed the lead-opted-in Denotes that an issue has been opted in to a release label Aug 27, 2023
@mimowo
Copy link
Contributor

mimowo commented Oct 20, 2023

@alculquicondor please update the list of code changes for Beta with this PR: kubernetes/kubernetes#121103.

@salehsedghpour
Copy link
Contributor

Hello 👋 1.30 Enhancements Lead here,

I'm closing milestone 1.28 now,
If you wish to progress this enhancement in v1.30, please follow the instructions here to opt in the enhancement and make sure the lead-opted-in label is set so it can get added to the tracking board and finally add /milestone v1.30. Thanks!

/milestone clear

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
sig/api-machinery Categorizes an issue or PR as relevant to SIG API Machinery. sig/apps Categorizes an issue or PR as relevant to SIG Apps. sig/node Categorizes an issue or PR as relevant to SIG Node. sig/scheduling Categorizes an issue or PR as relevant to SIG Scheduling. stage/beta Denotes an issue tracking an enhancement targeted for Beta status wg/batch Categorizes an issue or PR as relevant to WG Batch.
Projects
Status: Graduating
Status: Tracked
Status: Tracked
Status: Needs Triage
Status: In Progress
Development

No branches or pull requests