Skip to content

DRA: device taints and tolerations #5055

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
4 of 8 tasks
pohly opened this issue Jan 20, 2025 · 26 comments
Open
4 of 8 tasks

DRA: device taints and tolerations #5055

pohly opened this issue Jan 20, 2025 · 26 comments
Assignees
Labels
lead-opted-in Denotes that an issue has been opted in to a release sig/node Categorizes an issue or PR as relevant to SIG Node. sig/scheduling Categorizes an issue or PR as relevant to SIG Scheduling. stage/alpha Denotes an issue tracking an enhancement targeted for Alpha status wg/device-management Categorizes an issue or PR as relevant to WG Device Management.

Comments

@pohly
Copy link
Contributor

pohly commented Jan 20, 2025

Enhancement Description

Please keep this description up to date. This will help the Enhancement Team to track the evolution of the enhancement efficiently.

@k8s-ci-robot k8s-ci-robot added the needs-sig Indicates an issue or PR lacks a `sig/foo` label and requires one. label Jan 20, 2025
@pohly
Copy link
Contributor Author

pohly commented Jan 20, 2025

/sig node
/sig scheduling
/wg device-management

@k8s-ci-robot k8s-ci-robot added sig/node Categorizes an issue or PR as relevant to SIG Node. sig/scheduling Categorizes an issue or PR as relevant to SIG Scheduling. and removed needs-sig Indicates an issue or PR lacks a `sig/foo` label and requires one. labels Jan 20, 2025
@k8s-ci-robot k8s-ci-robot added the wg/device-management Categorizes an issue or PR as relevant to WG Device Management. label Jan 20, 2025
@github-project-automation github-project-automation bot moved this to Needs Triage in SIG Scheduling Jan 20, 2025
@pohly pohly moved this from 🆕 New to 🏗 In progress in Dynamic Resource Allocation Jan 21, 2025
@kannon92
Copy link
Contributor

I think sig-scheduling lead needs to opt-in for this feature for it to be tracked for 1.33.

@pohly
Copy link
Contributor Author

pohly commented Feb 3, 2025

/assign

@Huang-Wei
Copy link
Member

/label lead-opted-in

@k8s-ci-robot k8s-ci-robot added the lead-opted-in Denotes that an issue has been opted in to a release label Feb 6, 2025
@dipesh-rawat dipesh-rawat added this to the v1.33 milestone Feb 6, 2025
@dipesh-rawat
Copy link
Member

Hello @pohly 👋, v1.33 Enhancements team here.

Just checking in as we approach enhancements freeze on 02:00 UTC Friday 14th February 2025 / 19:00 PDT Thursday 13th February 2025.

This enhancement is targeting stage alpha for v1.33 (correct me, if otherwise)
/stage alpha

Here's where this enhancement currently stands:

  • KEP readme using the latest template has been merged into the k/enhancements repo.
  • KEP status is marked as implementable for latest-milestone: v1.32.
  • KEP readme has up-to-date graduation criteria
  • KEP has a production readiness review that has been completed and merged into k/enhancements. (For more information on the PRR process, check here). If your production readiness review is not completed yet, please make sure to fill the production readiness questionnaire in your KEP by the PRR Freeze deadline on Thursday 6th February 2025 so that the PRR team has enough time to review your KEP.

For this KEP, we would need to update the following:

  • Create the KEP readme using the latest template and merge it in the k/enhancements repo.
  • Ensure that the KEP has undergone a production readiness review and has been merged into k/enhancements.

It looks like #5034 will address most of these issues.

The status of this enhancement is marked as At risk for enhancements freeze. Please keep the issue description up-to-date with appropriate stages as well

If you anticipate missing enhancements freeze, you can file an exception request in advance. Thank you!

Sorry, something went wrong.

@dipesh-rawat dipesh-rawat moved this to At risk for enhancements freeze in 1.33 Enhancements Tracking Feb 6, 2025
@k8s-ci-robot k8s-ci-robot added the stage/alpha Denotes an issue tracking an enhancement targeted for Alpha status label Feb 6, 2025
@haircommander haircommander moved this to Proposed for consideration in SIG Node 1.33 KEPs planning Feb 6, 2025
@dipesh-rawat
Copy link
Member

Hi @pohly 👋, 1.33 Enhancements team here,

Just a quick friendly reminder as we approach the enhancements freeze later this week, at 02:00 UTC Friday 14th February 2025 / 19:00 PDT Thursday 13th February 2025.

The current status of this enhancement is marked as At risk for enhancement freeze. There are a few requirements mentioned in the comment #5055 (comment) that still need to be completed.

If you anticipate missing enhancements freeze, you can file an exception request in advance. Thank you!

@dipesh-rawat
Copy link
Member

Hello @pohly 👋, 1.33 Enhancements team here,

With PR #5034 has been merged, all the KEP requirements are in place and merged into k/enhancements.

Before the enhancement freeze, it would be appreciated if following nits could be addressed:

Aside from the minor nits mentioned above, this enhancement is all good for the upcoming enhancements freeze. 🚀

The status of this enhancement is now marked as tracked for enhancement freeze. Please keep the issue description up-to-date with appropriate stages as well. Thank you!

/label tracked/yes

Sorry, something went wrong.

@dipesh-rawat
Copy link
Member

Thanks for the update and for clarifying the changes for KEP-5055. I’ll go ahead and mark this as tracked again, since it was previously marked removed from milestone on 1.33 project board after the docs PR was closed.

(cc Docs Lead: @rayandas, we might need to track the reopened Docs PR kubernetes/website#49822))

@dipesh-rawat dipesh-rawat moved this from Removed from Milestone to Tracked for code freeze in 1.33 Enhancements Tracking Mar 25, 2025
@Urvashi0109 Urvashi0109 moved this from Tracked for code freeze to At Risk for Docs Freeze in 1.33 Enhancements Tracking Apr 3, 2025
@rayandas rayandas moved this from At Risk for Docs Freeze to Exception Required in 1.33 Enhancements Tracking Apr 9, 2025
@rayandas rayandas moved this from Exception Required to Tracked for Docs Freeze in 1.33 Enhancements Tracking Apr 9, 2025
@towca
Copy link

towca commented Apr 23, 2025

@pohly After reading the KEP, I have a couple of questions from the perspective of Node autoscaling. Do we expect the device taints to behave/be interacted with similarly to Node taints?

Cluster Autoscaler uses existing Nodes to reason about how a new Node from the same NodeGroup would look like. While doing that, it has to ignore Node taints that won't be present on a new Node. This includes well-known status/transient taints, but CA also allows the user to configure which taints are "startup" (can be present on every Node at the beginning, then are expected to disappear), and which are "status" (can be present on only some Nodes, new Nodes don't automatically get them). Any other taint will be copied to the NodeGroup template and prevent CA from scaling up for pods that don't tolerate it (even if the taint were to disappear later on).

Should we just extend these mechanisms and the general taint handling to device taints and essentially treat them the same as Node taints? Or are there any considerations that make device taints different from Node taints wrt the logic above?

@pohly
Copy link
Contributor Author

pohly commented Apr 23, 2025

For taints applied by the admin we probably should literally do what the admin says: if the DeviceTaintRule applies to all devices of a driver regardless of the pool, it still applies also to the fictional node.

For taints published by the driver it's as ambiguous as node taints. They may or may not be "startup", so I suppose a configuration option will be needed.

@towca
Copy link

towca commented Apr 25, 2025

For taints applied by the admin we probably should literally do what the admin says: if the DeviceTaintRule applies to all devices of a driver regardless of the pool, it still applies also to the fictional node.

Are you saying we should treat some DeviceTaintRules as special cases? Can we easily detect if a DeviceTaintRule "applies to all devices of a driver regardless of the pool"?

FYI, I updated kubernetes/autoscaler#7947 accordingly.

@nojnhuh
Copy link

nojnhuh commented Apr 25, 2025

I think Patrick is suggesting that in general, taints defined by DeviceTaintRules are more likely to be like "startup" taints (i.e. can be extrapolated to new nodes), whereas taints defined in ResourceSlices could very well be either "startup" or "status." There's nothing stopping some kind of controller from managing "status"-like taints in DeviceTaintRules though, so that's not a perfect heuristic.

One issue CA might have making that distinction though is that the ResourceSlice tracker doesn't record the origin of taints, so it's impossible to tell only by looking at a ResourceSlice from the tracker whether or not a taint came from a DeviceTaintRule. Maybe that's only a sign that "from DeviceTaintRule => 'startup'" isn't the right line to draw in the sand. IMO if some user configuration to make the startup/status distinction would be required for taints defined by drivers in ResourceSlices, that same configuration should also be required and apply the same way for taints in DeviceTaintRules.

@jenshu
Copy link
Contributor

jenshu commented May 16, 2025

Hi @pohly 👋, 1.34 Enhancements Lead here.

I am closing the v1.33 milestone now.

If you'd like to work on this enhancement in v1.34, please have the SIG lead opt-in by adding the lead-opted-in label, which ensures it gets added to the tracking board. Also, please set the milestone to v1.34 using /milestone v1.34.

Thanks!

/remove-label lead-opted-in
/remove-label tracked/yes
/milestone clear

@k8s-ci-robot k8s-ci-robot removed this from the v1.33 milestone May 16, 2025
@k8s-ci-robot k8s-ci-robot removed lead-opted-in Denotes that an issue has been opted in to a release tracked/yes Denotes an enhancement issue is actively being tracked by the Release Team labels May 16, 2025
@haircommander haircommander moved this from Triage to Proposed for consideration in SIG Node 1.34 KEPs planning May 20, 2025
@haircommander haircommander moved this from Proposed for consideration to Triage in SIG Node 1.34 KEPs planning May 20, 2025
@sanposhiho sanposhiho added the lead-opted-in Denotes that an issue has been opted in to a release label May 21, 2025
@jenshu
Copy link
Contributor

jenshu commented May 21, 2025

@pohly is this targeting alpha or beta for 1.34?

@pohly
Copy link
Contributor Author

pohly commented May 28, 2025

This will remain in alpha for 1.34. No KEP update is planned for this cycle, but there might be some minor code PRs.

@katcosgrove
Copy link
Contributor

Hi, @pohly! To be clear, any code changes that introduce behavioral changes or user-facing changes, no matter how minor, still require tracking by the release team even if you will not be graduating to the next stage. Just making sure that's understood, since there was significant confusion around this in the last cycle. If they're refactors or bug fixes, that's totally fine and we don't need to track it.

@pohly
Copy link
Contributor Author

pohly commented May 29, 2025

In this case, "behavioral changes or user-facing changes" would imply updating the KEP upfront. I can't think of anything right now, so I suppose it's fine to not track it.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
lead-opted-in Denotes that an issue has been opted in to a release sig/node Categorizes an issue or PR as relevant to SIG Node. sig/scheduling Categorizes an issue or PR as relevant to SIG Scheduling. stage/alpha Denotes an issue tracking an enhancement targeted for Alpha status wg/device-management Categorizes an issue or PR as relevant to WG Device Management.
Projects
Status: No status
Status: 🏗 In progress
Status: Triage
Status: Needs Triage
Status: Tracked for Docs Freeze
Status: Done
Development

No branches or pull requests