-
Notifications
You must be signed in to change notification settings - Fork 1.5k
DRA: device taints and tolerations #5055
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Comments
/sig node |
I think sig-scheduling lead needs to opt-in for this feature for it to be tracked for 1.33. |
/assign |
/label lead-opted-in |
Hello @pohly 👋, v1.33 Enhancements team here. Just checking in as we approach enhancements freeze on 02:00 UTC Friday 14th February 2025 / 19:00 PDT Thursday 13th February 2025. This enhancement is targeting stage Here's where this enhancement currently stands:
For this KEP, we would need to update the following:
It looks like #5034 will address most of these issues.
The status of this enhancement is marked as If you anticipate missing enhancements freeze, you can file an exception request in advance. Thank you! |
Hi @pohly 👋, 1.33 Enhancements team here, Just a quick friendly reminder as we approach the enhancements freeze later this week, at 02:00 UTC Friday 14th February 2025 / 19:00 PDT Thursday 13th February 2025. The current status of this enhancement is marked as If you anticipate missing enhancements freeze, you can file an exception request in advance. Thank you! |
Hello @pohly 👋, 1.33 Enhancements team here, With PR #5034 has been merged, all the KEP requirements are in place and merged into k/enhancements. Before the enhancement freeze, it would be appreciated if following nits could be addressed:
Aside from the minor nits mentioned above, this enhancement is all good for the upcoming enhancements freeze. 🚀 The status of this enhancement is now marked as /label tracked/yes |
Thanks for the update and for clarifying the changes for KEP-5055. I’ll go ahead and mark this as tracked again, since it was previously marked removed from milestone on 1.33 project board after the docs PR was closed. (cc Docs Lead: @rayandas, we might need to track the reopened Docs PR kubernetes/website#49822)) |
@pohly After reading the KEP, I have a couple of questions from the perspective of Node autoscaling. Do we expect the device taints to behave/be interacted with similarly to Node taints? Cluster Autoscaler uses existing Nodes to reason about how a new Node from the same NodeGroup would look like. While doing that, it has to ignore Node taints that won't be present on a new Node. This includes well-known status/transient taints, but CA also allows the user to configure which taints are "startup" (can be present on every Node at the beginning, then are expected to disappear), and which are "status" (can be present on only some Nodes, new Nodes don't automatically get them). Any other taint will be copied to the NodeGroup template and prevent CA from scaling up for pods that don't tolerate it (even if the taint were to disappear later on). Should we just extend these mechanisms and the general taint handling to device taints and essentially treat them the same as Node taints? Or are there any considerations that make device taints different from Node taints wrt the logic above? |
For taints applied by the admin we probably should literally do what the admin says: if the For taints published by the driver it's as ambiguous as node taints. They may or may not be "startup", so I suppose a configuration option will be needed. |
Are you saying we should treat some DeviceTaintRules as special cases? Can we easily detect if a DeviceTaintRule "applies to all devices of a driver regardless of the pool"? FYI, I updated kubernetes/autoscaler#7947 accordingly. |
I think Patrick is suggesting that in general, taints defined by DeviceTaintRules are more likely to be like "startup" taints (i.e. can be extrapolated to new nodes), whereas taints defined in ResourceSlices could very well be either "startup" or "status." There's nothing stopping some kind of controller from managing "status"-like taints in DeviceTaintRules though, so that's not a perfect heuristic. One issue CA might have making that distinction though is that the ResourceSlice tracker doesn't record the origin of taints, so it's impossible to tell only by looking at a ResourceSlice from the tracker whether or not a taint came from a DeviceTaintRule. Maybe that's only a sign that "from DeviceTaintRule => 'startup'" isn't the right line to draw in the sand. IMO if some user configuration to make the startup/status distinction would be required for taints defined by drivers in ResourceSlices, that same configuration should also be required and apply the same way for taints in DeviceTaintRules. |
Hi @pohly 👋, 1.34 Enhancements Lead here. I am closing the v1.33 milestone now. If you'd like to work on this enhancement in v1.34, please have the SIG lead opt-in by adding the Thanks! /remove-label lead-opted-in |
@pohly is this targeting alpha or beta for 1.34? |
This will remain in alpha for 1.34. No KEP update is planned for this cycle, but there might be some minor code PRs. |
Hi, @pohly! To be clear, any code changes that introduce behavioral changes or user-facing changes, no matter how minor, still require tracking by the release team even if you will not be graduating to the next stage. Just making sure that's understood, since there was significant confusion around this in the last cycle. If they're refactors or bug fixes, that's totally fine and we don't need to track it. |
In this case, "behavioral changes or user-facing changes" would imply updating the KEP upfront. I can't think of anything right now, so I suppose it's fine to not track it. |
Uh oh!
There was an error while loading. Please reload this page.
Enhancement Description
k/enhancements
) update PR(s):k/k
) update PR(s): device taints and tolerations (KEP 5055) kubernetes#130447k/website
) update PR(s): DRA: device taints and tolerations website#49822k/enhancements
) update PR(s): nonek/k
) update PR(s):k/website
) update PR(s):Please keep this description up to date. This will help the Enhancement Team to track the evolution of the enhancement efficiently.
The text was updated successfully, but these errors were encountered: