New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
dynamic resource allocation #3063
Comments
/assign @pohly |
do we have a discussion issue on this enhancement? |
@ahg-g: with discussion issue you mean a separate issue in some repo (where?) in which arbitrary comments are welcome? No, not at the moment. I've also not seen that done elsewhere before. IMHO at this point the open KEP PR is a good place to collect feedback and questions. I also intend to come to the next SIG-Scheduling meeting. |
Yeah, this is what I was looking for, the issue would be under k/k repo.
That is actually the common practice, one starts a feature request issue where the community discusses initial ideas and the merits of the request (look for issues with label
But the community have no idea what this is about yet, so better to have an issue discusses "What would you like to be added?" and "Why is this needed" beforehand. Also, meetings are attended by fairly small groups of contributors, having an issue tracking the discussion is important IMO. |
In my work in SIG-Storage I've not seen much use of such a discussion issue. Instead I had the impression that the usage of "kind/feature" is discouraged nowadays. https://github.com/kubernetes/kubernetes/issues/new?assignees=&labels=kind%2Ffeature&template=enhancement.yaml explicitly says
This proposal was discussed with various people beforehand, now we are in the formal KEP phase. But I agree, it is hard to provide a good link to those prior discussions. |
We use that in sig-scheduling, and it does serve as a very good place for initial rounds of discussions, discussions on slack and meetings are hard to reference as you pointed out. I still have no idea what this is proposing, and I may not attend the next sig meeting for example... |
Hi @ ! 1.24 Enhancements team here.
The status of this enhancement is track as |
The Enhancements Freeze is now in effect and this enhancement is removed from the release. /milestone clear |
Hi Patrick, `tracked/yes` will be applied when the KEP is merged and all
requirements for Enhancement Freeze are met. We are definitely keeping an
eye on this! Feel free to ping me once it's merged
…On Tue, Mar 1, 2022 at 11:17 AM Patrick Ohly ***@***.***> wrote:
@gracenng <https://github.com/gracenng> : an exception was requested and
granted for this enhancement to move to GA in 1.24:
https://groups.google.com/g/kubernetes-sig-release/c/sUpd2H1wxnk/m/lL_I6GT-BwAJ
Can you label it again as "tracked/yes"?
—
Reply to this email directly, view it on GitHub
<#3063 (comment)>,
or unsubscribe
<https://github.com/notifications/unsubscribe-auth/AKCRLO46IYCBRTXXZE6EHGTU5ZUM3ANCNFSM5JCIMTQA>
.
You are receiving this because you were mentioned.Message ID:
***@***.***>
|
The Kubernetes project currently lacks enough contributors to adequately respond to all issues and PRs. This bot triages issues and PRs according to the following rules:
You can:
Please send feedback to sig-contributor-experience at kubernetes/community. /lifecycle stale |
/remove-lifecycle stale |
Hello @pohly 👋, 1.25 Enhancements team here. Just checking in as we approach enhancements freeze on 18:00 PST on Thursday June 16, 2022. For note, This enhancement is targeting for stage Here's where this enhancement currently stands:
It looks like #3064 will address everything in this list. For note, the status of this enhancement is marked as |
I appreciate the breakdown. That said -- beta doesn't really exist. There's alpha (off by default), GA with low-confidence, and GA with high(er) confidence. I'm very reluctant to "beta" (GA with low confidence) this if we don't have a plan for how it will evolve to support autoscaling. |
Template KEP is that plan |
I will keep reading |
I don't find that a "minimum viable product". No-one is going to use this, so we are not going to get more feedback even if we do promote this subset to beta. It also sounds like we need to implement new functionality that never was available as alpha, so how can we go to beta with it straight away? The other downside is that we have to start adding more feature gate checks for specific fields, with all the associated logic (drop alpha fields, but only if not already set). This is adding work and complexity, and thus a risk to introduce new bugs. If we have to reduce the scope for beta, then I would slice up the KEP differently if (and only if) needed. But I am not (EDIT) going to dive into the how because of this: I asked in #4384 (comment) how many different feature gates we need in 1.30 when everything is still alpha. Let me repeat the key point: perhaps we don't need to decide now? We could continue to use the existing The practical advantage is that for 1.30 we we can skip the entire discussion around how to promote this and instead have that discussion later, for example in a working session at the KubeCon EU 2024 contributor summit (I have submitted a session proposal). It also makes the 1.30 implementation simpler (no additional feature gate checks). |
The ResourceClaimTemplate is not what enables autoscaling. It solves the problem of per-pod resource claims when pods get generated by an app controller. This part also doesn't seem to be controversial, at least not anymore after I changed to dynamically generated names 😉. My plan for supporting autoscaling are numeric parameters. |
Yes - in the break down, the template and numerical parameter functionality is combined into one KEP. That's what I meant that that KEP is the plan. What's "controversial" isn't the template API per se, but the way it introduces complexity with scheduling. The numerical parameters will reduce that considerably. I agree it was too aggressive to suggest even the scoped down thing in 1.30 for beta. You may be right that we can postpone the debate since we are staying all in alpha. But if we want a chance of delivering the solution in smaller, digestible chunks, I think we have to work out the right API now, which I don't think is quite there yet even for the basic ResourceClaim. My suggestion is that the user-owned resource claim API is under-specified as written, because instead of the user specifying the node, it randomly picks one during scheduling. So, it's sort of unusable in the manual flow except for network-attached resources. Before we automate something (i.e., add templating and automatic scheduling), we need the manual flow to work. And I do think if you give people an API that solves their use case, even with a little more manual prep-work / client-side work, people will use it. Along those lines, the change is small. You just need to require the user to pick a node during the creation of the ResourceClaim (for non-network attached resources), and then users can pre-provision nodes with pools of associated resources, and labels those sets of nodes. This makes it an actual usable API, and makes the functionality composable: the automation (templates) builds directly on top of the manual process. In fact, I think we can even push delayed allocation out-of-scope for the MVP, and still have something very useful. Typical UX would be:
This is a reasonable UX which will certainly be used. The scope of this much, much simpler and smaller than the current base DRA KEP. |
We can build on #3063 (comment) with an focused follow-up change to PodSchedulingContext: one that allows kubelets to demur to accept the Pod for arbitrary reasons. In other words, a kubelet could look at the existing attached resources, and the node as it's running right now, and inform the control plane that there's no such GPU, or that a different Pod is already using that NUMA partition, or that the phase of the moon is wrong… At that stage, this doesn't need to mean clever scheduling and doesn't actually count as dynamically allocating any resources. Maybe all the candidate nodes decline and the scheduler eventually gives up trying. Cluster autoscalers wouldn't be trying to make new nodes because the It's basic. However, just as @johnbelamaric explained, it's useful to some folk. The ability for a kubelet to demur through an update to PodSchedulingContext would support a bunch of related user stories, even if there are many others that still need work. If we go this route, where's a good place to take that discussion? |
I don't get how templates add complexity for scheduling. The scheduler needs to wait for the created ResourceClaim, but that's all. That's the same as "wait for user to create ResourceClaim", it doesn't make the scheduling more complex. Templates are not related to which nodes a picked.
The "I want this claim for node xyz" doesn't need to be in the
So when a deployment is used, all pods reference the same ResourceClaim? Then all pods run on the same node, using the same hardware resource. I don't see how you intend to handle this. This will require some new kind of API, one which will become obsolete once we have what people really want (automatic scheduling). If you think that this is doable, then this deserves a separate KEP which explains all the details and what that API would look like. It's not just some reduced DRA KEP.
Who are those folks? This seems very speculative to me.
PodSchedulingContext is what people are trying to avoid... |
Write a provisional KEP, submit it. We can then meet at KubeCon EU to discuss face-to-face or set up online meetings. |
Yeah, I think you're right, this doesn't quite work and templates are probably the fix. The goal as you said is be to avoid pod scheduling context, not templates really. I still think it's possible to create a scoped down but still useful API that accomplishes that. |
Today people solve this by grabbing the whole node and/or running privileges pods. This API that avoids this, allowing an administrator to pre-allocate resources via the node-side (privileged) drivers, without requiring a the user pod to have those privileges. Those would be the users of this initial API. |
This is a pretty big statement. I worry that the things we need for manual selection may be the the things we DON'T WANT for automation. Giving the user too much control can be an atteactive nuisance which makes the "real" solution harder. Pre-provisioned volumes are a lesson I took to heart.
In this model, what is the value of ResourceClaims above simple Device Plugins? Or maybe I misunderstand the proposal? I read this as: label nodes with GPUs, use node selectors, use devioce plugins to get access to GPUs. Which I think is what people already do, right? IOW it uses node-management as a substitute for (coarse) resource management. I am certainly seeing a trend of people running (effectively) one workload pod per node, which fits this model. If we tidy that up and codify it (even just saying "here's a pattern you should not feel bad about"), does it relieve some of the pressure? |
A manual (external) flow could work for things that you can attach to nodes, especially if they are then dedicated to the thing they attach to. Device plugins provide the ability for the thing you attach to work, but they don't attach it for you. Something like https://cloud.google.com/compute/docs/gpus/add-remove-gpus -something triggers making a ResourceClaim, and automation fulfils it by attaching a GPU to a VM. NFD runs and labels the node. There is probably a device plugin in this story; anyway, we end up with a node that has some GPU capacity available. Next, someone runs a Job that selects for that kind of GPU, and it works. However, what's possibly missing at this stage is the ability to reserve that GPU resource from the claim and avoid it being used for other Pods. If we want to run a second Pod, maybe we're able to attach a second GPU, avoiding the one-pod-per-node problem. We aren't doing DRA but we have helped some stories and narrowed what still needs delivering. |
Those were also what came to my mind when I read @johnbelamaric's outline. Pre-provisioned volumes have been replaced by CSI volume provisioning, but now that also is stuck having to support the complicated "volume binding" between PVC and PV. The result is that there are still race conditions that can lead to leaked volumes. Let's not repeat this for DRA. |
Here's why I think it could be different.
However:
Let's say you're attaching GPUs to a node, and you make a ResourceClaim to specify there should be 2 GPUs attached. The helper finds there's already one GPU manually / previous attached. How about we specify that this is not only undefined behaviour, but that we expect drivers to taint the node if they see it. No need to have complex logic around manual and automatic provisioning; something is Wrong and the node might not be any good now. If the helper is told to add 2 GPUs for a total of 2 attached GPUs and finds 0 attached GPUs: great! We get nodes with GPUs, no need to taint, and other consequences such as NFD and device plugin registration can all happen. Does that work? |
Fair point. But I think the trouble comes in more when you don't have a clear sense of ownership - that is, mixed manual and automated flows. If the automated flows have full ownership (overriding anything the user may have done).
I am not sure there is one, except that we are rebuilding a model that can be further extended to the full support. Another alternative may be to extend those existing mechanisms, rather than invent a new one. My main goal with spit balling some alternatives is to see if we can deliver incremental scope in a useful bug digestible way. My thinking is:
2 and 3 may have to go together, but I was hoping to find a solution to 1. With this approach, the functionality delivered in 1 can be done earlier because it is simpler, and it does not need to material change as we implement 2 and 3, reducing risk. Admittedly I may be cutting things up wrong...but I think there is merit to the approach. |
This can be done with the current API minus Perhaps some hardware vendor will even write such a DRA driver for you, if some of their customers want to use it like this - I can't speak for either of them. This is why the KEP has "collect feedback" as one of the graduation criteria. This might even give you option 2 and 3. I am not sure whether I grasp the difference between them 🤷 |
I am not sure there is one! |
Immediate provisioning doesn't ensure that the node on which the resource was provisioned can fit the pod's other dimensions, but maybe that's OK? The advantage of simple, stupid, counted resources is that the scheduler has all the information it needs about all requests. What I am trying to get to, and I think John is aiming at the same goal (not that you're not, Patrick :), is to say what is the biggest problem people really experience today, and how can we make that better? A year ago I would have said it was GPU sharing. Now I understand (anecdotally) that sharing is far less important than simply getting out of the way for people who want to use whole GPUs. Here's my uber-concern: k8s is the distillation of more than a decade of real-world experience running serving workloads (primarily) on mostly-fungible hardware. The game has changed, and we don't have a decade of experience. Anything we do right now has better than even odds of being wrong within a short period of time. The hardware is changing. Training vs. inference is changing. Capacity is crunched everywhere, and there's a sort of "gold rush" going on. What can we do to make life better for people? How can we help them improve their efficiency and their time to market? Everything else seems secondary. I ACK that this is GPU-centric and that DRA does more than just GPUs. |
I'm okay with promoting "core DRA minus PodSchedulingContext" to beta in 1.30, if that helps someone, somewhere. @klueska, @byako : I'll punt this to you. Just beware that it would mean that we need to rush another KEP for "PodSchedulingContext" for 1.30 and add a feature gate for that - I'm a bit worried that we are stretching ourselves too thin when we do that, and we also skip all of the usual "gather feedback" steps for beta. I'd much rather focus on numeric parameters... |
That sounds like "numeric parameters", which we cannot promote to beta in 1.30. When using "numeric parameters", PodSchedulingContext is indeed not needed. |
Right. The numeric parameters approach keeps emerging as a potentially better path forward, but I really don't think we know enough to proclaim any of this as beta. If there are things we can do that are smaller and more focused, which would solve some of the problems, I am eager to explore that. If we were starting from scratch RIGHT NOW, with no baggage, what would we be trying to achieve in 30? |
Yeah. I concede nothing in beta in 1.30. That was my original position anyway, but I though perhaps if we scoped something way down but kept continuity, we could do it. But it's clearly a no-go.
Great question, that is what I am looking for along the lines of MVP. What we really need to do is go back to the use cases for that, which is pretty hard on this tight timeline. The better option may be to ask "what would we be trying to achieve" cutting out the "in 30", and defer that question to 31. In the meantime, maybe we make some incremental steps in the direction we need in 30 based on what we know so far - something like @pohly is saying here plus numerical models. |
Can you expand on this? What becomes core DRA? |
I think it's clear at this point that nothing is going beta in 1.30. We will make sure to avoid the issue you are describing. I think we should err on the side of "failing to schedule" rather than "scheduling and failing". |
/milestone clear |
Enhancement Description
One-line enhancement description: dynamic resource allocation
Kubernetes Enhancement Proposal: https://github.com/kubernetes/enhancements/tree/master/keps/sig-node/3063-dynamic-resource-allocation
Discussion Link: CNCF TAG Runtime Container Device Interface (COD) Working Group meeting(s)
Primary contact (assignee): @pohly
Responsible SIGs: SIG Node
Enhancement target (which target equals to which milestone):
Alpha (1.26)
k/enhancements
) update PR(s):k/k
) update PR(s): dynamic resource allocation kubernetes#111023k/website
) update PR(s):Alpha (1.27)
k/enhancements
) update PR(s):k/k
) update PR(s):k/website
) update PR(s):Alpha (1.28)
k/enhancements
) update PR(s):k/k
) update PR(s):k/website
) update PR(s): dra: update for Kubernetes 1.28 website#41856Alpha (1.29)
k/enhancements
) update PR(s):k/k
) update PR(s):k/website
) update PR(s):Alpha (1.30)
k/enhancements
) update PR(s):k/k
) update PR(s):k/website
) update PR(s):The text was updated successfully, but these errors were encountered: