KEP 2999: adding new KEP for host network pod IPs #3000

aojea · 2021-10-04T07:53:20Z

One-line PR description: adding new KEP

Issue link: Host Network Pods IPs #2999

Other comments:

aojea · 2021-10-04T07:54:06Z

/assign @danwinship @thockin @dcbw @khenidak

danwinship · 2021-10-04T13:21:47Z

keps/sig-network/2999-host-network-pod-ips/README.md

+should be approved by the remaining approvers and/or the owning SIG (or
+SIG Architecture for cross-cutting KEPs).
+-->
+# KEP-NNNN: Your short, descriptive title


danwinship · 2021-10-04T13:24:34Z

keps/sig-network/2999-host-network-pod-ips/README.md

+
+## Motivation
+
+Kubelet reporting the updated IP in the node.Status, but host network pods reporting an IP that no longer exist is misleading and create confusion.


There is also some confusion / inconsistency about where host-network pod IPs come from; kubelet will fill them in from node.status.addresses if CRI leaves them unset, but if CRI sets the pod's IPs itself, kubelet won't touch them. (This is bad because, eg, the runtime might not know whether the node is supposed to be dual-stack or not, so it may fill in a single-stack IP, and then kubelet won't add the second node IP to that.)

cri-o leaves the pod IPs unset for host-network pods but I'm not sure what other CRI implementations do

+1
This is an excellent point. I do believe that PodIPs for host-net pods should be always match node.status.addresses irrespective of what CRI reports.

TIL, but is not a goal of the KEP, the goal is only to update the PodIP whenever it changes, CRI or Kubelet

Sorry somewhat confused on what the final state of this proposal will be?

What will be the source of truth for Pod.Status.PodIP and Pod.Status.PodIPs[]? Will they be populated from the Kubelet? From CRI? From both?

If both... what is the reconciliation logic? How would we manage a delta?

If this is already implemented in the kubelet somewhere already I am assuming we just defer to that logic? Which begs the question, is that logic worth revisting?

The proposed implementation is here kubernetes/kubernetes#105410
@danwinship I can only see that the IPs are obtained from the Node.Status.Addresses
https://github.com/kubernetes/kubernetes/blob/master/pkg/util/node/node.go#L99

but if CRI sets the pod's IPs itself, kubelet won't touch them.

how that can happen?

TIL, but is not a goal of the KEP, the goal is only to update the PodIP whenever it changes, CRI or Kubelet

But right now, if CRI sets the pod IP itself, then kubelet never touches it. If we unconditionally change it when it changes then that means that with some runtimes, if you start a hostNetwork pod on a dual-stack host, it will be single-stack, but if you then change the IPv4 address of the node, then the pod would become dual-stack. That's obviously wrong, so either we need to say that kubelet won't update the pod IP of pods where it didn't initially set the pod IP itself (which would mean we would have to remember somewhere whether we had set it ourselves or not), or else (as Kal suggested) we should override CRI in this case if it did try to set the pod IP itself.

@aojea https://github.com/kubernetes/kubernetes/blob/master/pkg/kubelet/kubelet_pods.go#L1522
If podIP was set by the runtime then kubelet ~~never calls GetNodeHostIPs~~ doesn't copy the host IP(s) to the pod IP(s)

why the CRI will set IPs on a hostnetwork pod? there are no CNI calls at all, that doesn't seem a realistic scenario, a hostNetwork pod doesn't report ips neither in CRIO or containerd
https://github.com/cri-o/cri-o/blob/8d4d158935929800c4300b82eb4b5a83ded400f4/server/sandbox_network.go#L34-L36

https://github.com/containerd/containerd/blob/4921fb6b63e5476e54fb7e2cf93c3b505f1c09cb/pkg/cri/store/sandbox/sandbox.go#L40-L41

danwinship · 2021-10-04T13:25:57Z

keps/sig-network/2999-host-network-pod-ips/README.md

+
+## Summary
+
+Host-network pods inherit their Pod IPs from the node where they are running. Nodes may change their IPs during their lifecycle, and kubelet updates the node.Status accordingly, however, host-network pods IPs are not updated and keep reporting the old IPs until they are restarted/recreated.


(line-wrap!)

s/Pod IPs/PodIPs/g

(there is no standards, but i noticed that is how we usually ref them).

danwinship · 2021-10-04T13:33:47Z

keps/sig-network/2999-host-network-pod-ips/README.md

+
+### Risks and Mitigations
+
+Applications running in a host-network pod that depend on the Downward API or the Pod.Status.PodIPs for auto-configuration should consider that it can change. However, most probably these applications would not be working correctly because of that.


Well, there are two cases here:

a node IP is removed (which is what you're talking about)

the node's "primary node IP(s)" change in some way but the old primary node IP(s) still exist

A. the node goes from single-stack to dual-stack and so starts using dual-stack podIPs for new host-network pods.

B. the node goes from dual-stack to single-stack and so stops using dual-stack podIPs, although presumably the secondary-IP-family node IP still exists even though kubelet is ignoring it.

C. the node's .Status.Addresses gets reordered in a way that causes a different IP to become primary. (eg, kubelet gets restarted with a different --node-ip value, or the user modifies the cloud configuration for the node to, eg, swap two virtual NICs)

In all of the case 2 scenarios, the pod's old IP continues to be at least partially valid/usable.

good point, worth clarifying

khenidak

There is a missing user story around static pods and how they are racy. They can start before CloudProvider sets Node.Status.Addresses.

I also want - for prosperity - to outline the idea of host-net should always listen to ::0 if they want to avoid node IP change issues.

krisnova · 2021-10-05T17:20:40Z

keps/sig-network/2999-host-network-pod-ips/README.md

+
+## Summary
+
+Host-network pods inherit their Pod IPs from the node where they are running. Nodes may change their IPs during their lifecycle, and kubelet updates the node.Status accordingly, however, host-network pods IPs are not updated and keep reporting the old IPs until they are restarted/recreated.


Host-network pods inherit their Pod IPs from the node where they are running.

I have never found good documentation on this, but there are some considerations for a host with multiple NICs with regard to hostNet: true pods. Most Linux network managers (tested: systemdnetworkd and NetworkManager) enable you to define a default route. So far I have only seen pods pull the same IP based on the default route. Although I can't back this up with paperwork. I haven't looked that far into the code (yet?).

However I am wondering how this change would respond in the following situations given this observation.

Given a node with the following criteria

eno1 10.0.0.100 (default route)

eno2 192.168.1.1

What is the expected Pod behavior if the default route changes from eno1 -> eno2?

What is the expected Pod behavior if eno1 is disabled (deactivated) and eno2 has a valid connection, but is not the default route?

What is the expected Pod behavior if a new NIC is added, but the default route remains?

What is the expected Pod behavior if eno1 at 10.0.0.100 remains the default route in the route table, but is assigned a NEW IP address 172.0.100.100?

What is the expected Pod behavior if eno1 at 10.0.0.100 remains the default route in the route table, but is assigned a SECONDARY IP address 172.0.100.100?

tldr; that default route though

There are two types of node ip changes, in context:

Soft change This happens because the delta between what kubelet sees (default first IPv4 iirc) and thus Static Pods see (if they start before cloud provider set-ip step) and what a thing like cloud provider sets which can be dual-stacked, different order, or even different IP.

Hard change Like the one you described or what @aojea described (e.g., DHCP resets) or to be more specific a reconfiguration on the networking stack of the node.

The first type has no baring on anything. It just says the pod is reachable using different set of addresses (assuming pod listens to ::0 or whatever equivalent in other languages). The node itself didn't change its own network stack configuration. To be more specific. The node only changed the IPs it reports to kubernetes from {<IP>} to !{<IP>} but the node owned and still owns the joined set of IPs. Please do note that srcIP for host-net pod does have to == pod IP, this is subject to how node networking stack is configured (similar to srcIP for pod-net pods which is subject CNI and its snat rules). This is the most common change we see.

For the second type also - arguably - should result into no change in pod (again assuming it listens to ::0 and it shares host network namespace). To present it differently let us assume the node retooled the entire networking stack (assuming CNI is smart enough to deal with this and no problems with pods in pod-net) what happens to kubelet? in practice kubelet is running as a pod in host-net (or as any other app in host default net namespace). Depending on the severity of change existing connection may need to be terminated and re-established but it does not require kubelet to restart (again assuming listening to ::0).

Maybe we should limit the kep to first type? @aojea

Although I can't back this up with paperwork. I haven't looked that far into the code (yet?).

the code is here if you want to have some fun :) https://github.com/kubernetes/kubernetes/blob/6ee6251fea608ff64ff69314624d26fc6b3b61ac/pkg/kubelet/nodestatus/setters.go#L60-L255

There are different scenarios, let's discard the cloud provider one and focus on bare metal/non cloud-provider, the code says

// 1) Use nodeIP if set (and not "0.0.0.0"/"::") // 2) If the user has specified an IP to HostnameOverride, use it // 3) Lookup the IP from node name by DNS // 4) Try to get the IP from the network interface used as default gateway // // For steps 3 and 4, IPv4 addresses are preferred to IPv6

if we are in step 4, we execute https://github.com/kubernetes/kubernetes/blob/6ee6251fea608ff64ff69314624d26fc6b3b61ac/staging/src/k8s.io/apimachinery/pkg/util/net/interface.go#L463-L468

// ResolveBindAddress returns the IP address of a daemon, based on the given bindAddress: // If bindAddress is unset, it returns the host's default IP, as with ChooseHostInterface(). // If bindAddress is unspecified or loopback, it returns the default IP of the same // address family as bindAddress. // Otherwise, it just returns bindAddress. func ResolveBindAddress(bindAddress net.IP) (net.IP, error) {

It is important to mention that these are host-network pods, so they have ALL the IPs in the host, they run in the host namespace, we are talking only about the pod.Status.PodIPs fields, there is no influence in the node network at all.

@khenidak I narrowed the scope of the KEP in goals/non-goals to avoid any change in the kubelet. This is just to set the reported nodeIP by the kubelet in the PodIPs,

You can see Dan's comments too, the whole IP dance inside of the kubelet requires a big big big work if we want to make it perfect
#3000 (comment)
#3000 (comment)

It is important to mention that these are host-network pods, so they have ALL the IPs in the host, they run in the host namespace, we are talking only about the pod.Status.PodIPs fields, there is no influence in the node network at all.

This is what I came here for.

I haven't always detected this behavior which is why I think I am confused. I might be able to offer reproduction steps when I find time. However I know I have had a node with both a publicly routable IP and an internal one as well and have seen the host net pods only route through the internal IP.

No more action needed. My comment is resolved (albeit it might be wise to add some more language about scope and IP changes like mentioned above). If I have other questions on the code I can bring it up later. Thanks for the great response.

endocrimes · 2021-10-06T12:58:08Z

/cc

thockin · 2022-01-25T00:16:57Z

Are we going to push for this in 1.24 ?

aojea · 2022-01-25T07:36:43Z

Are we going to push for this in 1.24 ?

We have a simpler solution for the problem reported kubernetes/kubernetes#106715

The dynamic hostIPs use case is still a breaking change for the nodes behaviors without a strong use case that justify it.

thockin · 2022-01-26T21:35:10Z

ping @aojea - deadline looming

aojea · 2022-01-26T21:46:31Z

ping @aojea - deadline looming

let's go with the bug fix and once we have more evidence and feedback we' ll revisit
/hold

k8s-ci-robot · 2022-01-27T16:31:25Z

[APPROVALNOTIFIER] This PR is NOT APPROVED

This pull-request has been approved by: aojea
To complete the pull request process, please ask for approval from dcbw and additionally assign ehashman after the PR has been reviewed.
You can assign the PR to them by writing /assign @ehashman in a comment when ready.

The full list of commands accepted by this bot can be found here.

Needs approval from an approver in each of these files:

keps/prod-readiness/OWNERS
keps/sig-network/OWNERS

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

k8s-triage-robot · 2022-04-27T17:24:07Z

The Kubernetes project currently lacks enough contributors to adequately respond to all issues and PRs.

This bot triages issues and PRs according to the following rules:

After 90d of inactivity, lifecycle/stale is applied
After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied
After 30d of inactivity since lifecycle/rotten was applied, the issue is closed

You can:

Mark this issue or PR as fresh with /remove-lifecycle stale
Mark this issue or PR as rotten with /lifecycle rotten
Close this issue or PR with /close
Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/lifecycle stale

k8s-triage-robot · 2022-05-27T17:59:00Z

The Kubernetes project currently lacks enough active contributors to adequately respond to all issues and PRs.

This bot triages issues and PRs according to the following rules:

After 90d of inactivity, lifecycle/stale is applied
After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied
After 30d of inactivity since lifecycle/rotten was applied, the issue is closed

You can:

Mark this issue or PR as fresh with /remove-lifecycle rotten
Close this issue or PR with /close
Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/lifecycle rotten

thockin · 2022-06-10T23:19:27Z

Are we touching this for 1.25?

aojea · 2022-06-11T10:46:40Z

Are we touching this for 1.25?

I didn't hear more about this problem since we merged kubernetes/kubernetes#106715

I'm going to set it in draft mode just in case we need to revisit

k8s-triage-robot · 2022-07-11T11:04:00Z

The Kubernetes project currently lacks enough active contributors to adequately respond to all issues and PRs.

This bot triages issues and PRs according to the following rules:

After 90d of inactivity, lifecycle/stale is applied
After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied
After 30d of inactivity since lifecycle/rotten was applied, the issue is closed

You can:

Reopen this issue or PR with /reopen
Mark this issue or PR as fresh with /remove-lifecycle rotten
Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/close

k8s-ci-robot · 2022-07-11T11:04:19Z

@k8s-triage-robot: Closed this PR.

In response to this:

The Kubernetes project currently lacks enough active contributors to adequately respond to all issues and PRs.

This bot triages issues and PRs according to the following rules:

After 90d of inactivity, lifecycle/stale is applied

After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied

After 30d of inactivity since lifecycle/rotten was applied, the issue is closed

You can:

Reopen this issue or PR with /reopen

Mark this issue or PR as fresh with /remove-lifecycle rotten

Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/close

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

host network pod ips

a7ef72b

k8s-ci-robot added cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. size/XL Denotes a PR that changes 500-999 lines, ignoring generated files. labels Oct 4, 2021

k8s-ci-robot requested review from dcbw and thockin October 4, 2021 07:53

k8s-ci-robot added kind/kep Categorizes KEP tracking issues and PRs modifying the KEP directory sig/network Categorizes an issue or PR as relevant to SIG Network. labels Oct 4, 2021

aojea mentioned this pull request Oct 4, 2021

Host Network Pods IPs #2999

Closed

4 tasks

k8s-ci-robot assigned danwinship, dcbw, khenidak and thockin Oct 4, 2021

aojea mentioned this pull request Oct 4, 2021

patch host-net pod status for changing node IPs kubernetes/kubernetes#105410

Closed

danwinship reviewed Oct 4, 2021

View reviewed changes

khenidak reviewed Oct 5, 2021

View reviewed changes

krisnova reviewed Oct 5, 2021

View reviewed changes

k8s-ci-robot requested a review from endocrimes October 6, 2021 12:58

thockin added this to the v1.24 milestone Nov 11, 2021

aojea mentioned this pull request Nov 29, 2021

set secondary address on host-network pods kubernetes/kubernetes#106715

Merged

k8s-ci-robot added the do-not-merge/hold Indicates that a PR should not merge because someone has issued a /hold command. label Jan 26, 2022

thockin modified the milestones: v1.24, v1.25 Jan 26, 2022

aojea closed this Jan 27, 2022

aojea reopened this Jan 27, 2022

k8s-ci-robot added the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Apr 27, 2022

k8s-ci-robot added lifecycle/rotten Denotes an issue or PR that has aged beyond stale and will be auto-closed. and removed lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. labels May 27, 2022

aojea marked this pull request as draft June 11, 2022 10:46

k8s-ci-robot added the do-not-merge/work-in-progress Indicates that a PR should not merge because it is a work in progress. label Jun 11, 2022

k8s-ci-robot closed this Jul 11, 2022

tssurya mentioned this pull request Apr 3, 2024

Fixes flaking node ip migration ETP local check ovn-org/ovn-kubernetes#4250

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

KEP 2999: adding new KEP for host network pod IPs #3000

KEP 2999: adding new KEP for host network pod IPs #3000

aojea commented Oct 4, 2021

aojea commented Oct 4, 2021

danwinship Oct 4, 2021

danwinship Oct 4, 2021

khenidak Oct 5, 2021

aojea Oct 5, 2021 •

edited

krisnova Oct 6, 2021

aojea Oct 6, 2021

danwinship Oct 6, 2021

danwinship Oct 6, 2021 •

edited

aojea Oct 6, 2021

danwinship Oct 4, 2021

khenidak Oct 5, 2021

danwinship Oct 4, 2021

aojea Oct 5, 2021

khenidak left a comment

krisnova Oct 5, 2021 •

edited

khenidak Oct 5, 2021 •

edited

aojea Oct 5, 2021

krisnova Oct 6, 2021

endocrimes commented Oct 6, 2021

thockin commented Jan 25, 2022

aojea commented Jan 25, 2022

thockin commented Jan 26, 2022

aojea commented Jan 26, 2022

k8s-ci-robot commented Jan 27, 2022

k8s-triage-robot commented Apr 27, 2022

k8s-triage-robot commented May 27, 2022

thockin commented Jun 10, 2022

aojea commented Jun 11, 2022

k8s-triage-robot commented Jul 11, 2022

k8s-ci-robot commented Jul 11, 2022


		## Motivation

		Kubelet reporting the updated IP in the node.Status, but host network pods reporting an IP that no longer exist is misleading and create confusion.


		## Summary

		Host-network pods inherit their Pod IPs from the node where they are running. Nodes may change their IPs during their lifecycle, and kubelet updates the node.Status accordingly, however, host-network pods IPs are not updated and keep reporting the old IPs until they are restarted/recreated.


		### Risks and Mitigations

		Applications running in a host-network pod that depend on the Downward API or the Pod.Status.PodIPs for auto-configuration should consider that it can change. However, most probably these applications would not be working correctly because of that.

KEP 2999: adding new KEP for host network pod IPs #3000

KEP 2999: adding new KEP for host network pod IPs #3000

Conversation

aojea commented Oct 4, 2021

aojea commented Oct 4, 2021

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

aojea Oct 5, 2021 • edited

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

danwinship Oct 6, 2021 • edited

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

khenidak left a comment

Choose a reason for hiding this comment

krisnova Oct 5, 2021 • edited

Choose a reason for hiding this comment

khenidak Oct 5, 2021 • edited

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

endocrimes commented Oct 6, 2021

thockin commented Jan 25, 2022

aojea commented Jan 25, 2022

thockin commented Jan 26, 2022

aojea commented Jan 26, 2022

k8s-ci-robot commented Jan 27, 2022

k8s-triage-robot commented Apr 27, 2022

k8s-triage-robot commented May 27, 2022

thockin commented Jun 10, 2022

aojea commented Jun 11, 2022

k8s-triage-robot commented Jul 11, 2022

k8s-ci-robot commented Jul 11, 2022

aojea Oct 5, 2021 •

edited

danwinship Oct 6, 2021 •

edited

krisnova Oct 5, 2021 •

edited

khenidak Oct 5, 2021 •

edited