Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

KEP 2999: adding new KEP for host network pod IPs #3000

Closed
wants to merge 1 commit into from

Conversation

aojea
Copy link
Member

@aojea aojea commented Oct 4, 2021

  • One-line PR description: adding new KEP
  • Other comments:

@k8s-ci-robot k8s-ci-robot added cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. size/XL Denotes a PR that changes 500-999 lines, ignoring generated files. labels Oct 4, 2021
@k8s-ci-robot k8s-ci-robot added kind/kep Categorizes KEP tracking issues and PRs modifying the KEP directory sig/network Categorizes an issue or PR as relevant to SIG Network. labels Oct 4, 2021
@aojea aojea mentioned this pull request Oct 4, 2021
4 tasks
@aojea
Copy link
Member Author

aojea commented Oct 4, 2021

/assign @danwinship @thockin @dcbw @khenidak

should be approved by the remaining approvers and/or the owning SIG (or
SIG Architecture for cross-cutting KEPs).
-->
# KEP-NNNN: Your short, descriptive title
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

^^^


## Motivation

Kubelet reporting the updated IP in the node.Status, but host network pods reporting an IP that no longer exist is misleading and create confusion.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There is also some confusion / inconsistency about where host-network pod IPs come from; kubelet will fill them in from node.status.addresses if CRI leaves them unset, but if CRI sets the pod's IPs itself, kubelet won't touch them. (This is bad because, eg, the runtime might not know whether the node is supposed to be dual-stack or not, so it may fill in a single-stack IP, and then kubelet won't add the second node IP to that.)

cri-o leaves the pod IPs unset for host-network pods but I'm not sure what other CRI implementations do

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

+1
This is an excellent point. I do believe that PodIPs for host-net pods should be always match node.status.addresses irrespective of what CRI reports.

Copy link
Member Author

@aojea aojea Oct 5, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

TIL, but is not a goal of the KEP, the goal is only to update the PodIP whenever it changes, CRI or Kubelet

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sorry somewhat confused on what the final state of this proposal will be?

What will be the source of truth for Pod.Status.PodIP and Pod.Status.PodIPs[]? Will they be populated from the Kubelet? From CRI? From both?

If both... what is the reconciliation logic? How would we manage a delta?

If this is already implemented in the kubelet somewhere already I am assuming we just defer to that logic? Which begs the question, is that logic worth revisting?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The proposed implementation is here kubernetes/kubernetes#105410
@danwinship I can only see that the IPs are obtained from the Node.Status.Addresses
https://github.com/kubernetes/kubernetes/blob/master/pkg/util/node/node.go#L99

but if CRI sets the pod's IPs itself, kubelet won't touch them.

how that can happen?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

TIL, but is not a goal of the KEP, the goal is only to update the PodIP whenever it changes, CRI or Kubelet

But right now, if CRI sets the pod IP itself, then kubelet never touches it. If we unconditionally change it when it changes then that means that with some runtimes, if you start a hostNetwork pod on a dual-stack host, it will be single-stack, but if you then change the IPv4 address of the node, then the pod would become dual-stack. That's obviously wrong, so either we need to say that kubelet won't update the pod IP of pods where it didn't initially set the pod IP itself (which would mean we would have to remember somewhere whether we had set it ourselves or not), or else (as Kal suggested) we should override CRI in this case if it did try to set the pod IP itself.

Copy link
Contributor

@danwinship danwinship Oct 6, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@aojea https://github.com/kubernetes/kubernetes/blob/master/pkg/kubelet/kubelet_pods.go#L1522
If podIP was set by the runtime then kubelet never calls GetNodeHostIPs doesn't copy the host IP(s) to the pod IP(s)

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why the CRI will set IPs on a hostnetwork pod? there are no CNI calls at all, that doesn't seem a realistic scenario, a hostNetwork pod doesn't report ips neither in CRIO or containerd
https://github.com/cri-o/cri-o/blob/8d4d158935929800c4300b82eb4b5a83ded400f4/server/sandbox_network.go#L34-L36

https://github.com/containerd/containerd/blob/4921fb6b63e5476e54fb7e2cf93c3b505f1c09cb/pkg/cri/store/sandbox/sandbox.go#L40-L41


## Summary

Host-network pods inherit their Pod IPs from the node where they are running. Nodes may change their IPs during their lifecycle, and kubelet updates the node.Status accordingly, however, host-network pods IPs are not updated and keep reporting the old IPs until they are restarted/recreated.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

(line-wrap!)

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

s/Pod IPs/PodIPs/g

(there is no standards, but i noticed that is how we usually ref them).


### Risks and Mitigations

Applications running in a host-network pod that depend on the Downward API or the Pod.Status.PodIPs for auto-configuration should consider that it can change. However, most probably these applications would not be working correctly because of that.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Well, there are two cases here:

  1. a node IP is removed (which is what you're talking about)
  2. the node's "primary node IP(s)" change in some way but the old primary node IP(s) still exist
    • A. the node goes from single-stack to dual-stack and so starts using dual-stack podIPs for new host-network pods.
    • B. the node goes from dual-stack to single-stack and so stops using dual-stack podIPs, although presumably the secondary-IP-family node IP still exists even though kubelet is ignoring it.
    • C. the node's .Status.Addresses gets reordered in a way that causes a different IP to become primary. (eg, kubelet gets restarted with a different --node-ip value, or the user modifies the cloud configuration for the node to, eg, swap two virtual NICs)

In all of the case 2 scenarios, the pod's old IP continues to be at least partially valid/usable.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

good point, worth clarifying

Copy link
Contributor

@khenidak khenidak left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There is a missing user story around static pods and how they are racy. They can start before CloudProvider sets Node.Status.Addresses.

I also want - for prosperity - to outline the idea of host-net should always listen to ::0 if they want to avoid node IP change issues.


## Summary

Host-network pods inherit their Pod IPs from the node where they are running. Nodes may change their IPs during their lifecycle, and kubelet updates the node.Status accordingly, however, host-network pods IPs are not updated and keep reporting the old IPs until they are restarted/recreated.
Copy link
Contributor

@krisnova krisnova Oct 5, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Host-network pods inherit their Pod IPs from the node where they are running.

I have never found good documentation on this, but there are some considerations for a host with multiple NICs with regard to hostNet: true pods. Most Linux network managers (tested: systemdnetworkd and NetworkManager) enable you to define a default route. So far I have only seen pods pull the same IP based on the default route. Although I can't back this up with paperwork. I haven't looked that far into the code (yet?).

However I am wondering how this change would respond in the following situations given this observation.

Given a node with the following criteria

  • eno1 10.0.0.100 (default route)
  • eno2 192.168.1.1

  • What is the expected Pod behavior if the default route changes from eno1 -> eno2?
  • What is the expected Pod behavior if eno1 is disabled (deactivated) and eno2 has a valid connection, but is not the default route?
  • What is the expected Pod behavior if a new NIC is added, but the default route remains?
  • What is the expected Pod behavior if eno1 at 10.0.0.100 remains the default route in the route table, but is assigned a NEW IP address 172.0.100.100?
  • What is the expected Pod behavior if eno1 at 10.0.0.100 remains the default route in the route table, but is assigned a SECONDARY IP address 172.0.100.100?

tldr; that default route though

Copy link
Contributor

@khenidak khenidak Oct 5, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There are two types of node ip changes, in context:

  1. Soft change This happens because the delta between what kubelet sees (default first IPv4 iirc) and thus Static Pods see (if they start before cloud provider set-ip step) and what a thing like cloud provider sets which can be dual-stacked, different order, or even different IP.
  2. Hard change Like the one you described or what @aojea described (e.g., DHCP resets) or to be more specific a reconfiguration on the networking stack of the node.

The first type has no baring on anything. It just says the pod is reachable using different set of addresses (assuming pod listens to ::0 or whatever equivalent in other languages). The node itself didn't change its own network stack configuration. To be more specific. The node only changed the IPs it reports to kubernetes from {<IP>} to !{<IP>} but the node owned and still owns the joined set of IPs. Please do note that srcIP for host-net pod does have to == pod IP, this is subject to how node networking stack is configured (similar to srcIP for pod-net pods which is subject CNI and its snat rules). This is the most common change we see.

For the second type also - arguably - should result into no change in pod (again assuming it listens to ::0 and it shares host network namespace). To present it differently let us assume the node retooled the entire networking stack (assuming CNI is smart enough to deal with this and no problems with pods in pod-net) what happens to kubelet? in practice kubelet is running as a pod in host-net (or as any other app in host default net namespace). Depending on the severity of change existing connection may need to be terminated and re-established but it does not require kubelet to restart (again assuming listening to ::0).

Maybe we should limit the kep to first type? @aojea

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Although I can't back this up with paperwork. I haven't looked that far into the code (yet?).

the code is here if you want to have some fun :) https://github.com/kubernetes/kubernetes/blob/6ee6251fea608ff64ff69314624d26fc6b3b61ac/pkg/kubelet/nodestatus/setters.go#L60-L255

There are different scenarios, let's discard the cloud provider one and focus on bare metal/non cloud-provider, the code says


 			// 1) Use nodeIP if set (and not "0.0.0.0"/"::")
			// 2) If the user has specified an IP to HostnameOverride, use it
			// 3) Lookup the IP from node name by DNS
			// 4) Try to get the IP from the network interface used as default gateway
			//
			// For steps 3 and 4, IPv4 addresses are preferred to IPv6 

if we are in step 4, we execute https://github.com/kubernetes/kubernetes/blob/6ee6251fea608ff64ff69314624d26fc6b3b61ac/staging/src/k8s.io/apimachinery/pkg/util/net/interface.go#L463-L468

// ResolveBindAddress returns the IP address of a daemon, based on the given bindAddress:
// If bindAddress is unset, it returns the host's default IP, as with ChooseHostInterface().
// If bindAddress is unspecified or loopback, it returns the default IP of the same
// address family as bindAddress.
// Otherwise, it just returns bindAddress.
func ResolveBindAddress(bindAddress net.IP) (net.IP, error) {

It is important to mention that these are host-network pods, so they have ALL the IPs in the host, they run in the host namespace, we are talking only about the pod.Status.PodIPs fields, there is no influence in the node network at all.

@khenidak I narrowed the scope of the KEP in goals/non-goals to avoid any change in the kubelet. This is just to set the reported nodeIP by the kubelet in the PodIPs,

You can see Dan's comments too, the whole IP dance inside of the kubelet requires a big big big work if we want to make it perfect
#3000 (comment)
#3000 (comment)

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It is important to mention that these are host-network pods, so they have ALL the IPs in the host, they run in the host namespace, we are talking only about the pod.Status.PodIPs fields, there is no influence in the node network at all.

This is what I came here for.

I haven't always detected this behavior which is why I think I am confused. I might be able to offer reproduction steps when I find time. However I know I have had a node with both a publicly routable IP and an internal one as well and have seen the host net pods only route through the internal IP.

No more action needed. My comment is resolved (albeit it might be wise to add some more language about scope and IP changes like mentioned above). If I have other questions on the code I can bring it up later. Thanks for the great response.

@endocrimes
Copy link
Member

/cc

@thockin
Copy link
Member

thockin commented Jan 25, 2022

Are we going to push for this in 1.24 ?

@aojea
Copy link
Member Author

aojea commented Jan 25, 2022

Are we going to push for this in 1.24 ?

We have a simpler solution for the problem reported kubernetes/kubernetes#106715

The dynamic hostIPs use case is still a breaking change for the nodes behaviors without a strong use case that justify it.

@thockin
Copy link
Member

thockin commented Jan 26, 2022

ping @aojea - deadline looming

@aojea
Copy link
Member Author

aojea commented Jan 26, 2022

ping @aojea - deadline looming

let's go with the bug fix and once we have more evidence and feedback we' ll revisit
/hold

@k8s-ci-robot k8s-ci-robot added the do-not-merge/hold Indicates that a PR should not merge because someone has issued a /hold command. label Jan 26, 2022
@thockin thockin modified the milestones: v1.24, v1.25 Jan 26, 2022
@aojea aojea closed this Jan 27, 2022
@aojea aojea reopened this Jan 27, 2022
@k8s-ci-robot
Copy link
Contributor

[APPROVALNOTIFIER] This PR is NOT APPROVED

This pull-request has been approved by: aojea
To complete the pull request process, please ask for approval from dcbw and additionally assign ehashman after the PR has been reviewed.
You can assign the PR to them by writing /assign @ehashman in a comment when ready.

The full list of commands accepted by this bot can be found here.

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@k8s-triage-robot
Copy link

The Kubernetes project currently lacks enough contributors to adequately respond to all issues and PRs.

This bot triages issues and PRs according to the following rules:

  • After 90d of inactivity, lifecycle/stale is applied
  • After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied
  • After 30d of inactivity since lifecycle/rotten was applied, the issue is closed

You can:

  • Mark this issue or PR as fresh with /remove-lifecycle stale
  • Mark this issue or PR as rotten with /lifecycle rotten
  • Close this issue or PR with /close
  • Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/lifecycle stale

@k8s-ci-robot k8s-ci-robot added the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Apr 27, 2022
@k8s-triage-robot
Copy link

The Kubernetes project currently lacks enough active contributors to adequately respond to all issues and PRs.

This bot triages issues and PRs according to the following rules:

  • After 90d of inactivity, lifecycle/stale is applied
  • After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied
  • After 30d of inactivity since lifecycle/rotten was applied, the issue is closed

You can:

  • Mark this issue or PR as fresh with /remove-lifecycle rotten
  • Close this issue or PR with /close
  • Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/lifecycle rotten

@k8s-ci-robot k8s-ci-robot added lifecycle/rotten Denotes an issue or PR that has aged beyond stale and will be auto-closed. and removed lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. labels May 27, 2022
@thockin
Copy link
Member

thockin commented Jun 10, 2022

Are we touching this for 1.25?

@aojea
Copy link
Member Author

aojea commented Jun 11, 2022

Are we touching this for 1.25?

I didn't hear more about this problem since we merged kubernetes/kubernetes#106715

I'm going to set it in draft mode just in case we need to revisit

@aojea aojea marked this pull request as draft June 11, 2022 10:46
@k8s-ci-robot k8s-ci-robot added the do-not-merge/work-in-progress Indicates that a PR should not merge because it is a work in progress. label Jun 11, 2022
@k8s-triage-robot
Copy link

The Kubernetes project currently lacks enough active contributors to adequately respond to all issues and PRs.

This bot triages issues and PRs according to the following rules:

  • After 90d of inactivity, lifecycle/stale is applied
  • After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied
  • After 30d of inactivity since lifecycle/rotten was applied, the issue is closed

You can:

  • Reopen this issue or PR with /reopen
  • Mark this issue or PR as fresh with /remove-lifecycle rotten
  • Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/close

@k8s-ci-robot
Copy link
Contributor

@k8s-triage-robot: Closed this PR.

In response to this:

The Kubernetes project currently lacks enough active contributors to adequately respond to all issues and PRs.

This bot triages issues and PRs according to the following rules:

  • After 90d of inactivity, lifecycle/stale is applied
  • After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied
  • After 30d of inactivity since lifecycle/rotten was applied, the issue is closed

You can:

  • Reopen this issue or PR with /reopen
  • Mark this issue or PR as fresh with /remove-lifecycle rotten
  • Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/close

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. do-not-merge/hold Indicates that a PR should not merge because someone has issued a /hold command. do-not-merge/work-in-progress Indicates that a PR should not merge because it is a work in progress. kind/kep Categorizes KEP tracking issues and PRs modifying the KEP directory lifecycle/rotten Denotes an issue or PR that has aged beyond stale and will be auto-closed. sig/network Categorizes an issue or PR as relevant to SIG Network. size/XL Denotes a PR that changes 500-999 lines, ignoring generated files.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

9 participants