Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Kubeadm operator #2505

Open
4 tasks
shekhar-rajak opened this issue Feb 12, 2021 · 21 comments
Open
4 tasks

Kubeadm operator #2505

shekhar-rajak opened this issue Feb 12, 2021 · 21 comments
Labels
kind/feature Categorizes issue or PR as related to a new feature. lifecycle/frozen Indicates that an issue or PR should not be auto-closed due to staleness. sig/cluster-lifecycle Categorizes an issue or PR as relevant to SIG Cluster Lifecycle.

Comments

@shekhar-rajak
Copy link
Contributor

shekhar-rajak commented Feb 12, 2021

Enhancement Description

  • One-line enhancement description (can be used as a release note): Kubeadm operator
  • Kubernetes Enhancement Proposal:

https://github.com/kubernetes/enhancements/tree/master/keps/sig-cluster-lifecycle/kubeadm/2505-Kubeadm-operator

Summary

Kubeadm operator would like to enable declarative control of kubeadm workflows, automating the execution and the
orchestration of such tasks across existing nodes in a cluster.

Motivation

Kubeadm binary can execute operations only on the machine where it is running e.g. it is not possible to execute
operations on other nodes, to copy files across nodes, etc.

As a consequence, most of the kubeadm workflows, like kubeadm upgrade,
consists of a complex sequence of tasks that should be manually executed and orchestrated across all the existing nodes
in the cluster.

Such a user experience is not ideal due to the error-prone nature of humans running commands. The manual approach
can be considered a blocker for implementing more complex workflows such as rotating certificate authorities,
modifying the settings of an existing cluster or any task that requires coordination of more than one Kubernetes node.

This KEP aims to address such problems by applying the operator pattern to kubeadm workflows.

  • Discussion Link:
  • Primary contact (assignee): @fabriziopandini
  • Responsible SIGs: sig-cluster-lifecycle
  • Enhancement target (which target equals to which milestone):
    • Alpha release target (x.y): TODO
    • Beta release target (x.y): TODO
    • Stable release target (x.y): TODO
  • Alpha
    • KEP (k/enhancements) update PR(s):
    • Code (k/k) update PR(s):
    • Docs (k/website) update PR(s):

Please keep this description up to date. This will help the Enhancement Team to track the evolution of the enhancement efficiently.

@k8s-ci-robot k8s-ci-robot added the needs-sig Indicates an issue or PR lacks a `sig/foo` label and requires one. label Feb 12, 2021
@neolit123 neolit123 added kind/feature Categorizes issue or PR as related to a new feature. sig/cluster-lifecycle Categorizes an issue or PR as relevant to SIG Cluster Lifecycle. labels Feb 13, 2021
@k8s-ci-robot k8s-ci-robot removed the needs-sig Indicates an issue or PR lacks a `sig/foo` label and requires one. label Feb 13, 2021
@fejta-bot
Copy link

Issues go stale after 90d of inactivity.
Mark the issue as fresh with /remove-lifecycle stale.
Stale issues rot after an additional 30d of inactivity and eventually close.

If this issue is safe to close now please do so with /close.

Send feedback to sig-contributor-experience at kubernetes/community.
/lifecycle stale

@k8s-ci-robot k8s-ci-robot added the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Jun 9, 2021
@neolit123
Copy link
Member

/remove-lifecycle stale

@k8s-ci-robot k8s-ci-robot removed the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Jun 9, 2021
@k8s-triage-robot
Copy link

The Kubernetes project currently lacks enough contributors to adequately respond to all issues and PRs.

This bot triages issues and PRs according to the following rules:

  • After 90d of inactivity, lifecycle/stale is applied
  • After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied
  • After 30d of inactivity since lifecycle/rotten was applied, the issue is closed

You can:

  • Mark this issue or PR as fresh with /remove-lifecycle stale
  • Mark this issue or PR as rotten with /lifecycle rotten
  • Close this issue or PR with /close
  • Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/lifecycle stale

@k8s-ci-robot k8s-ci-robot added the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Sep 7, 2021
@neolit123
Copy link
Member

/remove-lifecycle stale

@k8s-ci-robot k8s-ci-robot removed the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Sep 8, 2021
@pacoxu
Copy link
Member

pacoxu commented Sep 14, 2021

Any plan for this feature

@neolit123
Copy link
Member

neolit123 commented Sep 14, 2021 via email

@fabriziopandini
Copy link
Member

I have a prototype and many ideas around this and the kubeadm, library...

@k8s-triage-robot
Copy link

The Kubernetes project currently lacks enough contributors to adequately respond to all issues and PRs.

This bot triages issues and PRs according to the following rules:

  • After 90d of inactivity, lifecycle/stale is applied
  • After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied
  • After 30d of inactivity since lifecycle/rotten was applied, the issue is closed

You can:

  • Mark this issue or PR as fresh with /remove-lifecycle stale
  • Mark this issue or PR as rotten with /lifecycle rotten
  • Close this issue or PR with /close
  • Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/lifecycle stale

@k8s-ci-robot k8s-ci-robot added the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Dec 14, 2021
@fabriziopandini
Copy link
Member

/lifecycle frozen

@k8s-ci-robot k8s-ci-robot added lifecycle/frozen Indicates that an issue or PR should not be auto-closed due to staleness. and removed lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. labels Dec 14, 2021
@chendave
Copy link
Member

/cc

I will evaluate this feature and see if we can move forward this.

@pacoxu
Copy link
Member

pacoxu commented Jun 6, 2022

pacoxu/kubeadm-operator#53 I did some init implementation for upgrade and certs renew.

I got a problem to implement the kubeadm operator.

  1. how to restart kubelet? It is not suggested to run systemctl restart kubelet inside a pod for security concerns.
  2. I got a problem to replace the kubelet bin when kubelet is running.(file busy)

If kubelet is not restarted, the apiserver will be the new version and kubelet will be n-1 version.

  • A workaround for me, I can start a daemon process on every node to check if kubelet version of /usr/bin/kubelet and /usr/bin/kubelet-new are different. If they are different, it will trigger to execute systemctl stop kubelet && /usr/bin/cp /usr/bin/kubelet-new /usr/bin/kubelet && systemctl restart kubelet. This is not that convenient.

Some other thoughts, can we kill kubelet inside pod? Or can we add a flag file in hostPath to let kubelet know it should be restarted? I don't know if there is a simple way to restart kubelet by a kubelet API calling or other methods.

@chendave
Copy link
Member

chendave commented Jun 6, 2022

@pacoxu I am also working on this, all those work for my side are still under internal maintained. It's okay for you to join in but we'd better to sync with each other to avoid the duplicated effort on this.

our work is based on @fabriziopandini's POC, how about yours?

If kubelet is not restarted, the apiserver will be the new version and kubelet will be n-1 version.

This is indeed an issue, so far it is fine since kubelet 1.24 will continue to work with apisever 1.25, but this must be taken care because of the changing from cri support on docker.

@pacoxu
Copy link
Member

pacoxu commented Jun 6, 2022

our work is based on @fabriziopandini's POC, how about yours?

pacoxu/kubeadm-operator#2 The same. The POC was removed from kubeadm code base in kubernetes/kubeadm#2342.

@chendave
Copy link
Member

chendave commented Jun 6, 2022

cc @ruquanzhao for awareness, we need to figure out a solution on the upgrade of kubelet as well.

@neolit123
Copy link
Member

neolit123 commented Jun 6, 2022

big +1 for multiple people collaborating on this. happy to help with more ideas / review when needed.

Some other thoughts, can we kill kubelet inside pod? Or can we add a flag file in hostPath to let kubelet know it should be restarted? I don't know if there is a simple way to restart kubelet by a kubelet API calling or other methods.

i don't think there is an API to restart kubelet.
then again, that would restart a particular kubelet and we want to upgrade the binary too.

A workaround for me, I can start a daemon process on every node to check if kubelet version of /usr/bin/kubelet and /usr/bin/kubelet-new are different. If they are different, it will trigger to execute systemctl stop kubelet && /usr/bin/cp /usr/bin/kubelet-new /usr/bin/kubelet && systemctl restart kubelet. This is not that convenient.

this sounds like one of the ways to do it. i don't think i have and significantly better ideas right now.
e.g. we could (somehow) deploy scripts on the hosts that manage the kubelet restart cycle and upgrade even if the pods that the operator DS deploys are killed due to a kubelet restart.

how to restart kubelet? It is not suggested to run systemctl restart kubelet inside a pod for security concerns.

the operator will have have to have super powers on the hosts, so it will be considered as a trusted "actor"...there is no other way to manage component upgrade (kubeadm, kubelet) and cert rotation, etc.

@pacoxu
Copy link
Member

pacoxu commented Jun 6, 2022

I can build the workaround into a script.

the operator will have to have super powers on the hosts, so it will be considered as a trusted "actor"...there is no other way to manage component upgrade (kubeadm, kubelet) and cert rotation, etc.

If we restart kubelet inside pod, I tried to mount /run/systemd & /var/run/dbus/system_bus_socket and /sys/fs/cgroup to the agent and set privileged/hostPID to true, but the systemctl restart kubelet still failed. I am not sure what I am missing.

@pacoxu
Copy link
Member

pacoxu commented Jun 17, 2022

I write a simple kubelet-reloader

  • kubelet-reloader will watch on /usr/bin/kubelet-new.
  • once there is a different version of kubelet-new, the reloader will replace /usr/bin/kubelet and restart kubelet.

Currently the kubeadm-operator v0.1.0 can support upgrade cross versions like v1.22 to v1.24.

  • kubeadm operator will download kubectl/kubelet/kubeadm and upgrade.
  • kubelet will be placed in /usr/bin/kubelet-new for kubelet reloader.

See quick-start.

@neolit123
Copy link
Member

Thats great. I think we should have our discussion on the k/kubeadm issue to have it in one place. Also cross coordinate with @chendave to avoid duplicated work.

@pacoxu
Copy link
Member

pacoxu commented Jun 30, 2022

kubernetes/kubeadm#2317 may be the right place.

@chendave
Copy link
Member

FYI - We have all the initial scope of the KEP #1239 implemented here: https://github.com/chendave/kubeadm-operator

  • kubeadm upgrade
  • certificate renewal
  • certificate authority rotation (NEW)
  • change configuration in an existing cluster (NEW)

But it is still a just POC.

@pacoxu @neolit123 @ruquanzhao

@pacoxu
Copy link
Member

pacoxu commented May 10, 2023

We have a try to ask if kubeadm operator can be a sig-clusterlifecyle subproject earlier this year. Some context can be found in https://docs.google.com/document/d/1Gmc7LyCIL_148a9Tft7pdhdee0NBHdOfHS1SAF0duI4/edit#heading=h.xm2jvfwtcfuz sig clusterlifecycle weekly meeting and kubernetes-sigs/cluster-api#7044 cluster-api gathering feedback issue.

  • declarative vs imperative: as kubeadm operator is declarative.
  • the immutable node topic in CAPI is complicated and there is no clear way to tell if this operator will be ever used in CAPI. uses cases are not clear.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
kind/feature Categorizes issue or PR as related to a new feature. lifecycle/frozen Indicates that an issue or PR should not be auto-closed due to staleness. sig/cluster-lifecycle Categorizes an issue or PR as relevant to SIG Cluster Lifecycle.
Projects
None yet
Development

No branches or pull requests

8 participants