Forensic Container Checkpointing #2008

adrianreber · 2020-09-23T14:56:00Z

adrianreber · 2020-09-23T14:58:16Z

/sig node

kikisdeliveryservice · 2020-09-23T17:59:04Z

Discussion Link: N/A (or... at multiple conferences during the last years when presenting CRIU and container migration, there was always the question when will we see container migration in Kubernetes)

Responsible SIGs: maybe node

We recommend actively socializing your KEP with the appropriate sig to gain visibility, consensus and also for scheduling. Also as you are not sure of what SIG will sponsor this, reaching out to the SIGs to get clarity on that will be helpful to move your KEP forward.

kikisdeliveryservice · 2020-09-27T21:46:33Z

Hi @adrianreber

Any updates on whether this will be included in 1.20?

Enhancements Freeze is October 6th and by that time we require:

The KEP must be merged in an implementable state
The KEP must have test plans
The KEP must have graduation criteria
The KEP must have an issue in the milestone

Best,
Kirsten

adrianreber · 2020-09-28T05:41:19Z

Hello @kikisdeliveryservice

Any updates on whether this will be included in 1.20?

Sorry, but how would I decide this? There has not been a lot of feedback on the corresponding KEP which makes it really difficult for me to answer that question. On the other hand, maybe the missing feedback is a good sign that it will take some more time. So probably this will not be included in 1.20.

kikisdeliveryservice · 2020-09-28T16:52:54Z

Normally the sig would give a clear signal that it would be included. That would be by : reviewing the KEP, agreeing to the milestone proposals in the KEP etc.. I'd encourage you to keep in touch with them and start the 1.21 conversation early if this does not end up getting reviewed/merged properly by October 6th.

Best,
Kirsten

adrianreber · 2020-09-28T16:58:13Z

@kikisdeliveryservice Thanks for the guidance. Will do.

fejta-bot · 2020-12-27T17:24:01Z

Issues go stale after 90d of inactivity.
Mark the issue as fresh with /remove-lifecycle stale.
Stale issues rot after an additional 30d of inactivity and eventually close.

If this issue is safe to close now please do so with /close.

Send feedback to sig-testing, kubernetes/test-infra and/or fejta.
/lifecycle stale

fejta-bot · 2021-01-26T18:07:40Z

Stale issues rot after 30d of inactivity.
Mark the issue as fresh with /remove-lifecycle rotten.
Rotten issues close after an additional 30d of inactivity.

If this issue is safe to close now please do so with /close.

Send feedback to sig-testing, kubernetes/test-infra and/or fejta.
/lifecycle rotten

fejta-bot · 2021-02-25T18:53:30Z

Rotten issues close after 30d of inactivity.
Reopen the issue with /reopen.
Mark the issue as fresh with /remove-lifecycle rotten.

Send feedback to sig-contributor-experience at kubernetes/community.
/close

k8s-ci-robot · 2021-02-25T18:53:39Z

@fejta-bot: Closing this issue.

In response to this:

Rotten issues close after 30d of inactivity.
Reopen the issue with /reopen.
Mark the issue as fresh with /remove-lifecycle rotten.

Send feedback to sig-contributor-experience at kubernetes/community.
/close

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

adrianreber · 2021-02-25T19:11:18Z

/reopen
/remove-lifecycle rotten

k8s-ci-robot · 2021-02-25T19:11:26Z

@adrianreber: Reopened this issue.

In response to this:

/reopen
/remove-lifecycle rotten

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

fejta-bot · 2021-05-26T20:06:10Z

Issues go stale after 90d of inactivity.
Mark the issue as fresh with /remove-lifecycle stale.
Stale issues rot after an additional 30d of inactivity and eventually close.

If this issue is safe to close now please do so with /close.

Send feedback to sig-contributor-experience at kubernetes/community.
/lifecycle stale

adrianreber · 2021-05-27T12:39:34Z

/remove-lifecycle stale

Still working on it.

k8s-triage-robot · 2021-08-25T13:10:10Z

The Kubernetes project currently lacks enough contributors to adequately respond to all issues and PRs.

This bot triages issues and PRs according to the following rules:

After 90d of inactivity, lifecycle/stale is applied
After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied
After 30d of inactivity since lifecycle/rotten was applied, the issue is closed

You can:

Mark this issue or PR as fresh with /remove-lifecycle stale
Mark this issue or PR as rotten with /lifecycle rotten
Close this issue or PR with /close
Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/lifecycle stale

Kubernetes 1.25 introduced the possibility to checkpoint a container. For details please see the KEP 2008 Forensic Container Checkpointing kubernetes/enhancements#2008 The initial implementation only provided a kubelet API endpoint to trigger a checkpoint. The main reason for not extending it to the API server and kubectl was that checkpointing is a completely new concept. Although the result of the checkpointing, the checkpoint archive, is only accessible by root it is important to remember that it contains all memory pages and thus all possible passwords, private keys and random numbers. With the checkpoint archive being only accessible by root it does not directly make it easier to access this potentially confidential information as root would be able to retrieve that information anyway. Now, at least three Kubernetes releases later, we have not heard any negative feedback about the checkpoint archive and its data. There were, however, many questions to be able to create a checkpoint via kubectl and not just via the kubelet API endpoint. This commit adds 'checkpoint' support to kubectl. The 'checkpoint' command is heavily influenced by the code of the 'exec' and 'logs' command. The checkpoint command is only available behind the 'alpha' sub-command as the "Forensic Container Checkpointing" KEP is still marked as Alpha. Example output: $ kubectl alpha checkpoint test-pod -c container-2 Node: 127.0.0.1/127.0.0.1 Namespace: default Pod: test-pod-1 Container: container-2 Checkpoint Archive: /var/lib/kubelet/checkpoints/checkpoint-archive.tar The tests are implemented that they handle a CRI implementation with and without a implementation of the CRI RPC call 'ContainerCheckpoint'. Signed-off-by: Adrian Reber <areber@redhat.com>

adrianreber · 2023-09-27T07:31:58Z

First attempt to provide checkpoint via kubectl: kubernetes/kubernetes#120898

Tobeabellwether · 2023-10-24T17:00:49Z

Hi @adrianreber, I created a simple microservice pod and tried to migrate it. I found that when I just started it and used a counter-like function, I was able to checkpoint it, but when I used it to connect to the message broker and send the message, checkpointing it will raise the following error, is there any way to solve it?

checkpointed: checkpointing of default/order-service-7c69b4d88b-n56xq/order-service failed
(rpc error: code = Unknown desc = failed to checkpoint container b0ae35f30db3124341035041d02ce85bd83be2b38180081ab919a9f89e16c3af:

running "/usr/local/bin/runc"
["checkpoint"
"--image-path" "/var/lib/containers/storage/overlay-containers/b0ae35f30db3124341035041d02ce85bd83be2b38180081ab919a9f89e16c3af/userdata/checkpoint"
"--work-path" "/var/lib/containers/storage/overlay-containers/b0ae35f30db3124341035041d02ce85bd83be2b38180081ab919a9f89e16c3af/userdata"
"--leave-running" "b0ae35f30db3124341035041d02ce85bd83be2b38180081ab919a9f89e16c3af"]

failed: /usr/local/bin/runc --root /run/runc --systemd-cgroup checkpoint --image-path /var/lib/containers/storage/overlay-containers/b0ae35f30db3124341035041d02ce85bd83be2b38180081ab919a9f89e16c3af/userdata/checkpoint --work-path /var/lib/containers/storage/overlay-containers/b0ae35f30db3124341035041d02ce85bd83be2b38180081ab919a9f89e16c3af/userdata --leave-running b0ae35f30db3124341035041d02ce85bd83be2b38180081ab919a9f89e16c3af

failed: time="2023-10-24T16:40:11Z"
level=error
msg="criu failed: type NOTIFY errno 0\nlog file: /var/lib/containers/storage/overlay-containers/b0ae35f30db3124341035041d02ce85bd83be2b38180081ab919a9f89e16c3af/userdata/dump.log"

adrianreber · 2023-10-24T17:27:42Z

@Tobeabellwether Please open a bug at CRI-O with the dump.log attached.

Tobeabellwether · 2023-10-24T18:06:59Z

@Tobeabellwether Please open a bug at CRI-O with the dump.log attached.

@adrianreber Thanks for the tip, I checked that dump.log file and found:

(00.134562) Error (criu/sk-inet.c:191): inet: Connected TCP socket, consider using --tcp-established option.
(00.134634) ----------------------------------------
(00.134654) Error (criu/cr-dump.c:1669): Dump files (pid: 1533879) failed with -1

So, I tried to forcefully interrupt the TCP connection between the pod and the message broker, and the checkpoint was successfully created. Should this still be considered a bug?

adrianreber · 2023-10-24T18:10:51Z

Ah, good to know. No, this is not a real bug. We probably at some point need the ability to pass down different parameters from Kubernetes to CRIU. But this is something for the far future.

You can also control CRIU options with a CRIU configuration file. Handling TCP connections that are established could be configured there.

Tobeabellwether · 2023-10-28T00:05:47Z

Ah, good to know. No, this is not a real bug. We probably at some point need the ability to pass down different parameters from Kubernetes to CRIU. But this is something for the far future.

You can also control CRIU options with a CRIU configuration file. Handling TCP connections that are established could be configured there.

Hi @adrianreber again, when I try to restore the checkpoint of the pod with TCP connection on a new pod, I encounter a new problem:

So I try to check the log file under /var/run/containers/storage/overlay-containers/order-service/userdata/restore.log but I only found those folders, no one with the name of the container to restore:

and the restoration for pods without TCP connections works fine.

adrianreber · 2023-10-30T07:16:50Z

@Tobeabellwether You can redirect the CRIU log file to another file using the CRIU configuration file: log-file /tmp/restore.log and have a look at that file.

k8s-triage-robot · 2024-01-31T09:40:17Z

The Kubernetes project currently lacks enough contributors to adequately respond to all issues.

This bot triages un-triaged issues according to the following rules:

After 90d of inactivity, lifecycle/stale is applied
After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied
After 30d of inactivity since lifecycle/rotten was applied, the issue is closed

You can:

Mark this issue as fresh with /remove-lifecycle stale
Close this issue with /close
Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/lifecycle stale

adrianreber · 2024-01-31T09:43:47Z

/remove-lifecycle stale

mrunalp · 2024-02-05T19:27:48Z

/stage beta
/milestone v1.30
/label lead-opted-in

sreeram-venkitesh · 2024-02-07T16:47:09Z

Hello @adrianreber 👋, v1.30 Enhancements team here.

Just checking in as we approach enhancements freeze on 02:00 UTC Friday 9th February 2024.

This enhancement is targeting for stage beta for v1.30 (correct me, if otherwise)

Here's where this enhancement currently stands:

KEP readme using the latest template has been merged into the k/enhancements repo.
KEP status is marked as implementable for latest-milestone: 1.30. KEPs targeting stable will need to be marked as implemented after code PRs are merged and the feature gates are removed.
KEP readme has up-to-date graduation criteria
KEP has a production readiness review that has been completed and merged into k/enhancements. (For more information on the PRR process, check here).

Everything is done in #4288 and #4305. Please make sure that these PRs are merged before the enhancements freeze.

The status of this enhancement is marked as At risk for enhancement freeze currently. Please make sure your PRs are merged in time.

sreeram-venkitesh · 2024-02-09T02:47:50Z

Hello 👋, v.130 Enhancements team here.

Unfortunately, this enhancement did not meet requirements for enhancements freeze.

#4288 is merged, but #4305 is still open. Please file an exception request to get this PR merged.

If you still wish to progress this enhancement in v1.30, please file an exception request. Thanks!

salehsedghpour · 2024-02-09T02:53:04Z

/milestone clear

HFourier · 2024-03-11T02:54:39Z

Ah, good to know. No, this is not a real bug. We probably at some point need the ability to pass down different parameters from Kubernetes to CRIU. But this is something for the far future.
You can also control CRIU options with a CRIU configuration file. Handling TCP connections that are established could be configured there.

Hi @adrianreber again, when I try to restore the checkpoint of the pod with TCP connection on a new pod, I encounter a new problem:
So I try to check the log file under `/var/run/containers/storage/overlay-containers/order-service/userdata/restore.log` but I only found those folders, no one with the name of the container to restore: and the restoration for pods without TCP connections works fine.

I have the same problem, have you solved it?

Tobeabellwether · 2024-03-11T09:02:28Z

Ah, good to know. No, this is not a real bug. We probably at some point need the ability to pass down different parameters from Kubernetes to CRIU. But this is something for the far future.
You can also control CRIU options with a CRIU configuration file. Handling TCP connections that are established could be configured there.

Hi @adrianreber again, when I try to restore the checkpoint of the pod with TCP connection on a new pod, I encounter a new problem:

So I try to check the log file under /var/run/containers/storage/overlay-containers/order-service/userdata/restore.log but I only found those folders, no one with the name of the container to restore:

and the restoration for pods without TCP connections works fine.

I have the same problem, have you solved it?

I've read CRIU's doc, I think checkpointing Container with data in and out is in general tricky and not safe, so I just always close its connection before I checkpointing.

Kubernetes 1.25 introduced the possibility to checkpoint a container. For details please see the KEP 2008 Forensic Container Checkpointing kubernetes/enhancements#2008 The initial implementation only provided a kubelet API endpoint to trigger a checkpoint. The main reason for not extending it to the API server and kubectl was that checkpointing is a completely new concept. Although the result of the checkpointing, the checkpoint archive, is only accessible by root it is important to remember that it contains all memory pages and thus all possible passwords, private keys and random numbers. With the checkpoint archive being only accessible by root it does not directly make it easier to access this potentially confidential information as root would be able to retrieve that information anyway. Now, at least three Kubernetes releases later, we have not heard any negative feedback about the checkpoint archive and its data. There were, however, many questions to be able to create a checkpoint via kubectl and not just via the kubelet API endpoint. This commit adds 'checkpoint' support to kubectl. The 'checkpoint' command is heavily influenced by the code of the 'exec' and 'logs' command. The checkpoint command is only available behind the 'alpha' sub-command as the "Forensic Container Checkpointing" KEP is still marked as Alpha. Example output: $ kubectl alpha checkpoint test-pod -c container-2 Node: 127.0.0.1/127.0.0.1 Namespace: default Pod: test-pod-1 Container: container-2 Checkpoint Archive: /var/lib/kubelet/checkpoints/checkpoint-archive.tar The tests are implemented that they handle a CRI implementation with and without a implementation of the CRI RPC call 'ContainerCheckpoint'. Signed-off-by: Adrian Reber <areber@redhat.com>

k8s-ci-robot added the needs-sig Indicates an issue or PR lacks a `sig/foo` label and requires one. label Sep 23, 2020

adrianreber mentioned this issue Sep 23, 2020

Add Forensic Container Checkpointing KEP #1990

Merged

k8s-ci-robot added sig/node Categorizes an issue or PR as relevant to SIG Node. and removed needs-sig Indicates an issue or PR lacks a `sig/foo` label and requires one. labels Sep 23, 2020

kikisdeliveryservice added tracked/no Denotes an enhancement issue is NOT actively being tracked by the Release Team stage/alpha Denotes an issue tracking an enhancement targeted for Alpha status labels Sep 28, 2020

k8s-ci-robot added the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Dec 27, 2020

k8s-ci-robot added lifecycle/rotten Denotes an issue or PR that has aged beyond stale and will be auto-closed. and removed lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. labels Jan 26, 2021

k8s-ci-robot closed this as completed Feb 25, 2021

k8s-ci-robot reopened this Feb 25, 2021

k8s-ci-robot removed the lifecycle/rotten Denotes an issue or PR that has aged beyond stale and will be auto-closed. label Feb 25, 2021

k8s-ci-robot added the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label May 26, 2021

k8s-ci-robot removed the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label May 27, 2021

adrianreber mentioned this issue Jun 10, 2021

Provide support for checkpoint and restore cri-o/cri-o#4199

Merged

adrianreber mentioned this issue Oct 9, 2023

KEP-2008: Graduate "Forensic Container Checkpointing" to Beta #4288

Merged

2 tasks

adrianreber mentioned this issue Oct 19, 2023

KEP-2008: Provide details about additional checkpoint/restore use cases #4305

Open

k8s-ci-robot added the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Jan 31, 2024

k8s-ci-robot removed the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Jan 31, 2024

k8s-ci-robot removed the stage/alpha Denotes an issue tracking an enhancement targeted for Alpha status label Feb 5, 2024

k8s-ci-robot modified the milestones: v1.25, v1.30 Feb 5, 2024

k8s-ci-robot added stage/beta Denotes an issue tracking an enhancement targeted for Beta status lead-opted-in Denotes that an issue has been opted in to a release labels Feb 5, 2024

k8s-ci-robot removed this from the v1.30 milestone Feb 9, 2024

adrianreber mentioned this issue Feb 19, 2024

Switch 'ContainerCheckpoint' from Alpha to Beta kubernetes/kubernetes#123215

Merged

9 tasks

adrianreber mentioned this issue Mar 8, 2024

KEP 2008: Update KEP to reflect current code changes #4544

Open

kannon92 mentioned this issue Mar 29, 2024

Checkpoint/Restart or Live Motion bottlerocket-os/bottlerocket#3803

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Forensic Container Checkpointing #2008

Forensic Container Checkpointing #2008

adrianreber commented Sep 23, 2020 •

edited

adrianreber commented Sep 23, 2020

kikisdeliveryservice commented Sep 23, 2020

kikisdeliveryservice commented Sep 27, 2020

adrianreber commented Sep 28, 2020

kikisdeliveryservice commented Sep 28, 2020

adrianreber commented Sep 28, 2020

fejta-bot commented Dec 27, 2020

fejta-bot commented Jan 26, 2021

fejta-bot commented Feb 25, 2021

k8s-ci-robot commented Feb 25, 2021

adrianreber commented Feb 25, 2021

k8s-ci-robot commented Feb 25, 2021

fejta-bot commented May 26, 2021

adrianreber commented May 27, 2021

k8s-triage-robot commented Aug 25, 2021

adrianreber commented Sep 27, 2023

Tobeabellwether commented Oct 24, 2023

adrianreber commented Oct 24, 2023

Tobeabellwether commented Oct 24, 2023

adrianreber commented Oct 24, 2023

Tobeabellwether commented Oct 28, 2023

adrianreber commented Oct 30, 2023

k8s-triage-robot commented Jan 31, 2024

adrianreber commented Jan 31, 2024

mrunalp commented Feb 5, 2024

sreeram-venkitesh commented Feb 7, 2024

sreeram-venkitesh commented Feb 9, 2024

salehsedghpour commented Feb 9, 2024

HFourier commented Mar 11, 2024

Tobeabellwether commented Mar 11, 2024

Forensic Container Checkpointing #2008

Forensic Container Checkpointing #2008

Comments

adrianreber commented Sep 23, 2020 • edited

Enhancement Description

adrianreber commented Sep 23, 2020

kikisdeliveryservice commented Sep 23, 2020

kikisdeliveryservice commented Sep 27, 2020

adrianreber commented Sep 28, 2020

kikisdeliveryservice commented Sep 28, 2020

adrianreber commented Sep 28, 2020

fejta-bot commented Dec 27, 2020

fejta-bot commented Jan 26, 2021

fejta-bot commented Feb 25, 2021

k8s-ci-robot commented Feb 25, 2021

adrianreber commented Feb 25, 2021

k8s-ci-robot commented Feb 25, 2021

fejta-bot commented May 26, 2021

adrianreber commented May 27, 2021

k8s-triage-robot commented Aug 25, 2021

adrianreber commented Sep 27, 2023

Tobeabellwether commented Oct 24, 2023

adrianreber commented Oct 24, 2023

Tobeabellwether commented Oct 24, 2023

adrianreber commented Oct 24, 2023

Tobeabellwether commented Oct 28, 2023

adrianreber commented Oct 30, 2023

k8s-triage-robot commented Jan 31, 2024

adrianreber commented Jan 31, 2024

mrunalp commented Feb 5, 2024

sreeram-venkitesh commented Feb 7, 2024

sreeram-venkitesh commented Feb 9, 2024

salehsedghpour commented Feb 9, 2024

HFourier commented Mar 11, 2024

Tobeabellwether commented Mar 11, 2024

adrianreber commented Sep 23, 2020 •

edited