Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

checkpoint restore: netns is not configured when a custom userns is used #18502

Open
Luap99 opened this issue May 8, 2023 · 12 comments
Open
Labels
kind/bug Categorizes issue or PR as related to a bug.

Comments

@Luap99
Copy link
Member

Luap99 commented May 8, 2023

Issue Description

The restore of a checkpoint for a container with uses a custom userns is not working correctly. The netns is not setup at all.

Also the checkpoint is only working for runc, with crun checkpoint fails in this case. (I will file a crun bug).

Steps to reproduce the issue

$ sudo podman --runtime runc run -d --name test --uidmap 0:0:1000 quay.io/libpod/testimage:20221018 top
20075287fa50180654a6cce41bdd75a770190985a7a8d67e429b725c6ef5212d
$ sudo bin/podman exec test ip addr
1: lo: <LOOPBACK,UP,LOWER_UP> mtu 65536 qdisc noqueue state UNKNOWN group default qlen 1000
    link/loopback 00:00:00:00:00:00 brd 00:00:00:00:00:00
    inet 127.0.0.1/8 scope host lo
       valid_lft forever preferred_lft forever
    inet6 ::1/128 scope host 
       valid_lft forever preferred_lft forever
2: eth0@if58: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc noqueue state UP group default qlen 1000
    link/ether fa:bc:a0:fe:2a:88 brd ff:ff:ff:ff:ff:ff link-netnsid 0
    inet 10.88.0.9/16 brd 10.88.255.255 scope global eth0
       valid_lft forever preferred_lft forever
    inet6 fe80::f8bc:a0ff:fefe:2a88/64 scope link 
       valid_lft forever preferred_lft forever
$ sudo podman container checkpoint test
test
$ sudo podman container restore test
test
$ sudo podman exec test ip addr
1: lo: <LOOPBACK> mtu 65536 qdisc noop state DOWN group default qlen 1000
    link/loopback 00:00:00:00:00:00 brd 00:00:00:00:00:00

Describe the results you received

The netns was not setup on restore.

1: lo: <LOOPBACK> mtu 65536 qdisc noop state DOWN group default qlen 1000
    link/loopback 00:00:00:00:00:00 brd 00:00:00:00:00:00

Describe the results you expected

Netns setup with same mac and ip.

1: lo: <LOOPBACK,UP,LOWER_UP> mtu 65536 qdisc noqueue state UNKNOWN group default qlen 1000
    link/loopback 00:00:00:00:00:00 brd 00:00:00:00:00:00
    inet 127.0.0.1/8 scope host lo
       valid_lft forever preferred_lft forever
    inet6 ::1/128 scope host 
       valid_lft forever preferred_lft forever
2: eth0@if58: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc noqueue state UP group default qlen 1000
    link/ether fa:bc:a0:fe:2a:88 brd ff:ff:ff:ff:ff:ff link-netnsid 0
    inet 10.88.0.9/16 brd 10.88.255.255 scope global eth0
       valid_lft forever preferred_lft forever
    inet6 fe80::f8bc:a0ff:fefe:2a88/64 scope link 
       valid_lft forever preferred_lft forever

podman info output

host:
  arch: amd64
  buildahVersion: 1.31.0-dev
  cgroupControllers:
  - cpuset
  - cpu
  - io
  - memory
  - hugetlb
  - pids
  - misc
  cgroupManager: systemd
  cgroupVersion: v2
  conmon:
    package: conmon-2.1.7-2.fc37.x86_64
    path: /usr/bin/conmon
    version: 'conmon version 2.1.7, commit: '
  cpuUtilization:
    idlePercent: 88.72
    systemPercent: 2.17
    userPercent: 9.11
  cpus: 12
  databaseBackend: boltdb
  distribution:
    distribution: fedora
    variant: workstation
    version: "37"
  eventLogger: journald
  hostname: pholzing-fedora
  idMappings:
    gidmap: null
    uidmap: null
  kernel: 6.2.14-200.fc37.x86_64
  linkmode: dynamic
  logDriver: journald
  memFree: 9195008000
  memTotal: 33384812544
  networkBackend: netavark
  ociRuntime:
    name: crun
    package: crun-1.8.4-1.fc37.x86_64
    path: /usr/bin/crun
    version: |-
      crun version 1.8.4
      commit: 5a8fa99a5e41facba2eda4af12fa26313918805b
      rundir: /run/crun
      spec: 1.0.0
      +SYSTEMD +SELINUX +APPARMOR +CAP +SECCOMP +EBPF +CRIU +LIBKRUN +WASM:wasmedge +YAJL
  os: linux
  remoteSocket:
    path: /run/podman/podman.sock
  security:
    apparmorEnabled: false
    capabilities: CAP_CHOWN,CAP_DAC_OVERRIDE,CAP_FOWNER,CAP_FSETID,CAP_KILL,CAP_NET_BIND_SERVICE,CAP_SETFCAP,CAP_SETGID,CAP_SETPCAP,CAP_SETUID,CAP_SYS_CHROOT
    rootless: false
    seccompEnabled: true
    seccompProfilePath: /usr/share/containers/seccomp.json
    selinuxEnabled: true
  serviceIsRemote: false
  slirp4netns:
    executable: /usr/bin/slirp4netns
    package: slirp4netns-1.2.0-8.fc37.x86_64
    version: |-
      slirp4netns version 1.2.0
      commit: 656041d45cfca7a4176f6b7eed9e4fe6c11e8383
      libslirp: 4.7.0
      SLIRP_CONFIG_VERSION_MAX: 4
      libseccomp: 2.5.3
  swapFree: 8589930496
  swapTotal: 8589930496
  uptime: 4h 31m 12.00s (Approximately 0.17 days)
plugins:
  authorization: null
  log:
  - k8s-file
  - none
  - passthrough
  - journald
  network:
  - bridge
  - macvlan
  - ipvlan
  - host-device-plugin
  volume:
  - local
registries:
  search:
  - registry.fedoraproject.org
  - registry.access.redhat.com
  - docker.io
  - quay.io
store:
  configFile: /usr/share/containers/storage.conf
  containerStore:
    number: 0
    paused: 0
    running: 0
    stopped: 0
  graphDriverName: overlay
  graphOptions:
    overlay.mountopt: nodev,metacopy=on
  graphRoot: /var/lib/containers/storage
  graphRootAllocated: 510389125120
  graphRootUsed: 145230790656
  graphStatus:
    Backing Filesystem: btrfs
    Native Overlay Diff: "false"
    Supports d_type: "true"
    Using metacopy: "true"
  imageCopyTmpDir: /var/tmp
  imageStore:
    number: 1
  runRoot: /run/containers/storage
  transientStore: false
  volumePath: /var/lib/containers/storage/volumes
version:
  APIVersion: 4.6.0-dev
  Built: 1683548648
  BuiltTime: Mon May  8 14:24:08 2023
  GitCommit: fb034432743d4b4522c93056bf4580743fe539a0
  GoVersion: go1.19.8
  Os: linux
  OsArch: linux/amd64
  Version: 4.6.0-dev

Podman in a container

No

Privileged Or Rootless

None

Upstream Latest Release

Yes

Additional environment details

No response

Additional information

Found while working on #18468.
The problem is that I try to use the userns case all the time which is causing tests failures.

@Luap99 Luap99 added the kind/bug Categorizes issue or PR as related to a bug. label May 8, 2023
@Luap99

This comment was marked as resolved.

@Luap99
Copy link
Member Author

Luap99 commented May 8, 2023

Ok I think #18468 should address it, however I still have problems with crun. crun can checkpoint but it fails to restore because the netns link is missing which is correct as podman will now only configure the netns after the oci runtime creates the container.

(00.048662)      1: Try to restore a link 10:2:eth0
(00.048676)      1: Restoring link eth0 type 2
(00.048691)      1: Restoring netdev eth0 idx 2
(00.048705)      1: Restore ll addr (8e:../6) for device
(00.048712)      1: Error (criu/net.c:1462): Unknown peer net namespace
(00.063166)      1: Error (criu/libnetlink.c:54): -16 reported by netlink: Device or resource busy
(00.063205)      1: Error (criu/net.c:1816): Can't restore link: -16
(00.063270)      1: Error (criu/util.c:1411): Can't wait or bad status: errno=0, status=65280
(00.064717) Error (criu/cr-restore.c:2536): Restoring FAILED.

I am not sure how checkpoint/restore is exactly supposed to work with the netns. Does podman have to create these netns before we restore? I assume not because the same thing works with runc only crun complains.
@giuseppe @adrianreber PTAL

@adrianreber
Copy link
Collaborator

I think at this point nobody tested if checkpoint/restore works in combination with user namespaces.

CRIU has support for user namespaces but I am not aware in how far it has been enabled in runc/crun. What you describe seems to be as expected. It is not implemented in crun and the runc implementations exists but has not been used much or at all.

If used without user namespaces Podman has to create the network namespace and tell CRIU which network namespace is used and CRIU will restore the processes into that network namespace.

@Luap99
Copy link
Member Author

Luap99 commented May 8, 2023

If used without user namespaces Podman has to create the network namespace and tell CRIU which network namespace is used and CRIU will restore the processes into that network namespace.

That seems to be the case for crun but not runc. runc works fine with adding an empty netns and restore into that. Then podman configures the netns later. As this is the only way to ensure the netns is owned by the right userns, I would love if we could make crun work the same?

@adrianreber
Copy link
Collaborator

If used without user namespaces Podman has to create the network namespace and tell CRIU which network namespace is used and CRIU will restore the processes into that network namespace.

That seems to be the case for crun but not runc. runc works fine with adding an empty netns and restore into that.

In the case of runc is it in combination with a user namespace or without? Not sure if you are talking about the situation with or without a user namespace. Because for crun there is just no code at all to handle user namespace with checkpoint/restore.

Then podman configures the netns later.

Also for a restore? The container starts to run after the restore and the network namespace if changed while the container is running? That does sound like it will break things.

As this is the only way to ensure the netns is owned by the right userns, I would love if we could make crun work the same?

As mentioned before, crun has no user namespace support in combination with checkpoint/restore and the runc user namespace implementation is not really used.

@Luap99
Copy link
Member Author

Luap99 commented May 8, 2023

If used without user namespaces Podman has to create the network namespace and tell CRIU which network namespace is used and CRIU will restore the processes into that network namespace.

That seems to be the case for crun but not runc. runc works fine with adding an empty netns and restore into that.

In the case of runc is it in combination with a user namespace or without? Not sure if you are talking about the situation with or without a user namespace. Because for crun there is just no code at all to handle user namespace with checkpoint/restore.

Both, in #18468 I try to make podman only use one code path for the network setup and obviously that means I have to go with the userns path every time.

Then podman configures the netns later.

Also for a restore? The container starts to run after the restore and the network namespace if changed while the container is running? That does sound like it will break things.

Yes, ok this is a big blocker. I did not know restore would start it right away. So that means userns with netns + restore is impossible to support then.

As this is the only way to ensure the netns is owned by the right userns, I would love if we could make crun work the same?

As mentioned before, crun has no user namespace support in combination with checkpoint/restore and the runc user namespace implementation is not really used.

Yes understood, with #18468 we do not use a userns but still try to setup the netns after the oci runtime. So based on the statement above this is something we should never do. I guess I have to revert to the previous behaviour for restore then.

@adrianreber
Copy link
Collaborator

Yes understood, with #18468 we do not use a userns but still try to setup the netns after the oci runtime. So based on the statement above this is something we should never do. I guess I have to revert to the previous behaviour for restore then.

CRIU has the ability to restore a process into a stopped state. This is not exposed in runc/crun, but theoretically you could restore a process using --leave-stopped (see CRIU man page), set up the network namespace and then SIGCONT the processeses in the container.

A combination with the cgroup freezer could also be possible, although I am not sure if that will work. Never tried it and I am not sure you can restore a process in a frozen cgroup. If the cgroup frozen can CRIU run? CRIU uses the cgroup freezer (if available) during checkpointing to stop all processes at the same time. Another idea would be to extend CRIU to support restoring processes but instead of leave them all stopped put them in a frozen cgroup which Podman could then unfreeze.

All these ideas, except the existing --leave-stopped require additional effort in CRIU and runc/crun. --leave-stopped requires changes to crun/runc. Although it is just passing a parameter through.

@Luap99
Copy link
Member Author

Luap99 commented May 8, 2023

OK, thanks for the info. So far all I care about is #18468. I definitely have little interesting in addressing this in crun and runc. But it good to know that CRIU could do it.

@github-actions
Copy link

github-actions bot commented Jun 8, 2023

A friendly reminder that this issue had no activity for 30 days.

@github-actions
Copy link

github-actions bot commented Aug 2, 2023

A friendly reminder that this issue had no activity for 30 days.

@rhatdan
Copy link
Member

rhatdan commented Aug 3, 2023

@Luap99 What should we do with this issue?

@Luap99
Copy link
Member Author

Luap99 commented Aug 3, 2023

Well it is up to oci runtimes to support this correctly so far crun does not seem support it (containers/crun#1207).
And the there is the general problem as mentioned above that a restore will immediately start the process so there is no time for us to configure the netns afterwards. And we cannot create the netns ourselves as the namespace must be owned by the correct userns in order to function correctly.

I only found this while working on the network code, so far looks like no real users complained so I don't think this is a priority.
For now maybe we should juts update the docs stating that checkpoint/restore with userns is currently unsupported.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
kind/bug Categorizes issue or PR as related to a bug.
Projects
None yet
Development

No branches or pull requests

3 participants