Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add NixOS image for bare-metal Kata #1019

Open
wants to merge 6 commits into
base: main
Choose a base branch
from
Open

Conversation

msanft
Copy link
Contributor

@msanft msanft commented Nov 19, 2024

This switches the image used in our bare-metal Kata uses (e.g. non-AKS and non-peerpods) to a NixOS image that we build in-tree as a MicroVM image (e.g. separated kernel, initrd, cmdline and rootfs).

I tried to split this up into digestible commits as much as possible, but a large part still has to be done in one commit in order to ensure bisectability through all commits. I also tried to write meaningful commit messages, so please consult these to find out about the specific changes.

@msanft msanft added the no changelog PRs not listed in the release notes label Nov 19, 2024
@msanft msanft force-pushed the msanft/kata/nixos-image branch 6 times, most recently from 1140233 to 304fca3 Compare November 21, 2024 14:38
Copy link

github-actions bot commented Nov 21, 2024

PR Preview Action v1.4.8
🚀 Deployed preview to https://edgelesssys.github.io/contrast/pr-preview/pr-1019/
on branch gh-pages at 2024-11-29 14:01 UTC

@msanft msanft force-pushed the msanft/kata/nixos-image branch 2 times, most recently from c7dbe2c to ad338fa Compare November 22, 2024 09:45
@msanft msanft requested a review from 3u13r November 22, 2024 16:09
Copy link
Contributor

@burgerdev burgerdev left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for properly organizing the commits, that indeed improves the review experience! I went through up-to-including 6ded3e8 for now.

@@ -49,8 +49,11 @@ runCommandLocal "ociLayer"
# Copy files into the tree (./root/)
for i in ''${!srcs[@]}; do
# resolve symlinks
src=$(readlink -f ''${srcs[i]})
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think this is sound. Shouldn't the link target be part of the sources, too? Or, the other way round, why can't I package a symlink pointing to something in another layer?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please see the reasoning I made in the PR description. symlinks generally are fine, but the way we implement this as of now just is not. And rather than making a complicated solution that would enable symlinks, this is sufficient for the use of ociLayerTar as of now. I might enable symlinks in another PR if we see a reason to do so.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I can't find any reasoning about this in the PR description. I agree with @burgerdev comment. What symlink isn't working for this fix to be needed?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Commit description, sorry. It's about kata-kernel, which in the NixOS toplevel is a symlink to the bzImage in a separate derivation for the kernel.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sorry, I don't understand: the "derivation for the kernel" should also be installed in the system, shouldn't it, making the symlink resolve just fine?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is not about stuff in the system. We pack this into an OCI image layer, which is why I think adding support for resolving symlinks would be too complex compared to what we actually use this for. We select the kernel file from the "top-level" derivation to be included in the OCI layer. But it's just a symlink to a file in the derivation of the kernel. The OCI builder we have just copies the file without checking what it is nor dereferencing it. Thus, this - as of now - is an easy-to-trigger footgun for people using this. As we specify the target location in the layer for each file anyway, I think the proposed solution is just the best we can get with low complexity. I'm very strongly against keeping it as-is, just because the status quo is not sound imo.

packages/by-name/kata/kata-runtime/package.nix Outdated Show resolved Hide resolved
@@ -144,5 +147,10 @@ buildGoModule rec {
in
[ "-skip=^${builtins.concatStringsSep "$|^" skippedTests}$" ];

passthru.cmdline = {
default = "tsc=reliable no_timer_check rcupdate.rcu_expedited=1 i8042.direct=1 i8042.dumbkbd=1 i8042.nopnp=1 i8042.noaux=1 noreplace-smp reboot=k cryptomgr.notests net.ifnames=0 pci=lastbus=0 root=/dev/vda1 rootflags=ro rootfstype=erofs console=hvc0 console=hvc1 quiet systemd.show_status=false panic=1 nr_cpus=1 selinux=0 systemd.unit=kata-containers.target systemd.mask=systemd-networkd.service systemd.mask=systemd-networkd.socket scsi_mod.scan=none";
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Some param choices that I found surprising:

  • reboot=k: why not efi?
  • quiet: don't we want these log messages?
  • cryptomgr.notests: these tests sound helpful - why disable them?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is not created by us but the default command line used by Kata (hence I moved it to this Nix package). I do not understand the choices as well.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Describing this in a comment.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks! Could you please add a source link, too?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Many places in the Kata code influence the generation of this value, so adding a single source is hard.

packages/by-name/boot-microvm/package.nix Show resolved Hide resolved
@msanft msanft force-pushed the msanft/kata/nixos-image branch 3 times, most recently from 4d4e74c to 4148507 Compare November 26, 2024 12:57
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

package comment seems outdated. What part of this is microvm specific at this point? Looks like it's just an image in parts compared to packaging it as uki.

Copy link
Contributor Author

@msanft msanft Nov 28, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think it's still a valid name for a non-bootable image like this, as you can boot it without a rootfs.

But I'm also open for other suggestions here.

Many Nix packages contain symlinks (e.g. packages aggregated into one with `symlinkJoin`) consist of symlinks. Our current implementation of the OCI layer builder does not handle symlinks at all and just copies them without copying the object they're pointing to. A sane configuration-less implementation would probably copy symlinks recursively until the target is found and copy that as well. For now, this just switches to an implementation that resolves all symlinks recursively, which is fine for all of our use-cases at the moment.
This adds a Nix builder to build a micro VM image for direct Linux boot, specifically for the bare-metal Kata image where this is necessary to satisfy Contrast's security assumptions made on the SNP launch digest computation.
Using the Kata kernel with a baremetal NixOS image requires some additional config options to specify NixOS' sanity checks, so add them here.
Kata has a check to see if only image OR initrd are supplied, which is not needed for our use-case. So add a patch to remove that. This should probably be brought upstream in a usable fashion later on.
This adds a little helper script to boot a Micro VM, as we build them for Kata bare-metal, via QEMU.
This switches the image used in our bare-metal Kata uses (e.g. non-AKS and non-peerpods) to a NixOS image that we build in-tree as a MicroVM image (e.g. separated kernel, initrd, cmdline and rootfs).
@@ -144,5 +147,10 @@ buildGoModule rec {
in
[ "-skip=^${builtins.concatStringsSep "$|^" skippedTests}$" ];

passthru.cmdline = {
default = "tsc=reliable no_timer_check rcupdate.rcu_expedited=1 i8042.direct=1 i8042.dumbkbd=1 i8042.nopnp=1 i8042.noaux=1 noreplace-smp reboot=k cryptomgr.notests net.ifnames=0 pci=lastbus=0 root=/dev/vda1 rootflags=ro rootfstype=erofs console=hvc0 console=hvc1 quiet systemd.show_status=false panic=1 nr_cpus=1 selinux=0 systemd.unit=kata-containers.target systemd.mask=systemd-networkd.service systemd.mask=systemd-networkd.socket scsi_mod.scan=none";
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks! Could you please add a source link, too?

Comment on lines +42 to +43
It needs to be noted that setting these values is unsupported, and doing so may lead to unexpected
behaviour, as Contrast isn't tested against all possible configuration combinations.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would rather phrase this positively, listing the options that are expected to work. Initial list:

  • runtime.create_container_timeout
  • hypervisor.default_memory (only upwards?)
  • hypervisor.default_vcpus
  • agent.kernel_modules (?)

Reading through the linked doc, I noticed that we're not setting container_annotations. Maybe doesn't make a difference now, but we might want to add it.

This doc change could be in a separate commit.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we can hardly list things that are expected to work, as we don't test any of these. What do you think?
I'm also fine with specifying that some may work, and when they are useful.

I'm not sure about how agent.kernel_modules should work exactly, but besides the NVIDIA driver for GPU images, our images don't actually contain loadable modules.

Re container_annotations: Yeah, I think we'll want to set these.

nodeinstaller/internal/constants/constants.go Show resolved Hide resolved
packages/by-name/image-podvm/package.nix Show resolved Hide resolved
@@ -17,7 +17,7 @@
OVMF-SNP,
OVMF-TDX,

debugRuntime ? false,
debugRuntime ? true,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Revert to false

Comment on lines +35 to +36
# required for local booting, but no boot logs in kata with this
boot.kernelParams = lib.optionals config.contrast.azure.enable [ "console=ttyS0" ];
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't understand: the serial console worked fine on BM, why do we disable it?

Copy link
Contributor Author

@msanft msanft Dec 2, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

QEMU, locally, without specifying a plethora of virtconsole options, will write to ttyS0. Kata reads the guest logs from hvc*, which doesn't work anymore when specifying ttyS0 here. (Even though I think it should, but maybe more than 2 consoles are not supported by the kernel.)

Why do you think we disable it?

Comment on lines +96 to +99
environment.etc."kata-opa/default-policy.rego".source = pkgs.writeText "default-policy.rego" ''
package agent_policy
default SetPolicyRequest := true
'';
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should we use ${kata.kata-runtime.src}/src/kata-opa/allow-set-policy.rego?

Comment on lines +41 to +43
// Comment out the following lines for debugging of the event log.
// sum := sha512.Sum384(bytes)
// fmt.Println(hex.EncodeToString(sum[:]))
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Consider removing.

Comment on lines +252 to +260
// TODO(msanft): Find out what goes wrong with the kernel hash calculation.
// For now, hard-code the kernel hash from a booted image.
// https://github.com/tianocore/edk2/blob/0f3867fa6ef0553e26c42f7d71ff6bdb98429742/OvmfPkg/Tcg/TdTcg2Dxe/TdTcg2Dxe.c#L1381
// kernelHash := "51877d12bdb51b50d99b213545042c6bbacde98a27ca52786cae8f4b55d71824d7c7aa3cf75ef8ed23543f2f2828b4a5"
// var buffer [48]byte
// if _, err := hex.Decode(buffer[:], []byte(kernelHash)); err != nil {
// panic(err)
// }
// rtmr.Extend(buffer)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is that still relevant?

Comment on lines +10 to +12
to the address it gets loaded to for Contrast's minimum VM memory, regardless
of if the VM has more memory, as it would otherwise happen with QEMU's 4Gi
constraint.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That 4Gi constraint is mentioned a few times, but I believe it's not explained anywhere.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
no changelog PRs not listed in the release notes
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants