-
Notifications
You must be signed in to change notification settings - Fork 8
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add NixOS image for bare-metal Kata #1019
base: main
Are you sure you want to change the base?
Conversation
1140233
to
304fca3
Compare
|
c7dbe2c
to
ad338fa
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for properly organizing the commits, that indeed improves the review experience! I went through up-to-including 6ded3e8 for now.
@@ -49,8 +49,11 @@ runCommandLocal "ociLayer" | |||
# Copy files into the tree (./root/) | |||
for i in ''${!srcs[@]}; do | |||
# resolve symlinks | |||
src=$(readlink -f ''${srcs[i]}) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't think this is sound. Shouldn't the link target be part of the sources, too? Or, the other way round, why can't I package a symlink pointing to something in another layer?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Please see the reasoning I made in the PR description. symlinks generally are fine, but the way we implement this as of now just is not. And rather than making a complicated solution that would enable symlinks, this is sufficient for the use of ociLayerTar
as of now. I might enable symlinks in another PR if we see a reason to do so.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I can't find any reasoning about this in the PR description. I agree with @burgerdev comment. What symlink isn't working for this fix to be needed?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Commit description, sorry. It's about kata-kernel, which in the NixOS toplevel is a symlink to the bzImage in a separate derivation for the kernel.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sorry, I don't understand: the "derivation for the kernel" should also be installed in the system, shouldn't it, making the symlink resolve just fine?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is not about stuff in the system. We pack this into an OCI image layer, which is why I think adding support for resolving symlinks would be too complex compared to what we actually use this for. We select the kernel file from the "top-level" derivation to be included in the OCI layer. But it's just a symlink to a file in the derivation of the kernel. The OCI builder we have just copies the file without checking what it is nor dereferencing it. Thus, this - as of now - is an easy-to-trigger footgun for people using this. As we specify the target location in the layer for each file anyway, I think the proposed solution is just the best we can get with low complexity. I'm very strongly against keeping it as-is, just because the status quo is not sound imo.
@@ -144,5 +147,10 @@ buildGoModule rec { | |||
in | |||
[ "-skip=^${builtins.concatStringsSep "$|^" skippedTests}$" ]; | |||
|
|||
passthru.cmdline = { | |||
default = "tsc=reliable no_timer_check rcupdate.rcu_expedited=1 i8042.direct=1 i8042.dumbkbd=1 i8042.nopnp=1 i8042.noaux=1 noreplace-smp reboot=k cryptomgr.notests net.ifnames=0 pci=lastbus=0 root=/dev/vda1 rootflags=ro rootfstype=erofs console=hvc0 console=hvc1 quiet systemd.show_status=false panic=1 nr_cpus=1 selinux=0 systemd.unit=kata-containers.target systemd.mask=systemd-networkd.service systemd.mask=systemd-networkd.socket scsi_mod.scan=none"; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Some param choices that I found surprising:
reboot=k
: why notefi
?quiet
: don't we want these log messages?cryptomgr.notests
: these tests sound helpful - why disable them?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is not created by us but the default command line used by Kata (hence I moved it to this Nix package). I do not understand the choices as well.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Describing this in a comment.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks! Could you please add a source link, too?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Many places in the Kata code influence the generation of this value, so adding a single source is hard.
4d4e74c
to
4148507
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
package comment seems outdated. What part of this is microvm specific at this point? Looks like it's just an image in parts compared to packaging it as uki.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think it's still a valid name for a non-bootable image like this, as you can boot it without a rootfs.
But I'm also open for other suggestions here.
4148507
to
d14b544
Compare
d14b544
to
963b695
Compare
Many Nix packages contain symlinks (e.g. packages aggregated into one with `symlinkJoin`) consist of symlinks. Our current implementation of the OCI layer builder does not handle symlinks at all and just copies them without copying the object they're pointing to. A sane configuration-less implementation would probably copy symlinks recursively until the target is found and copy that as well. For now, this just switches to an implementation that resolves all symlinks recursively, which is fine for all of our use-cases at the moment.
This adds a Nix builder to build a micro VM image for direct Linux boot, specifically for the bare-metal Kata image where this is necessary to satisfy Contrast's security assumptions made on the SNP launch digest computation.
Using the Kata kernel with a baremetal NixOS image requires some additional config options to specify NixOS' sanity checks, so add them here.
Kata has a check to see if only image OR initrd are supplied, which is not needed for our use-case. So add a patch to remove that. This should probably be brought upstream in a usable fashion later on.
This adds a little helper script to boot a Micro VM, as we build them for Kata bare-metal, via QEMU.
This switches the image used in our bare-metal Kata uses (e.g. non-AKS and non-peerpods) to a NixOS image that we build in-tree as a MicroVM image (e.g. separated kernel, initrd, cmdline and rootfs).
963b695
to
2d6b621
Compare
@@ -144,5 +147,10 @@ buildGoModule rec { | |||
in | |||
[ "-skip=^${builtins.concatStringsSep "$|^" skippedTests}$" ]; | |||
|
|||
passthru.cmdline = { | |||
default = "tsc=reliable no_timer_check rcupdate.rcu_expedited=1 i8042.direct=1 i8042.dumbkbd=1 i8042.nopnp=1 i8042.noaux=1 noreplace-smp reboot=k cryptomgr.notests net.ifnames=0 pci=lastbus=0 root=/dev/vda1 rootflags=ro rootfstype=erofs console=hvc0 console=hvc1 quiet systemd.show_status=false panic=1 nr_cpus=1 selinux=0 systemd.unit=kata-containers.target systemd.mask=systemd-networkd.service systemd.mask=systemd-networkd.socket scsi_mod.scan=none"; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks! Could you please add a source link, too?
It needs to be noted that setting these values is unsupported, and doing so may lead to unexpected | ||
behaviour, as Contrast isn't tested against all possible configuration combinations. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I would rather phrase this positively, listing the options that are expected to work. Initial list:
runtime.create_container_timeout
hypervisor.default_memory
(only upwards?)hypervisor.default_vcpus
agent.kernel_modules
(?)
Reading through the linked doc, I noticed that we're not setting container_annotations
. Maybe doesn't make a difference now, but we might want to add it.
This doc change could be in a separate commit.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think we can hardly list things that are expected to work, as we don't test any of these. What do you think?
I'm also fine with specifying that some may work, and when they are useful.
I'm not sure about how agent.kernel_modules
should work exactly, but besides the NVIDIA driver for GPU images, our images don't actually contain loadable modules.
Re container_annotations
: Yeah, I think we'll want to set these.
@@ -17,7 +17,7 @@ | |||
OVMF-SNP, | |||
OVMF-TDX, | |||
|
|||
debugRuntime ? false, | |||
debugRuntime ? true, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Revert to false
# required for local booting, but no boot logs in kata with this | ||
boot.kernelParams = lib.optionals config.contrast.azure.enable [ "console=ttyS0" ]; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't understand: the serial console worked fine on BM, why do we disable it?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
QEMU, locally, without specifying a plethora of virtconsole
options, will write to ttyS0
. Kata reads the guest logs from hvc*
, which doesn't work anymore when specifying ttyS0
here. (Even though I think it should, but maybe more than 2 consoles are not supported by the kernel.)
Why do you think we disable it?
environment.etc."kata-opa/default-policy.rego".source = pkgs.writeText "default-policy.rego" '' | ||
package agent_policy | ||
default SetPolicyRequest := true | ||
''; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should we use ${kata.kata-runtime.src}/src/kata-opa/allow-set-policy.rego
?
// Comment out the following lines for debugging of the event log. | ||
// sum := sha512.Sum384(bytes) | ||
// fmt.Println(hex.EncodeToString(sum[:])) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Consider removing.
// TODO(msanft): Find out what goes wrong with the kernel hash calculation. | ||
// For now, hard-code the kernel hash from a booted image. | ||
// https://github.com/tianocore/edk2/blob/0f3867fa6ef0553e26c42f7d71ff6bdb98429742/OvmfPkg/Tcg/TdTcg2Dxe/TdTcg2Dxe.c#L1381 | ||
// kernelHash := "51877d12bdb51b50d99b213545042c6bbacde98a27ca52786cae8f4b55d71824d7c7aa3cf75ef8ed23543f2f2828b4a5" | ||
// var buffer [48]byte | ||
// if _, err := hex.Decode(buffer[:], []byte(kernelHash)); err != nil { | ||
// panic(err) | ||
// } | ||
// rtmr.Extend(buffer) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is that still relevant?
to the address it gets loaded to for Contrast's minimum VM memory, regardless | ||
of if the VM has more memory, as it would otherwise happen with QEMU's 4Gi | ||
constraint. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That 4Gi constraint is mentioned a few times, but I believe it's not explained anywhere.
This switches the image used in our bare-metal Kata uses (e.g. non-AKS and non-peerpods) to a NixOS image that we build in-tree as a MicroVM image (e.g. separated kernel, initrd, cmdline and rootfs).
I tried to split this up into digestible commits as much as possible, but a large part still has to be done in one commit in order to ensure bisectability through all commits. I also tried to write meaningful commit messages, so please consult these to find out about the specific changes.