-
Notifications
You must be signed in to change notification settings - Fork 27
container: support splitting inputs #69
Comments
There's obviously nothing ostree specific about this really. It should be possible to write a tool which accepts an arbitrary OCI/Docker image and reworks the layering such that e.g. large binaries are moved into separate content-addressed layers - right? Maybe it's a bit ostree specific in that it would be most effective when applied to a "fat" operating system that has lots of embedded binaries from multiple sources. In contrast a simple standard multi-stage build that links to a base image is already fairly optimal. It's doubtful that it'd be valuable to split up most standard distribution base images. (But that'd be interesting to verify) |
There are issues with splitting into too many layers. I recall hearing about performance suffering as layer count increased, but I can't recall details and that's likely solvable. If it does increase too much, however, you run into some fairly fundamental limits on registries and in many image-handling libraries. Some of our source image work produced images with >256 layers, and those were effectively unusable - the kernel parameter for the Overlay mount was simply too large to pass to the kernel, for example. |
Thinking about this more, a notable wrinkle here is things like the dpkg/rpm database, which today are single-file databases. That means that when one writes such a tool to split out e.g. the openssl or glibc libraries into their own content addressed layer, the final image layer will still need to contain the updated package metadata. This is another way of noting that single file package databases defeat the ability to do true dynamic linking - i.e. when just |
NixOS does have a database: https://nixos.org/guides/nix-pills/install-on-your-running-system.html#idm140737320795232 I've discussed trying to work around the layers issue with @giuseppe in order to allow sharing packages between host and guest as well as between containers. My best current shot at a solution is to have packages installed somewhere central (like |
Can you elaborate on "layers issue"?
Let's avoid the use of the term "guest" in the context of containers as it's kind of owned by virtualization. The term "host" is still quite relevant to both, but it's also IMO important that containers really are just processes from the perspective of the host. Unless here you do mean "host" and "guest" in the virtualization sense? Now I think let's be a bit more precise here and when you say "containers" you really mean "container image layers", correct? (e.g. "image layers" for short).
This seems to be predicated on the idea of a single "package" system, whether that's nix or RPM or dpkg or whatever inside the containers. I don't think that's going to realistically happen. |
Sorry the issue of having too many container layers leading to decreased performance.
Yep hopefully this is more precise: I mean sharing between host and container image layers and between different container image layers (if some sort of linking within the layer is used) or between containers (if the entire layer is shared)
I don't think it has to - just think of it as subdividing OCI layers into noncolliding chunks with checksums. For our purposes, it would probably be easiest if each chunk was a single RPM, and that would make sharing with rpm-based hosts easier. But the single packaging system is already provided by containerization - we already have |
I am still very wary of intersecting host and container content and updates. That's not opposition exactly, but I think we need to do it very carefully and thoughtfully. I touched on this in containers/image#1084 (comment)
(I wouldn't quite call this a "packaging system"; when one says "package" I think more of apt/yum) Ultimately I think a key question here is: Does the container stack attempt to dynamically (client side) perform deduplication, or do we try to create a "postprocessing tool" that transforms an image as I'm suggesting? Or, a variant of this is we do something like teach edit1: edit2: |
That makes a lot of sense. Would you say there's already a hole in the firewall between containers because of shared layers? Would doing this make that hole worse, or is it the same hole?
Agreed |
Returning to the issue of too many layers decreasing performance: |
I’m afraid I really can’t remember if I brought up something like that or if so, what it was. I’m not aware of layer count scaling horribly, at least. Do note I read @mheon ’s comment above as performance being solvable up to the ~128-layer limit for a container. Right now, I wouldn’t recommend any design that can get anywhere close to that value, without having a way forward if the limit is hit. (That 128-layer limit is also why I don’t think it’s too likely we would get horrible performance: there would have to be an exponential algorithm or something like that for O(f(N)) to be unacceptable with N≤128; at these scales, I’d be much more worried about scaling to large file counts, in hundreds of thousands or whatever.) |
@mtrmac thanks!
The article @cgwalters linked discusses how to combine layers once that limit is hit |
Some initial code in #123 |
One thing I'd note here: I think we should be able to separate logical layers from physical layers. I think it significantly simplifies things if we remove support for whiteouts etc. in these layers. IOW, the layers are "pure union" sources. That means that given layers L1, L2, L3 that are known to be union-able and also not a source, instead of using overlayfs at runtime, the layers are physically re-unioned by hardlink/reflinking. (Perhaps a corollary here really is that these union-only layers should also not care about their hardlink count) |
I wanted to comment on https://www.scrivano.org/posts/2021-10-26-compose-fs/ and https://github.com/giuseppe/composefs briefly. Broadly speaking, I think we're going to get a lot of mileage out of better splitting up container blobs as is discussed here. For one thing, such an approach benefits container runtimes today without any kernel/etc changes. (I still have a TODO item to more deeply investigate buildkit because I think they may already do this) One major downside of ostree (and this proposal) is that garbage collection is much more expensive. I think what we probably want is to at least only do deduplication inside related images to start. It seems very unlikely to me that the benefits of e.g. sharing files across a rhel8 image and a fedora34 image are worth the overhead. (But, it'd be interesting to be proven wrong) |
The only issue that post identifies with overlayfs is that it can't deduplicate the same file present in multiple layers, right? Would there be any advantage to composefs compared to overlayfs with files split into separate layers? Would it scale better for large numbers of layers? |
I once saw an overlayfs maintainer describe it as a "just in time" That said, I suspect the vast majority of images and use cases would be completely fine seeing a higher |
Isn't another |
Well...containers (by default) know which version of the kernel the host is running, which I would say is far more security sensitive. But OTOH, generalizing this into leaking any arbitrary file (executable/shared library) does seems like a potentially valid concern. |
Not sure if comments about sharing layers between the host and container image layers belong on a separate issue, but if that was possible would Wondering since we'll be using |
I think that is probably a separate issue. It's certainly related to this, but it would be a rather profound change from the current system architecture. One thing I do want to make more erogonomic though is pulling from to |
Prep for ostreedev#69 where we'll split up the input ostree commit into content-addressed blobs. We want to inject something useful into the the `history` in the config git that describes each chunk, so add support for that into our OCI writer. Change the default description for the (currently single) layer to include the commit subject, if present; otherwise the commit hash. The description of the layer shouldn't change as this tool changes, just as the input changes. (Side note; today rpm-ostree isn't adding a subject description, but hey, maybe someone else is)
This came out of some prep work on ostreedev#69 Right now it's confusing, the layering code ended up re-implementing the "fetch and unpack tarball" logic from the unencapsulation path unnecessarily. I think it's much clearer if the layering path just calls down into the unencapsulation path first. Among other things this will also ensure we're honoring the image verification string.
This came out of some prep work on ostreedev#69 Right now it's confusing, the layering code ended up re-implementing the "fetch and unpack tarball" logic from the unencapsulation path unnecessarily. I think it's much clearer if the layering path just calls down into the unencapsulation path first. Among other things this will also ensure we're honoring the image verification string.
Closes: ostreedev#69 This is initial basic support for splitting files (objects) from a commit into separate container image layers, and reassembling those layers into a commit on the client. We retain our present logic around e.g. GPG signature verification. There's a new `chunking.rs` file which has logic to automatically factor out things like the kernel/initramfs and large files. In order to fetch these images client side, we now heavily intermix/cross the previous code for fetching non-ostree layers.
Closes: ostreedev#69 This is initial basic support for splitting files (objects) from a commit into separate container image layers, and reassembling those layers into a commit on the client. We retain our present logic around e.g. GPG signature verification. There's a new `chunking.rs` file which has logic to automatically factor out things like the kernel/initramfs and large files. In order to fetch these images client side, we now heavily intermix/cross the previous code for fetching non-ostree layers.
Closes: ostreedev#69 This is initial basic support for splitting files (objects) from a commit into separate container image layers, and reassembling those layers into a commit on the client. We retain our present logic around e.g. GPG signature verification. There's a new `chunking.rs` file which has logic to automatically factor out things like the kernel/initramfs and large files. In order to fetch these images client side, we now heavily intermix/cross the previous code for fetching non-ostree layers.
In ostree we aim to provide generic mechanisms that can be consumed by any package or build system. Hence we often use the term "component" instead of "package". This new `objectsource` module is an abstraction over basic metadata for a component/package, currently name, identifier, and last change time. This will be used for splitting up a single ostree commit back into "chunks" or container image layers, grouping objects that come from the same component together. ostreedev#69
Closes: ostreedev#69 This is initial basic support for splitting files (objects) from a commit into separate container image layers, and reassembling those layers into a commit on the client. We retain our present logic around e.g. GPG signature verification. There's a new `chunking.rs` file which has logic to automatically factor out things like the kernel/initramfs and large files. In order to fetch these images client side, we now heavily intermix/cross the previous code for fetching non-ostree layers.
Closes: ostreedev#69 This is initial basic support for splitting files (objects) from a commit into separate container image layers, and reassembling those layers into a commit on the client. We retain our present logic around e.g. GPG signature verification. There's a new `chunking.rs` file which has logic to automatically factor out things like the kernel/initramfs and large files. In order to fetch these images client side, we now heavily intermix/cross the previous code for fetching non-ostree layers.
Closes: ostreedev#69 This is initial basic support for splitting files (objects) from a commit into separate container image layers, and reassembling those layers into a commit on the client. We retain our present logic around e.g. GPG signature verification. There's a new `chunking.rs` file which has logic to automatically factor out things like the kernel/initramfs and large files. In order to fetch these images client side, we now heavily intermix/cross the previous code for fetching non-ostree layers.
Closes: ostreedev#69 This is initial basic support for splitting files (objects) from a commit into separate container image layers, and reassembling those layers into a commit on the client. We retain our present logic around e.g. GPG signature verification. There's a new `chunking.rs` file which has logic to automatically factor out things like the kernel/initramfs and large files. In order to fetch these images client side, we now heavily intermix/cross the previous code for fetching non-ostree layers.
Closes: ostreedev#69 This is initial basic support for splitting files (objects) from a commit into separate container image layers, and reassembling those layers into a commit on the client. We retain our present logic around e.g. GPG signature verification. There's a new `chunking.rs` file which has logic to automatically factor out things like the kernel/initramfs and large files. In order to fetch these images client side, we now heavily intermix/cross the previous code for fetching non-ostree layers.
Closes: ostreedev#69 This is initial basic support for splitting files (objects) from a commit into separate container image layers, and reassembling those layers into a commit on the client. We retain our present logic around e.g. GPG signature verification. There's a new `chunking.rs` file which has logic to automatically factor out things like the kernel/initramfs and large files. In order to fetch these images client side, we now heavily intermix/cross the previous code for fetching non-ostree layers.
Closes: ostreedev#69 This is initial basic support for splitting files (objects) from a commit into separate container image layers, and reassembling those layers into a commit on the client. We retain our present logic around e.g. GPG signature verification. There's a new `chunking.rs` file which has logic to automatically factor out things like the kernel/initramfs and large files. In order to fetch these images client side, we now heavily intermix/cross the previous code for fetching non-ostree layers.
Closes: ostreedev#69 This is initial basic support for splitting files (objects) from a commit into separate container image layers, and reassembling those layers into a commit on the client. We retain our present logic around e.g. GPG signature verification. There's a new `chunking.rs` file which has logic to automatically factor out things like the kernel/initramfs and large files. In order to fetch these images client side, we now heavily intermix/cross the previous code for fetching non-ostree layers.
Closes: ostreedev#69 This is initial basic support for splitting files (objects) from a commit into separate container image layers, and reassembling those layers into a commit on the client. We retain our present logic around e.g. GPG signature verification. There's a new `chunking.rs` file which has logic to automatically factor out things like the kernel/initramfs and large files. In order to fetch these images client side, we now heavily intermix/cross the previous code for fetching non-ostree layers.
Closes: ostreedev#69 This is initial basic support for splitting files (objects) from a commit into separate container image layers, and reassembling those layers into a commit on the client. We retain our present logic around e.g. GPG signature verification. There's a new `chunking.rs` file which has logic to automatically factor out things like the kernel/initramfs and large files. In order to fetch these images client side, we now heavily intermix/cross the previous code for fetching non-ostree layers.
Closes: ostreedev#69 This is initial basic support for splitting files (objects) from a commit into separate container image layers, and reassembling those layers into a commit on the client. We retain our present logic around e.g. GPG signature verification. There's a new `chunking.rs` file which has logic to automatically factor out things like the kernel/initramfs and large files. In order to fetch these images client side, we now heavily intermix/cross the previous code for fetching non-ostree layers.
Closes: ostreedev#69 This is initial basic support for splitting files (objects) from a commit into separate container image layers, and reassembling those layers into a commit on the client. We retain our present logic around e.g. GPG signature verification. There's a new `chunking.rs` file which has logic to automatically factor out things like the kernel/initramfs and large files. In order to fetch these images client side, we now heavily intermix/cross the previous code for fetching non-ostree layers.
Closes: ostreedev#69 This is initial basic support for splitting files (objects) from a commit into separate container image layers, and reassembling those layers into a commit on the client. We retain our present logic around e.g. GPG signature verification. There's a new `chunking.rs` file which has logic to automatically factor out things like the kernel/initramfs and large files. In order to fetch these images client side, we now heavily intermix/cross the previous code for fetching non-ostree layers.
Closes: ostreedev#69 This is initial basic support for splitting files (objects) from a commit into separate container image layers, and reassembling those layers into a commit on the client. We retain our present logic around e.g. GPG signature verification. There's a new `chunking.rs` file which has logic to automatically factor out things like the kernel/initramfs and large files. In order to fetch these images client side, we now heavily intermix/cross the previous code for fetching non-ostree layers.
Closes: ostreedev#69 This is initial basic support for splitting files (objects) from a commit into separate container image layers, and reassembling those layers into a commit on the client. We retain our present logic around e.g. GPG signature verification. There's a new `chunking.rs` file which has logic to automatically factor out things like the kernel/initramfs and large files. In order to fetch these images client side, we now heavily intermix/cross the previous code for fetching non-ostree layers.
Closes: ostreedev#69 This is initial basic support for splitting files (objects) from a commit into separate container image layers, and reassembling those layers into a commit on the client. We retain our present logic around e.g. GPG signature verification. There's a new `chunking.rs` file which has logic to automatically factor out things like the kernel/initramfs and large files. In order to fetch these images client side, we now heavily intermix/cross the previous code for fetching non-ostree layers.
Optimizing ostree-native containers into layers
This project started by creating a simple and straightforward mechanism to bidirectionally map between an ostree commit object and an OCI/Docker container - we serialize the whole commit/image into a single tarball, which is then wrapped as a container layer.
In other words, we generate a base image with one big layer. This was simple and straightforward to implement, relatively speaking. (Note that derived containers naturally do have distinct layers. But this issue is about the "base image".)
The core problem is that any change to the base image (e.g. kernel security update) requires each client to redownload the entire base image which could be 1GB or more. See e.g. this spreadsheet for Fedora CoreOS content size.
Prior art and related projects
See https://grahamc.com/blog/nix-and-layered-docker-images for a Nix based build system that knows how to intelligently generate a container image with multiple layers.
There is also work in the baseline container ecosystem for optimizing transport of containers.
Copying in bits of this comment:
estargz has some design points worth looking at, but largely speaking I think few people want to have their base operating system lazily fetched. (Although, there is clearly an intersection with estargz's per-file digests and ostree)
Proposed initial solution: support for splitting layers
I think right now, we cannot rely on any container-level deltas. As a baseline, we should support splitting ostree commits into distinct layers, because that works with every container runtime and registry today. Further, splitting layers optimizes both pulls and pushes.
In addition, split layers means that any other "per layer" deltas actually just work more efficiently and better, whether that's zstd-chunked or layer-bsdiff.
Aside: today with Dockerfile, one cannot easily do this. Discussion: containers/podman#12605
Implementing split layers
There is an initial PR for this here: #123
Conceptually there are two parts:
Generating split layers
There are multiple competing things we need to optimize. A first theoretically obvious thing to do is have a single layer per e.g. RPM/deb package. But we can't do this because there are simply too many packages and container limits are around 100. See the above nix-related blog.
And even if we wanted to go close to a theroetical limit of 100, because we want to support people generating derived images, we must reserve space for that. At least 30 layers. Conservatively, let's say we shouldn't go beyond 50 layers.
In the general case, because ostree is a low-level library, we need to support higher level software telling us how to split things - or at a minimum, which files are likely to change together.
Principles/constraints:
The "over time" problem
We should also be robust to changes in the content set itself. For example, if a component which was previously small grew large, we want to avoid that change "cascading" into affecting how we change multiple layers. Another way to say this is that any major change in how we chunk things also implies clients will need to redownload many layers.
Initial proposal
In the initial code, ostree has some hardcoded knowledge of things like
/usr/lib/firmware
from linux-firmware, which is by far the largest single chunk of the OS. It is large, and while it doesn't change too often, it clearly makes sense to chunk by itself.The current auto-chunking logic also handles the kernel/initramfs in
/usr/lib/modules
.Past that, we have code which then tries to cherry pick large files (as a percentage of the total remaining size) which captures things like large statically linked Go binaries.
Supporting injected mappings
What would clearly aid this is support for having e.g. rpm-ostree inject the mapping between file paths and RPM name - or really most generically, assign a stable identifier to a set of file paths. Also supporting some sort of indication of relative change frequency would be useful.
Something like:
Some interesting subtleties here around e.g. "what happens if a file from different packages/components was deduplicated by ostree to one object?". Probably it needs to be promoted to the source with the greatest change frequency.
Current status
See https://quay.io/repository/cgwalters/fcos-chunked for an image that was generated this way. Specifically, try viewing the manifest and you can see the separate chunks - for example, there's a ~220MB chunk just for the kernel. If a kernel security update happens, you just download that chunk and the rpm database chunk.
The text was updated successfully, but these errors were encountered: