-
Notifications
You must be signed in to change notification settings - Fork 95
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
A way to output just a diff of ./target
#76
Comments
There's definitely room for improvement here! The current implementation was easy to get up and running without making it too hard to juggle all the artifacts Perhaps we can try something where we symlink the artifacts to earlier ones instead of copying them over completely after decompressing. Not sure if that will play nicely with cargo but it's worth the experiment |
Yes. :) I was thinking this will be particularly worthwhile for projects that have large workspaces and instead of all-deps + whole-workspace builds, want all-deps + app1 + app2 + app3.
Smart. Worth giving a try. This works as long as |
@dpc I've opened #150 to use artifact symlinking by default which should bring down space usage in the Nix store when lots of derivations are chained!
Unfortunately it does look like rustc gets spooked if the artifacts are symlinks to read-only files (it doesn't try to unlink the files first) so we're still forced to fully copy the artifacts in the build directory, but at least we can dedup them in the Nix store |
Sweet. I'll give it a try later today. |
Well... I'm afraid I don't have good news. It seems like in the debug build that we are using now by default everything got significantly larger and takes longer. The lack of compression seems like a biggest issue. Before the dependencies would build to:
and the workspace itself to:
Now the dependencies are:
and the workspace build:
I've checked and On top of it some parts seem to take a long, long while now, like:
and the For reference I'm trying |
Release build metrics Release Before: Deps:
Workspace:
After: Deps:
Workspace:
A bit better, but still rather bad. :) |
@ipetkov Some thoughts: Compression of the artifacts with zstd is too good to give up. That probably pushes towards the "layers" approach, where instead of symlinking particular files, we would link to the "previous-layer.zstd" (possibly daisy chained). I'm not sure what is being used for deduplication now, but it seems slow AFAICT (didn't meassure but looked like between 1 to 3 minutes). |
I wonder if both the tarball and the zstd compression are finding ways to dedup common parts across all files. The current deduping strategy only dedups if the same file exists with the same contents as the previous build and that's pretty much it. When I was thinking about using symlinking instead of compressing I kept rejecting the approach of combining the two together because compressing after deduping would hide the symlinks from Nix in a way that it won't be able to automatically track what outputs are chained to what. Except I totally didn't consider the fact that we can drop a "manifest file" which contains the paths to any previously built artifacts. I think that may give us a best-of-both-worlds approach where we can first dedup files via symlink where we can, then pack and compress the results! I'll work on implementing the idea above, but in the mean time, feel free to put |
I'm confused. "tarball" is "zstd compressed". Aren't these two the same thing?
I'm confused about everything here. In my imagination the output of a workspace build would be:
If another package were to build on top of this incrementally, crane would: notice there's As for deduplication: After extracting |
We do both here.
Heh, sorry just thinking out loud. Actually that idea I wrote up might not fully work since we don't have individual files to link to so it maybe needs more thought with a fresh head on my part 😉
That's an interesting idea! I'll do some experimentation |
Thought: absent other ideas, a simple Rust program:
that uses The program could have deep vs mtime modes, and use mtime and stats optimizations even when in byte-for-byte (deep) comparison to make it fast, and zstd under the hood shouldn't have any overheads over using zstd binary, AFAICT. |
Implemented in #398! |
I was admiring the results of
crane
-based CI pipeline in Fedimint: https://github.com/fedimint/fedimint/actions/runs/2853966200 . After adding all the propersrc
filtering, when no source files were modified, nothing gets rebuild - not even tests are being re-run (since they've already passed successful, there's no point in re-running them).Which is frankly super awesome, and thank you so much for working on
crane
! I can't overstate how well it works and how helpful you are. :)However as you can see the build still takes ~2m, and it's all downloading stuff from cachix. There is something off there - as 2GB of stuff shouldn't take 2 minutes to download, IMO, but that's something I'm going to investigate on my own.
I started looking into why is it 2GB of data, and I realized that part of the reason is that build contains two
target/
outputs:The 420MB one is the
./target
after building the dependencies only, and 490M are the full workspace build.And it made my realize - 80% of that 490M is redundant, isn't it? It is exactly same data that 420MB contains.
So I wonder - if
crane
allow somehow to store in$out
only the files of./target
that are different from the inputcargoArtifacts
, then steps that need that output can just restore the first version, and then the diff, and get the same data, without storing anything twice.The diffing itself might be slightly slower, but I think it will be more than made up for storage savings, in particular if network transfers are involved. And given that in the cloud CPU power is relatively cheap, yet storage is expensive it could be a big optimization.
This could possibly be all optional. The step producing
./target
would have somedoInstallCargoArtifactsDiffOnly = true;
, and then downstream users would docargoArtifacts = [ workspaceDeps workspaceFull ];
. I wonder if it's possible to somehow write the reference to the basecargoArtifacts
along the diff-only$out
, so that the users don't even need to specify the[ base diff1 diff2 ... ]
list, but I kind of doubt it (unless there's some Nix magic that I'm not aware of).The text was updated successfully, but these errors were encountered: