-
-
Notifications
You must be signed in to change notification settings - Fork 267
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
research area: parallel zip creation #2158
Comments
Have you explored |
No, thanks so much! Will be playing around with pex3 for a bit now......... |
I am confident that either of the methods you described above are a preferable solution for the use case I described. I will probably still be playing around with parallel zip creation just to see if it demonstrates any significant perf improvement at all without sacrificing much compatibility. It would be neat to demonstrate an X% speedup if there were zero compatibility issues, and less useful if it produces a wonky zip. I would also like this to work on all platforms. Serializing a few thoughts:
Given the above, choosing to "naively parallelize" reading & compressing individual file entries from the chroot, then synchronizing on adding individual compressed entries to the zip output file (which would then have a nondeterministic file entry ordering, but would hopefully not require extensive hacks to It might also be interesting to try "virtualizing" As of now, I do not expect this investigation to be extremely useful for the pex project, and will prioritize it accordingly. But it's a fun idea. |
Hmmmm: the particular parallelization/synchronization operations necessary for this make it seem like a good fit for pants, actually, especially since pants is much more able to control the build inputs and output than pex itself can afford to. I think I'll leave this open because I can still see a potential for a perf improvement without introducing too many new code paths, for use cases where the output hash doesn't matter (particularly because pex provides a |
Since this would only be used by the Pex CLI (build-time), a rust implementation is actually how I'd approach this. It's the right language for the job, the problem really has nothing to do with Pex (it has much wider applicability), and there are great tools for turning a crate into an easily consumable Python platform-specific distribution. |
In terms of interest from PEX, the resulting zip would definitely need to be vanilla consumable via the zipfile module / the Python zipimport facility. In terms of feature flags, that's really a non-issue. Pex follows the rule of not breaking consumers; so no new feature is ever default unless it's totally transparent to the user. |
This makes perfect sense, thanks! Totally hadn't considered this. |
Created project at https://github.com/cosmicexplorer/medusa-zip, will update here if the prototype demonstrates any value. |
The project has two features: crawling for files in parallel, and converting a crawl list into a zip in parallel. Given that pex currently expects to have all the I'm probably going to try exposing these two features as subcommands and executing
After deciding that asyncifying and parallelizing the read/compress process was likely to contribute the strongest performance improvement, I was able to avoid any modifications to the rust |
Ok, so even just limiting the parallelism to reading source files + compressing their contents, we have produced quite a significant speedup. The current pex diff is at https://github.com/pantsbuild/pex/compare/main...cosmicexplorer:pex:medusa-zip?expand=1, and the pex command I'm testing it against went from 2m40s to 1m16s, which is 2.1x as fast (and that's the entire pex command, including resolution and preparing the chroot beforehand). As mentioned, I didn't get into any zip file hacks, so I would be surprised if this produced compatibility issues with TODO
I had assumed we would need to cut at the API surface of |
I think maybe you missed copy mode symlink. That's why the chroot setup is fast today. |
Yes, but I was confused as to why it only checks for |
If you do a cursory search for LINK I think you can avoid raising an issue, all three modes are handled, and LINK gives a hard link farm. |
It's true that all three modes are handled, but it just seemed counterintuitive from a passing glance that |
I have been making more fun changes to zip file creation:
This might have been discouraging, but after diving into the zip format again earlier today, I believe that we can extend the current method (creating many in-memory single-file zips and copying over data) to produce chunked intermediate zips (which can be created in parallel, written to file, then copied over to the final output zip with essentially a massive To reiterate:
For pex's use case, we can even do one better, creating and caching an "intermediate zip" for each dist itself, so we can avoid traversing the directory hierarchy for dists at all and instead largely just We still need to determine whether merging intermediate zips works the way I think it does, but I am pretty sure that we can essentially rip the central directory headers from the end of the source zip file, copy-paste the entire body of the source zip into the current zip, then append the new central directory headers into our existing central directory headers after writing the rest of the file. You'll note this can be formulated recursively, although I'm not sure whether that's helpful yet. In order to do this, I'm going to implement:
|
Ok so that seems to work great! Results
TODO: caching intermediate zips for 3rdparty distsSo this alone would be great (and would also minimize the changes to pex), but as detailed above, I would like to additionally make use of the zip merging breakthrough to add further caching to this process:
With the caching of intermediate zip files for 3rdparty dists, I suspect that the final zip production can be made into a matter of several seconds, as it completely avoids even traversing the filesystem for the myriad 3rdparty dists after they are first installed and cached. For pex files with vendored code, or simply quite a lot of first-party code, we wouldn't try to perform any caching, but simply make use of the normal faster parallel zipping process (which internally uses the zip merge operation). Changes to pex codeCurrent pex changes are visible at: main...cosmicexplorer:pex:medusa-zip, currently at +46/-18. As of right now, before implementing the caching of intermediate zips for 3rdparty dists, we only modify
My goals with the caching of intermediate zips for 3rdparty dists are:
CompatibilityIf merging is performed in this way, all of the |
Ok, this all seems to work great and the resulting zip is generated in several hundred milliseconds, which rocks. I've been transferring more functionality from .zip_entry_for_file(), like making sure the modification times are taken from the file on disk. (https://github.com/cosmicexplorer/medusa-zip) Really notably, I've actually switched the default mechanism for zipping up source files to run entirely synchronously, so I can focus on the compatibility part. The parallel approach is absolutely still useful, but the speedup I care about is coming from merging those intermediate zips for all 3rdparty dists. Once the file perms and mtime are done correctly, I'm going to fix up the generation of intermediate zips for dists so that it's done as part of the parallel InstallRequest jobs (the caching is currently performed when ._add_dist_dir() is called). (It would probably be better to make the generation of the intermediate zip into its own parallel job at some point, though.) After that, I think it'll be ready for review. I'm not sure exactly how we would incorporate a rust tool into the pex build; I looked at the great tooling we have for vendored code, but that seems very python-specific. @jsirois what format would be most useful for pex to consume a rust tool with? It works great as a subprocess. |
See #2175 for an initial implementation. |
Problem
I was creating a pex file to resolve dependencies while playing around with fawkes (https://github.com/shawn-shan/fawkes), and like most modern ML projects, it contains many large binary dependencies. This meant that while resolution (with the 2020 pip resolver) was relatively fast, the creation of the zip itself took about a minute after that (without any progress indicator), which led me to prefer a venv when iterating on the fawkes source code.
Use Case Discussion
I understand pex's supported use cases revolve much more around robust reproducible deployment scenarios, where taking a single minute to zip up a massive bundle of dependencies is more than acceptable, and where making use of the battle-tested stdlib
zipfile.ZipFile
is extremely important to ensure pex files can be executed as well as inspected on all platforms and by all applications. However, for use cases like the one described above, where the pex is going to be created strictly for the local platform, I think it would be really convenient to avoid having to set up a stateful venv.Alternatives
I could just create a pex file for the dependencies, and use that to launch the python process that runs from source code, and indeed that is what we decided on to implement pantsbuild/pants#8793, which worked perfectly for the Twitter ML infra team's jupyter notebooks. But (assuming this is actually possible) I would still personally find a feature that zips up pex files much faster to be useful for a lot of "I really just wanna hack something together" scenarios where I don't necessarily want to have to set up a two-phase build process like that.
Implementation Strategy
After going through the proposal of pypa/pip#8448 to hack the zip format to minimize wheel downloads, which was implemented much more thoughtfully in pypa/pip#8467 as
LazyZipOverHTTP
, and then realizing I had overpromised the potential speedup at pypa/pip#7049 (comment), I am wary of just assuming that hacking around with the zip format will necessarily improve performance over the battle-tested synchronous stdlib implementation.However, it seems plausible that the process of compressing individual entries could be parallelized.
pigz
describes their methodology at https://github.com/madler/pigz/blob/cb8a432c91a1dbaee896cd1ad90be62e5d82d452/pigz.c#L279-L340, and there is a codebase namedfastzip
that does this in python using threads, with a great discussion of performance bottlenecks and a few notes on compatibility issues. The Apache commons library appears to have implemented this too withParallelScatterZipCreator
.End-User Interface
Due to the likely compatibility issues (both with executing the parallel method at all, as well as consuming the resulting zip file), it seems best to put this behind a flag, and probably to explicitly call it experimental (I like the way pip does
--use-feature=fast-deps
, for example), and possibly even to print out a warning to stderr when executing or processing any pex files created this way. To enable the warning message, or in case we want to do any other special-case processing of pex files created this way, we could put a key inPEX-INFO
's.build_properties
, or perhaps add an empty sentinel file named.parallel-zip
to the output.The text was updated successfully, but these errors were encountered: