-
Notifications
You must be signed in to change notification settings - Fork 40
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Parallel extraction #165
Comments
It definitely can be done, but there's one catch: Since the The only ergonomic solution I can think of is
|
also, wondering is there any reason the implementation to traverse the files in the zip file is done with a count rather than iterator? |
It definitely can implement the Iterator, it's probably that the maintainers do not have time to add that themselves. |
I want to note that with the merge technique from zip-rs/zip-old#401 it becomes possible to parallelize zip creation as well as extraction. I prototyped this in a separate library |
The |
Thanks for this explanation! Could we also perhaps employ parallelism by pipelining the decompression/decryption of each entry though and only take a single pass over the file? |
I remember that decompression/decryption is done lazily, but you would need to read the source of ask @Plecra or @zamazan4ik for this.
Parsing zip file requires reading head and tail of the file, it can be done in a streaming way which the entire file is read only once, but it's only useful when streaming a http request and unzip it. If you have the zip stores on fs, then seeking it is quite cheap and should be ok, though iterating once is better for caching. |
I will try to make a benchmark to see if the idea works! |
I have demonstrated one approach to improve extraction performance with rayon threadpools in zip-rs/zip-old#407, but it has a few caveats and for multiple reasons I am going to try a separate branch that adds the |
It took a lot of iteration, but I produced a prototype of an async API for |
With the help of crate With a 600 MB archive with ~3000 files, I get the following result: Using
Using
Which is about 4x faster on my i3-10300T (4 core - 8 threads) The parallel extract functionfn extract_zip_par(
archive: zip::ZipArchive<sync_file::SyncFile>,
directory: &Path,
) -> zip::result::ZipResult<()> {
(0..archive.len())
.into_par_iter()
.try_for_each_with(archive, |archive, i| {
let mut file = archive.by_index(i)?;
let filepath = file
.enclosed_name()
.ok_or_else(|| zip::result::ZipError::InvalidArchive("Invalid file path"))?;
let outpath = directory.join(filepath);
if file.name().ends_with('/') {
fs::create_dir_all(&outpath)?;
} else {
if let Some(p) = outpath.parent() {
if !p.exists() {
fs::create_dir_all(p)?;
}
}
let mut outfile = fs::File::create(&outpath)?;
io::copy(&mut file, &mut outfile)?;
}
// Get and Set permissions
#[cfg(unix)]
{
use std::os::unix::fs::PermissionsExt;
if let Some(mode) = file.unix_mode() {
fs::set_permissions(&outpath, fs::Permissions::from_mode(mode))?;
}
}
Ok::<_, zip::result::ZipError>(())
})?;
Ok(())
} |
I was looking if the library supports any multithreading, would be cool but did not see anything related, has there been any though of that?
particularly thinking in something like par_iter to iterate over zip contents
The text was updated successfully, but these errors were encountered: