Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Parallel extraction #165

Open
horahh opened this issue Sep 9, 2023 · 11 comments
Open

Parallel extraction #165

horahh opened this issue Sep 9, 2023 · 11 comments

Comments

@horahh
Copy link

horahh commented Sep 9, 2023

I was looking if the library supports any multithreading, would be cool but did not see anything related, has there been any though of that?
particularly thinking in something like par_iter to iterate over zip contents

@NobodyXu
Copy link
Contributor

NobodyXu commented Sep 9, 2023

It definitely can be done, but there's one catch: Since the reader is shared among all zip entries, calling seek would also affect other entries reading from reader.

The only ergonomic solution I can think of is

  • either for zip to require read_at for reader, which is already implemented for File
  • or requires the reader to implement Clone by storing File to be stored inside an Arc and keep track of the curent location of cursor in the reader
  • or requires reader to implement TryClone: fn try_clone(&self) -> io::Result<Self> though this makes it harder to pass a trait object

@horahh
Copy link
Author

horahh commented Sep 14, 2023

also, wondering is there any reason the implementation to traverse the files in the zip file is done with a count rather than iterator?

@NobodyXu
Copy link
Contributor

It definitely can implement the Iterator, it's probably that the maintainers do not have time to add that themselves.

@cosmicexplorer
Copy link
Contributor

I want to note that with the merge technique from zip-rs/zip-old#401 it becomes possible to parallelize zip creation as well as extraction. I prototyped this in a separate library medusa-zip but if parallel zip creation is of general enough utility I would also love to contribute it to this library as well!

@cosmicexplorer
Copy link
Contributor

The medusa-zip prototype library is async, but I assume we would probably want to convert the parallel zip technique into idiomatic blocking rust for this library of course.

@cosmicexplorer
Copy link
Contributor

@NobodyXu:

It definitely can be done, but there's one catch: Since the reader is shared among all zip entries, calling seek would also affect other entries reading from reader.

Thanks for this explanation! Could we also perhaps employ parallelism by pipelining the decompression/decryption of each entry though and only take a single pass over the file?

@NobodyXu
Copy link
Contributor

Could we also perhaps employ parallelism by pipelining the decompression/decryption of each entry though

I remember that decompression/decryption is done lazily, but you would need to read the source of ask @Plecra or @zamazan4ik for this.

only take a single pass over the file?

Parsing zip file requires reading head and tail of the file, it can be done in a streaming way which the entire file is read only once, but it's only useful when streaming a http request and unzip it.

If you have the zip stores on fs, then seeking it is quite cheap and should be ok, though iterating once is better for caching.

@cosmicexplorer
Copy link
Contributor

I will try to make a benchmark to see if the idea works!

@cosmicexplorer
Copy link
Contributor

cosmicexplorer commented Sep 21, 2023

I have demonstrated one approach to improve extraction performance with rayon threadpools in zip-rs/zip-old#407, but it has a few caveats and for multiple reasons I am going to try a separate branch that adds the async-executor crate in order to convert some of this blocking work into async tasks.

@cosmicexplorer
Copy link
Contributor

It took a lot of iteration, but I produced a prototype of an async API for ZipArchive, including an async extract() method. This was able to produce a performance improvement over sync extraction without needing to assume file handles are clonable or any of the shortcuts taken in zip-rs/zip-old#408. I'm convinced this is the right way to go for the library now, but I'm going to extract some of the code I had to write to make zip-rs/zip-old#409 work into a separate crate before proposing this as a real change (see TODO section in zip-rs/zip-old#409).

@a1phyr
Copy link
Contributor

a1phyr commented Apr 10, 2024

With the help of crate sync_file, it is easy to use rayon to do parallel extraction.

With a 600 MB archive with ~3000 files, I get the following result:

Using ZipArchive::extract:

$ time ./target/release/zip_data test.zip zip
real	0m0.994s
user	0m0.543s
sys	0m0.451s

Using rayon + sync_file in a function adapted from ZipArchive::extract:

$ time ./target/release/zip_data test.zip zip_par
real	0m0.255s
user	0m0.753s
sys	0m0.773s

Which is about 4x faster on my i3-10300T (4 core - 8 threads)

The parallel extract function
fn extract_zip_par(
    archive: zip::ZipArchive<sync_file::SyncFile>,
    directory: &Path,
) -> zip::result::ZipResult<()> {
    (0..archive.len())
        .into_par_iter()
        .try_for_each_with(archive, |archive, i| {
            let mut file = archive.by_index(i)?;
            let filepath = file
                .enclosed_name()
                .ok_or_else(|| zip::result::ZipError::InvalidArchive("Invalid file path"))?;

            let outpath = directory.join(filepath);

            if file.name().ends_with('/') {
                fs::create_dir_all(&outpath)?;
            } else {
                if let Some(p) = outpath.parent() {
                    if !p.exists() {
                        fs::create_dir_all(p)?;
                    }
                }
                let mut outfile = fs::File::create(&outpath)?;
                io::copy(&mut file, &mut outfile)?;
            }
            // Get and Set permissions
            #[cfg(unix)]
            {
                use std::os::unix::fs::PermissionsExt;
                if let Some(mode) = file.unix_mode() {
                    fs::set_permissions(&outpath, fs::Permissions::from_mode(mode))?;
                }
            }

            Ok::<_, zip::result::ZipError>(())
        })?;

    Ok(())
}

@Pr0methean Pr0methean transferred this issue from zip-rs/zip-old Jun 2, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging a pull request may close this issue.

4 participants