Parallel extraction #165

horahh · 2023-09-09T07:20:28Z

I was looking if the library supports any multithreading, would be cool but did not see anything related, has there been any though of that?
particularly thinking in something like par_iter to iterate over zip contents

NobodyXu · 2023-09-09T08:01:08Z

It definitely can be done, but there's one catch: Since the reader is shared among all zip entries, calling seek would also affect other entries reading from reader.

The only ergonomic solution I can think of is

either for zip to require read_at for reader, which is already implemented for File
or requires the reader to implement Clone by storing File to be stored inside an Arc and keep track of the curent location of cursor in the reader
or requires reader to implement TryClone: fn try_clone(&self) -> io::Result<Self> though this makes it harder to pass a trait object

horahh · 2023-09-14T17:14:57Z

also, wondering is there any reason the implementation to traverse the files in the zip file is done with a count rather than iterator?

NobodyXu · 2023-09-14T21:55:52Z

It definitely can implement the Iterator, it's probably that the maintainers do not have time to add that themselves.

cosmicexplorer · 2023-09-18T03:40:06Z

I want to note that with the merge technique from zip-rs/zip-old#401 it becomes possible to parallelize zip creation as well as extraction. I prototyped this in a separate library medusa-zip but if parallel zip creation is of general enough utility I would also love to contribute it to this library as well!

cosmicexplorer · 2023-09-18T03:42:33Z

The medusa-zip prototype library is async, but I assume we would probably want to convert the parallel zip technique into idiomatic blocking rust for this library of course.

cosmicexplorer · 2023-09-18T03:55:00Z

@NobodyXu:

It definitely can be done, but there's one catch: Since the reader is shared among all zip entries, calling seek would also affect other entries reading from reader.

Thanks for this explanation! Could we also perhaps employ parallelism by pipelining the decompression/decryption of each entry though and only take a single pass over the file?

NobodyXu · 2023-09-18T04:01:13Z

Could we also perhaps employ parallelism by pipelining the decompression/decryption of each entry though

I remember that decompression/decryption is done lazily, but you would need to read the source of ask @Plecra or @zamazan4ik for this.

only take a single pass over the file?

Parsing zip file requires reading head and tail of the file, it can be done in a streaming way which the entire file is read only once, but it's only useful when streaming a http request and unzip it.

If you have the zip stores on fs, then seeking it is quite cheap and should be ok, though iterating once is better for caching.

cosmicexplorer · 2023-09-18T04:02:58Z

I will try to make a benchmark to see if the idea works!

cosmicexplorer · 2023-09-21T04:28:44Z

I have demonstrated one approach to improve extraction performance with rayon threadpools in zip-rs/zip-old#407, but it has a few caveats and for multiple reasons I am going to try a separate branch that adds the async-executor crate in order to convert some of this blocking work into async tasks.

cosmicexplorer · 2023-09-30T09:41:04Z

It took a lot of iteration, but I produced a prototype of an async API for ZipArchive, including an async extract() method. This was able to produce a performance improvement over sync extraction without needing to assume file handles are clonable or any of the shortcuts taken in zip-rs/zip-old#408. I'm convinced this is the right way to go for the library now, but I'm going to extract some of the code I had to write to make zip-rs/zip-old#409 work into a separate crate before proposing this as a real change (see TODO section in zip-rs/zip-old#409).

a1phyr · 2024-04-10T16:22:13Z

With the help of crate sync_file, it is easy to use rayon to do parallel extraction.

With a 600 MB archive with ~3000 files, I get the following result:

Using ZipArchive::extract:

$ time ./target/release/zip_data test.zip zip
real	0m0.994s
user	0m0.543s
sys	0m0.451s

Using rayon + sync_file in a function adapted from ZipArchive::extract:

$ time ./target/release/zip_data test.zip zip_par
real	0m0.255s
user	0m0.753s
sys	0m0.773s

Which is about 4x faster on my i3-10300T (4 core - 8 threads)

The parallel extract function

fn extract_zip_par(
    archive: zip::ZipArchive<sync_file::SyncFile>,
    directory: &Path,
) -> zip::result::ZipResult<()> {
    (0..archive.len())
        .into_par_iter()
        .try_for_each_with(archive, |archive, i| {
            let mut file = archive.by_index(i)?;
            let filepath = file
                .enclosed_name()
                .ok_or_else(|| zip::result::ZipError::InvalidArchive("Invalid file path"))?;

            let outpath = directory.join(filepath);

            if file.name().ends_with('/') {
                fs::create_dir_all(&outpath)?;
            } else {
                if let Some(p) = outpath.parent() {
                    if !p.exists() {
                        fs::create_dir_all(p)?;
                    }
                }
                let mut outfile = fs::File::create(&outpath)?;
                io::copy(&mut file, &mut outfile)?;
            }
            // Get and Set permissions
            #[cfg(unix)]
            {
                use std::os::unix::fs::PermissionsExt;
                if let Some(mode) = file.unix_mode() {
                    fs::set_permissions(&outpath, fs::Permissions::from_mode(mode))?;
                }
            }

            Ok::<_, zip::result::ZipError>(())
        })?;

    Ok(())
}

cosmicexplorer mentioned this issue Sep 21, 2023

Pipelined parallel extract zip-rs/zip-old#407

Closed

cosmicexplorer mentioned this issue Sep 30, 2023

prototype async API, with demonstrable perf improvements via benchmark zip-rs/zip-old#409

Closed

Pr0methean transferred this issue from zip-rs/zip-old Jun 2, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Parallel extraction #165

Parallel extraction #165

horahh commented Sep 9, 2023

NobodyXu commented Sep 9, 2023

horahh commented Sep 14, 2023

NobodyXu commented Sep 14, 2023

cosmicexplorer commented Sep 18, 2023

cosmicexplorer commented Sep 18, 2023

cosmicexplorer commented Sep 18, 2023

NobodyXu commented Sep 18, 2023

cosmicexplorer commented Sep 18, 2023

cosmicexplorer commented Sep 21, 2023 •

edited

Loading

cosmicexplorer commented Sep 30, 2023

a1phyr commented Apr 10, 2024

Parallel extraction #165

Parallel extraction #165

Comments

horahh commented Sep 9, 2023

NobodyXu commented Sep 9, 2023

horahh commented Sep 14, 2023

NobodyXu commented Sep 14, 2023

cosmicexplorer commented Sep 18, 2023

cosmicexplorer commented Sep 18, 2023

cosmicexplorer commented Sep 18, 2023

NobodyXu commented Sep 18, 2023

cosmicexplorer commented Sep 18, 2023

cosmicexplorer commented Sep 21, 2023 • edited Loading

cosmicexplorer commented Sep 30, 2023

a1phyr commented Apr 10, 2024

cosmicexplorer commented Sep 21, 2023 •

edited

Loading