Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Limited seeking backwards when reading a ZipArchive #162

Open
XLPhere opened this issue Nov 29, 2023 · 6 comments
Open

Limited seeking backwards when reading a ZipArchive #162

XLPhere opened this issue Nov 29, 2023 · 6 comments
Labels
help wanted Extra attention is needed

Comments

@XLPhere
Copy link

XLPhere commented Nov 29, 2023

This is about the fact that ZipArchive requires the reader to implement seeking, which makes sense in many cases, but seems to have the following issue:

In the case where I'm trying to extract a file from a stream of a zip-archive which does not support seeking, i would need to store everything we have read so far, so i can support seeking backwards. For larger files this can get rather impractical when it comes to memory usage.

Is there a way to extract a file from an archive without seeking backwards or with limited seeking backwards (maximum number of bytes we can go back)?

@Pr0methean
Copy link
Member

Not without buffering the whole compressed file, since the compression is independent per file but isn't random-access.

@0xCCF4
Copy link

0xCCF4 commented Jun 21, 2024

I'm currently working on a project and came over the same problem. I've a nested zip file within a compressed stream that does not support seeking. I can't store the whole compressed stream in RAM, since the files I am dealing with are too big. Then I came across the

zip2/src/read.rs

Lines 1594 to 1610 in 27c7fa4

/// Read ZipFile structures from a non-seekable reader.
///
/// This is an alternative method to read a zip file. If possible, use the ZipArchive functions
/// as some information will be missing when reading this manner.
///
/// Reads a file header from the start of the stream. Will return `Ok(Some(..))` if a file is
/// present at the start of the stream. Returns `Ok(None)` if the start of the central directory
/// is encountered. No more files should be read after this.
///
/// The Drop implementation of ZipFile ensures that the reader will be correctly positioned after
/// the structure is done.
///
/// Missing fields are:
/// * `comment`: set to an empty string
/// * `data_start`: set to 0
/// * `external_attributes`: `unix_mode()`: will return None
pub fn read_zipfile_from_stream<'a, R: Read>(reader: &'a mut R) -> ZipResult<Option<ZipFile<'_>>> {

function that accepts a non-seekable stream as input. The problem was that I am dealing with ZIP files that have a data descriptor set. As @Pr0methean mentioned safe way would be to buffer the whole compressed file since the actual size can be found in the data descriptor after the compressed file's content.

My next idea was to do just that: buffer the stream content, providing a seekable reader and search for the first zip entry. After that the buffer can be emptied and the next entry can be searched. I implemented this (without the buffering part) in #197 . See https://github.com/0xCCF4/zip2/blob/59f1327e72390813a422e1c7486818da536e9ad7/src/read.rs#L1695-L1728

Then I had the idea, maybe one could just start "proactively" decompressing the content while looking ahead for the data descriptor and if found stop streaming the content. This would have the risk of streaming garbage data when the data descriptor can not be found and reading over the file boundary; but would not require buffering the compressed file data/stream data.

@Pr0methean
Copy link
Member

Pr0methean commented Jun 21, 2024 via email

@0xCCF4
Copy link

0xCCF4 commented Jun 22, 2024

Yes that is what I tried but this function (also used by the unstable::ZipStreamReader) does not support data descriptors and will fail to read zip files using them (see attached test below).

read_zipfile_from_stream is relying on the zip local entry block before the compressed file contents to figure out the size of the following data. That's fine but inherently does not work if the size is not given in the header but only later in zip data descriptor for the file.

I am therefore working at a change to the unstable::ZipStreamReader to also support those zip files.

#[test]
fn test_stream() {
    let mut v = Vec::new();
    v.extend_from_slice(include_bytes!("data/data_descriptor.zip"));
    let stream = Box::new(io::Cursor::new(v)) as Box<dyn Read>;

    let reader = ZipStreamReader::new(stream);

    struct TestVisitor {}

    impl ZipStreamVisitor for TestVisitor {
        fn visit_file(&mut self, file: &mut ZipFile<'_>) -> ZipResult<()> {
            println!("File: {}", file.name());
            let mut buffer = Vec::new();
            file.read_to_end(&mut buffer)?;
            match String::from_utf8(buffer) {
                Ok(text) => println!(" > {}", text),
                Err(_) => {},
            }
            Ok(())
        }

        fn visit_additional_metadata(&mut self, metadata: &ZipStreamFileMetadata) -> ZipResult<()> {
            println!("Metdata: {:?}", metadata);
            Ok(())
        }
    }

    let mut visitor = TestVisitor {};
    reader.visit(&mut visitor).expect("error");
}

@Pr0methean Pr0methean added the help wanted Extra attention is needed label Jun 23, 2024
@0xCCF4
Copy link

0xCCF4 commented Jun 23, 2024

I probably need some more days until I can finish the pull request #197 that introduces this feature.

@0xCCF4
Copy link

0xCCF4 commented Jul 1, 2024

I finished the first revision, and I am looking forward for a review. 🙂

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
help wanted Extra attention is needed
Projects
None yet
Development

No branches or pull requests

3 participants