Limited seeking backwards when reading a ZipArchive #162

XLPhere · 2023-11-29T14:54:32Z

This is about the fact that ZipArchive requires the reader to implement seeking, which makes sense in many cases, but seems to have the following issue:

In the case where I'm trying to extract a file from a stream of a zip-archive which does not support seeking, i would need to store everything we have read so far, so i can support seeking backwards. For larger files this can get rather impractical when it comes to memory usage.

Is there a way to extract a file from an archive without seeking backwards or with limited seeking backwards (maximum number of bytes we can go back)?

Pr0methean · 2024-05-01T03:24:06Z

Not without buffering the whole compressed file, since the compression is independent per file but isn't random-access.

0xCCF4 · 2024-06-21T14:21:05Z

I'm currently working on a project and came over the same problem. I've a nested zip file within a compressed stream that does not support seeking. I can't store the whole compressed stream in RAM, since the files I am dealing with are too big. Then I came across the

zip2/src/read.rs

Lines 1594 to 1610 in 27c7fa4

    
           /// Read ZipFile structures from a non-seekable reader. 
        
           /// 
        
           /// This is an alternative method to read a zip file. If possible, use the ZipArchive functions 
        
           /// as some information will be missing when reading this manner. 
        
           /// 
        
           /// Reads a file header from the start of the stream. Will return `Ok(Some(..))` if a file is 
        
           /// present at the start of the stream. Returns `Ok(None)` if the start of the central directory 
        
           /// is encountered. No more files should be read after this. 
        
           /// 
        
           /// The Drop implementation of ZipFile ensures that the reader will be correctly positioned after 
        
           /// the structure is done. 
        
           /// 
        
           /// Missing fields are: 
        
           /// * `comment`: set to an empty string 
        
           /// * `data_start`: set to 0 
        
           /// * `external_attributes`: `unix_mode()`: will return None 
        
           pub fn read_zipfile_from_stream<'a, R: Read>(reader: &'a mut R) -> ZipResult<Option<ZipFile<'_>>> {

function that accepts a non-seekable stream as input. The problem was that I am dealing with ZIP files that have a data descriptor set. As @Pr0methean mentioned safe way would be to buffer the whole compressed file since the actual size can be found in the data descriptor after the compressed file's content.

My next idea was to do just that: buffer the stream content, providing a seekable reader and search for the first zip entry. After that the buffer can be emptied and the next entry can be searched. I implemented this (without the buffering part) in #197 . See https://github.com/0xCCF4/zip2/blob/59f1327e72390813a422e1c7486818da536e9ad7/src/read.rs#L1695-L1728

Then I had the idea, maybe one could just start "proactively" decompressing the content while looking ahead for the data descriptor and if found stop streaming the content. This would have the risk of streaming garbage data when the data descriptor can not be found and reading over the file boundary; but would not require buffering the compressed file data/stream data.

Pr0methean · 2024-06-21T17:13:19Z

Try using `read_zipfile_from_stream()`.

…

On Fri, Jun 21, 2024, 07:21 0xCCF4 ***@***.***> wrote: I'm currently working on a project and came over the same problem. I've a nested zip file within a compressed stream that does not support seeking. I can't store the whole compressed stream in RAM, since the files I am dealing with are too big. Then I came across the https://github.com/zip-rs/zip2/blob/27c7fa4cd408bb4cc1364cf599942883371a27fa/src/read.rs#L1594-L1610 function that accepts a non-seekable stream as input. The problem was that I am dealing with ZIP files that have a data descriptor set. As @Pr0methean <https://github.com/Pr0methean> mentioned safe way would be to buffer the whole compressed file since the actual size can be found in the data descriptor after the compressed file's content. My next idea was to do just that: buffer the stream content, providing a seekable reader and search for the first zip entry. After that the buffer can be emptied and the next entry can be searched. I implemented this (without the buffering part) in #197 <#197> . See https://github.com/0xCCF4/zip2/blob/59f1327e72390813a422e1c7486818da536e9ad7/src/read.rs#L1695-L1728 Then I had the idea, maybe one could just start "proactively" decompressing the content while looking ahead for the data descriptor and if found stop streaming the content. This would have the risk of streaming garbage data when the data descriptor can not be found and reading over the file boundary; but would not require buffering the compressed file data/stream data. — Reply to this email directly, view it on GitHub <#162 (comment)>, or unsubscribe <https://github.com/notifications/unsubscribe-auth/ABF3NBKKULH455HN45UMJWDZIQZGNAVCNFSM6AAAAABJWBLVYCVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMZDCOBSHA2TAOBVGE> . You are receiving this because you were mentioned.Message ID: ***@***.***>

0xCCF4 · 2024-06-22T14:29:54Z

Yes that is what I tried but this function (also used by the unstable::ZipStreamReader) does not support data descriptors and will fail to read zip files using them (see attached test below).

read_zipfile_from_stream is relying on the zip local entry block before the compressed file contents to figure out the size of the following data. That's fine but inherently does not work if the size is not given in the header but only later in zip data descriptor for the file.

I am therefore working at a change to the unstable::ZipStreamReader to also support those zip files.

#[test]
fn test_stream() {
    let mut v = Vec::new();
    v.extend_from_slice(include_bytes!("data/data_descriptor.zip"));
    let stream = Box::new(io::Cursor::new(v)) as Box<dyn Read>;

    let reader = ZipStreamReader::new(stream);

    struct TestVisitor {}

    impl ZipStreamVisitor for TestVisitor {
        fn visit_file(&mut self, file: &mut ZipFile<'_>) -> ZipResult<()> {
            println!("File: {}", file.name());
            let mut buffer = Vec::new();
            file.read_to_end(&mut buffer)?;
            match String::from_utf8(buffer) {
                Ok(text) => println!(" > {}", text),
                Err(_) => {},
            }
            Ok(())
        }

        fn visit_additional_metadata(&mut self, metadata: &ZipStreamFileMetadata) -> ZipResult<()> {
            println!("Metdata: {:?}", metadata);
            Ok(())
        }
    }

    let mut visitor = TestVisitor {};
    reader.visit(&mut visitor).expect("error");
}

0xCCF4 · 2024-06-23T11:53:46Z

I probably need some more days until I can finish the pull request #197 that introduces this feature.

0xCCF4 · 2024-07-01T17:21:53Z

I finished the first revision, and I am looking forward for a review. 🙂

Pr0methean transferred this issue from zip-rs/zip-old Jun 2, 2024

0xCCF4 mentioned this issue Jun 21, 2024

feat: Interpret data descriptors when reading zip file from (read, nonseek) stream #197

Open

4 tasks

Pr0methean added the help wanted Extra attention is needed label Jun 23, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Limited seeking backwards when reading a ZipArchive #162

Limited seeking backwards when reading a ZipArchive #162

XLPhere commented Nov 29, 2023 •

edited

Loading

Pr0methean commented May 1, 2024

0xCCF4 commented Jun 21, 2024

Pr0methean commented Jun 21, 2024 via email

0xCCF4 commented Jun 22, 2024

0xCCF4 commented Jun 23, 2024

0xCCF4 commented Jul 1, 2024

Limited seeking backwards when reading a ZipArchive #162

Limited seeking backwards when reading a ZipArchive #162

Comments

XLPhere commented Nov 29, 2023 • edited Loading

Pr0methean commented May 1, 2024

0xCCF4 commented Jun 21, 2024

Pr0methean commented Jun 21, 2024 via email

0xCCF4 commented Jun 22, 2024

0xCCF4 commented Jun 23, 2024

0xCCF4 commented Jul 1, 2024

XLPhere commented Nov 29, 2023 •

edited

Loading