-
Notifications
You must be signed in to change notification settings - Fork 40
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Limited seeking backwards when reading a ZipArchive #162
Comments
Not without buffering the whole compressed file, since the compression is independent per file but isn't random-access. |
I'm currently working on a project and came over the same problem. I've a nested zip file within a compressed stream that does not support seeking. I can't store the whole compressed stream in RAM, since the files I am dealing with are too big. Then I came across the Lines 1594 to 1610 in 27c7fa4
function that accepts a non-seekable stream as input. The problem was that I am dealing with ZIP files that have a data descriptor set. As @Pr0methean mentioned safe way would be to buffer the whole compressed file since the actual size can be found in the data descriptor after the compressed file's content. My next idea was to do just that: buffer the stream content, providing a seekable reader and search for the first zip entry. After that the buffer can be emptied and the next entry can be searched. I implemented this (without the buffering part) in #197 . See https://github.com/0xCCF4/zip2/blob/59f1327e72390813a422e1c7486818da536e9ad7/src/read.rs#L1695-L1728 Then I had the idea, maybe one could just start "proactively" decompressing the content while looking ahead for the data descriptor and if found stop streaming the content. This would have the risk of streaming garbage data when the data descriptor can not be found and reading over the file boundary; but would not require buffering the compressed file data/stream data. |
Try using `read_zipfile_from_stream()`.
…On Fri, Jun 21, 2024, 07:21 0xCCF4 ***@***.***> wrote:
I'm currently working on a project and came over the same problem. I've a
nested zip file within a compressed stream that does not support seeking. I
can't store the whole compressed stream in RAM, since the files I am
dealing with are too big. Then I came across the
https://github.com/zip-rs/zip2/blob/27c7fa4cd408bb4cc1364cf599942883371a27fa/src/read.rs#L1594-L1610
function that accepts a non-seekable stream as input. The problem was that
I am dealing with ZIP files that have a data descriptor set. As
@Pr0methean <https://github.com/Pr0methean> mentioned safe way would be
to buffer the whole compressed file since the actual size can be found in
the data descriptor after the compressed file's content.
My next idea was to do just that: buffer the stream content, providing a
seekable reader and search for the first zip entry. After that the buffer
can be emptied and the next entry can be searched. I implemented this
(without the buffering part) in #197
<#197> . See
https://github.com/0xCCF4/zip2/blob/59f1327e72390813a422e1c7486818da536e9ad7/src/read.rs#L1695-L1728
Then I had the idea, maybe one could just start "proactively"
decompressing the content while looking ahead for the data descriptor and
if found stop streaming the content. This would have the risk of streaming
garbage data when the data descriptor can not be found and reading over the
file boundary; but would not require buffering the compressed file
data/stream data.
—
Reply to this email directly, view it on GitHub
<#162 (comment)>, or
unsubscribe
<https://github.com/notifications/unsubscribe-auth/ABF3NBKKULH455HN45UMJWDZIQZGNAVCNFSM6AAAAABJWBLVYCVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMZDCOBSHA2TAOBVGE>
.
You are receiving this because you were mentioned.Message ID:
***@***.***>
|
Yes that is what I tried but this function (also used by the unstable::ZipStreamReader) does not support data descriptors and will fail to read zip files using them (see attached test below). read_zipfile_from_stream is relying on the zip local entry block before the compressed file contents to figure out the size of the following data. That's fine but inherently does not work if the size is not given in the header but only later in zip data descriptor for the file. I am therefore working at a change to the unstable::ZipStreamReader to also support those zip files. #[test]
fn test_stream() {
let mut v = Vec::new();
v.extend_from_slice(include_bytes!("data/data_descriptor.zip"));
let stream = Box::new(io::Cursor::new(v)) as Box<dyn Read>;
let reader = ZipStreamReader::new(stream);
struct TestVisitor {}
impl ZipStreamVisitor for TestVisitor {
fn visit_file(&mut self, file: &mut ZipFile<'_>) -> ZipResult<()> {
println!("File: {}", file.name());
let mut buffer = Vec::new();
file.read_to_end(&mut buffer)?;
match String::from_utf8(buffer) {
Ok(text) => println!(" > {}", text),
Err(_) => {},
}
Ok(())
}
fn visit_additional_metadata(&mut self, metadata: &ZipStreamFileMetadata) -> ZipResult<()> {
println!("Metdata: {:?}", metadata);
Ok(())
}
}
let mut visitor = TestVisitor {};
reader.visit(&mut visitor).expect("error");
} |
I probably need some more days until I can finish the pull request #197 that introduces this feature. |
I finished the first revision, and I am looking forward for a review. 🙂 |
This is about the fact that ZipArchive requires the reader to implement seeking, which makes sense in many cases, but seems to have the following issue:
In the case where I'm trying to extract a file from a stream of a zip-archive which does not support seeking, i would need to store everything we have read so far, so i can support seeking backwards. For larger files this can get rather impractical when it comes to memory usage.
Is there a way to extract a file from an archive without seeking backwards or with limited seeking backwards (maximum number of bytes we can go back)?
The text was updated successfully, but these errors were encountered: