Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

New read api #233

Draft
wants to merge 5 commits into
base: master
Choose a base branch
from
Draft

Conversation

cosmicexplorer
Copy link
Contributor

@cosmicexplorer cosmicexplorer commented Aug 11, 2024

Moved from #207 after weird github errors.

Problem

I believe there are several shortcomings that necessitate a slight reimplementation of much of src/read.rs, some involving breaking API changes:

  1. Propagating Send and/or Sync bounds from R to ZipFile<R> is literally impossible with the current public API that type-erases to dyn Read.
    • We could make two separate copies, one with dyn Read + Send, but that would be a lot of trouble and not worth it.
    • We could just assert that R is Send and always carry a dyn Read + Send, but that would be unsound and unsafe and could introduce memory unsafety.
  2. ZipFile has a lot of stateful logic, including a ZipFileReader::NoReader state, even necessitating an invalid_state() helper method for error checking.
    • Notably (and imo inexcusably), ZipFile has a Drop impl which is only used when reading stream archives (to ensure we read out the rest of the entry from the stream), but this niche use case necessitates every reader struct in the crate to provide an .into_inner() method.
  3. To underline the reason this Drop impl for ZipFile is a bad idea, there is currently a bug in ZipStreamReader which drops the first ZipStreamFileMetadata entry, because read_zipfile_from_stream() does not save the contents of the last ZipLocalEntryBlock it reads which has spec::Magic::CENTRAL_DIRECTORY_HEADER_SIGNATURE.
    • Basically, streaming requires special handling, not being bolted onto ZipFile.
    • This change means we don't need to go through the ZipStreamVisitor interface, and we can have direct control over when to iterate through each entry of the streaming zip file. That's what zip::unstable::read::streaming provides.
    • Importantly, StreamingArchive maintains the internal state necessary to retain the bytes of the first metadata entry after we go through all the local file entries, fixing the bug in ZipStreamReader which drops the first metadata entry.
  4. Because streaming requires special handling and separate structs for entries, we also need to be able to get access to all the methods for accessing ZipFileData that we have on ZipFile, but for other structs. That's why this change introduces EntryData, which implements all the methods traditionally provided as member functions of ZipFile.
    • This is generally useful, as it allows us to avoid publicly exposing ZipFileData itself, and we can instead keep that to ourselves as an internal implementation detail.
    • In general, this gives us the ability to virtualize or introduce other indirections to access entry metadata, without changing the interface expected by end users, or having to reimplement any of our convenience methods for different types of zip backing archives.
  5. Finally, ZipFileReader::Compressed(Box<Crc32Reader<Decompressor<io::BufReader<CryptoReader<'a>>>>>) always goes through the logic for CryptoReader, when encrypted zip files are an extremely niche use case. We shouldn't be cluttering up our type definitions or our stack traces with encryption logic when most users are never going to touch it!
        let final_reader = if data.encrypted {
            let crypto_variant = CryptoKeyValidationSource::from_data(data)?;
            let is_ae2_encrypted = crypto_variant.is_ae2_encrypted();
            let crypto_reader = crypto_variant.make_crypto_reader(password, content)?;
            let entry_reader =
                construct_decompressing_reader(&data.compression_method, crypto_reader)?;
            if is_ae2_encrypted {
                /* Ae2 voids crc checking: https://www.winzip.com/en/support/aes-encryption/ */
                CryptoEntryReader::Ae2Encrypted(entry_reader)
            } else {
                CryptoEntryReader::NonAe2Encrypted(NewCrc32Reader::new(entry_reader, data.crc32))
            }
        } else {
            /* Not encrypted, so do the same as in .by_index_generic(): */
            let entry_reader = construct_decompressing_reader(&data.compression_method, content)?;
            CryptoEntryReader::Unencrypted(NewCrc32Reader::new(entry_reader, data.crc32))
        };
    /// Get a contained file by index
    pub fn by_index_generic(
        &mut self,
        file_number: usize,
    ) -> ZipResult<ZipEntry<'_, impl Read + '_>> {
        let Self {
            ref mut reader,
            ref shared,
            ..
        } = self;
        let (_, data) = shared
            .files
            .get_index(file_number)
            .ok_or(ZipError::FileNotFound)?;
        let content = find_entry_content_range(data, reader)?;
        let entry_reader = construct_decompressing_reader(&data.compression_method, content)?;
        let crc32_reader = NewCrc32Reader::new(entry_reader, data.crc32);
        Ok(ZipEntry {
            data,
            reader: crc32_reader,
        })
    }

That's the entire method! No jumping into decryption code paths for the much more common case of simple decompression with crc32 checking!

So that's my case for the API changes. We can do them in pieces, but I'm proposing this intentionally breaking change that I believe will improve the internal architecture and correctness, increase our ability to perform optimizations, and generate more flexible entry structs in our public API.

More Specifically

This section was copied from #207.

As described in gyscos/zstd-rs#288 (comment), our usage of &'a mut dyn Read in ZipFileReader stops us from being able to impl Send or Sync, or generally to send a ZipFile into a separate thread than the one owning the ZipArchive it belongs to. While this is generally not an issue, as the single synchronous Read + Seek handle makes it difficult to farm out work to subthreads, for the purposes of #72, we would like to be able to reuse some code instead of creating an entirely separate implementation. Note: #72 does not actually use this code, and this is a false argument. However, it is correct that other code using this crate, especially in async contexts, could really use the ability to impose Send bounds, which ZipFile is unable to satisfy.

That brings us to the second set of issues, surrounding zipcrypto support:

  • all of our readers go through the CryptoReader enum and the make_crypto_reader() method, even though zip crypto is a very uncommon use case,
  • ZipFile itself contains some very complex mutable state with both a ZipFileReader and a CryptoReader, even though ZipFileReader itself also contains a CryptoReader,
  • Crc32Reader has to maintain a special flag to disable CRC checking in the special case of AE2 encryption, which is a sign of a leaky abstraction and may easily introduce a failure to check CRC in other cases by accident in the future.

Solution

Luckily, all of this becomes significantly easier without too many changes:

  • Create src/unstable/read.rs to reimplement read::ZipFile with the new unstable::read::ZipEntry.
  • Make EntryReader and ZipEntry with a parameterized reader type R instead of an internal &'a mut dyn Read.
  • Make new versions of e.g. ZipArchive::by_index() returning ZipResult<ZipEntry<'_, impl Read + '_>> to avoid leaking internal structs while retaining the ability to rearrange the layout of our internal structs.
  • Vastly improve readability of CryptoReader creation through CryptoVariant.

This also leads to a significantly improved refactoring of streaming support in the streaming submodule of src/unstable/read.rs.

Result

zipcrypto support doesn't muck up the non-crypto methods, ZipEntry is now Send, and src/unstable/read.rs is significantly easier to read with equivalent functionality. This will have to be a breaking change in order to achieve Send-ability, but I believe the readability/maintainability is substantially improved with this change.

TODO

  • Make ZipFile wrap a dyn Read handle produced from ZipEntry, so there's no CryptoReader type in our non-crypto code paths.
  • Make ZipStreamReader use StreamingArchive internally (generating ZipFile instances from the ZipEntrys it produces).

src/crc32.rs Dismissed Show dismissed Hide dismissed
src/unstable/read.rs Dismissed Show dismissed Hide dismissed
src/unstable/read.rs Dismissed Show dismissed Hide dismissed
src/unstable/read.rs Dismissed Show dismissed Hide dismissed
src/unstable/read.rs Dismissed Show dismissed Hide dismissed
src/unstable/read.rs Dismissed Show dismissed Hide dismissed
src/unstable/read.rs Dismissed Show dismissed Hide dismissed
@cosmicexplorer cosmicexplorer added the rust Pull requests that update Rust code label Aug 11, 2024
@cosmicexplorer
Copy link
Contributor Author

cosmicexplorer commented Aug 11, 2024

I've just added the new API to the fuzz testing in 568fcab, and benchmarks confirming it achieves the same performance as the current API in 2731e7e. These benchmarks also show that decompressing entries with the stream API is just as fast as iterating over entries from a ZipArchive. Note that I made zip::read::stream fully public in order to benchmark it against my new zip::unstable::read::streaming API, as it was both unused beforehand and marked as pub(crate).

@cosmicexplorer cosmicexplorer force-pushed the new-read-api branch 6 times, most recently from 7f56cfe to e882a72 Compare August 11, 2024 19:10
@Pr0methean
Copy link
Member

Pr0methean commented Aug 11, 2024

Re point 3: I'm confused about why this is a bug. If the block has spec::Magic::CENTRAL_DIRECTORY_HEADER_SIGNATURE, aren't we supposed to ignore it, since the metadata in streaming mode comes from local headers?

@cosmicexplorer cosmicexplorer mentioned this pull request Aug 12, 2024
3 tasks
@Pr0methean
Copy link
Member

Re point 3: I'm confused about why this is a bug. If the block has spec::Magic::CENTRAL_DIRECTORY_HEADER_SIGNATURE, aren't we supposed to ignore it, since the metadata in streaming mode comes from local headers?

@cosmicexplorer I'm not sure I can properly review this PR without an answer to this question.

@cosmicexplorer
Copy link
Contributor Author

Re point 3: I'm confused about why this is a bug. If the block has spec::Magic::CENTRAL_DIRECTORY_HEADER_SIGNATURE, aren't we supposed to ignore it, since the metadata in streaming mode comes from local headers?

That's correct, for the case in which we are only interested in local blocks, but it causes us to drop the first metadata block when we are interested in the visit_additional_metadata() method for ZipStreamVisitor from src/read/stream.rs. That is currently not an exposed API, so there's currently no harm and no bug exposed to users, but I am using it as an example of how read_zipfile_from_stream() itself cannot be used to implement ZipStreamReader without dropping that first central directory block.

This is actually not an indictment of src/read/stream.rs, as ZipStreamReader actually demonstrates exactly the interface we want--in fact, what I should have done is probably just to implement ZipStreamReader in terms of unstable::read::streaming::StreamingArchive. However StreamingArchive makes use of ZipEntry (with the parameterized reader instead of the &mut dyn Read), so it's hard to see how to do that in a compatible way.

I will think about whether ZipFile can be made to store a dyn Read handle produced from the ZipEntry code in unstable/read.rs. That would be a really interesting way to use this code internally without breaking the API.

@cosmicexplorer
Copy link
Contributor Author

I guess I should say another thing for context: this shouldn't really be considered a priority, and it doesn't block any current work on the crate (including #208 or #235). I continue to think it's a useful refactoring change, but given that it necessarily requires introducing a breaking API change, it might be worth leaving on the back burner for now and coming back to when there are other breaking changes we want to introduce with a major version bump.

I was very interested to see whether it would produce any performance improvement (maybe it would when operating over an in-memory Cursor instead of File i/o?), but a vtable call through dyn Read is really cheap, so I think the most important angle is that ZipEntry from this PR can impl Send (in fact it does so automatically), which the current ZipFile API is unable to do without introducing unsoundness. I believe this generally means a ZipFile can't be transferred across await points, but since a ZipFile also contains a mutable reference to the parent ZipArchive anyway (ZipEntry does too, since the returned impl Read takes a mutable reference to the read stream R from ZipArchive<R>), I don't believe anyone using this crate in async code is able to operate over ZipFile entries separate from parent ZipArchives anyway. So the Send impl isn't really too important in practice, especially since work like #208 (introducing more high-level abstractions to enable our own parallelism) is likely to achieve whatever people might want an async API for (this assumption may be wrong).

The part that I like the most about this (and the reason for its creation) is how the parameterized R: Read stream enables us to avoid polluting our ZipFile type with any knowledge of the CryptoReader (because I feel that that code is especially difficult to read and extend, especially e.g. requiring us to insert a boolean flag in Crc32Reader). There is a secondary benefit too that we do not need a Drop impl for ZipFile which is only used for streaming.

Given the idea I just had above (making ZipFile wrap a dyn Read handle using the code from ZipEntry), I'm going to see whether we can get most of the benefit of this change without breaking the API:

  • Make ZipFile wrap a dyn Read handle produced from ZipEntry, so there's no CryptoReader type in our non-crypto code paths.
  • Make ZipStreamReader use StreamingArchive internally (generating ZipFile instances from the ZipEntrys it produces).

I'm going to mark this as draft until I can do the above, and ping you when I've got something for review.

@cosmicexplorer cosmicexplorer marked this pull request as draft August 17, 2024 18:31
- also use git dep for zstd to pull in removal of R: BufRead bound
- move new methods into unstable/read.rs
- support getting aes verification info from a raw ZipEntry
- flesh out other methods from ZipFile -> ZipEntry
- now create a streaming abstraction
- move some methods around
- make AesModeInfo pub(crate) to make its intention clear
- make ArchiveEntry more publicly accessible
- make ZipFile impl EntryData
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
rust Pull requests that update Rust code
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants