-
Notifications
You must be signed in to change notification settings - Fork 40
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
New read api #233
base: master
Are you sure you want to change the base?
New read api #233
Conversation
ea58b90
to
dcbabcc
Compare
I've just added the new API to the fuzz testing in 568fcab, and benchmarks confirming it achieves the same performance as the current API in 2731e7e. These benchmarks also show that decompressing entries with the stream API is just as fast as iterating over entries from a |
7f56cfe
to
e882a72
Compare
Re point 3: I'm confused about why this is a bug. If the block has |
@cosmicexplorer I'm not sure I can properly review this PR without an answer to this question. |
That's correct, for the case in which we are only interested in local blocks, but it causes us to drop the first metadata block when we are interested in the This is actually not an indictment of I will think about whether |
I guess I should say another thing for context: this shouldn't really be considered a priority, and it doesn't block any current work on the crate (including #208 or #235). I continue to think it's a useful refactoring change, but given that it necessarily requires introducing a breaking API change, it might be worth leaving on the back burner for now and coming back to when there are other breaking changes we want to introduce with a major version bump. I was very interested to see whether it would produce any performance improvement (maybe it would when operating over an in-memory The part that I like the most about this (and the reason for its creation) is how the parameterized Given the idea I just had above (making
I'm going to mark this as draft until I can do the above, and ping you when I've got something for review. |
e882a72
to
f33748b
Compare
- also use git dep for zstd to pull in removal of R: BufRead bound - move new methods into unstable/read.rs - support getting aes verification info from a raw ZipEntry - flesh out other methods from ZipFile -> ZipEntry - now create a streaming abstraction - move some methods around - make AesModeInfo pub(crate) to make its intention clear - make ArchiveEntry more publicly accessible - make ZipFile impl EntryData
Moved from #207 after weird github errors.
Problem
I believe there are several shortcomings that necessitate a slight reimplementation of much of
src/read.rs
, some involving breaking API changes:Send
and/orSync
bounds fromR
toZipFile<R>
is literally impossible with the current public API that type-erases todyn Read
.dyn Read + Send
, but that would be a lot of trouble and not worth it.R
isSend
and always carry adyn Read + Send
, but that would be unsound andunsafe
and could introduce memory unsafety.ZipFile
has a lot of stateful logic, including aZipFileReader::NoReader
state, even necessitating aninvalid_state()
helper method for error checking.ZipFile
has aDrop
impl which is only used when reading stream archives (to ensure we read out the rest of the entry from the stream), but this niche use case necessitates every reader struct in the crate to provide an.into_inner()
method.Drop
impl forZipFile
is a bad idea, there is currently a bug inZipStreamReader
which drops the firstZipStreamFileMetadata
entry, becauseread_zipfile_from_stream()
does not save the contents of the lastZipLocalEntryBlock
it reads which hasspec::Magic::CENTRAL_DIRECTORY_HEADER_SIGNATURE
.ZipFile
.ZipStreamVisitor
interface, and we can have direct control over when to iterate through each entry of the streaming zip file. That's whatzip::unstable::read::streaming
provides.StreamingArchive
maintains the internal state necessary to retain the bytes of the first metadata entry after we go through all the local file entries, fixing the bug inZipStreamReader
which drops the first metadata entry.ZipFileData
that we have onZipFile
, but for other structs. That's why this change introducesEntryData
, which implements all the methods traditionally provided as member functions ofZipFile
.ZipFileData
itself, and we can instead keep that to ourselves as an internal implementation detail.ZipFileReader::Compressed(Box<Crc32Reader<Decompressor<io::BufReader<CryptoReader<'a>>>>>)
always goes through the logic forCryptoReader
, when encrypted zip files are an extremely niche use case. We shouldn't be cluttering up our type definitions or our stack traces with encryption logic when most users are never going to touch it!Crc32Reader
atzip2/src/crc32.rs
Line 15 in 3f6768e
Crc32Reader
at all, and this change means we actually don't, instead of having to insert little workarounds and bolt on flags to disable stuff; see https://github.com/cosmicexplorer/zip/blob/04f6ced5f2197d0ef790774442e6371ab904229e/src/read.rs#L1329-L1345, where we do the more complex logic to instantiate the crypto reader in this new change:That's the entire method! No jumping into decryption code paths for the much more common case of simple decompression with crc32 checking!
So that's my case for the API changes. We can do them in pieces, but I'm proposing this intentionally breaking change that I believe will improve the internal architecture and correctness, increase our ability to perform optimizations, and generate more flexible entry structs in our public API.
More Specifically
This section was copied from #207.
As described in gyscos/zstd-rs#288 (comment), our usage of
&'a mut dyn Read
inZipFileReader
stops us from being able to implSend
orSync
, or generally to send aZipFile
into a separate thread than the one owning theZipArchive
it belongs to. While this is generally not an issue, as the single synchronousRead + Seek
handle makes it difficult to farm out work to subthreads, for the purposes of #72, we would like to be able to reuse some code instead of creating an entirely separate implementation. Note: #72 does not actually use this code, and this is a false argument. However, it is correct that other code using this crate, especially inasync
contexts, could really use the ability to imposeSend
bounds, whichZipFile
is unable to satisfy.That brings us to the second set of issues, surrounding zipcrypto support:
CryptoReader
enum and themake_crypto_reader()
method, even though zip crypto is a very uncommon use case,ZipFile
itself contains some very complex mutable state with both aZipFileReader
and aCryptoReader
, even thoughZipFileReader
itself also contains aCryptoReader
,Crc32Reader
has to maintain a special flag to disable CRC checking in the special case of AE2 encryption, which is a sign of a leaky abstraction and may easily introduce a failure to check CRC in other cases by accident in the future.Solution
Luckily, all of this becomes significantly easier without too many changes:
src/unstable/read.rs
to reimplementread::ZipFile
with the newunstable::read::ZipEntry
.EntryReader
andZipEntry
with a parameterized reader typeR
instead of an internal&'a mut dyn Read
.ZipArchive::by_index()
returningZipResult<ZipEntry<'_, impl Read + '_>>
to avoid leaking internal structs while retaining the ability to rearrange the layout of our internal structs.CryptoReader
creation throughCryptoVariant
.This also leads to a significantly improved refactoring of streaming support in the
streaming
submodule ofsrc/unstable/read.rs
.Result
zipcrypto support doesn't muck up the non-crypto methods,
ZipEntry
is nowSend
, andsrc/unstable/read.rs
is significantly easier to read with equivalent functionality. This will have to be a breaking change in order to achieveSend
-ability, but I believe the readability/maintainability is substantially improved with this change.TODO
ZipFile
wrap adyn Read
handle produced fromZipEntry
, so there's noCryptoReader
type in our non-crypto code paths.ZipStreamReader
useStreamingArchive
internally (generatingZipFile
instances from theZipEntry
s it produces).