-
Notifications
You must be signed in to change notification settings - Fork 170
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Huge allocations #283
Comments
Thanks @erri120 Some testing and implementation has been done through #250 to try and address (most of) the internal byte array issues. It uses a List structure to hold file byte chunks that should be small enough (< 85000 bytes) to avoid LOH allocations, and also allow for multiple reads against the same byte sequence instead of copying (which may be a lot of the allocations you see). It's still a WIP, quite a change to the library, and may or may not address all allocation scenarios. Hopefully it becomes a small step forward in the future but we'll see how it plays out. Keep in mind too that MetadataExtractor is designed to always read all possible metadata and create the internal representation. It wasn't designed to read up to a user-defined point or in a lazy manner. A few issues have been made over the years requesting a lazy and/or truncated reading feature but like you said, it would be quite a refactor. I think something along the lines of #250 would need to be in place first to at the very least address all the byte array copying and then go back to these other things. |
@erri120 thanks for your investigation and suggestions.
In general the library doesn't assume that the stream is seekable. Some users process data live off a socket, for example, where seeking is not an option without buffering. As @kwhopper calls out, he has a promising PR to unify the data access logic, which should open up some more opportunities here. However in the non-seek pattern, there's no guarantee (in general) that later data doesn't point back to earlier data. In such cases we need to buffer data that we skip, in case it's needed later on. There are some cases where we expect to skip without needing to seek back to the data being skipped (e.g. image data). I believe that would be the most promising approach to cut down on these allocations in the non-seekable case. In the seek case, it's still quite likely we need to buffer chunks of data due to random access patterns. A promising option here would be to maintain a pool of these buffers so that they are reused, reducing GC churn.
Span/ReadOnlySpan requires data to be in contiguous memory, so it still needs to be copied from the underlying stream to a buffer. While MetadataExtractor could make better use of these newer language constructs in some places (note though that we still currently support as far back as .NET Framework 3.5), it is not a free lunch. Span is great for avoiding copies of data, but in our case we can't (in general) avoid copying from a stream to a buffer. I'm not sure how the MediaInfo library you're comparing against works, but perhaps they're not pulling out as much metadata. If you know that you're only after a few specific types of metadata, you can configure a type (such as |
I'm currently in the process of figure out what library to use for getting metadata from media files and this library is definitly one of the fastes around. Only problem I have are the huge allocations it makes:
The folder I tested this on contained 144 files (136 Videos and 8 Images total 1GB) and saw allocations of around 220MB.
The next benchmark was done on a folder containing 276 files (only images) and again we see allocations way above reason.
Using the Dynamic Program Analysis build into Rider, the most allocations happen because the library reads the entire contents of a section into a byte array and often processes those later on like for PNG and JPEG.
Possible improvements could be made by using
Span<T>
and.Slice
for the chunks which returns aReadOnlySpan<T>
with no extra allocations.Then there is also the concept of binary overlays that differ from typical binary importating in that you do not read everything from file into memery upfront and then parse it but keep an open stream and only parse the bare minimum needed to know the file layout. With the layout you can then expose getters that call Lazy functions or similar which then jump to the specific position in the file stream and parse the section on demand instead of up front. This method is extremely useful as the program only needs to actually read and parse what you need so allocations will be kept to a minimum. The biggest problem this has it that it requires some not so small amount of refactoring and API changes.
The text was updated successfully, but these errors were encountered: