Fix LimitedInputStream reading too far #38

runeflobakk · 2024-07-31T10:36:41Z

Well, it fixes it sort of 99.99%...

Issue 1

The ability of the LimitedInputStream to throw an exception if reading beyond its limit introduces a small issue, and I believe this supported variance breaks the design of InputStream behavior. Reaching the limit should in all cases just be an EOF, not a thrown exception. The reason for this is that this makes it impossible to stop exactly after reading the last byte before the limit. If consuming the LimitedInputStream by exhausting it (which is the usual way), one must do at least one read beyond the limit to see if the underlying stream is also exactly at EOF, or if it contains more data (and followingly it should throw an exception because reading beyond the limit). See also commit msg of ecb3a8e

If e.g. wrapping a BufferedInputStream with a LimitedInputStream in order to initially consume a header portion of the bytes, and then afterwards .reset() the BufferedInputStream, the value passed to .mark(..) must be one more than the reading limit of LimitedInputStream, because it needs to read one more byte to determine it's next action if reaching the limit, and this breaks the set mark in the BufferedInputStream. It would be expected that these two values should be the same, that a LimitedInputStream in fact never reads beyond it's limit.

The LimitedInputStreamTest.rewindWhenReachingLimit() test demonstrates this.

Issue 2

The second issue this PR fixes is:
If reading bytes in chunks (which is most common for performance reasons), if the last chunk of read bytes reads past the (and potentially way past) the limit, the last chunk would be treated as having reached the limit and either an EOF or exception would be yielded. For example:

you limit to reading 4096 bytes
bytes are consumed in chunks of 3072 bytes
then you would get exactly 3072 bytes from the stream, and not the last 1024 bytes up to the limit, because reading the last chunk would yield an "early" EOF.
in addition, LimitedInputStream would read in total 6144 bytes from the underlying stream, because it naively delegated reading operations directly, and did the calculations wrt. the limit after each read.

Running LimitedInputStreamTest.rewindWhenReachingLimit() against the old implementation also demonstrates this, because IOUtils.toByteArray reads chunks of 8kB, and thus consuming the LimitedInputStream (limit=1024) yields zero bytes, because the first read way past the limit.

Now, the reading from the source stream is guarded by calculating how many bytes is applicable to read.

Conclusion

LimitedInputStream now guards the delegated read operations to never read more than absolutely necessary.
It is still however necessary to provide a value for BufferedInputStream.mark(..) one larger than the limit of the wrapping LimitedInputStream in order for being able to safely rewind after exhausting the LimitedInputStream. This is unfortunate, but still better than the LimitedInputStream potentially reading way past its limit.

It is implicit that we are dealing with bytes in the context of an InputStream, and we were invoking .toBytes() on every use anyway.

If using a LimitedInputStream for reading a certain amount of "header" data from an InputStream, then rewinding back to start in order to process the entire stream. E.g. to persist it somewhere. The problem with this approach is that a LimitedInputStream can be instructed to throw an exception if trying to read past the limit (i.e. it is expected to finish processing fairly early in the stream, and then end reading, and the limit is to protect spooling through a potentially huge amount of data), and in the event of actually reaching the limit, a LimitedInputStream _must_ read at least one more byte in order to determine if the underlying stream is exhausted and yields -1, or if it has more data, and followingly the LimitedInputStream must throw an exception. This reeks of a design error in LimitedInputStream. Perhaps having the variance of throwing an exception on reaching the "end" of a limited stream is flawed, and this issue demonstrates that. A LimitedInputStream introduces a potential earlier EOF, and it should perhaps strictly treat it like that and yield -1 in any case it reaches the limit (i.e. the end). If detecting if it actually reached the limit is required, it is a separate concern, and must be done externally wrt. the InputStream.

runeflobakk · 2024-08-02T12:26:14Z

I am leaving this for now, need some thinking around this.

It has occured to me that it may actually be appropriate to throw an exception in any case if trying to read beyond the limit, regardless if the underlying stream would EOF on this read attempt, and this would probably simplify the implementation and also has the potential of avoiding any read attempt past the set limit of LimitedInputStream. The need to "peek" one byte beyond the limit to distinguish if the underlying stream EOFs at exactly the limit, or if there are more data, may be a bit artificial, and if the LimitedInputStream is set to throw an exception, it should do that in the event of reaching the limit, regardless if the content of the underlying stream happens to exactly "fit" within the limit.

The issue that should be prioritized to fix is to guard the LimitedInputStream from reading too far.

Doing some thinking, and either closing or reworking this PR later. 🤔

runeflobakk added 3 commits July 30, 2024 12:52

Refactor to single test with nested variants

dd82650

Use long instead of DataSize in LimitedInputStream

67e2f78

It is implicit that we are dealing with bytes in the context of an InputStream, and we were invoking .toBytes() on every use anyway.

runeflobakk requested a review from sarawe July 31, 2024 11:06

runeflobakk marked this pull request as ready for review July 31, 2024 11:06

runeflobakk requested a review from fredrbus July 31, 2024 11:09

runeflobakk marked this pull request as draft August 1, 2024 15:20

More tests for LimitedInputStream

fb5d6c8

runeflobakk force-pushed the fix-limited-stream-reading-too-far branch from 5e5178b to fb5d6c8 Compare August 2, 2024 12:18

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Fix LimitedInputStream reading too far #38

Fix LimitedInputStream reading too far #38

runeflobakk commented Jul 31, 2024 •

edited

Loading

runeflobakk commented Aug 2, 2024 •

edited

Loading

Fix LimitedInputStream reading too far #38

Are you sure you want to change the base?

Fix LimitedInputStream reading too far #38

Conversation

runeflobakk commented Jul 31, 2024 • edited Loading

Issue 1

Issue 2

Conclusion

runeflobakk commented Aug 2, 2024 • edited Loading

runeflobakk commented Jul 31, 2024 •

edited

Loading

runeflobakk commented Aug 2, 2024 •

edited

Loading