Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix LimitedInputStream reading too far #38

Draft
wants to merge 4 commits into
base: main
Choose a base branch
from

Conversation

runeflobakk
Copy link
Member

@runeflobakk runeflobakk commented Jul 31, 2024

Well, it fixes it sort of 99.99%...

Issue 1

The ability of the LimitedInputStream to throw an exception if reading beyond its limit introduces a small issue, and I believe this supported variance breaks the design of InputStream behavior. Reaching the limit should in all cases just be an EOF, not a thrown exception. The reason for this is that this makes it impossible to stop exactly after reading the last byte before the limit. If consuming the LimitedInputStream by exhausting it (which is the usual way), one must do at least one read beyond the limit to see if the underlying stream is also exactly at EOF, or if it contains more data (and followingly it should throw an exception because reading beyond the limit). See also commit msg of ecb3a8e

If e.g. wrapping a BufferedInputStream with a LimitedInputStream in order to initially consume a header portion of the bytes, and then afterwards .reset() the BufferedInputStream, the value passed to .mark(..) must be one more than the reading limit of LimitedInputStream, because it needs to read one more byte to determine it's next action if reaching the limit, and this breaks the set mark in the BufferedInputStream. It would be expected that these two values should be the same, that a LimitedInputStream in fact never reads beyond it's limit.

The LimitedInputStreamTest.rewindWhenReachingLimit() test demonstrates this.

Issue 2

The second issue this PR fixes is:
If reading bytes in chunks (which is most common for performance reasons), if the last chunk of read bytes reads past the (and potentially way past) the limit, the last chunk would be treated as having reached the limit and either an EOF or exception would be yielded. For example:

  • you limit to reading 4096 bytes
  • bytes are consumed in chunks of 3072 bytes
  • then you would get exactly 3072 bytes from the stream, and not the last 1024 bytes up to the limit, because reading the last chunk would yield an "early" EOF.
  • in addition, LimitedInputStream would read in total 6144 bytes from the underlying stream, because it naively delegated reading operations directly, and did the calculations wrt. the limit after each read.

Running LimitedInputStreamTest.rewindWhenReachingLimit() against the old implementation also demonstrates this, because IOUtils.toByteArray reads chunks of 8kB, and thus consuming the LimitedInputStream (limit=1024) yields zero bytes, because the first read way past the limit.

Now, the reading from the source stream is guarded by calculating how many bytes is applicable to read.

Conclusion

LimitedInputStream now guards the delegated read operations to never read more than absolutely necessary.
It is still however necessary to provide a value for BufferedInputStream.mark(..) one larger than the limit of the wrapping LimitedInputStream in order for being able to safely rewind after exhausting the LimitedInputStream. This is unfortunate, but still better than the LimitedInputStream potentially reading way past its limit.

It is implicit that we are dealing with bytes in the context of an
InputStream, and we were invoking .toBytes() on every use anyway.
If using a LimitedInputStream for reading a certain amount of
"header" data from an InputStream, then rewinding back to start in order
to process the entire stream. E.g. to persist it somewhere.

The problem with this approach is that a LimitedInputStream can be
instructed to throw an exception if trying to read past the limit (i.e.
it is expected to finish processing fairly early in the stream, and then
end reading, and the limit is to protect spooling through a potentially
huge amount of data), and in the event of actually reaching the limit, a
LimitedInputStream _must_ read at least one more byte in order to
determine if the underlying stream is exhausted and yields -1, or if it
has more data, and followingly the LimitedInputStream must throw an
exception.

This reeks of a design error in LimitedInputStream. Perhaps having the
variance of throwing an exception on reaching the "end" of a limited
stream is flawed, and this issue demonstrates that. A LimitedInputStream
introduces a potential earlier EOF, and it should perhaps strictly treat
it like that and yield -1 in any case it reaches the limit (i.e. the end).
If detecting if it actually reached the limit is required, it is a
separate concern, and must be done externally wrt. the InputStream.
@runeflobakk runeflobakk requested a review from sarawe July 31, 2024 11:06
@runeflobakk runeflobakk marked this pull request as ready for review July 31, 2024 11:06
@runeflobakk runeflobakk requested a review from fredrbus July 31, 2024 11:09
@runeflobakk runeflobakk marked this pull request as draft August 1, 2024 15:20
@runeflobakk runeflobakk force-pushed the fix-limited-stream-reading-too-far branch from 5e5178b to fb5d6c8 Compare August 2, 2024 12:18
@runeflobakk
Copy link
Member Author

runeflobakk commented Aug 2, 2024

I am leaving this for now, need some thinking around this.

It has occured to me that it may actually be appropriate to throw an exception in any case if trying to read beyond the limit, regardless if the underlying stream would EOF on this read attempt, and this would probably simplify the implementation and also has the potential of avoiding any read attempt past the set limit of LimitedInputStream. The need to "peek" one byte beyond the limit to distinguish if the underlying stream EOFs at exactly the limit, or if there are more data, may be a bit artificial, and if the LimitedInputStream is set to throw an exception, it should do that in the event of reaching the limit, regardless if the content of the underlying stream happens to exactly "fit" within the limit.

The issue that should be prioritized to fix is to guard the LimitedInputStream from reading too far.

Doing some thinking, and either closing or reworking this PR later. 🤔

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

1 participant