-
Notifications
You must be signed in to change notification settings - Fork 2
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Fix LimitedInputStream reading too far #38
base: main
Are you sure you want to change the base?
Conversation
It is implicit that we are dealing with bytes in the context of an InputStream, and we were invoking .toBytes() on every use anyway.
If using a LimitedInputStream for reading a certain amount of "header" data from an InputStream, then rewinding back to start in order to process the entire stream. E.g. to persist it somewhere. The problem with this approach is that a LimitedInputStream can be instructed to throw an exception if trying to read past the limit (i.e. it is expected to finish processing fairly early in the stream, and then end reading, and the limit is to protect spooling through a potentially huge amount of data), and in the event of actually reaching the limit, a LimitedInputStream _must_ read at least one more byte in order to determine if the underlying stream is exhausted and yields -1, or if it has more data, and followingly the LimitedInputStream must throw an exception. This reeks of a design error in LimitedInputStream. Perhaps having the variance of throwing an exception on reaching the "end" of a limited stream is flawed, and this issue demonstrates that. A LimitedInputStream introduces a potential earlier EOF, and it should perhaps strictly treat it like that and yield -1 in any case it reaches the limit (i.e. the end). If detecting if it actually reached the limit is required, it is a separate concern, and must be done externally wrt. the InputStream.
5e5178b
to
fb5d6c8
Compare
I am leaving this for now, need some thinking around this. It has occured to me that it may actually be appropriate to throw an exception in any case if trying to read beyond the limit, regardless if the underlying stream would EOF on this read attempt, and this would probably simplify the implementation and also has the potential of avoiding any read attempt past the set limit of LimitedInputStream. The need to "peek" one byte beyond the limit to distinguish if the underlying stream EOFs at exactly the limit, or if there are more data, may be a bit artificial, and if the LimitedInputStream is set to throw an exception, it should do that in the event of reaching the limit, regardless if the content of the underlying stream happens to exactly "fit" within the limit. The issue that should be prioritized to fix is to guard the LimitedInputStream from reading too far. Doing some thinking, and either closing or reworking this PR later. 🤔 |
Well, it fixes it sort of 99.99%...
Issue 1
The ability of the LimitedInputStream to throw an exception if reading beyond its limit introduces a small issue, and I believe this supported variance breaks the design of InputStream behavior. Reaching the limit should in all cases just be an EOF, not a thrown exception. The reason for this is that this makes it impossible to stop exactly after reading the last byte before the limit. If consuming the
LimitedInputStream
by exhausting it (which is the usual way), one must do at least one read beyond the limit to see if the underlying stream is also exactly at EOF, or if it contains more data (and followingly it should throw an exception because reading beyond the limit). See also commit msg of ecb3a8eIf e.g. wrapping a
BufferedInputStream
with aLimitedInputStream
in order to initially consume a header portion of the bytes, and then afterwards.reset()
the BufferedInputStream, the value passed to.mark(..)
must be one more than the reading limit ofLimitedInputStream
, because it needs to read one more byte to determine it's next action if reaching the limit, and this breaks the setmark
in theBufferedInputStream
. It would be expected that these two values should be the same, that a LimitedInputStream in fact never reads beyond it's limit.The
LimitedInputStreamTest.rewindWhenReachingLimit()
test demonstrates this.Issue 2
The second issue this PR fixes is:
If reading bytes in chunks (which is most common for performance reasons), if the last chunk of read bytes reads past the (and potentially way past) the limit, the last chunk would be treated as having reached the limit and either an EOF or exception would be yielded. For example:
Running
LimitedInputStreamTest.rewindWhenReachingLimit()
against the old implementation also demonstrates this, becauseIOUtils.toByteArray
reads chunks of 8kB, and thus consuming the LimitedInputStream (limit=1024) yields zero bytes, because the first read way past the limit.Now, the reading from the source stream is guarded by calculating how many bytes is applicable to read.
Conclusion
LimitedInputStream
now guards the delegated read operations to never read more than absolutely necessary.It is still however necessary to provide a value for
BufferedInputStream.mark(..)
one larger than the limit of the wrapping LimitedInputStream in order for being able to safely rewind after exhausting the LimitedInputStream. This is unfortunate, but still better than theLimitedInputStream
potentially reading way past its limit.