-
-
Notifications
You must be signed in to change notification settings - Fork 797
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Investigate use of Compact strings for TextBuffer
#910
Comments
Could be useful to prototype and we could perf test it. |
The Reader based JSON parser would probably still be faster with a char array based TextBuffer. Possibly, we need separate byte and char TextBuffer impls. |
I would be surprised if that could perform well: wouldn't it need to re-encode UTF-8 in general case? And considering that input to decode (before being fed to I mean I am not against someone trying things out but my first gut reaction is that this does not make much sense. Happy to be proven wrong of course :) |
it would have to re-encode to utf-16 if there is non-latin1 input, yes. this is also what stringbuilder does though, and i think it may be worth it. will depend heavily on the input data ofc. maybe it would be worth trying to implement TextBuffer with a StringBuilder as the backing store instead of the char[] array. it should be fairly easy, and StringBuilder already has all the compaction machinery. So this approach could be good for testing at least. |
@yawkat StringBuilder approach loses the BufferRecycler support that jackson-core has. Maybe, we would need a StringBuilder recycler. |
Really? But back to the original suggestion of reduced memory usage... I don't know. Buffer recycler does hold on to buffers, but otherwise these are transient buffers. It is also possible that EDIT: read about "compact strings" added in JDK 9 years ago (which I vaguely remember). |
okay i've done some preliminary benchmarking in https://github.com/yawkat/jackson-core/tree/compact-strings . it is very interesting. The benchmark results without StringBuilder:
and with StringBuilder:
As you can see, SB is basically the same for short strings and it's significantly slower for non-ascii strings (could be mitigated but probably not fixed 100%). However SB is significantly faster (30% faster) for long ascii strings. The code is basically the same for both, with StringBuilder instead of char[] of course. There is one small change: I've replaced the ascii copying loops, which would previously store the individual bytes into the char[], with One problem here is that the changes required are really annoying. What I did was basically copy TextBuffer for SB (SBTextBuffer) and copied the BufferRecycler infrastructure for SB. I then created a base class for TextBuffer and SBTextBuffer (TextBufferBase) with abstract methods for what ParserBase uses. I changed ParserBase to use the new TextBufferBase and added a getter ( Another technical challenge I've not looked at is that StringBuilders "degenerate" to utf-16 if they ever contain non-latin1 characters. That means that if any input contains non-ascii strings, the buffers in BufferRecycler become utf-16 (even after reset) and the performance benefit will disappear. This could be fixed by a simple jdk change (setLength(0) could reset the coder of StringBuilder to ASCII), but that'd take a while to appear in the releases. Until then, we would need to have separate recyclers for ASCII and non-ASCII StringBuilders, and support the transformation between them in SBTextBuffer... Fortunately this issue is not relevant in the benchmarks (strings on the same fork are homogeneous), but it would need to be resolved. In summary, the preliminary performance numbers look really good for the "long ascii string" case. I think this is a common case (e.g. long base64 strings), so imo if this change was simple, it would be worth it. However given the technical challenges involved, I don't think it's feasible. |
@yawkat Very impressive investigation, thank you for doing this! It is pretty interesting that in the expected good case (long ascii) there is tangible performance benefits. Still, it is good to be aware of this idea, maybe something could come out of it in future. |
TextBuffer
Another option I just thought of would be to use byte[] containing raw UTF-8, instead of StringBuilder. The Strings would then be constructed directly from the byte array, either as UTF-8 if it contains non-ascii characters, or as ISO_8859_1 if it contains only ascii (we can know this since we need to iterate anyway to detect quotes and escape sequences). This would have the same copying benefits of this test case, but would avoid the "degeneration" issue with StringBuilder. Unfortunately it still has the same issues with TextBuffer compatibility. However I think it may be workable at some point to replace all uses of TextBuffer with such a byte[]. This is because the String(char[]) constructors that we currently use will usually do a "compression" to latin1 byte[] anyway. So if we instead use a byte[] on the buffer side, even for the Reader-based JsonParsers, we might not get a performance hit, because we simply move the "compression" step to latin1 to our side instead of having the String constructor do it. |
I was looking at some heap dumps and saw that TextBuffer still uses char[] internally. I think it could instead use an approach similar to the jdk String and StringBuilder, with a byte[] that either contains latin1 encoded as one byte per char, or utf-16 with two bytes per char. This would save a lot of memory in most standard cases, especially when the buffer becomes large.
It could also improve performance when constructing String instances a bit. Using the String charset constructor, it's possible to create a latin1 String directly from bytes with a single copy, while going through the char[] constructor needs to run compaction. I don't know if this is relevant though, the compaction is probably already very fast.
What do you think @cowtowncoder @pjfanning ?
The text was updated successfully, but these errors were encountered: