-
Notifications
You must be signed in to change notification settings - Fork 1.1k
Stop breaking surrogate pairs in toDelta()/fromDelta() #80
base: master
Are you sure you want to change the base?
Conversation
01426e2
to
71646fb
Compare
See google/diff-match-patch#80 There have been certain cases where consecutive surrogate pairs crash `diff-match-patch` when it's building the _delta string_. This is because the diffing algorithm finds a common prefix up to and including half of the surrogate pair match when three consecutive code points share the same high surrogate. ``` // before - \ud83d\ude4b\ud83d\ue4b // after - \ud83d\ude4b\ud83d\ue4c\ud83d\ude4b // deltas - eq \ud83d\ude4b\ud83d -> add \ude4c -> eq \ud83d \ude4b ``` This crashes when trying to encode the invalid sequence of UTF-16 code units into `URIEncode`, which expects a valid Unicode string. After the fix the delta groups are normalized to prevent this situation and the `node-simperium` library should be able to process those problematic changesets.
See google/diff-match-patch#80 There have been certain cases where consecutive surrogate pairs crash `diff-match-patch` when it's building the _delta string_. This is because the diffing algorithm finds a common prefix up to and including half of the surrogate pair match when three consecutive code points share the same high surrogate. ``` // before - \ud83d\ude4b\ud83d\ue4b // after - \ud83d\ude4b\ud83d\ue4c\ud83d\ude4b // deltas - eq \ud83d\ude4b\ud83d -> add \ude4c -> eq \ud83d \ude4b ``` This crashes when trying to encode the invalid sequence of UTF-16 code units into `URIEncode`, which expects a valid Unicode string. After the fix the delta groups are normalized to prevent this situation and the `node-simperium` library should be able to process those problematic changesets.
See: Simperium/node-simperium#91 See: google/diff-match-patch#80 There have been certain cases where consecutive surrogate pairs crash `diff-match-patch` when it's building the _delta string_. This is because the diffing algorithm finds a common prefix up to and including half of the surrogate pair match when three consecutive code points share the same high surrogate. ``` // before - \ud83d\ude4b\ud83d\ue4b // after - \ud83d\ude4b\ud83d\ue4c\ud83d\ude4b // deltas - eq \ud83d\ude4b\ud83d -> add \ude4c -> eq \ud83d \ude4b ``` This crashes when trying to encode the invalid sequence of UTF-16 code units into `URIEncode`, which expects a valid Unicode string. After the fix the delta groups are normalized to prevent this situation and the `node-simperium` library should be able to process those problematic changesets. This patch updates the dependency on the patched version of `node-simperium`.
See google/diff-match-patch#80 There have been certain cases where consecutive surrogate pairs crash `diff-match-patch` when it's building the _delta string_. This is because the diffing algorithm finds a common prefix up to and including half of the surrogate pair match when three consecutive code points share the same high surrogate. ``` // before - \ud83d\ude4b\ud83d\ue4b // after - \ud83d\ude4b\ud83d\ue4c\ud83d\ude4b // deltas - eq \ud83d\ude4b\ud83d -> add \ude4c -> eq \ud83d \ude4b ``` This crashes when trying to encode the invalid sequence of UTF-16 code units into `URIEncode`, which expects a valid Unicode string. After the fix the delta groups are normalized to prevent this situation and the `node-simperium` library should be able to process those problematic changesets.
* Fix: No more crashing/skipped changes for certain changes See: Simperium/node-simperium#91 See: google/diff-match-patch#80 There have been certain cases where consecutive surrogate pairs crash `diff-match-patch` when it's building the _delta string_. This is because the diffing algorithm finds a common prefix up to and including half of the surrogate pair match when three consecutive code points share the same high surrogate. ``` // before - \ud83d\ude4b\ud83d\ue4b // after - \ud83d\ude4b\ud83d\ue4c\ud83d\ude4b // deltas - eq \ud83d\ude4b\ud83d -> add \ude4c -> eq \ud83d \ude4b ``` This crashes when trying to encode the invalid sequence of UTF-16 code units into `URIEncode`, which expects a valid Unicode string. After the fix the delta groups are normalized to prevent this situation and the `node-simperium` library should be able to process those problematic changesets. This patch updates the dependency on the patched version of `node-simperium`.
@NeilFraser is there anything I can do to help this patch/issue move along? |
fbb8c73
to
55a8779
Compare
Quick update for those following: we identified issues when the surrogate pairs were in I've updated the PR with an improved patch to the JS/Java libraries and hope to port the changes to the other libraries tomorrow. In this update I've included a custom This custom |
Q: What happens in the different languages when diffing "人" vs "中"? JavaScript should just say delete one character, insert another. But what about Python? |
Thank you so much @NeilFraser for the response!
Right now it appears like It doesn't even need to be a question about language though. Python2 is internally inconsistent because if Python is compiled in narrow mode it returns different numbers than if Python is compiled in wide mode. In my patch I'm normalizing In other words before my patch it's more different across libraries and platforms than after it. After this patch the languages which have been implemented here should all return the same delta - delete one character, insert another. In tests across JavaScript, Java, Objective-C, Python2, Python3 |
@NeilFraser is there a way to run the suite of relative performance diffs or is that simply generated from running I've been running the only real major issue is the question of what the indices should be, and I've tried to not change that other than in Python2 compiled in wide mode and in Python 3. we can back out the Python 3 change if we need to preserve the behavior. the other languages/platforms in this patch all previously returned UCS-2 code units. |
bdc4740
to
fa122d3
Compare
We also faced an issue with surrogate pairs here where the substring breaking doesn't account for surrogate pairs. |
In 1.0.2 we applied an update from google/diff-match-patch#80 to resolve the surrogate pair encoding issue. Since that time we found a bug in the fix and resolved that and therefore this patch brings those additional updates to Simeprium. There previously remained issues when decoding broken patches that already existed or which came from unpatched versions of the iOS library and also remained issues when unexpectedly receiving empty diff groups in `toDelta`
In 1.0.2 we applied an update from google/diff-match-patch#80 to resolve the surrogate pair encoding issue. Since that time we found a bug in the fix and resolved that and therefore this patch brings those additional updates to Simeprium. There previously remained issues when decoding broken patches that already existed or which came from unpatched versions of the iOS library and also remained issues when unexpectedly receiving empty diff groups in `toDelta`
@NeilFraser what can we do to be helpful here? We've been running with these changes in Simplenote for more than a month and it eliminated all of the encoding issues we were experiencing. I know that there is still a "better option" of adjusting the way that we break strings up so that we never get broken surrogate halves, but from a practicality and simplicity standpoint I think this is solid (maybe there are bugs we've silently missed). |
When reading through a Rust implementation I found that the author created a separate https://github.com/dtolnay/dissimilar/blob/master/src/lib.rs#L402 |
@dmsnell What's the status of this? Does Unicode work correctly yet in master? |
@mustafamunawar we are waiting feedback or direction from @NeilFraser. We have been using this branch successfully in Simplenote for the past nine months without issue. You are welcome to try it out and watch this PR for an eventual merge. |
this is the basic procedure in this PR as well as has been taken in the dissimilar library linked above. it's a good idea and one that solves the problem of crashing when splitting Unicode boundaries. linking here to avoid clouding your Maven PR (#2) with noise about Unicode boundaries |
@NeilFraser I ran into this problem again today through a dependency-of-a-dependency pulling in For Simplenote we're able to pull in this branch and that works, but with the transitive dependencies in other projects we don't get as much of a luxury. Would rather avoid forking the repo especially since everything in the world basically refers to this one (and you've put so much work into it), but I feel like if we don't fix Again, I'm happy to put more energy into the fix. We've been going for more than two years with this branch in Simplenote and as far as we can infer the issue is entirely eliminated and no side-effects have crept in. Our users write some very long texts too and we've had no indication of performance issues. The approach in the Rust library I think is more elegant than my solution but I think mine here is the least intrusive on the overall library, addressing the problem of splitting surrogate pairs only where it becomes a problem, that is, when URL-encoding the patch. Really appreciate all you've built here and I assume you're busy; just trying to find a way to keep this from being a constant headache. |
I've tried all 3 solutions (#13, #69, #80) to this problem and #13 is the only one that solved the encodingURI exception for me. I created a minimal repo which clearly reproduces the issue: https://github.com/gamedevsam/diff-match-patch-emoji-issue Hope it's helpful to make this solution more robust. |
@gamedevsam I think it's possible you confused It does limit itself to Unfortunately #13 has a massive performance problem particularly for long texts and likely that approach (converting strings to arrays of integers) won't be viable for a generic library like this, especially since this patch (#80) has the fix to get to the same result. You can check with the author of |
@dmsnell Thank you so much for all the work you've put into this accross all the threads! I initially tried fixing this myself, and you helped me realize my approach was way too naive! :) We use About my solution:
🙏 |
Nice work @michal-kurz. If you want to propose a PR with that code against this branch we can merge it in. I'd like to explore using your function in |
@dmsnell Thank you! I made a PR to your branch. Unfortunately there are some unwanted changes in whitespaces comming with it - I spent the last 25 minutes failing to undo them, and the Github diff keeps looking different from my local, no idea why. I'm way too frustrated to continue with that at this point 😠 As for incorporating the |
oh the pain; I feel you: been there, done that. typically I find that some IDE is adding or removing the extra whitespace. did you try using might also verify you don't have custom pre-commit hooks that are running formatting. by the way I suspect you meant to poke me, @dmsnell, and not |
@dmsnell Solved, I ended up soft-resetting, re-commiting only the desired lines and force-pushing the result. And yes, I did mess up the mention tag, also fixed. |
So unfortunately I found my solution for I sometimes end up with a single surrogate on the very edge of the whole patch (first char of first diff, or last char of last diff). For example: str1 = '. 🔧🚗\n\--X'
str2 = '. 🔧🚗\n\--Y'
dmp.patch_make(str1, str2)
// Results in three diffs:
// [
// {0: 0, 1: '\uDE97\n--'},
// {0: -1, 1: 'X'},
// {0: 1, 1: 'Y'}
// ] Such case cannot be fully "losslessly" shifted with @dmsnell 's approach - I have to decide to either expand or shrink the whole diff scope. In my case, the broken diffs were created as I patched those by extending them one character outward in case it's an unpaired surrogate, and this seems to have solved the problem (here is a PR targeted at our fork of dmp). I decided for this approach, because I'm worried that shrinking the scope of the diff could cause some unforseen problems (mainly when coverting back to patch via I hope this is the only potential source of such broken patch. If so, the |
@michal-kurz I added a patch that's almost identical to yours here (should have read yours first 🙃) but I make a couple extra bounds checks to avoid extending the patch context before or after the document boundaries. had to apply both your fix though and the one thing I wish were better is that when I moved that being said, we're still without regression in Simplenote and other places at Automattic where we've deployed this patch and it's been a few years now, so I think whatever practical concerns there may be with this approach, it's clearly better than if there are problems with the approach in this PR, they are so rare that it's not comparable to the wildly-common bugs present in |
Resolves google#69 for JavaScript Sometimes we can find a common prefix that runs into the middle of a surrogate pair and we split that pair when building our diff groups. This is fine as long as we are operating on UTF-16 code units. It becomes problematic when we start trying to treat those substrings as valid Unicode (or UTF-8) sequences. When we pass these split groups into `toDelta()` we do just that and the library crashes. In this patch we're post-processing the diff groups before encoding them to make sure that we un-split the surrogate pairs. The post-processed diffs should produce the same output when applying the diffs. The diff string itself will be different but should change that much - only by a single character at surrogate boundaries.
Resolves google#69 for Java Sometimes we can find a common prefix that runs into the middle of a surrogate pair and we split that pair when building our diff groups. This is fine as long as we are operating on UTF-16 code units. It becomes problematic when we start trying to treat those substrings as valid Unicode (or UTF-8) sequences. When we pass these split groups into `toDelta()` we do just that and the library crashes. In this patch we're post-processing the diff groups before encoding them to make sure that we un-split the surrogate pairs. The post-processed diffs should produce the same output when applying the diffs. The diff string itself will be different but should change that much - only by a single character at surrogate boundaries.
Resolves google#69 for Objective-C Sometimes we can find a common prefix that runs into the middle of a surrogate pair and we split that pair when building our diff groups. This is fine as long as we are operating on UTF-16 code units. It becomes problematic when we start trying to treat those substrings as valid Unicode (or UTF-8) sequences. When we pass these split groups into `toDelta()` we do just that and the library crashes. In this patch we're post-processing the diff groups before encoding them to make sure that we un-split the surrogate pairs. The post-processed diffs should produce the same output when applying the diffs. The diff string itself will be different but should change that much - only by a single character at surrogate boundaries.
Resolves google#69 for Python2 Sometimes we can find a common prefix that runs into the middle of a surrogate pair and we split that pair when building our diff groups. This is fine as long as we are operating on UTF-16 code units. It becomes problematic when we start trying to treat those substrings as valid Unicode (or UTF-8) sequences. When we pass these split groups into `toDelta()` we do just that and the library crashes. In this patch we're post-processing the diff groups before encoding them to make sure that we un-split the surrogate pairs. The post-processed diffs should produce the same output when applying the diffs. The diff string itself will be different but should change that much - only by a single character at surrogate boundaries.
Resolves google#69 for Python3 Sometimes we can find a common prefix that runs into the middle of a surrogate pair and we split that pair when building our diff groups. This is fine as long as we are operating on UTF-16 code units. It becomes problematic when we start trying to treat those substrings as valid Unicode (or UTF-8) sequences. When we pass these split groups into `toDelta()` we do just that and the library crashes. In this patch we're post-processing the diff groups before encoding them to make sure that we un-split the surrogate pairs. The post-processed diffs should produce the same output when applying the diffs. The diff string itself will be different but should change that much - only by a single character at surrogate boundaries.
d681ef6
to
50f1542
Compare
All: I have merged this into my fork and it's now in the Unfortunately I won't have a lot of time to actively maintain the fork, but I will do my best to be responsive and help merge in updates. |
Fixes #10, #59, #68
Alternate to #69
Fixes for Java, JavaScript, Objective C, Python2, Python3
Status
We've deployed these changes in Simplenote and are uncovering any issues we can find.We've been running the code in this patch for years now and no issues or regressions have appeared.dmsnell/diff-match-patch
instead ofgoogle/diff-match-patch
. I'd prefer merging these fixes upstream and would happily do the work required if @NeilFraser provides guidance or opinions or a simple 👍.Todo:
cleanupSplitSurrogates
in JavaScriptcleanupSplitSurrogates
in JavacleanupSplitSurrogates
in Objective CcleanupSplitSurrogates
in Python2cleanupSplitSurrogates
in Python3cleanupSplitSurrogates
in CPPcleanupSplitSurrogates
in CSharpcleanupSplitSurrogates
in DartcleanupSplitSurrogates
in Luapatch_toText
in JavaScriptpatch_toText
in Javapatch_toText
in Objective Cpatch_toText
in Python2patch_toText
in Python3patch_toText
in CPPpatch_toText
in CSharppatch_toText
in Dartpatch_toText
in LuaMajor changes
toDelta()
we no longer attempt to create URI-encoded strings with split surrogates. this was either crashing client applications or corrupting the data, depending on which platform was in use.fromDelta()
we are recovering when possible from corrupted delta strings produced by unpatched versions of this library. there are two fixes:(null)
in between two surrogate halves. this is caused by the original bug and we might receive these sequences in a delta string: (high surrogate)(null)
(low surrogate). If we leave these in here then diff-match-patch will operate fine but it will send invalid Unicode that it created onto the consuming application. in this patch I've added guards against this very specific pattern so that we can remove the unintentionally injected(null)
and re-join the surrogate halves. this means that we're mangling the input but that input would have already been mangled anyway and presumably it'd be "impossible" to send the same sequence of code units to diff-match-patch since something would have had to send invalid Unicode in the first place.Working on this bug has surfaced a bigger question about what diff-match-patch expects. All of the Simplenote applications work at the level of UTF-16 code units and a large part of that is because this is how most of the diff-match-patch platforms work.
Something seems ideal about referencing Unicode code points in diff-match-patch instead of any particular encoding, but to change now seems like it would break way more than it would fix. I'd like to propose that we think about creating a new output version which could be selected to specify that diffs do that.
Sometimes we can find a common prefix that runs into the middle of a
surrogate pair and we split that pair when building our diff groups.
This is fine as long as we are operating on UTF-16 code units. It
becomes problematic when we start trying to treat those substrings as
valid Unicode (or UTF-8) sequences.
When we pass these split groups into
toDelta()
we do just that and thelibrary crashes. In this patch we're post-processing the diff groups
before encoding them to make sure that we un-split the surrogate pairs.
The post-processed diffs should produce the same output when applying
the diffs. The diff string itself will be different but shouldn't change
that much - only by a single character at surrogate boundaries.
Notes
isSurrogate()
for all cases when we should have been usingisHighSurrogate()
andisLowSurragate()
depending on the context.Thanks for any and all help and feedback.