-
Notifications
You must be signed in to change notification settings - Fork 343
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
diff: consider uncommon words to match only if they have the same count #763
Conversation
Oh no, this doesn't actually fix the bug. I'll have to dig more. |
I think the remaining problem is that we don't shrink conflict regions by removing matching regions at the beginning and end of a non-matching region. So when diffing "a b c" and "a B c", both jj and git will find that "a" and "c" match. However, jj will stop there, while git will shrink the conflicting region (" b "/" B "). So I think this PR is still a step in the right direction, but I'll try to add this conflict-region-shrinking thing in a separate commit before this actually fixes the bug (hopefully). |
6cca609
to
90bcdbc
Compare
90bcdbc
to
525a0e0
Compare
683f950
to
72e1399
Compare
I ran into this bug when I added a similar-looking test case before an existing test case. Even though though doesn't fix #761, it's still an improvement that I think we should get in. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't have expertise on diff algorithms, but this appears to generate more verbose hunks.
For example jj show --git 006c764694a2
contains
@@ -202,9 +204,35 @@
}
assert!(self.adds.is_empty());
result
- }
-}
-
+ }
+}
+
+impl<T: ContentHash> ContentHash for Conflict<T> {
...
maybe because the occurrences of empty lines differ?
Patience diff starts by lining up unique elements (e.g. lines) to find matching segments of the inputs. After that, it refines the non-matching segments by repeating the process. Histogram expands on that by not just considering unique elements but by continuing with elements of count 2, then 3, etc. Before this commit, when diffing "a b a b b" against "a b a b a b", we would match the two "a"s in the first input against the first two "a"s in the second input. After this patch, we ignore the "a"s because their counts differ, so we try to align the "b"s instead. I have had this commit lying around since I wrote the histogram diff implementation in 1e657c5. I vaguely remember thinking that the way I had implemented it (without this commit) was a bit weird, but I wasn't sure if this commit would be an improvement or not. The bug report from @chooglen today of a case where we behave differently from Git is enough to make me think that we make this change after all. Many unit tests of the diff algorithm are affected, mostly because we no longer match the leading space in " a " against the space in " b" and similar.
72e1399
to
0a678f4
Compare
Yes, I think it's the empty lines and lines the lines with just a |
Appears that our diff logic lacks the following steps:
https://blog.jcoglan.com/2017/09/28/implementing-patience-diff/ With some form of leading/trailing matches handling, the diff output looks pretty good. I'll send a patch later (maybe after the release.) |
Yes, I think that's what I meant by #763 (comment)
When I looked into it (around the time of this PR), I thought it seemed like adding that step would involve a somewhat large refactoring and I gave up. Maybe you figured out a better way, or maybe you just bit the bullet. Thanks either way :) |
I just tried inserting something like this in between recursion, but I might misunderstand the logic at all. let common = zip(left_ranges, right_ranges).take_while(..).count();
let (leading_left, rem..) = left_ranges.split(common);
let (leading_right, rem..) = right_ranges.split(common);
result = zip(leading_left, leading_right) + recurse(rem..) |
Included in #4010. |
Patience diff starts by lining up unique elements (e.g. lines) to find matching segments of the inputs. After that, it refines the non-matching segments by repeating the process. Histogram expands on that by not just considering unique elements but by continuing with elements of count 2, then 3, etc.
Before this commit, when diffing "a b a b b" against "a b a b a b", we would match the two "a"s in the first input against the first two "a"s in the second input. After this patch, we ignore the "a"s because their counts differ, so we try to align the "b"s instead.
I have had this commit lying around since I wrote the histogram diff implementation 18 months ago. I vaguely remember thinking that the way I had implemented it (without this commit) was a bit weird, but I wasn't sure if this commit would be an improvement or not. The bug report from @chooglen today of a case where we behave differently from Git is enough to make me think that we make this change after all.
Many unit tests of the diff algorithm are affected, mostly because we no longer match the leading space in " a " against the space in " b" and similar.
Checklist
If applicable:
CHANGELOG.md