diff: consider uncommon words to match only if they have the same count #763

martinvonz · 2022-11-18T06:01:10Z

Patience diff starts by lining up unique elements (e.g. lines) to find matching segments of the inputs. After that, it refines the non-matching segments by repeating the process. Histogram expands on that by not just considering unique elements but by continuing with elements of count 2, then 3, etc.

Before this commit, when diffing "a b a b b" against "a b a b a b", we would match the two "a"s in the first input against the first two "a"s in the second input. After this patch, we ignore the "a"s because their counts differ, so we try to align the "b"s instead.

I have had this commit lying around since I wrote the histogram diff implementation 18 months ago. I vaguely remember thinking that the way I had implemented it (without this commit) was a bit weird, but I wasn't sure if this commit would be an improvement or not. The bug report from @chooglen today of a case where we behave differently from Git is enough to make me think that we make this change after all.

Many unit tests of the diff algorithm are affected, mostly because we no longer match the leading space in " a " against the space in " b" and similar.

Checklist

If applicable:

I have updated CHANGELOG.md
I have updated the documentation (README.md, docs/, demos/)

martinvonz · 2022-11-18T06:07:46Z

Oh no, this doesn't actually fix the bug. I'll have to dig more.

martinvonz · 2022-11-18T06:29:04Z

Oh no, this doesn't actually fix the bug. I'll have to dig more.

I think the remaining problem is that we don't shrink conflict regions by removing matching regions at the beginning and end of a non-matching region. So when diffing "a b c" and "a B c", both jj and git will find that "a" and "c" match. However, jj will stop there, while git will shrink the conflicting region (" b "/" B "). So I think this PR is still a step in the right direction, but I'll try to add this conflict-region-shrinking thing in a separate commit before this actually fixes the bug (hopefully).

martinvonz · 2023-07-24T16:15:51Z

I ran into this bug when I added a similar-looking test case before an existing test case. Even though though doesn't fix #761, it's still an improvement that I think we should get in.

yuja

I don't have expertise on diff algorithms, but this appears to generate more verbose hunks.

For example jj show --git 006c764694a2 contains

@@ -202,9 +204,35 @@
         }
         assert!(self.adds.is_empty());
         result
-    }
-}
-
+    }
+}
+
+impl<T: ContentHash> ContentHash for Conflict<T> {
...

maybe because the occurrences of empty lines differ?

lib/src/diff.rs

@chooglen

Patience diff starts by lining up unique elements (e.g. lines) to find matching segments of the inputs. After that, it refines the non-matching segments by repeating the process. Histogram expands on that by not just considering unique elements but by continuing with elements of count 2, then 3, etc. Before this commit, when diffing "a b a b b" against "a b a b a b", we would match the two "a"s in the first input against the first two "a"s in the second input. After this patch, we ignore the "a"s because their counts differ, so we try to align the "b"s instead. I have had this commit lying around since I wrote the histogram diff implementation in 1e657c5. I vaguely remember thinking that the way I had implemented it (without this commit) was a bit weird, but I wasn't sure if this commit would be an improvement or not. The bug report from @chooglen today of a case where we behave differently from Git is enough to make me think that we make this change after all. Many unit tests of the diff algorithm are affected, mostly because we no longer match the leading space in " a " against the space in " b" and similar.

martinvonz · 2023-07-25T04:19:19Z

I don't have expertise on diff algorithms, but this appears to generate more verbose hunks.

For example jj show --git 006c764694a2 contains
@@ -202,9 +204,35 @@
         }
         assert!(self.adds.is_empty());
         result
-    }
-}
-
+    }
+}
+
+impl<T: ContentHash> ContentHash for Conflict<T> {
...
maybe because the occurrences of empty lines differ?

Yes, I think it's the empty lines and lines the lines with just a } (and no space) in this instance. That's quite annoying, so I guess I'll have to not merge this until I've fixed that too. Thanks for letting me know.

yuja · 2024-07-01T14:26:00Z

Appears that our diff logic lacks the following steps:

In Bram Cohen’s blog post about this algorithm, he mentions two steps we’ve not covered here:

Match the first lines of both if they’re identical, then match the second, third, etc. until a pair doesn’t match.

Match the last lines of both if they’re identical, then match the next to last, second to last, etc. until a pair doesn’t match.

[...] That’s because Git doesn’t perform these steps first, it performs them after calculating the matching lines but before recursing into each slice.

https://blog.jcoglan.com/2017/09/28/implementing-patience-diff/

With some form of leading/trailing matches handling, the diff output looks pretty good. I'll send a patch later (maybe after the release.)

martinvonz · 2024-07-01T14:31:11Z

Appears that our diff logic lacks the following steps:

Yes, I think that's what I meant by #763 (comment)

With some form of leading/trailing matches handling, the diff output looks pretty good. I'll send a patch later (maybe after the release.)

When I looked into it (around the time of this PR), I thought it seemed like adding that step would involve a somewhat large refactoring and I gave up. Maybe you figured out a better way, or maybe you just bit the bullet. Thanks either way :)

yuja · 2024-07-01T15:20:17Z

it seemed like adding that step would involve a somewhat large refactoring and I gave up. Maybe you figured out a better way, or maybe you just bit the bullet.

I just tried inserting something like this in between recursion, but I might misunderstand the logic at all.

let common = zip(left_ranges, right_ranges).take_while(..).count();
let (leading_left, rem..) = left_ranges.split(common);
let (leading_right, rem..) = right_ranges.split(common);
result = zip(leading_left, leading_right) + recurse(rem..)

yuja · 2024-07-09T11:36:20Z

Included in #4010.

martinvonz enabled auto-merge (rebase) November 18, 2022 06:01

martinvonz disabled auto-merge November 18, 2022 06:07

martinvonz force-pushed the push-5640745a446c47efb1bd819dd4c95a66 branch from 6cca609 to 90bcdbc Compare November 18, 2022 06:39

martinvonz marked this pull request as draft May 16, 2023 17:36

martinvonz force-pushed the push-5640745a446c47efb1bd819dd4c95a66 branch from 90bcdbc to 525a0e0 Compare July 24, 2023 15:38

martinvonz marked this pull request as ready for review July 24, 2023 15:39

martinvonz force-pushed the push-5640745a446c47efb1bd819dd4c95a66 branch 2 times, most recently from 683f950 to 72e1399 Compare July 24, 2023 16:15

yuja reviewed Jul 25, 2023

View reviewed changes

lib/src/diff.rs Outdated Show resolved Hide resolved

martinvonz force-pushed the push-5640745a446c47efb1bd819dd4c95a66 branch from 72e1399 to 0a678f4 Compare July 25, 2023 04:15

yuja mentioned this pull request Jul 2, 2024

diff: stricter uncommon lcs, match up leading/trailing ranges instead #4010

Merged

4 tasks

yuja closed this Jul 9, 2024

martinvonz deleted the push-5640745a446c47efb1bd819dd4c95a66 branch August 3, 2024 13:35

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

diff: consider uncommon words to match only if they have the same count #763

diff: consider uncommon words to match only if they have the same count #763

martinvonz commented Nov 18, 2022 •

edited

Loading

martinvonz commented Nov 18, 2022

martinvonz commented Nov 18, 2022

martinvonz commented Jul 24, 2023

yuja left a comment

martinvonz commented Jul 25, 2023

yuja commented Jul 1, 2024

martinvonz commented Jul 1, 2024

yuja commented Jul 1, 2024

yuja commented Jul 9, 2024

diff: consider uncommon words to match only if they have the same count #763

diff: consider uncommon words to match only if they have the same count #763

Conversation

martinvonz commented Nov 18, 2022 • edited Loading

Checklist

martinvonz commented Nov 18, 2022

martinvonz commented Nov 18, 2022

martinvonz commented Jul 24, 2023

yuja left a comment

Choose a reason for hiding this comment

martinvonz commented Jul 25, 2023

yuja commented Jul 1, 2024

martinvonz commented Jul 1, 2024

yuja commented Jul 1, 2024

yuja commented Jul 9, 2024

martinvonz commented Nov 18, 2022 •

edited

Loading