Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

diff: consider uncommon words to match only if they have the same count #763

Closed
wants to merge 1 commit into from

Conversation

martinvonz
Copy link
Member

@martinvonz martinvonz commented Nov 18, 2022

Patience diff starts by lining up unique elements (e.g. lines) to find matching segments of the inputs. After that, it refines the non-matching segments by repeating the process. Histogram expands on that by not just considering unique elements but by continuing with elements of count 2, then 3, etc.

Before this commit, when diffing "a b a b b" against "a b a b a b", we would match the two "a"s in the first input against the first two "a"s in the second input. After this patch, we ignore the "a"s because their counts differ, so we try to align the "b"s instead.

I have had this commit lying around since I wrote the histogram diff implementation 18 months ago. I vaguely remember thinking that the way I had implemented it (without this commit) was a bit weird, but I wasn't sure if this commit would be an improvement or not. The bug report from @chooglen today of a case where we behave differently from Git is enough to make me think that we make this change after all.

Many unit tests of the diff algorithm are affected, mostly because we no longer match the leading space in " a " against the space in " b" and similar.

Checklist

If applicable:

  • I have updated CHANGELOG.md
  • I have updated the documentation (README.md, docs/, demos/)

@martinvonz martinvonz enabled auto-merge (rebase) November 18, 2022 06:01
@martinvonz
Copy link
Member Author

Oh no, this doesn't actually fix the bug. I'll have to dig more.

@martinvonz martinvonz disabled auto-merge November 18, 2022 06:07
@martinvonz
Copy link
Member Author

Oh no, this doesn't actually fix the bug. I'll have to dig more.

I think the remaining problem is that we don't shrink conflict regions by removing matching regions at the beginning and end of a non-matching region. So when diffing "a b c" and "a B c", both jj and git will find that "a" and "c" match. However, jj will stop there, while git will shrink the conflicting region (" b "/" B "). So I think this PR is still a step in the right direction, but I'll try to add this conflict-region-shrinking thing in a separate commit before this actually fixes the bug (hopefully).

@martinvonz martinvonz force-pushed the push-5640745a446c47efb1bd819dd4c95a66 branch from 6cca609 to 90bcdbc Compare November 18, 2022 06:39
@martinvonz martinvonz marked this pull request as draft May 16, 2023 17:36
@martinvonz martinvonz force-pushed the push-5640745a446c47efb1bd819dd4c95a66 branch from 90bcdbc to 525a0e0 Compare July 24, 2023 15:38
@martinvonz martinvonz marked this pull request as ready for review July 24, 2023 15:39
@martinvonz martinvonz force-pushed the push-5640745a446c47efb1bd819dd4c95a66 branch 2 times, most recently from 683f950 to 72e1399 Compare July 24, 2023 16:15
@martinvonz
Copy link
Member Author

I ran into this bug when I added a similar-looking test case before an existing test case. Even though though doesn't fix #761, it's still an improvement that I think we should get in.

Copy link
Collaborator

@yuja yuja left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't have expertise on diff algorithms, but this appears to generate more verbose hunks.

For example jj show --git 006c764694a2 contains

@@ -202,9 +204,35 @@
         }
         assert!(self.adds.is_empty());
         result
-    }
-}
-
+    }
+}
+
+impl<T: ContentHash> ContentHash for Conflict<T> {
...

maybe because the occurrences of empty lines differ?

lib/src/diff.rs Outdated Show resolved Hide resolved
Patience diff starts by lining up unique elements (e.g. lines) to find
matching segments of the inputs. After that, it refines the
non-matching segments by repeating the process. Histogram expands on
that by not just considering unique elements but by continuing with
elements of count 2, then 3, etc.

Before this commit, when diffing "a b a b b" against "a b a b a b", we
would match the two "a"s in the first input against the first two "a"s
in the second input. After this patch, we ignore the "a"s because
their counts differ, so we try to align the "b"s instead.

I have had this commit lying around since I wrote the histogram diff
implementation in 1e657c5. I vaguely remember thinking that the
way I had implemented it (without this commit) was a bit weird, but I
wasn't sure if this commit would be an improvement or not. The bug
report from @chooglen today of a case where we behave differently from
Git is enough to make me think that we make this change after all.

Many unit tests of the diff algorithm are affected, mostly because we
no longer match the leading space in " a " against the space in " b"
and similar.
@martinvonz martinvonz force-pushed the push-5640745a446c47efb1bd819dd4c95a66 branch from 72e1399 to 0a678f4 Compare July 25, 2023 04:15
@martinvonz
Copy link
Member Author

I don't have expertise on diff algorithms, but this appears to generate more verbose hunks.

For example jj show --git 006c764694a2 contains

@@ -202,9 +204,35 @@
         }
         assert!(self.adds.is_empty());
         result
-    }
-}
-
+    }
+}
+
+impl<T: ContentHash> ContentHash for Conflict<T> {
...

maybe because the occurrences of empty lines differ?

Yes, I think it's the empty lines and lines the lines with just a } (and no space) in this instance. That's quite annoying, so I guess I'll have to not merge this until I've fixed that too. Thanks for letting me know.

@yuja
Copy link
Collaborator

yuja commented Jul 1, 2024

Appears that our diff logic lacks the following steps:

In Bram Cohen’s blog post about this algorithm, he mentions two steps we’ve not covered here:

  1. Match the first lines of both if they’re identical, then match the second, third, etc. until a pair doesn’t match.
  2. Match the last lines of both if they’re identical, then match the next to last, second to last, etc. until a pair doesn’t match.

[...] That’s because Git doesn’t perform these steps first, it performs them after calculating the matching lines but before recursing into each slice.

https://blog.jcoglan.com/2017/09/28/implementing-patience-diff/

With some form of leading/trailing matches handling, the diff output looks pretty good. I'll send a patch later (maybe after the release.)

@martinvonz
Copy link
Member Author

Appears that our diff logic lacks the following steps:

Yes, I think that's what I meant by #763 (comment)

With some form of leading/trailing matches handling, the diff output looks pretty good. I'll send a patch later (maybe after the release.)

When I looked into it (around the time of this PR), I thought it seemed like adding that step would involve a somewhat large refactoring and I gave up. Maybe you figured out a better way, or maybe you just bit the bullet. Thanks either way :)

@yuja
Copy link
Collaborator

yuja commented Jul 1, 2024

it seemed like adding that step would involve a somewhat large refactoring and I gave up. Maybe you figured out a better way, or maybe you just bit the bullet.

I just tried inserting something like this in between recursion, but I might misunderstand the logic at all.

let common = zip(left_ranges, right_ranges).take_while(..).count();
let (leading_left, rem..) = left_ranges.split(common);
let (leading_right, rem..) = right_ranges.split(common);
result = zip(leading_left, leading_right) + recurse(rem..)

@yuja
Copy link
Collaborator

yuja commented Jul 9, 2024

Included in #4010.

@yuja yuja closed this Jul 9, 2024
@martinvonz martinvonz deleted the push-5640745a446c47efb1bd819dd4c95a66 branch August 3, 2024 13:35
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants