Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

diff: stricter uncommon lcs, match up leading/trailing ranges instead #4010

Merged
merged 4 commits into from
Jul 9, 2024

Conversation

yuja
Copy link
Collaborator

@yuja yuja commented Jul 2, 2024

The last patch is copied from #763 with a few modifications.

Checklist

If applicable:

  • I have updated CHANGELOG.md
  • I have updated the documentation (README.md, docs/, demos/)
  • I have updated the config schema (cli/src/config-schema.json)
  • I have added tests to cover my changes

@yuja yuja force-pushed the push-xynmywnslpsp branch from cf28fd9 to ebed4f2 Compare July 2, 2024 15:28
@jonathantanmy
Copy link
Collaborator

I haven't looked in-depth into #763 to know if this is a good change, but in any case, if we want this change, we should add a test case that tests the case where a diff result contains both the results of the leading/trailing code and the LCS code. As it is, all tests pass with this diff:

diff --git a/lib/src/diff.rs b/lib/src/diff.rs
index 57fb9cc2..af19cdb1 100644
--- a/lib/src/diff.rs
+++ b/lib/src/diff.rs
@@ -177,7 +177,8 @@ pub(crate) fn unchanged_ranges(
     if common_leading_len > 0 {
         let (left_leading_ranges, left_ranges) = left_ranges.split_at(common_leading_len);
         let (right_leading_ranges, right_ranges) = right_ranges.split_at(common_leading_len);
-        let mut result = unchanged_ranges_lcs(left, right, left_ranges, right_ranges);
+        // let mut result = unchanged_ranges_lcs(left, right, left_ranges, right_ranges);
+        let mut result = vec![];
         result.splice(
             0..0,
             iter::zip(
@@ -197,7 +198,8 @@ pub(crate) fn unchanged_ranges(
             left_ranges.split_at(left_ranges.len() - common_trailing_len);
         let (right_ranges, right_trailing_ranges) =
             right_ranges.split_at(right_ranges.len() - common_trailing_len);
-        let mut result = unchanged_ranges_lcs(left, right, left_ranges, right_ranges);
+        // let mut result = unchanged_ranges_lcs(left, right, left_ranges, right_ranges);
+        let mut result = vec![];
         result.extend(iter::zip(
             left_trailing_ranges.iter().cloned(),
             right_trailing_ranges.iter().cloned(),

The non-test code looks reasonable. Thanks for separating out the logical steps into their own commits.

@yuja yuja force-pushed the push-xynmywnslpsp branch from ebed4f2 to 421173e Compare July 3, 2024 04:38
@yuja
Copy link
Collaborator Author

yuja commented Jul 3, 2024

add a test case that tests the case where a diff result contains both the results of the leading/trailing code and the LCS code. As it is, all tests pass with this diff:

Good point. Actually there was a bug that leading/trailing matches weren't trimmed from both ends, and that was the reason we had recursion there, I think. I've added a patch that removes recursion.

@yuja yuja force-pushed the push-xynmywnslpsp branch 4 times, most recently from 71cd5e3 to a4968e1 Compare July 5, 2024 11:21
yuja and others added 4 commits July 7, 2024 22:33
The iterator might look a bit involved, but it clarifies that we never combine
words from different buckets.
This is adapted from Breezy/Python patiencediff. AFAICT, Git implementation is
slightly different (and maybe more efficient?), but it's not super easy to
integrate with our diff logic. I'm not sure which one is better overall, but I
think the result is good so long as "uncommon LCS" matching is attempted first.

https://github.com/breezy-team/patiencediff/blob/a9a3e4edc34c5820e9703727ece596c7759666da/patiencediff/_patiencediff_py.py#L108

This patch prevents some weird test changes that would otherwise be introduced
by the next patch.
Patience diff starts by lining up unique elements (e.g. lines) to find
matching segments of the inputs. After that, it refines the
non-matching segments by repeating the process. Histogram expands on
that by not just considering unique elements but by continuing with
elements of count 2, then 3, etc.

Before this commit, when diffing "a b a b b" against "a b a b a b", we
would match the two "a"s in the first input against the first two "a"s
in the second input. After this patch, we ignore the "a"s because
their counts differ, so we try to align the "b"s instead.

I have had this commit lying around since I wrote the histogram diff
implementation in 1e657c5. I vaguely remember thinking that the
way I had implemented it (without this commit) was a bit weird, but I
wasn't sure if this commit would be an improvement or not. The bug
report from @chooglen today of a case where we behave differently from
Git is enough to make me think that we make this change after all.

jj-vcs#761
…ursion

I don't think there's a possibility that uncommon_shared_words can become
non-empty by trimming the same amount of lines from both sides. Well, there's
an edge case regarding max_occurrences, but that shouldn't matter in practice.
@yuja yuja force-pushed the push-xynmywnslpsp branch from a4968e1 to be6810d Compare July 7, 2024 13:36
@jonathantanmy
Copy link
Collaborator

OK, removing the recursion makes sense and addresses my concern above.

Let me summarize my current thoughts for other reviewers:

  • This PR makes LCS more strict by requiring that if we use uncommon words, they must appear the same number of times in the LHS and RHS. But now, a diff between "a a a" and "a a" will have no matches, so this PR adds the matching prefix/suffix rule to compensate.
  • If we believe that we need the same-number-of-times rule, then this PR is good.
  • I'm not sure whether we really need the same-number-of-times rule.

So I would say that this PR is good from a code standpoint, but I'm not sure whether it's good from a feature standpoint. Maybe someone from #763 could review this PR.

@yuja
Copy link
Collaborator Author

yuja commented Jul 9, 2024

  • This PR makes LCS more strict by requiring that if we use uncommon words, they must appear the same number of times in the LHS and RHS. But now, a diff between "a a a" and "a a" will have no matches, so this PR adds the matching prefix/suffix rule to compensate.

  • If we believe that we need the same-number-of-times rule, then this PR is good.

  • I'm not sure whether we really need the same-number-of-times rule.

Thanks for summarizing it.

To add some context, I think the current implementation is a bit weird in that we choose the first min(left, right) occurrences. There's no clue that the first N matching pairs should line up, not the last N. Practically, this change tends to disable pairing unbalanced whitespace characters in color-words diffs. The resulting diff looks more readable to me.

Copy link
Member

@martinvonz martinvonz left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think the result is good so long as "uncommon LCS" matching is attempted first.

What happens if we don't?

@yuja
Copy link
Collaborator Author

yuja commented Jul 9, 2024

I think the result is good so long as "uncommon LCS" matching is attempted first.

What happens if we don't?

could produce inferior diffs. If I understand it correctly, Git roughly does LCS -> widen -> LCS ..., Bzr does LCS -> LCS -> ... -> widen.

image
instead of
image

Copy link
Member

@martinvonz martinvonz left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Approved partly based on Jonathan's review. Thanks!

@yuja yuja merged commit dce3ec7 into jj-vcs:main Jul 9, 2024
17 checks passed
@yuja yuja deleted the push-xynmywnslpsp branch July 9, 2024 11:35
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants