diff: stricter uncommon lcs, match up leading/trailing ranges instead #4010

yuja · 2024-07-02T15:06:44Z

The last patch is copied from #763 with a few modifications.

Checklist

If applicable:

I have updated CHANGELOG.md
I have updated the documentation (README.md, docs/, demos/)
I have updated the config schema (cli/src/config-schema.json)
I have added tests to cover my changes

jonathantanmy · 2024-07-02T21:29:52Z

I haven't looked in-depth into #763 to know if this is a good change, but in any case, if we want this change, we should add a test case that tests the case where a diff result contains both the results of the leading/trailing code and the LCS code. As it is, all tests pass with this diff:

diff --git a/lib/src/diff.rs b/lib/src/diff.rs
index 57fb9cc2..af19cdb1 100644
--- a/lib/src/diff.rs
+++ b/lib/src/diff.rs
@@ -177,7 +177,8 @@ pub(crate) fn unchanged_ranges(
     if common_leading_len > 0 {
         let (left_leading_ranges, left_ranges) = left_ranges.split_at(common_leading_len);
         let (right_leading_ranges, right_ranges) = right_ranges.split_at(common_leading_len);
-        let mut result = unchanged_ranges_lcs(left, right, left_ranges, right_ranges);
+        // let mut result = unchanged_ranges_lcs(left, right, left_ranges, right_ranges);
+        let mut result = vec![];
         result.splice(
             0..0,
             iter::zip(
@@ -197,7 +198,8 @@ pub(crate) fn unchanged_ranges(
             left_ranges.split_at(left_ranges.len() - common_trailing_len);
         let (right_ranges, right_trailing_ranges) =
             right_ranges.split_at(right_ranges.len() - common_trailing_len);
-        let mut result = unchanged_ranges_lcs(left, right, left_ranges, right_ranges);
+        // let mut result = unchanged_ranges_lcs(left, right, left_ranges, right_ranges);
+        let mut result = vec![];
         result.extend(iter::zip(
             left_trailing_ranges.iter().cloned(),
             right_trailing_ranges.iter().cloned(),

The non-test code looks reasonable. Thanks for separating out the logical steps into their own commits.

yuja · 2024-07-03T04:44:01Z

add a test case that tests the case where a diff result contains both the results of the leading/trailing code and the LCS code. As it is, all tests pass with this diff:

Good point. Actually there was a bug that leading/trailing matches weren't trimmed from both ends, and that was the reason we had recursion there, I think. I've added a patch that removes recursion.

The iterator might look a bit involved, but it clarifies that we never combine words from different buckets.

This is adapted from Breezy/Python patiencediff. AFAICT, Git implementation is slightly different (and maybe more efficient?), but it's not super easy to integrate with our diff logic. I'm not sure which one is better overall, but I think the result is good so long as "uncommon LCS" matching is attempted first. https://github.com/breezy-team/patiencediff/blob/a9a3e4edc34c5820e9703727ece596c7759666da/patiencediff/_patiencediff_py.py#L108 This patch prevents some weird test changes that would otherwise be introduced by the next patch.

@chooglen

Patience diff starts by lining up unique elements (e.g. lines) to find matching segments of the inputs. After that, it refines the non-matching segments by repeating the process. Histogram expands on that by not just considering unique elements but by continuing with elements of count 2, then 3, etc. Before this commit, when diffing "a b a b b" against "a b a b a b", we would match the two "a"s in the first input against the first two "a"s in the second input. After this patch, we ignore the "a"s because their counts differ, so we try to align the "b"s instead. I have had this commit lying around since I wrote the histogram diff implementation in 1e657c5. I vaguely remember thinking that the way I had implemented it (without this commit) was a bit weird, but I wasn't sure if this commit would be an improvement or not. The bug report from @chooglen today of a case where we behave differently from Git is enough to make me think that we make this change after all. jj-vcs#761

…ursion I don't think there's a possibility that uncommon_shared_words can become non-empty by trimming the same amount of lines from both sides. Well, there's an edge case regarding max_occurrences, but that shouldn't matter in practice.

jonathantanmy · 2024-07-08T21:07:40Z

OK, removing the recursion makes sense and addresses my concern above.

Let me summarize my current thoughts for other reviewers:

This PR makes LCS more strict by requiring that if we use uncommon words, they must appear the same number of times in the LHS and RHS. But now, a diff between "a a a" and "a a" will have no matches, so this PR adds the matching prefix/suffix rule to compensate.
If we believe that we need the same-number-of-times rule, then this PR is good.
I'm not sure whether we really need the same-number-of-times rule.

So I would say that this PR is good from a code standpoint, but I'm not sure whether it's good from a feature standpoint. Maybe someone from #763 could review this PR.

yuja · 2024-07-09T03:21:21Z

This PR makes LCS more strict by requiring that if we use uncommon words, they must appear the same number of times in the LHS and RHS. But now, a diff between "a a a" and "a a" will have no matches, so this PR adds the matching prefix/suffix rule to compensate.

If we believe that we need the same-number-of-times rule, then this PR is good.

I'm not sure whether we really need the same-number-of-times rule.

Thanks for summarizing it.

To add some context, I think the current implementation is a bit weird in that we choose the first min(left, right) occurrences. There's no clue that the first N matching pairs should line up, not the last N. Practically, this change tends to disable pairing unbalanced whitespace characters in color-words diffs. The resulting diff looks more readable to me.

martinvonz

I think the result is good so long as "uncommon LCS" matching is attempted first.

What happens if we don't?

yuja · 2024-07-09T09:45:01Z

I think the result is good so long as "uncommon LCS" matching is attempted first.

What happens if we don't?

could produce inferior diffs. If I understand it correctly, Git roughly does LCS -> widen -> LCS ..., Bzr does LCS -> LCS -> ... -> widen.

instead of

martinvonz

Approved partly based on Jonathan's review. Thanks!

yuja force-pushed the push-xynmywnslpsp branch from cf28fd9 to ebed4f2 Compare July 2, 2024 15:28

yuja force-pushed the push-xynmywnslpsp branch from ebed4f2 to 421173e Compare July 3, 2024 04:38

yuja force-pushed the push-xynmywnslpsp branch 4 times, most recently from 71cd5e3 to a4968e1 Compare July 5, 2024 11:21

yuja and others added 4 commits July 7, 2024 22:33

diff: remove unneeded mutation of histogram

58d93ca

The iterator might look a bit involved, but it clarifies that we never combine words from different buckets.

yuja force-pushed the push-xynmywnslpsp branch from a4968e1 to be6810d Compare July 7, 2024 13:36

martinvonz reviewed Jul 9, 2024

View reviewed changes

martinvonz approved these changes Jul 9, 2024

View reviewed changes

yuja merged commit dce3ec7 into jj-vcs:main Jul 9, 2024
17 checks passed

yuja deleted the push-xynmywnslpsp branch July 9, 2024 11:35

yuja mentioned this pull request Jul 9, 2024

diff: consider uncommon words to match only if they have the same count #763

Closed

2 tasks

martinvonz mentioned this pull request Jan 9, 2025

jj sometimes generates conflicts where Git (GitHub) doesn't #761

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

diff: stricter uncommon lcs, match up leading/trailing ranges instead #4010

diff: stricter uncommon lcs, match up leading/trailing ranges instead #4010

yuja commented Jul 2, 2024

jonathantanmy commented Jul 2, 2024

yuja commented Jul 3, 2024

jonathantanmy commented Jul 8, 2024

yuja commented Jul 9, 2024

martinvonz left a comment

yuja commented Jul 9, 2024

martinvonz left a comment

diff: stricter uncommon lcs, match up leading/trailing ranges instead #4010

diff: stricter uncommon lcs, match up leading/trailing ranges instead #4010

Conversation

yuja commented Jul 2, 2024

Checklist

jonathantanmy commented Jul 2, 2024

yuja commented Jul 3, 2024

jonathantanmy commented Jul 8, 2024

yuja commented Jul 9, 2024

martinvonz left a comment

Choose a reason for hiding this comment

yuja commented Jul 9, 2024

martinvonz left a comment

Choose a reason for hiding this comment