-
Notifications
You must be signed in to change notification settings - Fork 343
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
diff: stricter uncommon lcs, match up leading/trailing ranges instead #4010
Conversation
I haven't looked in-depth into #763 to know if this is a good change, but in any case, if we want this change, we should add a test case that tests the case where a diff result contains both the results of the leading/trailing code and the LCS code. As it is, all tests pass with this diff:
The non-test code looks reasonable. Thanks for separating out the logical steps into their own commits. |
Good point. Actually there was a bug that leading/trailing matches weren't trimmed from both ends, and that was the reason we had recursion there, I think. I've added a patch that removes recursion. |
71cd5e3
to
a4968e1
Compare
The iterator might look a bit involved, but it clarifies that we never combine words from different buckets.
This is adapted from Breezy/Python patiencediff. AFAICT, Git implementation is slightly different (and maybe more efficient?), but it's not super easy to integrate with our diff logic. I'm not sure which one is better overall, but I think the result is good so long as "uncommon LCS" matching is attempted first. https://github.com/breezy-team/patiencediff/blob/a9a3e4edc34c5820e9703727ece596c7759666da/patiencediff/_patiencediff_py.py#L108 This patch prevents some weird test changes that would otherwise be introduced by the next patch.
Patience diff starts by lining up unique elements (e.g. lines) to find matching segments of the inputs. After that, it refines the non-matching segments by repeating the process. Histogram expands on that by not just considering unique elements but by continuing with elements of count 2, then 3, etc. Before this commit, when diffing "a b a b b" against "a b a b a b", we would match the two "a"s in the first input against the first two "a"s in the second input. After this patch, we ignore the "a"s because their counts differ, so we try to align the "b"s instead. I have had this commit lying around since I wrote the histogram diff implementation in 1e657c5. I vaguely remember thinking that the way I had implemented it (without this commit) was a bit weird, but I wasn't sure if this commit would be an improvement or not. The bug report from @chooglen today of a case where we behave differently from Git is enough to make me think that we make this change after all. jj-vcs#761
…ursion I don't think there's a possibility that uncommon_shared_words can become non-empty by trimming the same amount of lines from both sides. Well, there's an edge case regarding max_occurrences, but that shouldn't matter in practice.
OK, removing the recursion makes sense and addresses my concern above. Let me summarize my current thoughts for other reviewers:
So I would say that this PR is good from a code standpoint, but I'm not sure whether it's good from a feature standpoint. Maybe someone from #763 could review this PR. |
Thanks for summarizing it. To add some context, I think the current implementation is a bit weird in that we choose the first |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think the result is good so long as "uncommon LCS" matching is attempted first.
What happens if we don't?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Approved partly based on Jonathan's review. Thanks!
The last patch is copied from #763 with a few modifications.
Checklist
If applicable:
CHANGELOG.md