-
Notifications
You must be signed in to change notification settings - Fork 343
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
mailmap: add support for .mailmap
files
#3951
base: main
Are you sure you want to change the base?
Conversation
This comment was marked as resolved.
This comment was marked as resolved.
d61a6ef
to
3065d5d
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
First round, I'm not totally onboard for the *_raw()
templates but let's wait for other peoples opinion. And to fix the build failure, wrap the necessary stuff in #[cfg(git)]
and make the dependency rely on it.
3065d5d
to
e896de2
Compare
This comment was marked as outdated.
This comment was marked as outdated.
For what it’s worth, the best alternative DSL API I came up with was |
e896de2
to
6066538
Compare
This comment was marked as outdated.
This comment was marked as outdated.
f21c84b
to
0044daf
Compare
0044daf
to
04df057
Compare
Okay, I’ve dropped the canonicalizing functions from I also separated out the signature conversion, and renamed I’ve currently left the |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for leaving the mechanical bits in a standalone commit.
GitHub won't let me comment on the commit message :) but in my opinion, you could say a little more - that the mailmap will introduce a concept of pre- and post-mapped identities. Right now, it's implied here.
Are committer and author the only types of identities that are here? What happens to, say, commit message footers like Signed-off-by: (which are mailmapped in Git, IIRC)? Is that handled differently?
04df057
to
90e7a18
Compare
Thank you for taking a look!
That’s not something I hear often :) I think this is a good idea and I’ve added another few paragraphs (after my editor ate the first draft…); I’ve called them raw signatures and canonical signatures, as that’s the best terminology I can think of currently and it matches the code and comments. I’ve also tweaked a few doc comments to allude to these concept more explicitly, although there isn’t an in‐depth exposition in the code currently. Possibly this should be recorded more permanently in developer documentation somewhere but hopefully that can be left to a follow‐up PR – the design docs section, perhaps? Although I wouldn’t exactly say that this is a stellar piece of de novo design to flaunt, just a pragmatic stop‐gap for arguably‐regrettable decisions the majority of DVCSes have made up to this point…
As far as I can tell authors and committers are the only places Jujutsu gets and processes signatures from currently, and there is no special support for trailers. It seems like |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This looks great to me. Thanks!
Please give @yuja a chance to do another round of reviews too if he wants to.
let commit = store.get_commit(&entry.commit_id()).unwrap(); | ||
let author = mailmap.author(&commit); | ||
pattern.matches(&author.name) || pattern.matches(&author.email) | ||
}) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can't we transform the query by frontend?
author(foo)
-> author(foo)|author(exact:lookup_reverse(foo)[0])|author(exact:lookup_reverse(foo)[1])|..
It's not exactly the same (because the query also matches the original values), but I think it's practically okay. I think it's easier for revset engine backed by database to process query.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Right, I mentioned this in the commit message under “The mapping to canonical signatures is one‐way, and queries only match on the canonical form”. It seems a little tricky to translate back because the functions accept glob patterns; at the very least we would have to traverse the Mailmap
structure and syntheize author_raw(exact:…)
patterns for all possible matches. But I’m not entirely sure that would be correct in all cases, even if we accept the change in semantics. I’d have to think about it.
Rather than changing the semantics/implementation here, I think in a database backend, you could have a caching table like (commit ID or hash or something of .mailmap
, raw signature that appears in commits, canonical signature from corresponding .mailmap
), and an index for lookup? I don’t really know how those backends work since there isn’t a public example of one. Maybe you could even use a materialized view or something.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't know the implementation either, and I guess Google backend can just ignore mailmap thingy at all (because .mailmap shouldn't exist there.)
That said, it'll be nice if mailmap handling is isolated to the frontend layer. Suppose the size of the mailmap is small compared to the commit history, I think it's okay to scan all entries to build a reverse mapping.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I had a think about this. If you wanted to avoid the additional table/view I mentioned, we can still obtain the same database performance benefits without changing semantics (which I’m a little reluctant to do, as it goes against the rubric of when raw signature content should affect UI I put in the commit message). author(a*b*)
would optimize to author(a*b*) & (author_raw(a*b*) | author_raw(exact:…) | author_raw(exact:…) | …)
based on the .mailmap
. The latter can be queried on the database efficiently, and the author(a*b*) &
is a quick post‐processing step on the results that ensures the semantics remain the same.
This wouldn’t be hard to implement, but I’m hesitant to add the complexity off the bat if it wouldn’t currently have any prospective users, especially as it would make performance strictly worse with the current backends. (Although considering the complaints I’ve heard from Googlers in the past about what a pain it is to get their username and name updated across all systems maybe they could do with having some kind of system here…)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't think doing preprocessing is strictly worse in performance. If the query doesn't match any mailmap entry (which I think is common), there's no additional cost in later stage. OTOH, without preprocessing, we'll do binary search for each commit entry. I don't think the difference would be significant, though.
FWIW, if a database-backed revset engine transforms mailmap to table/view, they'll want a single mailmap resource than mailmap per expression node.
Anyway, I think this is a fundamental issue of mailmap design, and I just thought it would be nice if we can avoid spreading a mess to the backend layer.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
And, my point is that it's nice if mailmap handling does not cross the frontend-backend boudary.
I agree with this. At the same time, mailmap is a bit of a hack, so what Emily has currently implemented in this PR seems fine to me.
Another alternative might be to pass the Mailmap
into Index::evaluate_revset()
. What do you think about that?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Another alternative might be to pass the
Mailmap
intoIndex::evaluate_revset()
.
That seems slightly better.
FWIW, I also consider mailmap as a hack, and that's the reason why I insist that the query result doesn't have to be 100% accurate. It's simpler to approximate mailmap query as union of matched mailmap entries than extending backend API.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Strawman design:
Mailmap
could become a trait.- The
Backend
trait could gain e.g.fn read_mailmap(&self, as_of_commit: &CommitId) -> impl Mailmap
. A database backend could stub this out if it doesn’t support the feature or ignore the parameter and return an interface to a database table. Handling the queries efficiently would be the responsibility of a backend that wants to support this, and could be handled as per the desugaring I outlined in lib: add support for.mailmap
files #3951 (comment).- The in‐tree backends would share the current implementation.
Not sure if this is better or worse than the status quo, but I thought I should throw it out there.
Yes, that'd be the approach if we pursue the identity work further, but clearly is unneeded complexity for this PR. As what the current .mailmap
approach should be, there I heavily agree with Yuya and Martin.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I appreciate all the feedback; I’m going to take a day to think the design space over and try and address some of the concerns here.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Another alternative might be to pass the
Mailmap
intoIndex::evaluate_revset()
. What do you think about that?
I tried this, and it was a mess, sadly; the state needed to be piped through the whole evaluation system and it caused a lot of churn and other problems. It basically resulted in ResolvedExpression
s not being fully resolved and having to patch that up in the evaluation layer, so in terms of these layering concerns it was a lot worse.
I wasn’t sure about how to move forward here, but then I realized that it is possible to implement the front‐end desugaring of the revset queries in a way that always gives correct .mailmap
results and yields simple, efficient RevsetExpression
s that should work great with database indexes. The backend now no longer needs to care about .mailmap
support at all. Hopefully that will make everyone happy! I’m still finishing polishing it off, so I haven’t pushed it here yet, but I’ve opened #3991 as an interim change I made along the way that resolves an old TODO and helped with the implementation.
(For what it’s worth: I do agree that .mailmap
is essentially a pragmatic hack, an inherently incomplete measure, and a symptom of a deficient underlying design. Still, Git users are unlikely to ever get anything better, and for some people it’s a hack that is the difference between a marginally workable identity system and one that is completely unfit for purpose, so I do care about implementing it at least as well as Git does. I’m principled in my pragmatism; giving up on the principles of application I listed in the commit message results in concrete scenarios where it fails to meet the basic point of supporting .mailmap
, in my view. Thankfully, it looks like it’s totally practical to address these layering concerns while still upholding them!)
7e14ec7
to
5400721
Compare
The underlying predicates now match only on the name and email, and the revset functions construct the union of the two. These are desired for the `.mailmap` work and aren’t currently surfaced in the user interface, but the possibility of exposing more specific queries in the future becomes available.
Prepare for `.mailmap` support.
Prepare for `.mailmap` support.
These live in the repository and map email addresses and names to canonical ones, as described in [`gitmailmap(5)`]. [`gitmailmap(5)`]: https://git-scm.com/docs/gitmailmap We introduce the notion of raw signatures, representing the name and email baked in to an object’s metadata, and canonical signatures, representing the up‐to‐date identity information obtained by looking up a raw signature in a `.mailmap` file. When there are no matching entries, the raw and canonical signatures are the same. Canonical signatures should be used for the majority of purposes, such as display and querying in user interfaces and automated batch processing (e.g., to collate statistics by commit author, or email committers in batch). Generally speaking, whenever you care about *who made the commit* rather than *what data happens to be encoded in the commit itself*, they are the appropriate thing to work with. Raw signatures should usually not be surfaced to users by default unless explicitly asked for. Valid reasons to work with them include low‐level processing of commits where a `.mailmap` is not accessible or would be inappropriate to use (e.g., rewriting or duplication of commits without intent to alter metadata), automated testing, forensics (examining the raw object data stored in the backend that is used to compute a commit’s cryptographic hash), and analysis and debugging of `.mailmap` files themselves. In this model, you can think of raw signatures as being like database keys for a table mapping them to canonical signatures. In an ideal world, these keys would be opaque synthetic keys with no human meaning that are only surfaced when poking into internals; the idea is to treat them as such to the greatest extent possible given the realities of the current object model. The only signatures Jujutsu currently processes are commit authors and committers, which can be obtained in raw and canonical form with `Commit::{author,committer}_raw` and `Mailmap::{author,committer}` respectively. If Jujutsu starts to store or process immutable identity data in other contexts (e.g. support for additional metadata on commits like Git’s `Co-authored-by`/`Signed-off-by`/`Reviewed-by` trailers, or detached metadata that nonetheless must remain immutable), then the notion of raw and canonical signatures will carry over to those and the same guidelines about preferring to work with and display canonical signatures whenever reasonable will apply. This is not meant to be a comprehensive solution to identity management or obsolete the discussion in jj-vcs#2957. There are many possible designs of forward‐thinking author and committer identity systems that would be a lot better than `.mailmap` files, but I don’t really want to get lost in the weeds trying to solve an open research problem here. Instead, this is just an acknowledgement that any system that treats user names and emails as immutable (as Jujutsu currently does) is going to need a mapping layer to keep them updated, and both Git and Mercurial adopted `.mailmap` files, meaning they are already in wide use to address this problem. All sufficiently large open source repositories tend to grow a substantial `.mailmap` file, e.g. [Linux], [Rust], [curl], [Mesa], [Node.js], and [Git] itself. Currently, people working on these repositories with Jujutsu see and search outdated and inconsistent authorship information that contradicts what Git queries and outputs, which is at the very least somewhere between confusing and unhelpful. Even if we had a perfect orthogonal solution in the native backend, as long as we support working on Git repositories it’s a compatibility‐relevant feature. [Linux]: https://github.com/torvalds/linux/blob/f2661062f16b2de5d7b6a5c42a9a5c96326b8454/.mailmap [Rust]: https://github.com/rust-lang/rust/blob/2c243d957008f5909f7a4af19e486ea8a3814be7/.mailmap [curl]: https://github.com/curl/curl/blob/a7ec6a76abf5e29fb3f951a09d429ce5fbff250f/.mailmap [Mesa]: https://gitlab.freedesktop.org/mesa/mesa/-/blob/cdf3228f88361410175c338704908ea74dc7b8ae/.mailmap [Node.js]: https://github.com/nodejs/node/blob/4c730aed7f825af1691740663d599e9de5958f89/.mailmap [Git]: https://github.com/git/git/blob/9005149a4a77e2d3409c6127bf4fd1a0893c3495/.mailmap That said, this is not exclusive to the Git backend. The `.mailmap` name and format is perfectly generic, already shared between Git and Mercurial, and applies to all systems that bake names and emails into commits, including the current local backend. The code uses Gitoxide, but only as a convenient implementation of the file format; in a hypothetical world where the Git backend was removed without Jujutsu changing its notion of commit signatures, `gix-mailmap` could be used standalone, or replaced with a bespoke implementation. I discussed this on the Discord server and we seemed to arrive at a consensus that this would be a good feature to have for Git compatibility and as a pragmatic stop‐gap measure for the larger identity management problem, and that I should have a crack at implementing it to see how complex it would be. Happily, it turned out to be pretty simple! No major plumbing of state is required as the users of the template and revset engines already have the working copy commit close to hand to support displaying and matching `@`; I think this should be more lightweight (but admittedly less powerful) than the commit rewriting approach @arxanas floated on Discord. ## Notes on various design decisions * The `.mailmap` file is read from the working copy commit of the current workspace. This is roughly equivalent to Git reading from `$GIT_WORK_TREE/.mailmap`, or `HEAD:.mailmap` in bare repositories, and seems like the best fit for Jujutsu’s model. I briefly looked into reading it from the actual on‐disk working copy, but it seemed a lot more complicated and I’m not sure if there’s any point. I didn’t add support for Git’s `mailmap.file` and `mailmap.blob` configuration options; unlike ignores, I don’t think I’ve ever seen this feature used other than directly in a repository, and `mailmap.blob` seems to mostly be there to keep it working in bare repositories. I can imagine something like a managed corporate multi‐repo environment with a globally‐shared `mailmap.file` so if people feel like this is important to keep consistency with I can look into implementing it. But genuinely I’ve never personally seen anybody use this. * The `author`/`committer` DSL functions respect the `.mailmap`, with `*_raw` variants to ignore it. If there’s a `.mailmap` available, signatures should be mapped through it unless there’s a specific reason not to; this matches Git’s behaviour and is the main thing that makes this feature worthwhile. There is a corresponding breaking change of the external Rust API, but hopefully the new method name and documentation will nudge people towards doing the right thing. I was initially considering a keyword argument to the template and revset functions to specify whether to map or not (and even implemented keyword arguments for template functions), but I decided it was probably overkill and settled on the current separate functions. A suggestion from Discord was to add a method on signatures to the template language, e.g. `.canonical()` or `.mailmap()`. While this seems elegant to me, I still wanted the short, simple construction to be right by default, and I couldn’t think of any immediate uses outside of `.author()` and `.committer()`. If this is added later, we will still get the elegant property that `commit.{author,committer}()` is short for `commit.{author,committer}_raw().canonical()`. * The mapping to canonical signatures is one‐way, and queries only match on the canonical form. This is the same behaviour as Git. The alternative would be to consider the mapped signatures as an equivalence set and allow a query for any member to match all of them, but this would contradict what is actually displayed for the commits, violate the principles about surfacing raw signatures detailed above, and the `*_raw` functions may be more useful in such a case anyway. * There’s currently no real caching or optimization here. The `.mailmap` file is materialized and parsed whenever a template or revset context is initialized (although it’s still O(1), not parsing it for every processed commit), and `gix-mailmap` does a binary search to resolve signatures. I couldn’t measure any kind of substantial performance hit here, maybe 1‐3% percent on some `jj log` microbenchmarks, but it could just be noise; a couple times it was actually faster.
5400721
to
9b265b5
Compare
.mailmap
files.mailmap
files
I’ve pushed a pretty big rework of this; it feels cleaner now. Headline changes:
I believe I’ve addressed all the previous review comments; hopefully this is acceptable to everyone now and we can get this merged! This turned into a bigger project than I expected when I picked it as my first public contribution to Jujutsu, but I’m grateful for all the feedback and I think this has ended up in a good place. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The direction of this PR looks good to me, thanks!
lib/src/revset.rs
Outdated
/// Commits with author field matching the pattern. | ||
Author(SignatureField, StringPattern), | ||
/// Commits with committer field matching the pattern. | ||
Committer(SignatureField, StringPattern), |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Maybe these enum variants can be simply AuthorName(_)
/AuthorEmail(_)
/..?
We're going to introduce AuthorDate(_)
, etc., so Author<Thing>
seems more consistent. And we'll probably add author_name()
/_email()
revset function if needed.
@@ -2053,12 +2164,14 @@ impl<'a> RevsetParseContext<'a> { | |||
user_email: String, | |||
extensions: &'a RevsetExtensions, | |||
workspace: Option<RevsetWorkspaceContext<'a>>, | |||
mailmap: Rc<Mailmap>, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: Rc<_>
isn't needed, but templater depends on it. If Mailmap
is cached by WorkspaceCommandHelper
, we can replace Rc
with &'a
. That said, this isn't important, and can be addressed later. (I just find it looked a bit odd because the revset module doesn't require Rc
.)
self.clone() | ||
} else { | ||
RevsetExpression::minus(self, &RevsetExpression::union_all(others)) | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: I'll inline this to caller.
x & ~none()
could be rewritten to x
at later pass, but apparently we don't have a rule for that right now.
old_field_value.map(|field_value| pattern_case_insensitive.matches(field_value)); | ||
let new_field_matches = pattern.matches(new_field_value); | ||
|
||
if !new_field_matches && old_field_matches.unwrap_or(true) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Doesn't .unwrap_or(true)
mean we're building quite large negative sets? Mailmap entry doesn't usually contain old name, I suppose. Even if it's more correct, I don't think the added cost is acceptable.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
FWIW, my original idea was to build query without excludes
, meaning author(x)
matches x
in raw entry or mapped entry. This isn't perfect, but I considered it would be good enough. Mailmap is a bit of a hack.
If we need to guarantee that all shadowed matches are excluded, it's probably better to pass Mailmap
to evaluate_revset()
as a resource.
// } | ||
// | ||
// When matching raw signatures, the `Mailmap` is empty and this reduces to | ||
// `raw_field(pattern)`. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: it's probably better to write unit tests for this query transformation function.
StringPattern::Exact(literal) => Cow::Owned(StringPattern::exact_i(literal)), | ||
StringPattern::Substring(needle) => Cow::Owned(StringPattern::substring_i(needle)), | ||
StringPattern::Glob(pattern) => Cow::Owned(StringPattern::GlobI(pattern.clone())), | ||
_ => Cow::Borrowed(self), |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: I'll list all variants exhaustively (to catch future bugs.)
pub fn new_name(&self) -> Option<&str> { | ||
self.new_name.as_deref() | ||
} | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: might be okay to simply make all fields public.
What is the status here? It'd be a shame to let it bit rot further. |
Sorry for the inactivity; I was expecting this to be a relatively quick PR and burnt out a little on the review cycles. I still intend to rebase this and respond to review comments at some point, though I am not sure if it will be possible to fully satisfy everyone here. |
I would really love to see this get merged as long as it's in a "good enough" state even if it's not perfect. There are repositories I don't want to use jujutsu with until it has mailmap support because I'd rather not see my deadname show up in the UI. |
FWIW, the display part (= templater + Commit object methods) is easy, and can be merged separately. IIRC, most of the discussion here is about |
Explanation, motivation, backstory, and design decisions are in the commit message, and I’ve included comprehensive tests and hopefully‐adequate documentation.
Pulling in the people who actively discussed this on Discord for review, hope that’s okay: @nasamuffin @arxanas @PhilipMetzger
Some things I’m unsure about and could use feedback on:
I’m happy to hear bikeshedding about the DSL interfaces here, especially the
*_raw
names (*_unmapped
?).There is a breaking change to the Rust
.author()
and.committer()
methods onCommit
; I assume this is okay due to the current level of general churn, but I thought I should point it out. I plan to look into opening a PR to make gg do the right thing here.get_wc_commit_mailmap
isn’t exactly my favourite function name in the world, and it feels a little gross to have commit/tree‐related plumbing in themailmap
module (e.g.,jj_lib::gitignore
just has a pure model). Any suggestions for better places this could go, better names it could have, types it could hang off as a method of, etc. are welcome.I addedpub(crate)
tojj_lib::git_backend: signature_{from,to}_git
, as they’re used by themailmap
module to interface withgix-mailmap
. Since this isn’t a feature specific to the Git backend, it feels wrong to have that dependency, but I’m not sure where would be good to move them to. A newjj_lib::git_util
module, perhaps?I don’t love thejj_cmd_ok_as
thing in the CLI test, but it seemed like too much work to refactor that interface right now.I don’t really know how conflicts in the
.mailmap
file should be handled. Ideally the entries not involved in a conflict should still be respected, but I’m not even sure howmaterialize_tree_value
behaves here.Should I split this into multiple commits? A lot of the changes are fairly entangled, but I could probably tease one or two more commits out of this.Should there be more permanent documentation of the
.mailmap
support, and if so, where?Checklist
If applicable:
CHANGELOG.md