Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

str_util: support case‐insensitive string patterns #3991

Merged
merged 4 commits into from
Jul 10, 2024

Conversation

emilazy
Copy link
Collaborator

@emilazy emilazy commented Jun 29, 2024

See the commit message for details. This resolves an old TODO and came up during the rework of #3951. I’m not totally sure about the defaults here, but the functionality seems good to have.

Checklist

If applicable:

  • I have updated CHANGELOG.md
  • I have updated the documentation (README.md, docs/, demos/)
  • I have updated the config schema (cli/src/config-schema.json)
  • I have added tests to cover my changes

@yuja
Copy link
Collaborator

yuja commented Jun 29, 2024

This is a function argument rather than part of the pattern syntax as it doesn’t make sense for all uses of patterns to support case‐sensitivity

I think it should be a pattern kind so that the same syntax can be used in jj branch list PATTERN command for example. It's weird if these commands took a separate --case-insensitive flag.

(e.g. Git branch settings, where the remote end has no support for case‐insensitive matching),

Perhaps, .to_glob() can be renamed to .to_case_sensitive_glob() and error out if the type was Glob { .., case_insensitive: true }?

and it’s not clear how it would best be integrated into the pattern syntax anyway. The glob and regex crates also separate matching options like this from the pattern itself.

Something like iglob: or icase-glob: seems okay (though I'm not a big fan of these.) For regex:, we won't need iregex:, but could be added for consistency.

@arxanas how do you deal with case-insensitive matches in git-branchless? Just using regex: pattern?
https://github.com/arxanas/git-branchless/wiki/Reference:-Revsets#patterns

Case‐insensitivity is on by default; a casual search for commit names or descriptions is unlikely to care about specific case distinctions, the domain part of email addresses is de jure case‐insensitive, and the local‐part is case‐insensitive in practice too. This could be changed at a later date if it turns out to be undesirable default for any of these functions (description("FIXME")?).

Good point about email addresses, but I would make it case-sensitive by default mainly for consistency with command arguments and file patterns.

This currently only handles ASCII case folding, due to the complexities of case‐insensitive Unicode comparison and the glob crate’s lack of support for it.

Sounds good to me.

Thanks!

@emilazy
Copy link
Collaborator Author

emilazy commented Jun 29, 2024

Thank you for the review!

I think it should be a pattern kind so that the same syntax can be used in jj branch list PATTERN command for example. It's weird if these commands took a separate --case-insensitive flag.

I was wavering on whether it should be part of the pattern or not, so I can switch back. Although I’m not sure how much use there is for searching branch names case‐insensitively, especially when Git remotes don’t support it. I didn’t bother adding it to the corresponding revset functions for that reason; I didn’t want to make people expect they could treat branch names case‐insensitively in all contexts. It would be nice for it to work in contexts where it does make sense without needing an explicit extension to the interface, though, I agree.

Perhaps, .to_glob() can be renamed to .to_case_sensitive_glob() and error out if the type was Glob { .., case_insensitive: true }?

Sure, that works. It feels kind of awkward for some types of pattern to be rejected in some contexts, but I guess it’ll be inevitable for those code paths to error out if/when regex patterns are supported anyway.

Something like iglob: or icase-glob: seems okay (though I'm not a big fan of these.) For regex:, we won't need iregex:, but could be added for consistency.

Taking inspiration from regex syntax (a dangerous phrase!), what about…

exact/i:foo
substring/i:foo
glob/i:foo

# Short for `substring/i:foo`
/i:foo

Too weird? I like that it clearly separates out the type from the common options, and I think that we probably want something shorter than substring/i: for quick commit searches if it’s not going to be the default, but I admit that /i:foo makes me think of DOS. Maybe there’s something in the region of this we can bikeshed. substring(i):…? I guess we could just make i: short for isubstring:.

Good point about email addresses, but I would make it case-sensitive by default mainly for consistency with command arguments and file patterns.

Makes sense; it makes this less of a backwards‐incompatible change and can always be adjusted later. How about keeping mine() case‐insensitive, since it’s always dealing with email addresses? Not that it really matters, since it’s not likely common to have commits with email addresses differing only in case (though you can find it in Nixpkgs, like every other kind of metadata strangeness in the world).

@yuja
Copy link
Collaborator

yuja commented Jun 29, 2024

Taking inspiration from regex syntax (a dangerous phrase!), what about…

exact/i:foo
substring/i:foo
glob/i:foo

# Short for `substring/i:foo`
/i:foo

Too weird? I like that it clearly separates out the type from the common options, and I think that we probably want something shorter than substring/i: for quick commit searches if it’s not going to be the default, but I admit that /i:foo makes me think of DOS. Maybe there’s something in the region of this we can bikeshed. substring(i):…? I guess we could just make i: short for isubstring:.

It's unclear to me where we set a boundary (e.g. cwd-glob: vs glob/cwd: in file pattern), so I wouldn't introduce /{option} syntax. That said, I don't have strong feeling. If people like /i, I won't be against.

i: for isubstring: sounds good to me. If we add a short form, the canonical form could be icase-substring:.

How about keeping mine() case‐insensitive, since it’s always dealing with email addresses? Not that it really matters, since it’s not likely common to have commits with email addresses differing only in case (though you can find it in Nixpkgs, like every other kind of metadata strangeness in the world).

That makes sense to me. Inline comment will help understand why it's case insensitive.

@emilazy emilazy changed the title lib: add case_sensitive option to some revsets str_util: support case‐insensitive string patterns Jun 30, 2024
@emilazy
Copy link
Collaborator Author

emilazy commented Jun 30, 2024

I’ve switched this over to being part of string patterns, and settled on {kind}-i:…/i:… syntax for now to keep the option clearly separated from the base kind without adding any new weird symbols. It was a little gnarlier than expected because of the default pattern kind being dependent on context (which I didn’t realize before and don’t entirely love, but it is what it is). I’d appreciate it if you could take another look!

lib/src/revset.rs Outdated Show resolved Hide resolved
lib/src/revset.rs Outdated Show resolved Hide resolved
lib/src/str_util.rs Outdated Show resolved Hide resolved
lib/src/str_util.rs Outdated Show resolved Hide resolved
lib/src/str_util.rs Outdated Show resolved Hide resolved
lib/src/str_util.rs Outdated Show resolved Hide resolved
Copy link
Member

@martinvonz martinvonz left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

mine() matches case‐insensitively
unconditionally, since email addresses are conventionally
case‐insensitive and it doesn’t take a pattern anyway.

Is that already the case or do you mean that we should make it work that way? (It sounds like you're saying the former, but it doesn't look like that's what the code does.)

@martinvonz
Copy link
Member

mine() matches case‐insensitively
unconditionally, since email addresses are conventionally
case‐insensitive and it doesn’t take a pattern anyway.

Is that already the case or do you mean that we should make it work that way? (It sounds like you're saying the former, but it doesn't look like that's what the code does.)

Ah, you mean that it behaves that way after the commit. We don't really have a convention for what present tense means in commit messages. I use present tense to describe how it behaves before the patch (not saying you have to follow that, just explaining why I was confused).

@emilazy
Copy link
Collaborator Author

emilazy commented Jun 30, 2024

I agree that it’s confusing; I’ll reword it. I try to stick to the Git quasi‐standard of imperative present tense, but it can get so awkward when going into detail, so I frequently slip…

(Thank you for the repository invite, by the way!)

emilazy added 3 commits July 8, 2024 13:58
These support more types of pattern than just substring matching.
The comments say “Can find a unique match by either name or email”,
but these weren’t checking for an email match.
@emilazy emilazy force-pushed the push-ywkwkqzkppun branch 2 times, most recently from e9ac38f to d8438ce Compare July 8, 2024 13:29
Copy link
Collaborator

@yuja yuja left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks!

FWIW, for mailmap, maybe we could also leverage regex: pattern. It's more expressive and supports case-insensitive matches. I'm thinking of adding it as it'll be useful in diff_contains(pattern) revset #2933.

CHANGELOG.md Outdated Show resolved Hide resolved
lib/src/str_util.rs Outdated Show resolved Hide resolved
lib/src/str_util.rs Show resolved Hide resolved
Partially resolve a 1.5‐year‐old TODO comment.

Add opt‐in syntax for case‐insensitive matching, suffixing the
pattern kind with `-i`. Not every context supports case‐insensitive
patterns (e.g. Git branch fetch settings). It may make sense to make
this the default in at least some contexts (e.g. the commit signature
and description revsets), but it would require some thought to avoid
more confusing context‐sensitivity.

Make `mine()` match case‐insensitively unconditionally, since email
addresses are conventionally case‐insensitive and it doesn’t take
a pattern anyway.

This currently only handles ASCII case folding, due to the complexities
of case‐insensitive Unicode comparison and the `glob` crate’s lack
of support for it. This is unlikely to matter for email addresses,
which very rarely contain non‐ASCII characters, but is unfortunate
for names and descriptions. However, the current matching behaviour is
already seriously deficient for non‐ASCII text due to the lack of any
normalization, so this hopefully shouldn’t be a blocker to adding the
interface. An expository comment has been left in the code for anyone
who wants to try and address this (perhaps a future version of myself).
@emilazy emilazy force-pushed the push-ywkwkqzkppun branch from d8438ce to 22c29f1 Compare July 9, 2024 06:35
@emilazy
Copy link
Collaborator Author

emilazy commented Jul 9, 2024

Hmm, for the .mailmap use‐case it’d be escaping a literal string to use as a regex just to get case‐insensitivity. I guess if we don’t want to support end‐user case‐insensitive querying without using a regex that’d be fine, but it doesn’t seem like the most natural way to implement it. I would be happy to see regex querying in general though!

@yuja
Copy link
Collaborator

yuja commented Jul 9, 2024

for the .mailmap use‐case it’d be escaping a literal string to use as a regex just to get case‐insensitivity.

I thought we could leverage regex to combine multiple matches, but that might not be needed?

@emilazy
Copy link
Collaborator Author

emilazy commented Jul 9, 2024

Oh, yes, that would work. I’m not sure if matching against /(?i)^(a|b|c)$/ is better or worse for a database engine than lower(x) = "a" OR lower(x) = "b" or lower(x) = "C"; I guess it’d probably be faster for a local backend just because of the reduced indirection. Do you think that would be a better approach for all case‐insensitive matching, or should we merge this now and consider regex stuff as a potential optimization later? I’m also wondering if regex syntax compatibility issues between Rust and database engines might get in the way of making that happen soon.

@yuja
Copy link
Collaborator

yuja commented Jul 9, 2024

I don't have strong feeling. This PR looks good, but I'm not sure if we like glob-i: syntax. I don't know if we can come up with a better syntax, though.

Good point about database query. Regex version would be harder to optimize in principle.

@jennings
Copy link
Collaborator

jennings commented Jul 9, 2024

I know I'm late to the party, but as long as we've explicitly asked for bikeshedding...

  • I think case-insensitive or smart-case should be the default
  • I think this should apply to file patterns as well.
  • I think the hyphen should be omitted (globi: or iglob:)

If I could choose anything unilaterally, I would probably have three kinds (using glob as the example):

  • iglob: for case-insensitive
  • cglob: for case-sensitive (the c prefix comes from PowerShell, but I'm indifferent about which letter is used)
  • glob: for either smart-case or user configuration

Many search interfaces support smart-casing: Case-insensitive if the search term is all lowercase, and case-sensitive if the search term contains any uppercase characters. I think that's a good default for unqualified glob:.

Case sensitivity

For string patterns, I think it's more user friendly to do a case-insensitive search by default, and let the user opt-in to case-sensitivity if they want it. That way the user can usually be lazy and not worry about it, but has the option to be strict when necessary.

For file patterns, case-insensitive file systems are the default on Windows and macOS, and I think users are accustomed to not needing to be precise about casing.

Hyphen

Especially on Windows, I feel like there's a lot of typing symbols when typing revsets. PowerShell treats parens without a $ as a subshell, so most revsets need to be quoted. And then many string patterns require quotes around the value. Sometimes I feel like I'm using a chorded keyboard.

I don't think I can do anything about the quoting, but the hyphen feels unnecessary to me.

@yuja
Copy link
Collaborator

yuja commented Jul 10, 2024

Making the default case-insensitive has performance implications (especially in file matchers), so let's discuss that separately. It's rather a big change.

Smart-case seems useful, and can be introduced later. The underlying machinery wouldn't be different.

So the remaining issue is hyphen or not? Without hyphenation, root-glob: file pattern will be root-iglob:?
https://martinvonz.github.io/jj/prerelease/filesets/#file-patterns

@emilazy
Copy link
Collaborator Author

emilazy commented Jul 10, 2024

I agree that a glob of some kind would be a good default for string patterns, maybe even a case‐insensitive one. It needs separate bikeshedding though, since currently some patterns default to substring and some to exact, which is a mess. I do have an implementation locally that seems pretty good.

For file patterns I’m not so sure. Case‐sensitivity is common filesystem semantics (albeit mostly on Linux rather than Windows or macOS) that I believe Jujutsu mostly inherits; I think we’d need to enforce case‐folding in trees for that to be the right choice, as nothing stops you having both readme and README in a repository.

Two motivations for the hyphen in …-i:

  • Setting off the case‐insensitivity visually as a “modifier” to a base pattern kind, rather than its own entirely separate thing, hence my original weird proposal of glob/i. (Maybe in the future we’ll want more options? Although hopefully not.)

  • Consistency with the file pattern kinds, which use hyphens to set off their own modifiers.

It just seemed a bit cleaner to me. The ergonomics aren’t great, but I already accepted we can’t get good ergonomics here without further discussion. I’d like i: for case‐insensitive matching with the same pattern kind as whatever the default is, but it requires making patterns more consistent in general so that users aren’t surprised.

That said, I mostly just implemented this to have the internals there so that the .mailmap implementation can use it, and I’m not particularly attached to the bikeshed colour, so if people prefer dropping the hyphen I’ll do that. Anything that makes the behaviour of patterns more consistent will be a fairly significant breaking change anyway, so we can adjust the decisions made here at the same time without further penalty.

@jennings
Copy link
Collaborator

I’m so used to backwards compatibility being critical that I assumed it was important to get the naming right before the feature merges. But we see small breaking changes in most releases so clearly they’re acceptable.

I’m happy to have this merge now with any naming and discuss again later.

@emilazy
Copy link
Collaborator Author

emilazy commented Jul 10, 2024

That’s the privilege of working on pre‐1.0 software :)

Not that we should gratuitously break things, but I think we’re all agreed that the current state of string patterns isn’t ideal and that it’s too big a job to fix in this PR. I’ll just hit the merge button now so I can finally finish off the .mailmap work and hopefully we can have a discussion about patterns soon (FWIW my preference is for globs as the default format, and I’m agnostic about case handling – case‐sensitive across the board might be best, as case‐insensitivity wouldn’t work properly for naming branches and filenames).

@emilazy emilazy merged commit 93d76e5 into jj-vcs:main Jul 10, 2024
17 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants