Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

support -z from GNU grep #457

Closed
stephentalley opened this issue Jan 17, 2025 · 19 comments
Closed

support -z from GNU grep #457

stephentalley opened this issue Jan 17, 2025 · 19 comments
Labels
discuss Feedback requested for possible enhancements enhancement New feature or request

Comments

@stephentalley
Copy link

I recently tried to use ugrep as a drop-in replacement for GNU grep in a shell script, but was thwarted by ugrep's lack of support for GNU grep's -z option. That option interprets an input "line" as a NUL-delineated chunk rather than a newline-delineated chunk.

Example:

mkdir -p /tmp/test
touch /tmp/test/file.txt
touch "/tmp/test/file
with
newline.txt"

find /tmp/test -print0 | /usr/bin/grep -z '\.txt$'

I see that ugrep supports -0, which affects only output. -z affects both input and output in GNU grep.

It would be nice to be able to have this feature parity with GNU grep. Thanks!

@AndrewDDavis
Copy link
Contributor

I notice BSD grep also supports -z as a synonym for --null-data.

@genivia-inc
Copy link
Member

genivia-inc commented Jan 18, 2025

While -z is now part of GNU/BSD grep it wasn't always so. Rather --null-data should be used.

Note that this flag is often not needed with ugrep, because ugrep matches newline characters explicitly and also other anchors such as end-of-file \Z can be used.

I probably should add --null-data to ugrep for compatibility. I could do that with an "encoding" (like option --encoding) that replaces newline with zero when parsing the input. There is a caveat that very large files should never be held fully in memory, so there will be a limit to the length of the "line" that is output (here a "line" is the entire file).

@genivia-inc genivia-inc added the enhancement New feature or request label Jan 18, 2025
@stephentalley
Copy link
Author

A --null-data option would be much appreciated, thanks!

While -z is now part of GNU/BSD grep it wasn't always so. Rather --null-data should be used.

ugrep is in a bit of a sticky spot then.

One of its stated goals is "a true drop-in replacement for GNU grep", but if -z is to continue to mean something different between ggrep/ugrep, then that's not an entirely valid claim.

I can certainly change my own scripts to use --null-data, but if I've symlinked ugrep as grep somewhere early in my path, and I run someone else's script that relies on grep -z, then I run the risk of a (possibly silent) failure.

Maybe a deprecation/reassignment of -z is preferable?

@genivia-inc
Copy link
Member

One of its stated goals is "a true drop-in replacement for GNU grep", but if -z is to continue to mean something different between ggrep/ugrep, then that's not an entirely valid claim.

Yep, but I do remember that some time ago (maybe long ago, like 80s-90s?) GNU grep option -z used to support searching gz files and likewise BSD grep uses option -Z (capital Z) to search compressed files exactly like zgrep (the rationale goes that fgrep = grep -F, egrep = grep -E, zgrep = grep -Z etc). So go figure. Even not all official "grep-compatibles" are identical when it comes down to specialized features like that.

@genivia-inc
Copy link
Member

genivia-inc commented Jan 19, 2025

I've implemented a proto --null-data in my dev version ready for more testing. It's behaving the same as GNU grep, but there might be edge cases that could be lurking in the dark to be discovered so I need a bit more time to test more thoroughly.

There is also a nice new trick with this, where only the input is treated as a sequence of lines terminated by zero bytes, but the output is not (--null-data affects both the input and output).

Like GNU grep we have the same results with ugrep --null-data:

$ find ~/tmp -name 'testfile*' -print0 | ugrep --null-data '\.txt$' | more -R
/Users/engelen/tmp/testfile
with
newline.txt^@/Users/engelen/tmp/testfile.txt^@

The trick to only convert the input as null data without converting back in the output:

$ find ~/tmp -name 'testfile*' -print0 | ugrep --encoding=null-data '\.txt$' | more -R
/Users/engelen/tmp/testfile^@with^@newline.txt
/Users/engelen/tmp/testfile.txt

The --encoding=null-data above specifies that the input has zero byte terminators in the input that need to be swapped with newline characters. Note that the output of the latter indeed shows that NUL and LF are flipped in the output of find.

This might perhaps be useful to someone.

Let me add that a second trick is possible to output as --null-data but not input, i.e. output NUL for LF with ugrep --null-data --encoding=UTF-8, which might also be useful to someone.

The man page summary of --null-data that I have in mind should state this like so:

    --null-data
            Input and output are treated as sequences of lines with each line
            terminated by a zero byte instead of a newline; effectively swaps
            NUL with LF in the input and the output.  When combined with option
            --encoding=ENCODING, output each line terminated by a zero byte
            without changing the input specified by ENCODING.  Instead of
            option --null-data, option --encoding=null-data treats the input as
            a sequence of lines terminated by a zero byte without changing the
            output.  See also options --encoding and --null.

About option -z: I really like to keep it even though it clashes, because -z and -Z are widely used with ugrep interactively, more so than the null options I believe, which are more likely used in scripts. To meet halfway, perhaps we can reassign -0 to --null-data (instead of to --null)?

@genivia-inc
Copy link
Member

genivia-inc commented Jan 19, 2025

One way to get around the -z and -Z option clash with GNU/BSD grep is to use the internal ugrep "grep mode" we already have for other reasons and use it also to reassign -z and -Z to --null-data and --null at runtime, respectively. This "grep mode" automatically kicks in when the ugrep executable is renamed to one of grep, egrep, fgrep, zgrep, zegrep, zfgrep or when option -Y is explicitly specified. The caveat of doing this is that it can be confusing to users when a change of options -z and -Z is in effect.

This is something that is left open to discuss.

@genivia-inc genivia-inc added the discuss Feedback requested for possible enhancements label Jan 19, 2025
@AndrewDDavis
Copy link
Contributor

I see the logic of having -Y be more of a general mode-changing flag, rather than just "don't touch my patterns", but I do agree that having it so drastically change the behaviour of the other flags would be surprising, and sort of difficult to document.

Either way I think the difference should be mentioned in the ReadMe in the section on "Equivalence to GNU/BSD grep".

genivia-inc added a commit that referenced this issue Jan 21, 2025
@genivia-inc
Copy link
Member

@stephentalley @AndrewDDavis I've committed an update to the source code master branch. It would be of great help if you could try this version's support for --null-data and if the man page is clear and not confusing. Some extra testing and verification is greatly appreciated before I release an official version update. I have not made a change to reassign -z and -Z, so the long options --null-data and --null are necessary, or use short option pair -00 (two zeros) for --null-data or -0 for --null.

@genivia-inc
Copy link
Member

genivia-inc commented Jan 21, 2025

@AndrewDDavis I see the logic of having -Y be more of a general mode-changing flag, rather than just "don't touch my patterns", but I do agree that having it so drastically change the behaviour of the other flags would be surprising, and sort of difficult to document.

It could be done with a new flag --grep that is already used internally when the ugrep executable is renamed to grep. Then we won't do this with -Y or --empty that have nothing to do with the -z and -Z reassignment, so keep -Y and --empty as they are.

@stephentalley I can certainly change my own scripts to use --null-data, but if I've symlinked ugrep as grep somewhere early in my path, and I run someone else's script that relies on grep -z, then I run the risk of a (possibly silent) failure.
Maybe a deprecation/reassignment of -z is preferable?

See my comment above. Symlinking or copying ugrep to grep, egrep, fgrep, zgrep, zegrep, zfgrep could enable a --grep internal flag that reassigns -z and -Z to the nulls. This will only happen when hard/sym linking or copying the executable. Aliases have no effect. I'm torn on this one, because I feel it is not a good practice. On the other hand, we are talking about executables with different names that do behave differently already as expected, no?

EDIT: #269 is related to this discussion, where -Y was changed to make regex parsing a bit more permissive, like GNU grep. I propose to not let -Y do this any longer as a side-effect, but let --grep or renamed executables grep etc only do that for GNU grep compatibility. @stdedos do you see any issues with this?

@stdedos
Copy link
Contributor

stdedos commented Jan 21, 2025

EDIT: #269 is related to this discussion, where -Y was changed to make regex parsing a bit more permissive, like GNU grep. I propose to not let -Y do this any longer as a side-effect, but let --grep or renamed executables grep etc only do that for GNU grep compatibility. @stdedos do you see any issues with this?

Not rly. Most awesome / Sanest practice EVER to consider ugrep aliases to modify itself "feature-for-feature" (and not bug-for-bug "for the sake of bug-for-bug"), so that ug is a in-place replacement.


On tangents:

  1. I'd like the --no-empty feature on grep mode too. Catching grep "$foo" is nice.
  2. I'd like the smart-case feature, --max-width, etc on grep mode too.

I'd hope that ug (additionally via the grep+family) to pickup its config file. ... Although I've just reverted to typing ug ..., auto-picking up e.g. smart-case, --max-width, "--no-empty", etc would be nice.

"I cannot use aliases", since the default bashrc file already has "colorful" grep aliases (You cannot define multiple aliases)

@genivia-inc
Copy link
Member

@stdedos thanks.

Having an "enhanced grep compatibility mode" makes sense if we want -z and -Z to behave exactly as GNU/BSD grep, thus reassigning -z and -Z.

What does this mean?

Well, when ugrep is installed as GNU/BSD grep replacements grep, egrep, fgrep, zgrep, zegrep, zfgrep via symlinks or by copies of the ugrep executable, then this "enhanced grep compatibility mode" will be activated automatically.

The --help output of grep, egrep, fgrep, zgrep, zegrep, zfgrep will show the following option bindings:

    --null, -0, -Z
            Output a zero byte after the file name.  This option can be used
            with commands such as `find -print0' and `xargs -0' to process
            arbitrary file names, even those that contain newlines.  See also
            options -H or --with-filename and --null-data.
    --null-data, -00, -z
            Input and output are treated as sequences of lines with each line
            terminated by a zero byte instead of a newline; effectively swaps
            NUL with LF in the input and the output.  When combined with option
            --encoding=ENCODING, output each line terminated by a zero byte
            without affecting the input specified as per ENCODING.  Instead of
            option --null-data, option --encoding=null-data treats the input as
            a sequence of lines terminated by a zero byte without affecting the
            output.  Option --null-data is not compatible with UTF-16/32 input.
            See also options --encoding and --null.

In addition, the -z, --decompress and -Z, --fuzzy options are not displayed at all with --help. Firstly, this is because the --decompress option is enabled automatically for zgrep, zegrep and zfgrep. Secondly, --fuzzy matching is still available. Also --hidden, -. and -Y, --empty are left out from the --help page, because these options are enabled for compatibility with GNU/BSD grep (if desired, use --no-hidden to not output hidden files and --no-empty on the command line).

Any concerns, comments or suggestions?

genivia-inc added a commit that referenced this issue Jan 22, 2025
@genivia-inc
Copy link
Member

genivia-inc commented Jan 22, 2025

I've tagged the dev updates, the first with --null-data added and the second and third also with -z and -Z reassignment when ugrep is symlinked or copied to grep, egrep etc:

@stdedos
Copy link
Contributor

stdedos commented Jan 22, 2025

I think you have this well-thought. Going "so far" as to even "amend" the --help output in compatibility mode is above and beyond!

If I may - I would avoid "hiding" things. This is a different program that "plain grep". No need to hide it.
I'd just append "In xyz compatibility mode, --decompress is activated by default", or "... --hidden is activated by default".

Similarly, e.g. for --hidden, either write additionally "Use --no-hidden if desired", or even list --hidden and --no-hidden next to each other

--hidden, -.
...
In xyz compatibility mode, --hidden is activated by default.
--no-hidden
...


I'm struggling for the "correct" balance of DWIM, and being "too magical" even going so far as "hiding" things. I like DWIM, because, it's nice when things "just happen". But I also like to know that DWIM is at play, so, "if I feel it doesn't work for me", I can be aware that it happens.


I've tagged the two updates, the first with --null-data added and the second also with -z and -Z reassignment when ugrep is symlinked or copied to grep, egrep etc:

"Eventually" I'll "stalk" your tags and build Debian packages for each one of them. ... But that project is left aside for a couple of years by now 😓
... So for now I'm stuck on a v4 IIRC 🙃

@genivia-inc
Copy link
Member

genivia-inc commented Jan 22, 2025

"Eventually" I'll "stalk" your tags and build Debian packages for each one of them. ... But that project is left aside for a couple of years by now 😓 ... So for now I'm stuck on a v4 IIRC 🙃

It would be nice to update Debian packages when 7.2.0 is released. It looks like Debian is stuck with an old 3.11 release of ugrep. Much has improved since then.

Fewer users will update ugrep from Debian and many will use this GitHub repo or the ugrep static builds instead. This is not desirable for Debian either, since the ugrep installation statistics will not be lagging. At least Ubuntu has a much more recent version of ugrep v6.4.

@stdedos
Copy link
Contributor

stdedos commented Jan 22, 2025

Fewer users will update ugrep from Debian and many will use this GitHub repo or the ugrep static builds instead

Eeeh ... True. But I like "the order" from installing from the official package manager.

... and since I've done the work for it. I just need to setup CI/CD for it (i.e. build tags, track your repo), and "rationalize" how can I update from you, and keep my changes on top too.

genivia-inc added a commit that referenced this issue Jan 23, 2025
@genivia-inc
Copy link
Member

... and since I've done the work for it. I just need to setup CI/CD for it (i.e. build tags, track your repo), and "rationalize" how can I update from you, and keep my changes on top too.

I appreciate all your efforts to keep software packages up to date. I understand it is time consuming. I try to be as transparent as possible by indicating what changed in a release update. I am also committing intermediate feature request changes and fixes, so the source code can be diff'ed to see what the actual source code updates are. I've now also started to tag these commits when addressing a feature request or fix. When I'm happy with the update after more reviews and testing then I will release an official update (v7.2 will be released soon). I don't know if that helps to follow the changes e.g. by package maintainers who are concerned about what they are packaging.

@stdedos
Copy link
Contributor

stdedos commented Jan 23, 2025

Don't worry about all these 😅 It was my decision to do this.

... and I am doing a "bare minimum" - I don't think those changes "stand up" to serious Debian/Ubuntu maintainers.


... tbh, it's hard for me to see "what changed and why" (I'm reviewing your changes out of curiosity 😛) - but it could very well be my limitation - I don't have C++ experience, and definitely very old C experience 😓

And since I'm forking your repo - regardless of the things I add above it - there is a transparent (and minimal) chain of changes (if those are to someone's interest)

genivia-inc added a commit that referenced this issue Jan 23, 2025
@genivia-inc
Copy link
Member

I've released ugrep v7.2. Thank you for your feedback!

@stdedos
Copy link
Contributor

stdedos commented Jan 24, 2025

I've released ugrep v7.2. Thank you for your feedback!

Thank you for offering us a very nice product, being available and open to listen 🙃

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
discuss Feedback requested for possible enhancements enhancement New feature or request
Projects
None yet
Development

No branches or pull requests

4 participants