Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Tabs are stripped #20

Open
emilyyyylime opened this issue Sep 4, 2024 · 8 comments
Open

Tabs are stripped #20

emilyyyylime opened this issue Sep 4, 2024 · 8 comments

Comments

@emilyyyylime
Copy link

The tab character \t (U+0009 Horizontal Tabulation) gets stripped by this crate (for example strip_str("\t") -> ""), which is not what I expected.

Is the tab character considered an ANSI escape sequence by this crate? If so the documentation should make it clearer.

@luser
Copy link
Owner

luser commented Sep 4, 2024

Is the tab character considered an ANSI escape sequence by this crate? If so the documentation should make it clearer.

Hrm! Not intentionally!

@luser
Copy link
Owner

luser commented Sep 4, 2024

Oh, I guess I didn't dive deeply enough into how all of this works under the hood:

fn execute(&mut self, byte: u8) {
// We only care about executing linefeeds.
if byte == b'\n' {
self.err = writeln!(self.writer).err();
}
}

perform gets called for all C0 and C1 control characters, and the tab character is in C0. TIL!

Would you like to submit a patch? I migrated to a new PC a few months ago and don't have this repo checked out in a convenient place at the moment.

@emilyyyylime
Copy link
Author

@luser which characters exactly do we need to special case here? Only tabs? What about vertical tabs, carriage-returns, NULs, and other characters that mostly lost their meaning today?

@emilyyyylime
Copy link
Author

emilyyyylime commented Sep 4, 2024

Actually could we just print every byte passed to execute? It seems the actual ESCAPE character and the following sequence isn't passed to it

@luser
Copy link
Owner

luser commented Sep 4, 2024

which characters exactly do we need to special case here? Only tabs?

Adding tabs to the special-case list so that it consists of just \n and \t is what I am suggesting, yes. I don't think any of the others are going to do anything particularly useful, honestly.

Actually could we just print every byte passed to execute?

While we could, I think that given the stated purpose of this crate it's reasonable to be judicious in what we pass through. \t feels justifiable to transmit. I would want to see a proposed use case for including any others. This crate has been in reasonably widespread use for 6 years now and you seem to be the first person who has noticed that it was omitting tabs (or at least the first to take the time to file a bug on it). It seems reasonable to state that the current state of affairs isn't causing many highly-visible problems for people. :) To be clear—this is entirely my opinion, I don't have any data to back it up.

What would probably be useful would be to document the behavior we have implemented somewhere, maybe in the top-level crate docs?

//! This can be used to take output from a program that includes escape sequences and write

You don't have to write these docs if you submit a PR, I'm just writing this down while I'm thinking about it. We could link to the ANSI escape code page on Wikipedia, and describe the categories of what this library strips:

  • C0 and C1 control characters, except for \n and \t.
  • All ESC sequences, which start with byte 0x1B, which includes all CSI and OSC sequences.

Nice catch, BTW! Thanks for taking the time to report it!

@emilyyyylime
Copy link
Author

np! Sorry for taking so long to come back to this; I mostly empathise with your concerns of introducing changes to such a widely used crate, but at the very least some characters that I think appear reasonably often are \r, \0, and possibly the rest of the common C-style escapes \a, \b, \v, \f (Bell, backspace, vertical tab, and form feed). Would you be interested in whitelisting them as well? Another option would be to provide different functions with different granularity levels of escaping, allowing for no downstream breakage (I still think the default functionality should keep \t, \r, and \0, however.)

@emilyyyylime
Copy link
Author

It seems actually that the minimal set of bytes I'd expect to come through exactly overlap with the escape sequences that Rust provides https://doc.rust-lang.org/reference/tokens.html#byte-escapes. This seems to me to indicate some universality to them, at least within the Rust ecosystem. As per \a\b\v\f I'm willing to compromise

@emilyyyylime
Copy link
Author

I also did confirm that the characters we strip are exactly the range 0..32 (and also bytes that aren't valid UTF-8, in the case of the plain strip(), not sure if this is of any concern)

frozolotl pushed a commit to frozolotl/typst-ansi-hl that referenced this issue Oct 3, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants