Input iterator's column data member doesn't account for multibyte characters (UTF-8) #369

catsidhe · 2024-09-06T14:35:51Z

Unless I'm mistaken, column currently counts bytes. I ended up re-calculating the column by walking back from data until I see a newline character or reach the beginning of the content, counting characters while ignoring non-first UTF-8 bytes. Would've been nicer if it was handled by PEGTL.

The text was updated successfully, but these errors were encountered:

ColinH · 2024-09-11T17:33:38Z

Am I right to assume that what you are counting are codepoints? That might be possible to add for PEGTL 4.0, though we have been apprehensive to go down that road because the next step would be to take composed characters and grapheme clusters into consideration. That's a big can of worms to open.

catsidhe · 2024-09-11T18:59:57Z

Yes, just the code points, bytes outside 0x80..0xbf range. I figured it'd cover the majority of use cases, while being simple and cheap. But I understand your concern, that can of worms might be best left closed.

Another dimension to this is double-width and non-printable characters - those with wcwidth(c) values other than 1. Looks like most editors, gcc and clang, take wcwidth into account.

ColinH · 2024-09-14T08:57:31Z

Are you using lazy or eager position tracking?

catsidhe · 2024-09-16T03:29:49Z

Sorry, I don't quite know how to answer that. Maybe because I use it in a tree? Probably incorrectly, too!

auto from_utf8(std::string_view) -> std::u32string;
inline auto line_width(std::u32string_view sv) -> int {
  return std::accumulate(sv.begin(), sv.end(), 0, [](int w, char32_t uc) {
    return w + std::max(width(uc), 0);
  });
}
class node {
  // ...
  template<typename, typename Input>
  void start(Input const &in) noexcept {
    auto i = in.iterator();
    line = i.line;
    // Can I do this? Maybe not for all inputs.
    column = line_width(from_utf8({i.data - (i.column - 1), i.data})) + 1;
  }
};

Actually, I'm no longer sure how to best calculate columns. Consider this snippet:

int /*あいうえお*/ 123;

According to VSCode, the digit 1 is on column 15. It treats each character as occupying exactly one column.

Clang thinks it's on column 25 (counting bytes, like PEGTL):

test.cpp:1:25: error: expected unqualified-id
    1 | int /*あいうえお*/ 123;
      |                    ^

GCC says 20 (presumably using wcwidth):

test.cpp:1:20: error: expected unqualified-id before numeric constant
   1 | int /*あいうえお*/ 123;
     |                    ^~~

(Arrows point to 1 in both examples above in the terminal.)

Nano: 20
Emacs: 19 (it counts from 0)
Vim shows two numbers: 25-20

I'm leaning towards using wcwidth. But I don't think PEGTL should (since it's not in the standard library).

For best flexibility, perhaps a user-supplied character traits class is needed, that gets to decide how many columns each code point takes.

ColinH self-assigned this Sep 11, 2024

ColinH added the enhancement label Sep 11, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Input iterator's column data member doesn't account for multibyte characters (UTF-8) #369

Input iterator's column data member doesn't account for multibyte characters (UTF-8) #369

catsidhe commented Sep 6, 2024 •

edited

Loading

ColinH commented Sep 11, 2024

catsidhe commented Sep 11, 2024 •

edited

Loading

ColinH commented Sep 14, 2024

catsidhe commented Sep 16, 2024 •

edited

Loading

Input iterator's column data member doesn't account for multibyte characters (UTF-8) #369

Input iterator's column data member doesn't account for multibyte characters (UTF-8) #369

Comments

catsidhe commented Sep 6, 2024 • edited Loading

ColinH commented Sep 11, 2024

catsidhe commented Sep 11, 2024 • edited Loading

ColinH commented Sep 14, 2024

catsidhe commented Sep 16, 2024 • edited Loading

catsidhe commented Sep 6, 2024 •

edited

Loading

catsidhe commented Sep 11, 2024 •

edited

Loading

catsidhe commented Sep 16, 2024 •

edited

Loading