-
Notifications
You must be signed in to change notification settings - Fork 230
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Input iterator's column data member doesn't account for multibyte characters (UTF-8) #369
Comments
Am I right to assume that what you are counting are codepoints? That might be possible to add for PEGTL 4.0, though we have been apprehensive to go down that road because the next step would be to take composed characters and grapheme clusters into consideration. That's a big can of worms to open. |
Yes, just the code points, bytes outside 0x80..0xbf range. I figured it'd cover the majority of use cases, while being simple and cheap. But I understand your concern, that can of worms might be best left closed. Another dimension to this is double-width and non-printable characters - those with |
Are you using lazy or eager position tracking? |
Sorry, I don't quite know how to answer that. Maybe because I use it in a tree? Probably incorrectly, too! auto from_utf8(std::string_view) -> std::u32string;
inline auto line_width(std::u32string_view sv) -> int {
return std::accumulate(sv.begin(), sv.end(), 0, [](int w, char32_t uc) {
return w + std::max(width(uc), 0);
});
}
class node {
// ...
template<typename, typename Input>
void start(Input const &in) noexcept {
auto i = in.iterator();
line = i.line;
// Can I do this? Maybe not for all inputs.
column = line_width(from_utf8({i.data - (i.column - 1), i.data})) + 1;
}
}; Actually, I'm no longer sure how to best calculate columns. Consider this snippet: int /*あいうえお*/ 123; According to VSCode, the digit Clang thinks it's on column 25 (counting bytes, like PEGTL):
GCC says 20 (presumably using test.cpp:1:20: error: expected unqualified-id before numeric constant
1 | int /*あいうえお*/ 123;
| ^~~ (Arrows point to Nano: 20 I'm leaning towards using For best flexibility, perhaps a user-supplied character traits class is needed, that gets to decide how many columns each code point takes. |
Unless I'm mistaken,
column
currently counts bytes. I ended up re-calculating the column by walking back fromdata
until I see a newline character or reach the beginning of the content, counting characters while ignoring non-first UTF-8 bytes. Would've been nicer if it was handled by PEGTL.The text was updated successfully, but these errors were encountered: