Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Mismatch between WASM and native builds #82

Open
ubolonton opened this issue Apr 11, 2020 · 2 comments
Open

Mismatch between WASM and native builds #82

ubolonton opened this issue Apr 11, 2020 · 2 comments

Comments

@ubolonton
Copy link

There seems to be a mismatch between WASM and native builds (on macOS).

I built the CLI from latest tree-sitter's master (4c0fa29) and tried this code:

macro_rules! impl_pred {}

// TODO
i
impl_pred!(foo, bar);

This is the syntax tree reported by the native binding (tree-sitter test passes for this commit):

(source_file
 (macro_definition name:
                   (identifier))
 (line_comment)
 (macro_invocation macro:
                   (identifier)
                   (ERROR
                    (identifier))
                   (token_tree
                    (identifier)
                    (identifier))))

This is the syntax tree reported by WASM (through tree-sitter web-ui):

(source_file
 (macro_definition name:
                   (identifier))
 (line_comment)
 (identifier)
 (MISSING ";")
 (macro_invocation macro:
                   (identifier)
                   (token_tree
                    (identifier)
                    (identifier))))
@maxbrunsfeld
Copy link
Contributor

I think it may be because the wasm binding uses the UTF16 encoding, due to javascript’s string semantics. Do you still see a mismatch if you transcode to UTF16 in your rust test?

The reason that it matters is that certain “error costs” are calculated using nodes’ byte length. This is something I’ve been a bit unsatisfied with for a while, but I still don’t think it’s worth the memory cost to store each node’s Unicode character count.

We could make them behave more similarly by dividing the byte count by 2 when using UTF16. 😸I’d be curious if you have any suggestions.

@ubolonton
Copy link
Author

I think it may be because the wasm binding uses the UTF16 encoding, due to javascript’s string semantics. Do you still see a mismatch if you transcode to UTF16 in your rust test?

It seems so! There's no mismatch if I change run_tests to do this:

// let tree = parser.parse(&input, None).unwrap();
let utf16: Vec<u16> = str::from_utf8(&input).unwrap()
    .encode_utf16().into_iter().collect();
let tree = parser.parse_utf16(&utf16, None).unwrap();

The reason that it matters is that certain “error costs” are calculated using nodes’ byte length. This is something I’ve been a bit unsatisfied with for a while, but I still don’t think it’s worth the memory cost to store each node’s Unicode character count.

Yeah, I think it's not worth it, at least for programming languages. Non-ascii characters are rare, and would mostly be in comments/strings. I'm not sure about markup languages though. Maybe we could let grammars override the error costs in specific places?

We could make them behave more similarly by dividing the byte count by 2 when using UTF16. 😸I’d be curious if you have any suggestions.

I think making them more similar would be good, but I'm not sure about dividing by 2 when it's UTF16. 😅 In this case specifically, the syntax tree for UTF16 is more desirable:

In Atom (also UTF16 I assume):
test2 atom rs

In Emacs (UTF8):
test2 emacs rs

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants