String values don't properly handle unicode escapes #58

SteveKommrusch · 2018-10-31T15:00:06Z

I am using javalang to tokenize files which include Unicode escape sequences. These are correctly tokenized as strings, but the item.value is not handled cleanly. Consider the 2 cases below:
Case 1: builder.append(text, 0, MAX_TEXT).append('\u2026');
Case 2: builder.append(text, 0, MAX_TEXT).append('…');

In both cases, item.value is identical and I get an exception if I try to write the item.value to a file. I can catch the error and successfully print using python like this:

      if (token_type == 'String'):
          try:
              outfile.write(item.value)
          except UnicodeEncodeError:
              outfile.write(item.value.encode('unicode-escape').decode('utf-8'))

but the python code above prints the same value for Case 1 and 2. I suspect the proper fix is to use raw strings for String token values internal to javalang. Below is an example of raw strings solving the problem.

>>> str1 = '…'
>>> str2 = '\u2026'
>>> print("str1: ",str1," str2:",str2)
str1:  …  str2: …
>>> str1 == str2
True
>>> str1 = r'…'
>>> str2 = r'\u2026'
>>> print("str1: ",str1," str2:",str2)
str1:  …  str2: \u2026
>>> str1 == str2
False

chenzimin · 2019-02-06T09:31:28Z

Hi Steve,

If you change the code at line 534 in tokenizer.py to:

#self.pre_tokenize()
self.data = ''.join(self.decode_data())
self.length = len(self.data)

The unicode string will be stored as raw string, not converted to characters.

And one more benefit is that the position will also be correct for files containing unicode. I found this when I tried to debug the position error.

SteveKommrusch · 2019-02-07T17:53:29Z

Good discovery, thanks. Regards, Steve From: chenzimin Sent: Wednesday, February 6, 2019 2:31 AM To: c2nes/javalang Cc: Steve Kommrusch; Author Subject: Re: [c2nes/javalang] String values don't properly handle unicodeescapes (#58) Hi Steve, If you comment out this line, https://github.com/c2nes/javalang/blob/7a4af7f5136dd4f4f4b1846b3872f5688429e5db/javalang/tokenizer.py#L489, the unicode string will be stored as raw string, not converted to characters. And one more benefit is that the position will also be correct for files containing unicode. I found this when I tried to debug the position error. — You are receiving this because you authored the thread. Reply to this email directly, view it on GitHub, or mute the thread.

jose linked a pull request Mar 25, 2021 that will close this issue

Store the unicode string as raw string #96

Open

xmcp mentioned this issue Jun 6, 2021

Faulty unicode escape handling leads to tokenizing failure #99

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

String values don't properly handle unicode escapes #58

String values don't properly handle unicode escapes #58

SteveKommrusch commented Oct 31, 2018

chenzimin commented Feb 6, 2019 •

edited

Loading

SteveKommrusch commented Feb 7, 2019 via email

String values don't properly handle unicode escapes #58

String values don't properly handle unicode escapes #58

Comments

SteveKommrusch commented Oct 31, 2018

chenzimin commented Feb 6, 2019 • edited Loading

SteveKommrusch commented Feb 7, 2019 via email

chenzimin commented Feb 6, 2019 •

edited

Loading