Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Update proof of concept with respect to PEP 701 changes #22

Open
jimbaker opened this issue Apr 24, 2023 · 26 comments
Open

Update proof of concept with respect to PEP 701 changes #22

jimbaker opened this issue Apr 24, 2023 · 26 comments

Comments

@jimbaker
Copy link
Owner

PEP 701, as tracked in the CPython implementation and recently merged by @pablogsal , python/cpython#102856, required changes to the tokenizer and parser for the cleaned up implementation of f-strings. This impacts the initial proof of concept work in #1 written by @gvanrossum

First, we have discussed two possible syntaxes, as seen in the below examples:

  1. html'<li>{name}</li>' (this is what we have generally done, with the tag not being dotted and no whitespace between it and the string part)
  2. html @ f'<li>{name}</li>', as with the transpiler work by @rmorshea in Interim Transpiler #20

In my reading of the PEP, the easiest/most straightforward change is to leverage PEP 701 f-strings directly, producing these tokens in either case. (So the scheme in #20 more explicitly maps to this.)

NAME - "html"
FSTRING_START - "'" 
FSTRING_MIDDLE - '<li>'
LBRACE - '{'
NAME - 'name'
RBRACE - '}'
FSTRING_MIDDLE - '</li>'
FSTRING_END - "'"

In other words, this is the exact tokenization produced as now, other than the first token NAME. This tokenization should be straightforward to then parse, including any necessary updates to the POC (possibly including removing more of the changes that were done, one can hope!).

The advantage of this approach (at least at first glance) is that we avoid duplicating a lot of the new work that was done for PEP 701, such as working with brace nesting, and then any subsequent tokenization of the interpolations with potential nesting.

@rmorshea
Copy link
Collaborator

To clarify, I wasn't proposing html @ f'<li>{name}</li>' as a competing syntax. I chose that purely because I was interested in using tag strings in ReactPy before the PEP itself was accepted. In that context, picking a currently valid syntax would allow me to use existing tooling (e.g. black, flake8, mypy) without issue.

@pauleveritt
Copy link
Collaborator

@rmorshea We just chatted about it, so good timing. Should we find a Discord somewhere to keep you in the loop?

@rmorshea
Copy link
Collaborator

@pauleveritt, I'm happy to hop on discord.

@pauleveritt
Copy link
Collaborator

@rmorshea I'm sitting in the Python Discord ATM as pauleveritt -- give me a nudge and we'll chat.

@gvanrossum
Copy link
Collaborator

I am slowly working through the update (starting from scratch, but expect to start copying from #1 soon). See https://github.com/python/cpython/compare/main...gvanrossum:tag-strings-v2?expand=1

@gvanrossum
Copy link
Collaborator

Ready, see python/cpython#103835. Have fun!

@jimbaker
Copy link
Owner Author

My first reaction from only looking at the PR (without trying out the tag string examples) is that the use of the recent changes in PEP 701 has very nicely simplified the necessary code for tag strings. So this validates our thinking about what PEP 701 could help support in its formalization/remove hacky code, even if it wasn't explicitly written to support tag strings.

@jimbaker
Copy link
Owner Author

jimbaker commented Apr 26, 2023

Most of the examples in this repo do work, but I saw two problems with the current PR.

First, a useful helper function:

def tag(*args):
    return args

So in general, it works as expected:

>>> tag"...{f(42)}..."
('...', (<function <lambda> at 0x7ff5202b7490>, 'f(42)', None, None), '...')

But one assertion failure:

>>> tag"...{x=}..."
python: Parser/action_helpers.c:1205: decode_fstring_buffer: Assertion `tok_mode->last_expr_buffer != NULL' failed.
Aborted

And it's unclear to me how we should handle Unicode escapes now (I guess I will have to read the code here more closely):

>>> tag"...\N{GRINNING FACE}..."
  File "<stdin>", line 1
    tag"...\N{GRINNING FACE}..."
              ^^^^^^^^^^^^^
SyntaxError: invalid syntax. Perhaps you forgot a comma?
>>> tag"...\\N{GRINNING FACE}..."
  File "<stdin>", line 1
    tag"...\\N{GRINNING FACE}..."
               ^^^^^^^^^^^^^
SyntaxError: invalid syntax. Perhaps you forgot a comma?
>>> f"...\N{GRINNING FACE}..."
'...😀...'

@gvanrossum
Copy link
Collaborator

Thanks for flagging those. I'll try to figure out what's going on with the crash.

For \N, what do "rf" strings do?

@gvanrossum
Copy link
Collaborator

I have pushed a fix for the crash. It was a bit weird.

I'm still pondering the \N thing. It works sanely in the old tag-string branch, so I think it's another bug.

@gvanrossum
Copy link
Collaborator

gvanrossum commented Apr 26, 2023

Hm, "sanely" may be an exaggeration. In the old branch tag"\N{SPACE}" treats {SPACE} as an interpolation. That matches what rf"\N{SPACE}" does, so maybe it's inevitable. It does mean that tag strings cannot easily use the \N{...} notation. At the very least that needs to be called out in PEP and documentation.

A workaround in the user-level tag function could be to check if the previous item is a constant string ending in \N, and in that case do a unicode lookup of the string field. That feels brittle though (to be correct you'd have to count backslashes). [UPDATE: That doesn't work for tag"\N{GRINNING FACE}" because GRINNING FACE is not a valid expression.]

@jimbaker
Copy link
Owner Author

jimbaker commented Apr 26, 2023

Speaking of the treatment of something like \N{GRINNING FACE} I had to consult myself from last year: #6 (comment)

There have been some changes since then. Let's look at the following:

>>> def tag(*args): return args
...
>>> def decode(raw): return raw.encode('utf-8').decode('unicode-escape')
...
>>> decode(fr'\N{{GRINNING FACE}}')
'😀'
>>> tag'...\N{{GRINNING FACE}}...'
  File "<stdin>", line 1
    tag'...\N{{GRINNING FACE}}...'
                                 ^
SyntaxError: (unicode error) 'unicodeescape' codec can't decode bytes in position 3-5: malformed \N character escape
>>> tag'...\\N{{GRINNING FACE}}...'
('...\\N{', 'GRINNING FACE}', '...')
>>> x = 42
>>> fr'...{x}...'
'...42...'

So at the very least we should expect it to behave like fr with '\N{{UNICODE NAME}}`. Supporting it just as '\N{UNICODE NAME}' is trickier as you mention, but seems closer to what a user would expect - the backslash is unprocessed, but can then be processed as expected, but it's still possible to just write the codepoint out as it's text.

As to which one is sane... it's an interesting corner case. The argument for processing it like fr is simply that there are various DSLs out there. While I cannot think of one that even uses the convention \N{SOME NAME}, it could exist.

@gvanrossum
Copy link
Collaborator

Pushed a fix for \N{GRINNING FACE}. It is a syntax error, because GRINNING FACE is not a valid expression:

>>> tag"\N{GRINNING FACE}"
  File "<stdin>", line 1
    tag"\N{GRINNING FACE}"
           ^^^^^^^^^^^^^
SyntaxError: invalid syntax. Perhaps you forgot a comma?
>>> 

The workaround is to double the curlies:

>>> tag"\N{{GRINNING FACE}}"
('\\N{', 'GRINNING FACE}')
>>> 

@jimbaker
Copy link
Owner Author

Pushed a fix for \N{GRINNING FACE}. It is a syntax error, because GRINNING FACE is not a valid expression:

>>> tag"\N{GRINNING FACE}"
  File "<stdin>", line 1
    tag"\N{GRINNING FACE}"
           ^^^^^^^^^^^^^
SyntaxError: invalid syntax. Perhaps you forgot a comma?
>>> 

The workaround is to double the curlies:

>>> tag"\N{{GRINNING FACE}}"
('\\N{', 'GRINNING FACE}')
>>> 

This is probably the right way to do it, given that we get fr semantics for literals.

@jimbaker
Copy link
Owner Author

jimbaker commented Apr 26, 2023

@gvanrossum Never mind for now - I tried running on another Linux box and it does work. It's always possible I have some out of date header or whatever for SQLite, or somewhere else on the tool chain. But still strange that it's only with this usage.

All examples in the examples directory now work (for some definition of work, I need to add a tests issue so some of them can be turned into such for automated testing).

try this example code - I'm getting a segmentation fault when I run it. It's moderately complex in terms of the nesting, so it might be exercising something interesting. Or maybe it's my bug, but I wouldn't expect a segmentation fault regardless:
https://github.com/jimbaker/tagstr/blob/main/examples/sql.py

(While there is an example of usage as well with SQLAlchemy, it's an optional demo.)

This was referenced Apr 26, 2023
@gvanrossum
Copy link
Collaborator

@gvanrossum try this example code - I'm getting a segmentation fault when I run it. It's moderately complex in terms of the nesting, so it might be exercising something interesting. Or maybe it's my bug, but I wouldn't expect a segmentation fault regardless:
https://github.com/jimbaker/tagstr/blob/main/examples/sql.py

Hm, interesting. It's not in the parser or bytecode compiler, as it prints the message about pip installing sqlalchemy when I try to run it. Note: the instructions contain a bug -- you need to quote the greenlet>=2.0.0a2 argument to pip install otherwise it creates an interesting filename. Alas, even with that issue fixed, greenlet fails to compile for me (on Mac). So not sure how to continue debugging this.

@gvanrossum
Copy link
Collaborator

Hm, interesting that you have it working on a Linux box. Maybe sqlalchemy doesn't depend on greenlet there?

@gvanrossum
Copy link
Collaborator

gvanrossum commented Apr 27, 2023

So I looked at the failing tests in my draft PR, and tried to fix them. Unfortunately I found that one test seems to be failing due to an issue with resetting the tokenizer; I'll probably have to get @pablogsal or @lysnikolaou to help me with that.

Most of the failures were of a simpler nature, which we will need to mention in the PEP and docs: where previously, if the letter or letters immediately preceding a string literal are incorrect (e.g. urf"blah"), instead of a SyntaxError this will now raise NameError (usually). I guess if we had chosen to use backticks instead of generalizing f-strings (e.g., tag`foo{bar}` ) this would not have been a problem.

@gvanrossum
Copy link
Collaborator

@lysnikolaou Any chance you can help me with getting this last test passing? See the commit in python/cpython#103835: python/cpython@e37d679

@lysnikolaou
Copy link
Contributor

Sure I can help!

From the commit message of python/cpython@e37d679:

It seems that if there is a syntax error involving mismatched braces
in an f-string interpolation, the parser backtracks and the tokenizer
produces TAGSTRING_START instead of FSTRING_START, and other state
is not initialized properly.

I don't think that the parser backtracks and the tokenizer produces TAGSTRING_START instead of FSTRING_START. My understanding is that the conversion character s is recognized as TAGSTRING_START because of a quote immediately following it. The full list of tokens for one of the offending cases:

❯ ./python.exe
Python 3.12.0a7+ (heads/tag-strings-v2:e37d679a79, Apr 26 2023, 22:15:21) [Clang 14.0.3 (clang-1403.0.22.14.1)] on darwin
Type "help", "copyright", "credits" or "license" for more information.
>>> import tokenize
>>> src = "f'{3!s'"
>>> for token in tokenize._generate_tokens_from_c_tokenizer(src):
...     print(token)
... 
TokenInfo(type=61 (FSTRING_START), string="f'", start=(1, 0), end=(1, 2), line="f'{3!s'\n")
TokenInfo(type=25 (LBRACE), string='{', start=(1, 2), end=(1, 3), line="f'{3!s'\n")
TokenInfo(type=2 (NUMBER), string='3', start=(1, 3), end=(1, 4), line="f'{3!s'\n")
TokenInfo(type=54 (EXCLAMATION), string='!', start=(1, 4), end=(1, 5), line="f'{3!s'\n")
TokenInfo(type=62 (TAGSTRING_START), string="s'", start=(1, 5), end=(1, 7), line="f'{3!s'\n")
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "/Users/lysnikolaou/repos/python/cpython/Lib/tokenize.py", line 688, in _generate_tokens_from_c_tokenizer
    for info in c_tokenizer.TokenizerIter(source):
  File "<string>", line 1
    f'{3!s'
         ^
SyntaxError: unterminated f-string literal (detected at line 1)

I think what we will need to do is emit NAME after an exclamation point in an f-string interpolation even if it's followed by a quote. I can work on that if you want.

By the way, congrats on everyone involved. This is indeed very exciting!

@lysnikolaou
Copy link
Contributor

There's also the slightly more difficult case of something like f'{x'. The error message we should be producing in this case is not clear, since both of the following are correct.

  1. Adding a } will fix the SyntaxError, since the f-string f'{x}' is valid.
  2. Closing the tag string and adding a } afterwards is also okay, since quote reusing is now allowed and, thus, the following is also valid: f'{x''}'

With the current implementation the syntax error message reads SyntaxError: unterminated f-string literal, while the previous message was f-string: expecting '}'. I would suggest to keep a modified version of the old message, but I'm not really sure about it.

@gvanrossum
Copy link
Collaborator

Thanks! You're likely right. Interestingly you can get the main branch to emit the unexpected error too:

>>> f'{3!f'
  File "<stdin>", line 1
    f'{3!f'
         ^^
SyntaxError: f-string: invalid conversion character
>>> 

(should be: "expecting '}'").

So your fix for this should be applied to main, not just to my branch. :-)

The same argument applies to the other problem case:

>>> f'{f'
  File "<stdin>", line 1
    f'{f'
       ^
SyntaxError: unterminated f-string literal (detected at line 1)
>>> 

And why is that treated differently than this?

>>> f'{b'
  File "<stdin>", line 1
    f'{b'
       ^
SyntaxError: f-string: expecting '}'
>>> 

Honestly I don't care that much about getting the right error in every case -- mismatched string quotes can lead to many different errors. (E.g. what if a user accidentally puts an unescaped quote inside a string.) But the tests insist on certain error messages, so something needs to be done.

@lysnikolaou
Copy link
Contributor

I think that the error message on the main branch is okay, since f really is an invalid conversion character. It's a custom error implemented at the parser level, if I'm not mistaken. The one in the tag-strings branch (SyntaxError: unterminated f-string literal (detected at line 1) is the one that I think we should change.

And why is that treated differently than this?

The difference is that string-prefixes that were there before (u, b, r) go into the path that emits STRING tokens, while all the others go into code that emit FSTRING_START or TAGSTRING_START tokens. The fact that the error message is different doesn't really make much sense though, so I'll look into that as well.


There's another point that I'd like to bring up, just in case you haven't considered it for the PEP. I had a flash the other day about the time I tried to add a specific syntax error for invalid string prefixes. People started complaining because it was backwards-incompatible for cases like this:

def f():
    return'something'

or

'something' if expression else'something else'

Basically, all code that does not add spaces between a keyword and a string breaks. And that's true for this change as well.

On main:

❯ ./python.exe     
Python 3.12.0a7+ (heads/main:1d99e9e46e, May  3 2023, 13:07:40) [Clang 14.0.3 (clang-1403.0.22.14.1)] on darwin
Type "help", "copyright", "credits" or "license" for more information.
>>> def f():
...     return'1'
... 
>>> f()
'1'
>>> 'something' if 1 else'something else'
'something'

On the tag-string branch:

❯ ./python.exe  
Python 3.12.0a7+ (heads/tag-strings-v2:e37d679a79, May  3 2023, 13:11:32) [Clang 14.0.3 (clang-1403.0.22.14.1)] on darwin
Type "help", "copyright", "credits" or "license" for more information.
>>> def f():
...     return'1'
... 
>>> f()
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "<stdin>", line 2, in f
NameError: name 'return' is not defined
>>> 'something' if 1 else'something else'
  File "<stdin>", line 1
    'something' if 1 else'something else'
    ^^^^^^^^^^^^^^^^
SyntaxError: expected 'else' after 'if' expression

We ended up reverting that commit. There might be ways around this (one of which is transferring keyword handling in the tokenizer, which probably isn't a good idea) and I'd be up to do the work for it, even if it's just writing a section on the PEP, in case there's not one already.

@gvanrossum
Copy link
Collaborator

Yeah, there are endless variations on the theme of invalid conversion characters. Latest fun (on main):

>>> f'{3!r'''
... 
  File "<stdin>", line 1
    f'{3!r'''
         ^
SyntaxError: unterminated triple-quoted string literal (detected at line 1)
>>> 

(There's an invisible ^C at the ... prompt.)

Personally I think the lexer probably has to special-case the character after ! in interpolations.

About return'foo' and friends, the simplest solution would sure be to move reserved word recognition into the lexer. I personally think that's fine (we can find a way for pegen to write the list of reserved words gleaned from the grammar to a file that can be included by the lexer).

@jimbaker
Copy link
Owner Author

Following up on this issue, there are two specific changes we need to clarify the PEP. Other details are respect with the implementation, and be deferred for now.

  1. The PEP should make it clear that reserved words, per https://docs.python.org/3/reference/lexical_analysis.html#keywords, are not allowed tag names, including in dotted names. @pauleveritt let's add this restriction to https://github.com/jimbaker/tagstr/blob/main/pep.rst#valid-tag-names.

  2. We need to ensure good testing around reserved words, including producing the same errors. For example, this is seen in the recent rebasing by @lysnikolaou (https://github.com/lysnikolaou/cpython/tree/tag-strings-rebased):

>>> def f():
...     if 0:
...         pass
...     else'hi, {name}!'
...
>>> f()
Traceback (most recent call last):
  File "<python-input-22>", line 1, in <module>
    f()
    ~^^
  File "<python-input-21>", line 4, in f
    else'hi, {name}!'
    ^^^^^
NameError: name 'else' is not defined. Did you mean: 'False'?

whereas current cpython main is expected:

>>> def f():
...     if 0:
...         pass
...     else'hi, {name}!'
  File "<unknown>", line 4
    else'hi, {name}!'
        ^^^^^^^^^^^^^
SyntaxError: expected ':'

The straightforward solution proposed in #22 (comment) is probably what we need here.

  1. While a parser for the tag function could handle split strings like the following, as seen in the current implementation:
>>> def tag(*args): return args
...
>>> tag"\N{{GRINNING FACE}}"
('\\N{', 'GRINNING FACE}')

from a PEP perspective, we should follow what the user has written: exactly one string (specifically Decoded) is provided for each static string in the source tag string. So Decoded.raw would actually be

'\\N{GRINNING FACE}'

and this cooks to

'\\N{GRINNING FACE}'.encode('utf-8').decode('unicode-escape')
'😀'

@pauleveritt let's add a subsection to
https://github.com/jimbaker/tagstr/blob/main/pep.rst#tag-function-arguments that states this - it feels very similar to what we already state about "No Empty Decoded String"/

  1. The PEP is already clear with respect to ...!x, where x is not a valid conversion - "Note that as with f-strings, no other conversions are supported."

@pauleveritt
Copy link
Collaborator

Just sent @jimbaker a PR for review on this.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

5 participants