Update proof of concept with respect to PEP 701 changes #22

jimbaker · 2023-04-24T15:22:57Z

PEP 701, as tracked in the CPython implementation and recently merged by @pablogsal , python/cpython#102856, required changes to the tokenizer and parser for the cleaned up implementation of f-strings. This impacts the initial proof of concept work in #1 written by @gvanrossum

First, we have discussed two possible syntaxes, as seen in the below examples:

html'<li>{name}</li>' (this is what we have generally done, with the tag not being dotted and no whitespace between it and the string part)
html @ f'<li>{name}</li>', as with the transpiler work by @rmorshea in Interim Transpiler #20

In my reading of the PEP, the easiest/most straightforward change is to leverage PEP 701 f-strings directly, producing these tokens in either case. (So the scheme in #20 more explicitly maps to this.)

NAME - "html"
FSTRING_START - "'" 
FSTRING_MIDDLE - '<li>'
LBRACE - '{'
NAME - 'name'
RBRACE - '}'
FSTRING_MIDDLE - '</li>'
FSTRING_END - "'"

In other words, this is the exact tokenization produced as now, other than the first token NAME. This tokenization should be straightforward to then parse, including any necessary updates to the POC (possibly including removing more of the changes that were done, one can hope!).

The advantage of this approach (at least at first glance) is that we avoid duplicating a lot of the new work that was done for PEP 701, such as working with brace nesting, and then any subsequent tokenization of the interpolations with potential nesting.

The text was updated successfully, but these errors were encountered:

rmorshea · 2023-04-24T16:21:27Z

To clarify, I wasn't proposing html @ f'<li>{name}</li>' as a competing syntax. I chose that purely because I was interested in using tag strings in ReactPy before the PEP itself was accepted. In that context, picking a currently valid syntax would allow me to use existing tooling (e.g. black, flake8, mypy) without issue.

pauleveritt · 2023-04-24T16:48:16Z

@rmorshea We just chatted about it, so good timing. Should we find a Discord somewhere to keep you in the loop?

rmorshea · 2023-04-24T16:51:45Z

@pauleveritt, I'm happy to hop on discord.

pauleveritt · 2023-04-24T16:53:18Z

@rmorshea I'm sitting in the Python Discord ATM as pauleveritt -- give me a nudge and we'll chat.

gvanrossum · 2023-04-25T05:45:29Z

I am slowly working through the update (starting from scratch, but expect to start copying from #1 soon). See https://github.com/python/cpython/compare/main...gvanrossum:tag-strings-v2?expand=1

gvanrossum · 2023-04-25T15:29:42Z

Ready, see python/cpython#103835. Have fun!

jimbaker · 2023-04-25T18:09:12Z

My first reaction from only looking at the PR (without trying out the tag string examples) is that the use of the recent changes in PEP 701 has very nicely simplified the necessary code for tag strings. So this validates our thinking about what PEP 701 could help support in its formalization/remove hacky code, even if it wasn't explicitly written to support tag strings.

jimbaker · 2023-04-26T03:27:43Z

Most of the examples in this repo do work, but I saw two problems with the current PR.

First, a useful helper function:

def tag(*args):
    return args

So in general, it works as expected:

>>> tag"...{f(42)}..."
('...', (<function <lambda> at 0x7ff5202b7490>, 'f(42)', None, None), '...')

But one assertion failure:

>>> tag"...{x=}..."
python: Parser/action_helpers.c:1205: decode_fstring_buffer: Assertion `tok_mode->last_expr_buffer != NULL' failed.
Aborted

And it's unclear to me how we should handle Unicode escapes now (I guess I will have to read the code here more closely):

>>> tag"...\N{GRINNING FACE}..."
  File "<stdin>", line 1
    tag"...\N{GRINNING FACE}..."
              ^^^^^^^^^^^^^
SyntaxError: invalid syntax. Perhaps you forgot a comma?
>>> tag"...\\N{GRINNING FACE}..."
  File "<stdin>", line 1
    tag"...\\N{GRINNING FACE}..."
               ^^^^^^^^^^^^^
SyntaxError: invalid syntax. Perhaps you forgot a comma?
>>> f"...\N{GRINNING FACE}..."
'...😀...'

gvanrossum · 2023-04-26T15:53:32Z

Thanks for flagging those. I'll try to figure out what's going on with the crash.

For \N, what do "rf" strings do?

gvanrossum · 2023-04-26T21:30:50Z

I have pushed a fix for the crash. It was a bit weird.

I'm still pondering the \N thing. It works sanely in the old tag-string branch, so I think it's another bug.

gvanrossum · 2023-04-26T21:45:05Z

Hm, "sanely" may be an exaggeration. In the old branch tag"\N{SPACE}" treats {SPACE} as an interpolation. That matches what rf"\N{SPACE}" does, so maybe it's inevitable. It does mean that tag strings cannot easily use the \N{...} notation. At the very least that needs to be called out in PEP and documentation.

A workaround in the user-level tag function could be to check if the previous item is a constant string ending in \N, and in that case do a unicode lookup of the string field. That feels brittle though (to be correct you'd have to count backslashes). [UPDATE: That doesn't work for tag"\N{GRINNING FACE}" because GRINNING FACE is not a valid expression.]

jimbaker · 2023-04-26T22:03:23Z

Speaking of the treatment of something like \N{GRINNING FACE} I had to consult myself from last year: #6 (comment)

There have been some changes since then. Let's look at the following:

>>> def tag(*args): return args
...
>>> def decode(raw): return raw.encode('utf-8').decode('unicode-escape')
...
>>> decode(fr'\N{{GRINNING FACE}}')
'😀'
>>> tag'...\N{{GRINNING FACE}}...'
  File "<stdin>", line 1
    tag'...\N{{GRINNING FACE}}...'
                                 ^
SyntaxError: (unicode error) 'unicodeescape' codec can't decode bytes in position 3-5: malformed \N character escape
>>> tag'...\\N{{GRINNING FACE}}...'
('...\\N{', 'GRINNING FACE}', '...')
>>> x = 42
>>> fr'...{x}...'
'...42...'

So at the very least we should expect it to behave like fr with '\N{{UNICODE NAME}}`. Supporting it just as '\N{UNICODE NAME}' is trickier as you mention, but seems closer to what a user would expect - the backslash is unprocessed, but can then be processed as expected, but it's still possible to just write the codepoint out as it's text.

As to which one is sane... it's an interesting corner case. The argument for processing it like fr is simply that there are various DSLs out there. While I cannot think of one that even uses the convention \N{SOME NAME}, it could exist.

gvanrossum · 2023-04-26T22:05:51Z

Pushed a fix for \N{GRINNING FACE}. It is a syntax error, because GRINNING FACE is not a valid expression:

>>> tag"\N{GRINNING FACE}"
  File "<stdin>", line 1
    tag"\N{GRINNING FACE}"
           ^^^^^^^^^^^^^
SyntaxError: invalid syntax. Perhaps you forgot a comma?
>>>

The workaround is to double the curlies:

>>> tag"\N{{GRINNING FACE}}"
('\\N{', 'GRINNING FACE}')
>>>

jimbaker · 2023-04-26T22:07:01Z

Pushed a fix for \N{GRINNING FACE}. It is a syntax error, because GRINNING FACE is not a valid expression:
>>> tag"\N{GRINNING FACE}"
  File "<stdin>", line 1
    tag"\N{GRINNING FACE}"
           ^^^^^^^^^^^^^
SyntaxError: invalid syntax. Perhaps you forgot a comma?
>>> 
The workaround is to double the curlies:
>>> tag"\N{{GRINNING FACE}}"
('\\N{', 'GRINNING FACE}')
>>> 

This is probably the right way to do it, given that we get fr semantics for literals.

jimbaker · 2023-04-26T22:17:26Z

@gvanrossum Never mind for now - I tried running on another Linux box and it does work. It's always possible I have some out of date header or whatever for SQLite, or somewhere else on the tool chain. But still strange that it's only with this usage.

All examples in the examples directory now work (for some definition of work, I need to add a tests issue so some of them can be turned into such for automated testing).

try this example code - I'm getting a segmentation fault when I run it. It's moderately complex in terms of the nesting, so it might be exercising something interesting. Or maybe it's my bug, but I wouldn't expect a segmentation fault regardless:
https://github.com/jimbaker/tagstr/blob/main/examples/sql.py

~~(While there is an example of usage as well with SQLAlchemy, it's an optional demo.)~~

gvanrossum · 2023-04-26T22:45:39Z

@gvanrossum try this example code - I'm getting a segmentation fault when I run it. It's moderately complex in terms of the nesting, so it might be exercising something interesting. Or maybe it's my bug, but I wouldn't expect a segmentation fault regardless:
https://github.com/jimbaker/tagstr/blob/main/examples/sql.py

Hm, interesting. It's not in the parser or bytecode compiler, as it prints the message about pip installing sqlalchemy when I try to run it. Note: the instructions contain a bug -- you need to quote the greenlet>=2.0.0a2 argument to pip install otherwise it creates an interesting filename. Alas, even with that issue fixed, greenlet fails to compile for me (on Mac). So not sure how to continue debugging this.

gvanrossum · 2023-04-26T22:49:26Z

Hm, interesting that you have it working on a Linux box. Maybe sqlalchemy doesn't depend on greenlet there?

gvanrossum · 2023-04-27T01:52:26Z

So I looked at the failing tests in my draft PR, and tried to fix them. Unfortunately I found that one test seems to be failing due to an issue with resetting the tokenizer; I'll probably have to get @pablogsal or @lysnikolaou to help me with that.

Most of the failures were of a simpler nature, which we will need to mention in the PEP and docs: where previously, if the letter or letters immediately preceding a string literal are incorrect (e.g. urf"blah"), instead of a SyntaxError this will now raise NameError (usually). I guess if we had chosen to use backticks instead of generalizing f-strings (e.g., tag`foo{bar}` ) this would not have been a problem.

gvanrossum · 2023-04-27T02:06:20Z

@lysnikolaou Any chance you can help me with getting this last test passing? See the commit in python/cpython#103835: python/cpython@e37d679

lysnikolaou · 2023-04-27T04:28:24Z

Sure I can help!

From the commit message of python/cpython@e37d679:

It seems that if there is a syntax error involving mismatched braces
in an f-string interpolation, the parser backtracks and the tokenizer
produces TAGSTRING_START instead of FSTRING_START, and other state
is not initialized properly.

I don't think that the parser backtracks and the tokenizer produces TAGSTRING_START instead of FSTRING_START. My understanding is that the conversion character s is recognized as TAGSTRING_START because of a quote immediately following it. The full list of tokens for one of the offending cases:

❯ ./python.exe
Python 3.12.0a7+ (heads/tag-strings-v2:e37d679a79, Apr 26 2023, 22:15:21) [Clang 14.0.3 (clang-1403.0.22.14.1)] on darwin
Type "help", "copyright", "credits" or "license" for more information.
>>> import tokenize
>>> src = "f'{3!s'"
>>> for token in tokenize._generate_tokens_from_c_tokenizer(src):
...     print(token)
... 
TokenInfo(type=61 (FSTRING_START), string="f'", start=(1, 0), end=(1, 2), line="f'{3!s'\n")
TokenInfo(type=25 (LBRACE), string='{', start=(1, 2), end=(1, 3), line="f'{3!s'\n")
TokenInfo(type=2 (NUMBER), string='3', start=(1, 3), end=(1, 4), line="f'{3!s'\n")
TokenInfo(type=54 (EXCLAMATION), string='!', start=(1, 4), end=(1, 5), line="f'{3!s'\n")
TokenInfo(type=62 (TAGSTRING_START), string="s'", start=(1, 5), end=(1, 7), line="f'{3!s'\n")
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "/Users/lysnikolaou/repos/python/cpython/Lib/tokenize.py", line 688, in _generate_tokens_from_c_tokenizer
    for info in c_tokenizer.TokenizerIter(source):
  File "<string>", line 1
    f'{3!s'
         ^
SyntaxError: unterminated f-string literal (detected at line 1)

I think what we will need to do is emit NAME after an exclamation point in an f-string interpolation even if it's followed by a quote. I can work on that if you want.

By the way, congrats on everyone involved. This is indeed very exciting!

lysnikolaou · 2023-04-27T04:45:03Z

There's also the slightly more difficult case of something like f'{x'. The error message we should be producing in this case is not clear, since both of the following are correct.

Adding a } will fix the SyntaxError, since the f-string f'{x}' is valid.
Closing the tag string and adding a } afterwards is also okay, since quote reusing is now allowed and, thus, the following is also valid: f'{x''}'

With the current implementation the syntax error message reads SyntaxError: unterminated f-string literal, while the previous message was f-string: expecting '}'. I would suggest to keep a modified version of the old message, but I'm not really sure about it.

gvanrossum · 2023-04-28T05:06:58Z

Thanks! You're likely right. Interestingly you can get the main branch to emit the unexpected error too:

>>> f'{3!f'
  File "<stdin>", line 1
    f'{3!f'
         ^^
SyntaxError: f-string: invalid conversion character
>>>

(should be: "expecting '}'").

So your fix for this should be applied to main, not just to my branch. :-)

The same argument applies to the other problem case:

>>> f'{f'
  File "<stdin>", line 1
    f'{f'
       ^
SyntaxError: unterminated f-string literal (detected at line 1)
>>>

And why is that treated differently than this?

>>> f'{b'
  File "<stdin>", line 1
    f'{b'
       ^
SyntaxError: f-string: expecting '}'
>>>

Honestly I don't care that much about getting the right error in every case -- mismatched string quotes can lead to many different errors. (E.g. what if a user accidentally puts an unescaped quote inside a string.) But the tests insist on certain error messages, so something needs to be done.

lysnikolaou · 2023-05-03T11:14:14Z

I think that the error message on the main branch is okay, since f really is an invalid conversion character. It's a custom error implemented at the parser level, if I'm not mistaken. The one in the tag-strings branch (SyntaxError: unterminated f-string literal (detected at line 1) is the one that I think we should change.

And why is that treated differently than this?

The difference is that string-prefixes that were there before (u, b, r) go into the path that emits STRING tokens, while all the others go into code that emit FSTRING_START or TAGSTRING_START tokens. The fact that the error message is different doesn't really make much sense though, so I'll look into that as well.

There's another point that I'd like to bring up, just in case you haven't considered it for the PEP. I had a flash the other day about the time I tried to add a specific syntax error for invalid string prefixes. People started complaining because it was backwards-incompatible for cases like this:

def f():
    return'something'

or

'something' if expression else'something else'

Basically, all code that does not add spaces between a keyword and a string breaks. And that's true for this change as well.

On main:

❯ ./python.exe     
Python 3.12.0a7+ (heads/main:1d99e9e46e, May  3 2023, 13:07:40) [Clang 14.0.3 (clang-1403.0.22.14.1)] on darwin
Type "help", "copyright", "credits" or "license" for more information.
>>> def f():
...     return'1'
... 
>>> f()
'1'
>>> 'something' if 1 else'something else'
'something'

On the tag-string branch:

❯ ./python.exe  
Python 3.12.0a7+ (heads/tag-strings-v2:e37d679a79, May  3 2023, 13:11:32) [Clang 14.0.3 (clang-1403.0.22.14.1)] on darwin
Type "help", "copyright", "credits" or "license" for more information.
>>> def f():
...     return'1'
... 
>>> f()
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "<stdin>", line 2, in f
NameError: name 'return' is not defined
>>> 'something' if 1 else'something else'
  File "<stdin>", line 1
    'something' if 1 else'something else'
    ^^^^^^^^^^^^^^^^
SyntaxError: expected 'else' after 'if' expression

We ended up reverting that commit. There might be ways around this (one of which is transferring keyword handling in the tokenizer, which probably isn't a good idea) and I'd be up to do the work for it, even if it's just writing a section on the PEP, in case there's not one already.

gvanrossum · 2023-05-09T04:48:43Z

Yeah, there are endless variations on the theme of invalid conversion characters. Latest fun (on main):

>>> f'{3!r'''
... 
  File "<stdin>", line 1
    f'{3!r'''
         ^
SyntaxError: unterminated triple-quoted string literal (detected at line 1)
>>>

(There's an invisible ^C at the ... prompt.)

Personally I think the lexer probably has to special-case the character after ! in interpolations.

About return'foo' and friends, the simplest solution would sure be to move reserved word recognition into the lexer. I personally think that's fine (we can find a way for pegen to write the list of reserved words gleaned from the grammar to a file that can be included by the lexer).

jimbaker · 2024-06-29T16:19:47Z

Following up on this issue, there are two specific changes we need to clarify the PEP. Other details are respect with the implementation, and be deferred for now.

The PEP should make it clear that reserved words, per https://docs.python.org/3/reference/lexical_analysis.html#keywords, are not allowed tag names, including in dotted names. @pauleveritt let's add this restriction to https://github.com/jimbaker/tagstr/blob/main/pep.rst#valid-tag-names.
We need to ensure good testing around reserved words, including producing the same errors. For example, this is seen in the recent rebasing by @lysnikolaou (https://github.com/lysnikolaou/cpython/tree/tag-strings-rebased):

>>> def f():
...     if 0:
...         pass
...     else'hi, {name}!'
...
>>> f()
Traceback (most recent call last):
  File "<python-input-22>", line 1, in <module>
    f()
    ~^^
  File "<python-input-21>", line 4, in f
    else'hi, {name}!'
    ^^^^^
NameError: name 'else' is not defined. Did you mean: 'False'?

whereas current cpython main is expected:

>>> def f():
...     if 0:
...         pass
...     else'hi, {name}!'
  File "<unknown>", line 4
    else'hi, {name}!'
        ^^^^^^^^^^^^^
SyntaxError: expected ':'

The straightforward solution proposed in #22 (comment) is probably what we need here.

While a parser for the tag function could handle split strings like the following, as seen in the current implementation:

>>> def tag(*args): return args
...
>>> tag"\N{{GRINNING FACE}}"
('\\N{', 'GRINNING FACE}')

from a PEP perspective, we should follow what the user has written: exactly one string (specifically Decoded) is provided for each static string in the source tag string. So Decoded.raw would actually be

'\\N{GRINNING FACE}'

and this cooks to

'\\N{GRINNING FACE}'.encode('utf-8').decode('unicode-escape')
'😀'

@pauleveritt let's add a subsection to
https://github.com/jimbaker/tagstr/blob/main/pep.rst#tag-function-arguments that states this - it feels very similar to what we already state about "No Empty Decoded String"/

The PEP is already clear with respect to ...!x, where x is not a valid conversion - "Note that as with f-strings, no other conversions are supported."

pauleveritt · 2024-07-01T21:56:22Z

Just sent @jimbaker a PR for review on this.

This was referenced Apr 26, 2023

I am working in a branch #1

Closed

Progress at PyCon 2023 #21

Closed

lysnikolaou mentioned this issue Apr 27, 2023

Disallow TAGSTRING_START after the conversion character in an f-string gvanrossum/cpython#67

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Update proof of concept with respect to PEP 701 changes #22

Update proof of concept with respect to PEP 701 changes #22

jimbaker commented Apr 24, 2023

rmorshea commented Apr 24, 2023

pauleveritt commented Apr 24, 2023

rmorshea commented Apr 24, 2023

pauleveritt commented Apr 24, 2023

gvanrossum commented Apr 25, 2023

gvanrossum commented Apr 25, 2023

jimbaker commented Apr 25, 2023

jimbaker commented Apr 26, 2023 •

edited

Loading

gvanrossum commented Apr 26, 2023

gvanrossum commented Apr 26, 2023

gvanrossum commented Apr 26, 2023 •

edited

Loading

jimbaker commented Apr 26, 2023 •

edited

Loading

gvanrossum commented Apr 26, 2023

jimbaker commented Apr 26, 2023

jimbaker commented Apr 26, 2023 •

edited

Loading

gvanrossum commented Apr 26, 2023

gvanrossum commented Apr 26, 2023

gvanrossum commented Apr 27, 2023 •

edited

Loading

gvanrossum commented Apr 27, 2023

lysnikolaou commented Apr 27, 2023

lysnikolaou commented Apr 27, 2023

gvanrossum commented Apr 28, 2023

lysnikolaou commented May 3, 2023

gvanrossum commented May 9, 2023

jimbaker commented Jun 29, 2024

pauleveritt commented Jul 1, 2024

Update proof of concept with respect to PEP 701 changes #22

Update proof of concept with respect to PEP 701 changes #22

Comments

jimbaker commented Apr 24, 2023

rmorshea commented Apr 24, 2023

pauleveritt commented Apr 24, 2023

rmorshea commented Apr 24, 2023

pauleveritt commented Apr 24, 2023

gvanrossum commented Apr 25, 2023

gvanrossum commented Apr 25, 2023

jimbaker commented Apr 25, 2023

jimbaker commented Apr 26, 2023 • edited Loading

gvanrossum commented Apr 26, 2023

gvanrossum commented Apr 26, 2023

gvanrossum commented Apr 26, 2023 • edited Loading

jimbaker commented Apr 26, 2023 • edited Loading

gvanrossum commented Apr 26, 2023

jimbaker commented Apr 26, 2023

jimbaker commented Apr 26, 2023 • edited Loading

gvanrossum commented Apr 26, 2023

gvanrossum commented Apr 26, 2023

gvanrossum commented Apr 27, 2023 • edited Loading

gvanrossum commented Apr 27, 2023

lysnikolaou commented Apr 27, 2023

lysnikolaou commented Apr 27, 2023

gvanrossum commented Apr 28, 2023

lysnikolaou commented May 3, 2023

gvanrossum commented May 9, 2023

jimbaker commented Jun 29, 2024

pauleveritt commented Jul 1, 2024

jimbaker commented Apr 26, 2023 •

edited

Loading

gvanrossum commented Apr 26, 2023 •

edited

Loading

jimbaker commented Apr 26, 2023 •

edited

Loading

jimbaker commented Apr 26, 2023 •

edited

Loading

gvanrossum commented Apr 27, 2023 •

edited

Loading