Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Python backend #485

Open
wants to merge 9 commits into
base: master
Choose a base branch
from
Open

Python backend #485

wants to merge 9 commits into from

Conversation

AiStudent
Copy link

@AiStudent AiStudent commented Aug 27, 2024

Python backend using Python Lex Yacc, PLY.

1 - Requires python3.10+ and ply.

2 - Getting started documentation added to docs/user_guide.rst.

3 - I'll not pledge to be an active maintainer for 3 years, but who really knows.

4 - Questions (especially tricky or critical) are welcome.

Testsuite results:
The example grammars work (no layout supported).
The regression tests yield:

  • Failures:
    • Parameterized tests:Python:Python:100_coercion_lists
    • Parameterized tests:Python:Python:249_unicode
    • Parameterized tests:Python:Python:70_WhiteSpaceSeparator
    • Parameterized tests:Python:Python:235_SymbolsOverlapTokens
  • Errors:
    • Parameterized tests:Python:Python:289_LexerKeywords
    • Parameterized tests:Python:Python:222_IntegerList
    • Parameterized tests:Python:Python:256_Regex
    • Parameterized tests:Python:Python:278_Keywords
    • Parameterized tests:Python:Python:358_MixFixLists
    • Parameterized tests:Python:Python:266_define

@andreasabel
Copy link
Member

andreasabel commented Oct 30, 2024

@AiStudent Thanks for this PR. My apologies for not evaluating it earlier.
I checked it out now and try to experiment with it.
But the dependency ply seems hard to install, and looks badly maintained already.

  • Homebrew discontinued support for its formula: https://formulae.brew.sh/formula/python-ply

    Error: python-ply has been disabled because it does not meet homebrew/core's requirements for Python library formulae! It was disabled on 2024-08-15.

  • In general, it looks abandoned. It last release was for python 3.11 in 2018 but now we are at 3.13. https://pypi.org/project/ply/#history
  • It might not be fully abandoned, but seems to be superseded by SLY, according to this comment by the author of PLY:
    SLY and PLY dabeaz/ply#228 (comment)

    SLY is the more modern project and has more interesting features (such as recently added support for some EBNF grammar extensions).
    PLY was originally written in 2001 ... there will be no future pip-installable version of PLY. So, if people want to use it, they'd need to copy the code here.

It does not seem wise to add a new backend for a parser generator that is already phased out.

Could you instead target a parser generator that has a more promising future?

SLY by the same author also does not seem well-maintained, there is no release to the Python package repo: https://pypi.org/project/sly/

CC @aarneranta

Copy link
Member

@andreasabel andreasabel left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

PLY seems outdated. Please target a well-maintained python parser generator instead.

@andreasabel
Copy link
Member

Alternatives?

@AiStudent
Copy link
Author

AiStudent commented Nov 5, 2024

Retargeted to Lark with LALR. @andreasabel

Testsuite results:
The example grammars work (no layout supported and internal has no effect).
The regression tests yield:

Failures:

  • Parameterized tests:Python:Python:100_coercion_lists
  • Parameterized tests:Python:Python:70_WhiteSpaceSeparator
  • Parameterized tests:Python:Python:358_MixFixLists
  • Parameterized tests:Python:Python:479_LabelsCaseSensitive
  • Current parameterized test:Python:479_LabelsCaseSensitive

Errors:

  • Parameterized tests:Python:Python:289_LexerKeywords
  • Parameterized tests:Python:Python:222_IntegerList
  • Parameterized tests:Python:Python:256_Regex
  • Parameterized tests:Python:Python:278_Keywords
  • Parameterized tests:Python:Python:235_SymbolsOverlapTokens
  • Parameterized tests:Python:Python:266_define

@andreasabel
Copy link
Member

Great!
So the failure list is unchanged over the PLY target.
I'll have another look.

@AiStudent
Copy link
Author

So the failure list is unchanged over the PLY target.

Note it is slightly different: 249_unicode -> 358_MixFixLists for example.

@andreasabel
Copy link
Member

@andreasabel
Copy link
Member

Testsuite results: The example grammars work (no layout supported and internal has no effect). The regression tests yield:

I went through all of the failed tests:

Parameterized tests:Python:Python:100_coercion_lists

* expected: bar 42 . 42 . (41 + 1)

* but got:  bar 42 . 42 . 41 + 1

Since bar takes an Exp2 list, an Exp1 like 41 + 1 needs to be printed in parentheses.

Parameterized tests:Python:Python:70_WhiteSpaceSeparator

Seems that whitespace in separators is not honored by printer

* expected: 1  2  3  4  5  6  7  8  9

* but got:  1 2 3 4 5 6 7 8 9

Parameterized tests:Python:Python:358_MixFixLists

The generated printer seems to emit too many spaces:

-"A" . "B" . "QED" . "line1\n\tline2" . "\"%s\" is not an int" . semi-exprs: 1 + 1;
+"A"  . "B"  . "QED"  . "line1\n\tline2"  . "\"%s\" is not an int"  . semi-exprs: 1 + 1;

Parameterized tests:Python:Python:479_LabelsCaseSensitive

* expected: 2 ^ (2 ** 2)

* but got:  2 ** (2 ** 2)

Grammar is:

Eexp.  Exp  ::= Exp "^"  Exp1 ;
EExp.  Exp  ::= Exp "**" Exp1 ;

The reason for failure is that the generated method names in ParsingDefs.py lose the case:

grammar = r"""
...
  ?exp: exp "^" exp1 -> r_eexp
  | exp "**" exp1 -> r_eexp
  | exp1
...
"""
...
#transformer
class TreeTransformer(Transformer):
  @v_args(inline=True)
  def r_eexp(self, exp_1, exp_2):
    return Eexp(exp_1, exp_2)

  @v_args(inline=True)
  def r_eexp(self, exp_1, exp_2):
    return EExp(exp_1, exp_2)

Parameterized tests:Python:Python:289_LexerKeywords

Same problem as in 278_Keywords (see below).
Clashes with pythons True and False are not avoided.

Parameterized tests:Python:Python:222_IntegerList

Missing python module name sanitization:

from bnfcPyGen222_Integer-List-1.ParsingDefs import *

Parameterized tests:Python:Python:256_Regex

The grammar has:

token SpaceInAlts '\'' ["01x "]* '\'' ;
token Triple upper letter ["]-\'"] '\'';
token Name (char - [ "(){};.@\"" ] - [ " \n\t" ]) + ;

This is translated to:

  SPACEINALTS: /\'(\ |0|1|x)*\'/
  TRIPLE: /[A-Z][A-Za-z](\'|\-|\])\'/
  NAME: /[^\t\n\ \"\(\)\.\;\@\{\}]+/

Start symbol is:

Main ::= SpaceInAlts Name "/\\" Name "and" Triple;

Input is:

' x10 0' [x:4] /\ 'funky+identifier-umlaute:äöüß//\%#FOO and Un]'

Error is:

Unexpected token Token('NAME', "'") at line 1, column 1.
Expected one of: 
	* SPACEINALTS

I can get the test accepted if I give SPACEINALTS and TRIPLE a higher prio than NAME:

  SPACEINALTS.1: /\'(\ |0|1|x)*\'/
  TRIPLE.1: /[A-Z][A-Za-z](\'|\-|\])\'/
  NAME: /[^\t\n\ \"\(\)\.\;\@\{\}]+/

That is weird, because the docs say that the order is
https://lark-parser.readthedocs.io/en/stable/grammar.html#notes-for-when-using-a-lexer

Literals are matched according to the following precedence:
1. Highest priority first (priority is specified as: TERM.number: …)
2. Length of match (for regexps, the longest theoretical match is used)
3. Length of literal / pattern definition
4. Name

So if there is just one prio, the length of the match should win. Clearly, SPACEINALTS can match a longer string (' x10 0') than NAME (').

Is this a bug in LARK or its docu?

  • Parameterized tests:Python:Python:278_Keywords

This testcase aims at name clashes with the host language:

bnfcPyGentest/Absyn.py", line 143
    class False:
          ^^^^^
SyntaxError: invalid syntax

You need to make sure you sanitise python keywords and reserved identifiers.

Parameterized tests:Python:Python:235_SymbolsOverlapTokens

Running this test reports:

Unexpected token Token('MYPOSID', 'Foo') at line 1, column 4160.
Expected one of: 
	* IDENT

The grammar has start symbol:

Init.        Main ::= ... Ident MyId MyPosId ...

and the input is:

... Foo_ Bar1 Bar! ...

The generated lexer lexes Foo_ as MYPOSID probably because the regex generated for IDENT is not correct. IDENT is valid Haskell identifiers which includes underscores.

  MYID: /[A-Za-z](\d|[A-Za-z])*/
  MYPOSID: /[A-Za-z](\!|(\d|[A-Za-z]))*/
  IDENT: /[A-Za-z]\w*/

Parameterized tests:Python:Python:266_define

This produces some ill-formed python:

ParsingDefs.py", line 104, in d2_r_esqrt4
    return where(where(exp_))
           ^^^^^
NameError: name 'where' is not defined

Copy link
Member

@andreasabel andreasabel left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the port to LARK!

Looking through the failed tests, I guess some more refinements are needed concerning

  • name sanitisation (to avoid python syntax errors and name clashes),
  • case sensitivity of rule names,
  • whitespace handling in printing and
  • precedence handling in printing lists.

There is one problem with token precedence that I don't know whether it is a bug in LARK. (See test case 256_Regex.) Also, LARK claims to not respect the order of token definitions when resolving ambiguities. Usually, the earlier definitions have higher prio when the same length of input is matched. LARK allows to manually assign prios but these seem to override the "length" prio, according to the definition.
Not sure how this mismatch is resolved.

What is your take on this?

Frankly, I am a bit confused by the LARK docs, e.g. I don't know what the Name in the prio list means. Probably it means alphabetical, but it just doesn't make sense to me.

makefile pkgName optMakefileName basename = vcat
[
Makefile.mkRule "all" []
[ " " ]
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Since no compilation is needed for python, maybe the all goal could simply explain this, e.g.

@echo "Doing nothing: No compilation of the parser needed."

-- | Entrypoint for BNFC to use the Python backend.
makePython :: SharedOptions -> CF -> MkFiles ()
makePython opts cf = do
let pkgName = "bnfcPyGen" ++ name
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The failure of test 222_IntegerList is caused by the lack of filename sanitisation, the grammar file is named here 222_Integer-List-1 which yield an illegal python module name:

from bnfcPyGen222_Integer-List-1.ParsingDefs import *

@AiStudent
Copy link
Author

AiStudent commented Nov 11, 2024

Thanks for the extensive feedback. Currently there exists at least two issues.

Testsuite results:

  • Errors:
    • 222_IntegerList - Recursion error, a list with 10k elements breaks the recursive list concatenation.
    • 256_Regex - Token precedence
    • 235_SymbolsOverlapTokens - Token precedence

It seems that the length of the regex decides the precedence, not the lengths of the matched tokens. In the below test program I try to tokenize "ab" with the literals A or AB:

from lark import Lark
grammar = r"""
    start: A
     | AB
      
    A:  /a(c)*/
    AB: /ab/
"""
parser = Lark(grammar, start='start', parser='lalr', lexer='basic', priority='auto')
print(list(parser.lex("ab")))

Which yields:

lark.exceptions.UnexpectedCharacters: No terminal matches 'b' in the current parser context, at line 1 col 2

ab
 ^
Expected one of: 
	* A
	* AB

Previous tokens: Token('A', 'a')

In order to make AB have higher precedence, one may: add priority level number, such as AB.1, or make the regex of AB longer than A, like AB: /ab(c)*/ or removing (c)* from A. I have not found a way to make the lexer use the longest matched expression; one alternative, for the one who dares, seems to be to utilize Lark's custom lexer option.

@andreasabel
Copy link
Member

@AiStudent wrote:

It seems that the length of the regex decides the precedence, not the lengths of the matched tokens.

But this seems then to be a bug in LARK, at least according to the specification:
https://lark-parser.readthedocs.io/en/stable/grammar.html#notes-for-when-using-a-lexer

Literals are matched according to the following precedence:

  1. Highest priority first (priority is specified as: TERM.number: …)
  2. Length of match (for regexps, the longest theoretical match is used)

Or maybe, this is what is meant by

(for regexps, the longest theoretical match is used)

In any case, not prioritising the actual longest match seems awkward and not in line with standard lexing tools.
I wonder whether others have bumped into this issue. Is there any discussion of this issue on bug trackers etc.?

If you have really nailed the issue, it might be worthwhile raising it to the developers of LARK.

@AiStudent wrote:

Recursion error, a list with 10k elements breaks the recursive list concatenation.

Interesting, is this because Python has a limit on recursion depth?
Anyway, doing lots of concatenations is usually done not directly but via concepts like StringBuilder / StringBuffer, for performance reasons. (In Haskell, one can use difference lists, that's what the Haskell backend does.)
Is there a blessed way in Python for assembling strings from many pieces?

@AiStudent
Copy link
Author

AiStudent commented Jan 6, 2025

Hi, I re-targeted to ANTLRv4.

The parameterized test results yield one error:

  • Parameterized tests:Python:Examples:java

It appears that the BNFC example grammar for Java has mutually left recursive rules.

In order to use the generated frontend, a makefile is now required.
bnfc --python -m Calc.cf

Now one needs: the ANTLRv4 jar file, the Python package antlr4, and Python version 3.10 or higher.

In order to run the makefile the jar file has to be accessible, which currently should be exported as a variable from .profile:
export ANTLR="$HOME/Downloads/antlr/antlr-4.13.2-complete.jar"

Then:
make

And test:
echo "1+ 2" | python3 genTest.py


Using first match is apparently a design choice among several Python based parser generators, such as Lark, Parsimonious, and even PLY in retrospect, by relying on Python’s regex library.

Regarding the recursion depth: my understanding is that the nature of lists in BNFC, being defined in a recursive manner, such as (:) [C] ::= C ... [C] causes the produced tree to have an individual node for every element. The Python backend using ANTLRv4 is still susceptible to this if the rules have a recursive structure, such as the rule below:
(:) [C] ::= “a” C “b” [C] “c”
of which input may look like:
a 1 b a 2 b 3 c c

With some conditional checks both the separator and terminator pragmas produce parsing patterns that yields a single list instead of a tree, circumventing the problem. Essentially turning:
(:) [C] ::= C “b” [C]
into the pattern:
C ( “b” C )*

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants