-
Notifications
You must be signed in to change notification settings - Fork 164
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Python backend #485
base: master
Are you sure you want to change the base?
Python backend #485
Conversation
@AiStudent Thanks for this PR. My apologies for not evaluating it earlier.
It does not seem wise to add a new backend for a parser generator that is already phased out. Could you instead target a parser generator that has a more promising future? SLY by the same author also does not seem well-maintained, there is no release to the Python package repo: https://pypi.org/project/sly/ CC @aarneranta |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
PLY seems outdated. Please target a well-maintained python parser generator instead.
Alternatives?
|
Retargeted to Lark with LALR. @andreasabel Testsuite results: Failures:
Errors:
|
Great! |
Note it is slightly different: 249_unicode -> 358_MixFixLists for example. |
N.B. The CI failures are caused by dependencies: |
I went through all of the failed tests: Parameterized tests:Python:Python:100_coercion_lists
Since Parameterized tests:Python:Python:70_WhiteSpaceSeparatorSeems that whitespace in separators is not honored by printer
Parameterized tests:Python:Python:358_MixFixListsThe generated printer seems to emit too many spaces: -"A" . "B" . "QED" . "line1\n\tline2" . "\"%s\" is not an int" . semi-exprs: 1 + 1;
+"A" . "B" . "QED" . "line1\n\tline2" . "\"%s\" is not an int" . semi-exprs: 1 + 1; Parameterized tests:Python:Python:479_LabelsCaseSensitive
Grammar is:
The reason for failure is that the generated method names in grammar = r"""
...
?exp: exp "^" exp1 -> r_eexp
| exp "**" exp1 -> r_eexp
| exp1
...
"""
...
#transformer
class TreeTransformer(Transformer):
@v_args(inline=True)
def r_eexp(self, exp_1, exp_2):
return Eexp(exp_1, exp_2)
@v_args(inline=True)
def r_eexp(self, exp_1, exp_2):
return EExp(exp_1, exp_2) Parameterized tests:Python:Python:289_LexerKeywordsSame problem as in 278_Keywords (see below). Parameterized tests:Python:Python:222_IntegerListMissing python module name sanitization:
Parameterized tests:Python:Python:256_RegexThe grammar has:
This is translated to:
Start symbol is:
Input is:
Error is:
I can get the test accepted if I give SPACEINALTS and TRIPLE a higher prio than NAME:
That is weird, because the docs say that the order is
So if there is just one prio, the length of the match should win. Clearly, SPACEINALTS can match a longer string ( Is this a bug in LARK or its docu?
This testcase aims at name clashes with the host language:
You need to make sure you sanitise python keywords and reserved identifiers. Parameterized tests:Python:Python:235_SymbolsOverlapTokensRunning this test reports: Unexpected token Token('MYPOSID', 'Foo') at line 1, column 4160.
Expected one of:
* IDENT The grammar has start symbol:
and the input is:
The generated lexer lexes
Parameterized tests:Python:Python:266_defineThis produces some ill-formed python:
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for the port to LARK!
Looking through the failed tests, I guess some more refinements are needed concerning
- name sanitisation (to avoid python syntax errors and name clashes),
- case sensitivity of rule names,
- whitespace handling in printing and
- precedence handling in printing lists.
There is one problem with token precedence that I don't know whether it is a bug in LARK. (See test case 256_Regex
.) Also, LARK claims to not respect the order of token definitions when resolving ambiguities. Usually, the earlier definitions have higher prio when the same length of input is matched. LARK allows to manually assign prios but these seem to override the "length" prio, according to the definition.
Not sure how this mismatch is resolved.
What is your take on this?
Frankly, I am a bit confused by the LARK docs, e.g. I don't know what the Name
in the prio list means. Probably it means alphabetical, but it just doesn't make sense to me.
source/src/BNFC/Backend/Python.hs
Outdated
makefile pkgName optMakefileName basename = vcat | ||
[ | ||
Makefile.mkRule "all" [] | ||
[ " " ] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Since no compilation is needed for python, maybe the all
goal could simply explain this, e.g.
@echo "Doing nothing: No compilation of the parser needed."
source/src/BNFC/Backend/Python.hs
Outdated
-- | Entrypoint for BNFC to use the Python backend. | ||
makePython :: SharedOptions -> CF -> MkFiles () | ||
makePython opts cf = do | ||
let pkgName = "bnfcPyGen" ++ name |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The failure of test 222_IntegerList
is caused by the lack of filename sanitisation, the grammar file is named here 222_Integer-List-1
which yield an illegal python module name:
from bnfcPyGen222_Integer-List-1.ParsingDefs import *
…nd to the lex prio not working as expected
Thanks for the extensive feedback. Currently there exists at least two issues. Testsuite results:
It seems that the length of the regex decides the precedence, not the lengths of the matched tokens. In the below test program I try to tokenize "ab" with the literals A or AB:
Which yields:
In order to make AB have higher precedence, one may: add priority level number, such as |
@AiStudent wrote:
But this seems then to be a bug in LARK, at least according to the specification:
Or maybe, this is what is meant by
In any case, not prioritising the actual longest match seems awkward and not in line with standard lexing tools. If you have really nailed the issue, it might be worthwhile raising it to the developers of LARK. @AiStudent wrote:
Interesting, is this because Python has a limit on recursion depth? |
Hi, I re-targeted to ANTLRv4. The parameterized test results yield one error:
It appears that the BNFC example grammar for Java has mutually left recursive rules. In order to use the generated frontend, a makefile is now required. Now one needs: the ANTLRv4 jar file, the Python package In order to run the makefile the jar file has to be accessible, which currently should be exported as a variable from Then: And test: Using first match is apparently a design choice among several Python based parser generators, such as Lark, Parsimonious, and even PLY in retrospect, by relying on Python’s regex library. Regarding the recursion depth: my understanding is that the nature of lists in BNFC, being defined in a recursive manner, such as With some conditional checks both the separator and terminator pragmas produce parsing patterns that yields a single list instead of a tree, circumventing the problem. Essentially turning: |
Python backend using Python Lex Yacc, PLY.
1 - Requires python3.10+ and ply.
2 - Getting started documentation added to docs/user_guide.rst.
3 - I'll not pledge to be an active maintainer for 3 years, but who really knows.
4 - Questions (especially tricky or critical) are welcome.
Testsuite results:
The example grammars work (no layout supported).
The regression tests yield: