Replies: 10 comments 27 replies
-
Definitely! I'm not sure what you mean, how is"generating a parser" different to what you have now? |
Beta Was this translation helpful? Give feedback.
-
Oh do you mean generating a Participle parser from an ANTLR .g4 file via this parser? Great question and coincidentally I had just started doing something very similar to this, though my AST isn't anywhere near as complete as yours:
However I stopped this and started on a different approach. I think the idea of generating a Participle parser from an ANTLR grammar is great. However, the path I'd prefer it to take would be to translate it to Participle's EBNF form, then write a an ebnf2participle code generator that takes that EBNF and outputs a Participle grammar. This would result in Participle's EBNF being the lingua franca and close #14. Would you be interested in doing this? The parser for Participle's EBNF already exists in participle/ebnf. |
Beta Was this translation helpful? Give feedback.
-
What I'd like to end up with is something like this:
Then of course the same could potentially be done for other parser generators, maybe yacc+lex, etc. |
Beta Was this translation helpful? Give feedback.
-
Yeah, exactly 😄
Neat! (Are those files available anywhere, or just on your local? Didn't see them in the repo.)
I do see the potential advantage to going via EBNF, however I have mixed feelings on this. I suppose my basic concern is that I'm not familiar with EBNF. I've spent the past few days studying ANTLR syntax, but I have no idea if it can cleanly map to EBNF. I would need to read up on it I suppose. For the sake of discussion, let's take an example, the GraphQL EBNF you have in the readme: File = Entry* .
Entry = Type | Schema | Enum | "scalar" ident .
Type = "type" ident ("implements" ident)? "{" Field* "}" .
Field = ident ("(" (Argument ("," Argument)*)? ")")? ":" TypeRef ("@" ident)? .
Argument = ident ":" TypeRef ("=" Value)? .
TypeRef = "[" TypeRef "]" | ident "!"? .
Value = ident .
Schema = "schema" "{" Field* "}" .
Enum = "enum" ident "{" ident* "}" . What is this in ANTLR? file: entry* ;
entry: type | schema | enum | 'scalar' IDENT ;
type: 'type' IDENT ('implements' IDENT)? '{' field* '}' ;
field: IDENT ('(' (argument (',' argument)*)? ')')? ':' type_ref ('@' IDENT)? ;
argument: IDENT ':' type_ref ('=' value)? ;
type_ref: '[' type_ref ']' | IDENT '!'? ;
value: IDENT ;
schema: 'schema' '{' field* '}' ;
enum: 'enum' IDENT '{' IDENT* '}' ; That seems pretty clean, and is legal in a /** Taken from "The Definitive ANTLR 4 Reference" by Terence Parr */
// Derived from http://json.org
grammar JSON;
json
: value
;
obj
: '{' pair (',' pair)* '}'
| '{' '}'
;
pair
: STRING ':' value
;
arr
: '[' value (',' value)* ']'
| '[' ']'
;
value
: STRING
| NUMBER
| obj
| arr
| 'true'
| 'false'
| 'null'
;
STRING
: '"' (ESC | SAFECODEPOINT)* '"'
;
fragment ESC
: '\\' (["\\/bfnrt] | UNICODE)
;
fragment UNICODE
: 'u' HEX HEX HEX HEX
;
fragment HEX
: [0-9a-fA-F]
;
fragment SAFECODEPOINT
: ~ ["\\\u0000-\u001F]
;
NUMBER
: '-'? INT ('.' [0-9] +)? EXP?
;
fragment INT
: '0' | [1-9] [0-9]*
;
// no leading zeros
fragment EXP
: [Ee] [+\-]? INT
;
// \- since - means "range" inside [...]
WS
: [ \t\n\r] + -> skip
; Would it be possible to translate that to EBNF, and then to Participle? Does EBNF differentiate between lexer and parser tokens? Can it specify that a particular lexer token is discarded? Recursive tokens? (Not that I have figured out supporting that yet, will likely need to write a custom lexer.) Any other sharp edges you can think of? In the meantime, I've made some progress on the direct Antlr -> Participle, for example my generator outputs a working lexer for JSON based on the above grammar. The parse objects are a bit trickier to get right. EDIT: You can also define sublexers in Antlr, similar to Participle's stateful lexer. Can that be expressed in EBNF? |
Beta Was this translation helpful? Give feedback.
-
Good point re. lexers. I punted on that by just ignoring the lexer part, but for a generalised solution that would not suffice. The current EBNF has no concept of lexing, so it would definitely need to be extended. How does Antlr describe sub-lexers? I think rather than blocking you on that, go ahead and do the antlr2participle. In the meantime I'll extend the EBNF format so that once your code is merged I can adapt it to the new EBNF format. |
Beta Was this translation helpful? Give feedback.
-
@alecthomas I'd like your opinion on one or two points if you have a moment. Take, for example, the JSON Antlr file. If we have this rule: value
: STRING
| NUMBER
| obj
| arr
| 'true'
| 'false'
| 'null'
; What would be the maximally correct translation to Participle? // One possibility
type Value struct {
STRING *string `@STRING`
NUMBER *string `| @NUMBER`
Obj *Obj `| @@`
Arr *Arr `| @@`
True bool `| @'true'`
False bool `| @'false'`
Null bool `| @'null'`
}
// Option 2
type Value struct {
STRING *string `@STRING`
NUMBER *string `| @NUMBER`
Obj *Obj `| @@`
Arr *Arr `| @@`
Literal *string `| @( 'true' | 'false' | 'null' )`
} It's a bit interesting because literals in parser rules are often punctuation and you're not interested in capturing it. Take, for example, this from the TSQL grammar: throw_statement
: THROW (throw_error_number ',' throw_message ',' throw_state)? ';'?
; You don't care about those commas, which is quite a different scenario than the above JSON rule. I can't naively turn every literal in an Antlr parser rule into a There is a way in Antlr to specify that you care about a specific piece of a rule, applying a "label" such as in these rules: // Give the two sql_clauses fields different names:
try_catch_statement
: BEGIN TRY ';'? try_clauses=sql_clauses+ END TRY ';'? BEGIN CATCH ';'? catch_clauses=sql_clauses* END CATCH ';'?
;
// Capture the literal '=' into a boolean field named "eq", if the first alternate matches.
expression_elem
: leftAlias=column_alias eq='=' leftAssignment=expression
| expressionAs=expression as_column_alias?
; That's fine, but in the JSON example none of the literals are labeled. So what to do? It might be worth surveying more .g4 files to see how these rules tend to be put together. My initial instinct is: if a top-level alternate of a parser rule contains nothing that would be captured (i.e. just literals) then the entire alternate should be captured as a boolean (which may involve grouping the elements in the alternate). |
Beta Was this translation helpful? Give feedback.
-
Got a working JSON parser generated from the Antlr grammar. Seem alright? Source: /** Taken from "The Definitive ANTLR 4 Reference" by Terence Parr */
// Derived from http://json.org
grammar JSON;
json
: value
;
obj
: '{' pair (',' pair)* '}'
| '{' '}'
;
pair
: STRING ':' value
;
arr
: '[' value (',' value)* ']'
| '[' ']'
;
value
: STRING
| NUMBER
| obj
| arr
| 'true'
| 'false'
| 'null'
;
STRING
: '"' (ESC | SAFECODEPOINT)* '"'
;
fragment ESC
: '\\' (["\\/bfnrt] | UNICODE)
;
fragment UNICODE
: 'u' HEX HEX HEX HEX
;
fragment HEX
: [0-9a-fA-F]
;
fragment SAFECODEPOINT
: ~ ["\\\u0000-\u001F]
;
NUMBER
: '-'? INT ('.' [0-9] +)? EXP?
;
fragment INT
: '0' | [1-9] [0-9]*
;
// no leading zeros
fragment EXP
: [Ee] [+\-]? INT
;
// \- since - means "range" inside [...]
WS
: [ \t\n\r] + -> skip
; Result: package json
import (
"github.com/alecthomas/participle/v2"
"github.com/alecthomas/participle/v2/lexer/stateful"
)
var (
Lexer = stateful.Must(stateful.Rules{
"Root": {
{"STRING", `"(\\(["\\/bfnrt]|u[0-9a-fA-F][0-9a-fA-F][0-9a-fA-F][0-9a-fA-F])|[^"\\\x{0000}-\x{001F}])*"`, nil},
{"NUMBER", `-?(0|[1-9][0-9]*)(\.[0-9]+)?([Ee][+\-]?(0|[1-9][0-9]*))?`, nil},
{"ws", `[ \t\n\r]+`, nil},
{"XXX__LITERAL_,", `,`, nil},
{"XXX__LITERAL_:", `:`, nil},
{"XXX__LITERAL_[", `\[`, nil},
{"XXX__LITERAL_]", `\]`, nil},
{"XXX__LITERAL_false", `false`, nil},
{"XXX__LITERAL_null", `null`, nil},
{"XXX__LITERAL_true", `true`, nil},
{"XXX__LITERAL_{", `\{`, nil},
{"XXX__LITERAL_}", `\}`, nil},
},
})
Parser = participle.MustBuild(
&Json{},
participle.Lexer(Lexer),
participle.UseLookahead(2),
)
)
type Json struct {
Value *Value `@@`
}
type Obj struct {
Pair []*Pair `'{' @@ ( ',' @@ )* '}' | '{' '}'`
}
type Pair struct {
String *string `@STRING`
Value *Value `':' @@`
}
type Arr struct {
Value []*Value `'[' @@ ( ',' @@ )* ']' | '[' ']'`
}
type Value struct {
String *string `@STRING`
Number *string `| @NUMBER`
Obj *Obj `| @@`
Arr *Arr `| @@`
True bool `| @'true'`
False bool `| @'false'`
Null bool `| @'null'`
} |
Beta Was this translation helpful? Give feedback.
-
Mind jumping in the Slack channel linked in the README? Discussions is not that great for these kind of back and forths. |
Beta Was this translation helpful? Give feedback.
-
I'm starting to conclude that this may be a Hard Problem™. To recap on the TSQL grammar, the last major hurdle was that it had ambiguous lexer token rules: // Note that this grammar is case-insensitive.
R: 'R';
ID: ( [A-Z_#] | FullWidthLetter) ( [A-Z_#$@0-9] | FullWidthLetter )*; You can see how alter_external_library
: ALTER EXTERNAL LIBRARY library_name=id_ (AUTHORIZATION owner_name=id_)?
(SET|ADD) ( LR_BRACKET CONTENT EQUAL (client_library=STRING | BINARY | NONE) (COMMA PLATFORM EQUAL (WINDOWS|LINUX)? RR_BRACKET)
WITH (COMMA? LANGUAGE EQUAL (R|PYTHON) | DATA_SOURCE EQUAL external_data_source_name=id_ )+ RR_BRACKET )
; The problem was, since the To resolve this, I implemented Taking a break from TSQL, I moved on to the Dart grammar. The generator had relatively few problems and I quickly had a Participle grammar, however when I went to parse some Dart files I found a different set of problematic rules: NEWLINE: '\n' | '\r' | '\r\n' ;
WHITESPACE: [ \t\r\n\u000C]+ -> skip ; Snippets of generated Go code: Rules = stateful.Rules{
"Root": {
{"NEWLINE", `(\n|\r|\r\n)`, nil},
{"whitespace", `[ \t\r\n\x{000C}]+`, nil},
},
}
Lexer = stateful.Must(Rules, stateful.MatchLongest())
// ...
type ScriptTag struct {
NotNewline *string `'#!' ( @!NEWLINE )*`
Newline *string `@NEWLINE`
} Given the character I thought of perhaps "promoting" literal lexer tokens present in the parser rules: // Instead of
type AlterExternalLibrary struct {
// ...
Python *string `( @PYTHON`
R *string `| @R )`
// ...
}
// You would have
type AlterExternalLibrary struct {
// ...
Language *string `( 'PYTHON' | 'R' )`
// ...
} That would have solved the TSQL problem, albeit by making all tokens into the (I may still want to perform that operation, potentially as an option, since I believe it will make for much more readable structs in the case of certain grammars.) I looked into how how Antlr manages to deal with these things. From browsing the FAQ and skimming the technical paper, it seems that Antlr4 can handle any arbitrary grammar because it analyzes the input at runtime and determines possible resolution paths, caching the determination as it goes. Essentially, it's a lot more sophisticated than applying a set of regexes and seeing what matches, or even what the longest match is. Participle isn't and really shouldn't be as complex as Antlr, so it's starting to become a matter of determining tradeoffs. When does an Antlr grammar need to be adjusted before transforming it to Participle? Is this something that can be computed? I'm trying to determine how I would even resolve the ambiguity in the Dart grammar if I were hand-coding it in Participle. Likely it could be done by pushing & popping a group in the stateful lexer, possibly with a backref. |
Beta Was this translation helpful? Give feedback.
-
Conversion question: Antlr4 supports left-recursive rules, and Participle does not. Therefore it may be prudent to look into detecting left-recursive rules and restructuring them, if this is possible. Antlr4 can also mark a recursive alternative as right-associative. You can see both kinds in this sample from the Lua grammar: exp
: 'nil' | 'false' | 'true'
| number
| string
| '...'
| functiondef
| prefixexp
| tableconstructor
| <assoc=right> exp operatorPower exp
| operatorUnary exp
| exp operatorMulDivMod exp
| exp operatorAddSub exp
| <assoc=right> exp operatorStrcat exp
| exp operatorComparison exp
| exp operatorAnd exp
| exp operatorOr exp
| exp operatorBitwise exp
; I have not yet managed to think about associativity in grammars & ASTs without making my head hurt. I'm wondering @alecthomas if you have a grasp of what I would need to do to rewrite the above rule into a set of rules which would then convert cleanly to Participle? EDIT: Ah, this is a whole subject: https://en.wikipedia.org/wiki/Left_recursion |
Beta Was this translation helpful? Give feedback.
-
There are a lot of grammars available for Antlr4, I think something like 200 in this repo alone. However the experience of using it with Go is pretty terrible.
So, I used Participle to write a parser for Antlr v4 grammars:
Before I set up a PR, I'm wondering if this is something you would be interested in having this in the
examples
folder.Further, I have also made some progress on generating a Participle parser based on the above AST. I'm wondering if you think something like that would have a home here.
Beta Was this translation helpful? Give feedback.
All reactions