Antlr4 #157

tooolbox · 2021-07-18T23:11:50Z

tooolbox
Jul 18, 2021

There are a lot of grammars available for Antlr4, I think something like 200 in this repo alone. However the experience of using it with Go is pretty terrible.

So, I used Participle to write a parser for Antlr v4 grammars:

package antlr4

import (
	"github.com/alecthomas/participle/v2"
	"github.com/alecthomas/participle/v2/lexer/stateful"
)

var (
	Lexer = stateful.Must(stateful.Rules{
		"Root": {
			{"comment", `//[^\n]*`, nil},
			{"comment2", `/\\*`, stateful.Push("BlockComment")},
			{"String", `'(\\\\|\\'|[^'])*'`, nil},
			{"Group", `\[(\\]|[^\]])*\]`, nil},
			{"UpperIdent", `[A-Z][a-zA-Z_]*\w*`, nil},
			{"LowerIdent", `[a-z][a-zA-Z_]*\w*`, nil},
			{"Punct", `[-[!@#$%^&*()+_={}\|:;"'<,>.?/~]|]`, nil},
			{"whitespace", `[ \t\r\n]+`, nil},
		},
		"BlockComment": {
			{"end", "\\*/", stateful.Pop()},
			{"any", "([^*]|\\*[^/])+", nil},
		},
	})
	Parser = participle.MustBuild(&AntlrFile{},
		participle.Lexer(Lexer),
		participle.UseLookahead(2),
	)
)

type AntlrFile struct {
	Grammar *GrammarStmt `parser:" @@ "`
	Options *OptionStmt  `parser:" @@? "`
	Rules   []*Rule      `parser:" @@* "`
}

type GrammarStmt struct {
	LexerOnly  bool   `parser:" @'lexer'? "`
	ParserOnly bool   `parser:" @'parser'? "`
	Name       string `parser:" 'grammar' @( UpperIdent | LowerIdent) ';' "`
}

type OptionStmt struct {
	Opts []*Option `parser:" 'options' '{' @@ '}' "`
}

type Option struct {
	Key   string `parser:" @(LowerIdent|UpperIdent) '=' "`
	Value string `parser:" @(LowerIdent|UpperIdent) ';' "`
}

type Rule struct {
	LexRule *LexerRule  `parser:" ( @@ "`
	PrsRule *ParserRule `parser:" | @@ ) "`
}

type ParserRule struct {
	Name string       `parser:" @LowerIdent ':' "`
	Alt  *Alternative `parser:" @@ ';' "`
}

type LexerRule struct {
	Fragment bool         `parser:" @'fragment'? "`
	Name     string       `parser:" @UpperIdent ':' "`
	Alt      *Alternative `parser:" @@ "`
	Skip     bool         `parser:" ('-' '>' ( @'skip' "`
	Channel  string       `parser:" | 'channel' '(' @UpperIdent ')' ) )? ';' "`
}

type Alternative struct {
	Exp   *Expression  `parser:" @@? "`
	Label *string      `parser:" ('#' @UpperIdent)? "`
	Next  *Alternative `parser:" ( '|' @@ )? "`
}

type Expression struct {
	Label   *string     `parser:" ( @(UpperIdent|LowerIdent) "`
	LabelOp *string     `parser:" @( '=' | '+' '=' ) )? "`
	Unary   *Unary      `parser:" @@ "`
	Next    *Expression `parser:" ( @@ )? "`
}

type Unary struct {
	Op      string   `parser:" ( @( '~' ) "`
	Unary   *Unary   `parser:"     @@   ) "`
	Primary *Primary `parser:" | @@ "`
}

type Primary struct {
	Range     *CharRange   `parser:" ( @@ "`
	Str       *string      `parser:" | @String "`
	Ident     *string      `parser:" | @(UpperIdent|LowerIdent) "`
	Group     *string      `parser:" | @Group "`
	Any       bool         `parser:" | @'.' "`
	Sub       *Alternative `parser:" | '(' @@ ')' ) "`
	Arity     string       `parser:" ( @('+' | '*' | '?') "`
	NonGreedy bool         `parser:"   @'?'? )? "`
}

type CharRange struct {
	Start string `parser:" @String '.' '.' "`
	End   string `parser:" @String "`
}

Before I set up a PR, I'm wondering if this is something you would be interested in having this in the examples folder.

Further, I have also made some progress on generating a Participle parser based on the above AST. I'm wondering if you think something like that would have a home here.

alecthomas · 2021-07-18T23:27:53Z

alecthomas
Jul 18, 2021
Maintainer

Definitely!

I'm not sure what you mean, how is"generating a parser" different to what you have now?

0 replies

alecthomas · 2021-07-18T23:37:02Z

alecthomas
Jul 18, 2021
Maintainer

Oh do you mean generating a Participle parser from an ANTLR .g4 file via this parser?

Great question and coincidentally I had just started doing something very similar to this, though my AST isn't anywhere near as complete as yours:

	cmd/antlr2participle/Lua.g4
	cmd/antlr2participle/go.mod
	cmd/antlr2participle/go.sum
	cmd/antlr2participle/main.go

However I stopped this and started on a different approach. I think the idea of generating a Participle parser from an ANTLR grammar is great. However, the path I'd prefer it to take would be to translate it to Participle's EBNF form, then write a an ebnf2participle code generator that takes that EBNF and outputs a Participle grammar. This would result in Participle's EBNF being the lingua franca and close #14. Would you be interested in doing this?

The parser for Participle's EBNF already exists in participle/ebnf.

0 replies

alecthomas · 2021-07-18T23:38:40Z

alecthomas
Jul 18, 2021
Maintainer

What I'd like to end up with is something like this:

$ cat Lua.g4 | antlr2participle | participle > lua.go

Then of course the same could potentially be done for other parser generators, maybe yacc+lex, etc.

0 replies

tooolbox · 2021-07-19T00:19:24Z

tooolbox
Jul 19, 2021
Author

Oh do you mean generating a Participle parser from an ANTLR .g4 file via this parser?

Yeah, exactly 😄

Great question and coincidentally I had just started doing something very similar to this, though my AST isn't anywhere near as complete as yours:

Neat! (Are those files available anywhere, or just on your local? Didn't see them in the repo.)

I think the idea of generating a Participle parser from an ANTLR grammar is great. However, the path I'd prefer it to take would be to translate it to Participle's EBNF form, then write a an ebnf2participle code generator that takes that EBNF and outputs a Participle grammar. This would result in Participle's EBNF being the lingua franca and close #14. Would you be interested in doing this?

I do see the potential advantage to going via EBNF, however I have mixed feelings on this.

I suppose my basic concern is that I'm not familiar with EBNF. I've spent the past few days studying ANTLR syntax, but I have no idea if it can cleanly map to EBNF. I would need to read up on it I suppose.

For the sake of discussion, let's take an example, the GraphQL EBNF you have in the readme:

File = Entry* .
Entry = Type | Schema | Enum | "scalar" ident .
Type = "type" ident ("implements" ident)? "{" Field* "}" .
Field = ident ("(" (Argument ("," Argument)*)? ")")? ":" TypeRef ("@" ident)? .
Argument = ident ":" TypeRef ("=" Value)? .
TypeRef = "[" TypeRef "]" | ident "!"? .
Value = ident .
Schema = "schema" "{" Field* "}" .
Enum = "enum" ident "{" ident* "}" .

What is this in ANTLR?

file: entry* ;
entry: type | schema | enum | 'scalar' IDENT ;
type: 'type' IDENT ('implements' IDENT)? '{' field* '}' ;
field: IDENT ('(' (argument (',' argument)*)? ')')? ':' type_ref ('@' IDENT)? ;
argument: IDENT ':' type_ref ('=' value)? ;
type_ref: '[' type_ref ']' | IDENT '!'? ;
value: IDENT ;
schema: 'schema' '{' field* '}' ;
enum: 'enum' IDENT '{' IDENT* '}' ;

That seems pretty clean, and is legal in a .g4 file. However Antlr also describes the lexer in the same or a related file (or implies lexer behavior via string literals in parser rules). I think a canonical example is the JSON syntax:

/** Taken from "The Definitive ANTLR 4 Reference" by Terence Parr */

// Derived from http://json.org
grammar JSON;

json
   : value
   ;

obj
   : '{' pair (',' pair)* '}'
   | '{' '}'
   ;

pair
   : STRING ':' value
   ;

arr
   : '[' value (',' value)* ']'
   | '[' ']'
   ;

value
   : STRING
   | NUMBER
   | obj
   | arr
   | 'true'
   | 'false'
   | 'null'
   ;


STRING
   : '"' (ESC | SAFECODEPOINT)* '"'
   ;


fragment ESC
   : '\\' (["\\/bfnrt] | UNICODE)
   ;
fragment UNICODE
   : 'u' HEX HEX HEX HEX
   ;
fragment HEX
   : [0-9a-fA-F]
   ;
fragment SAFECODEPOINT
   : ~ ["\\\u0000-\u001F]
   ;


NUMBER
   : '-'? INT ('.' [0-9] +)? EXP?
   ;


fragment INT
   : '0' | [1-9] [0-9]*
   ;

// no leading zeros

fragment EXP
   : [Ee] [+\-]? INT
   ;

// \- since - means "range" inside [...]

WS
   : [ \t\n\r] + -> skip
   ;

Would it be possible to translate that to EBNF, and then to Participle? Does EBNF differentiate between lexer and parser tokens? Can it specify that a particular lexer token is discarded? Recursive tokens? (Not that I have figured out supporting that yet, will likely need to write a custom lexer.) Any other sharp edges you can think of?

In the meantime, I've made some progress on the direct Antlr -> Participle, for example my generator outputs a working lexer for JSON based on the above grammar. The parse objects are a bit trickier to get right.

EDIT: You can also define sublexers in Antlr, similar to Participle's stateful lexer. Can that be expressed in EBNF?

1 reply

alecthomas Jul 19, 2021
Maintainer

Almost all parser grammars are described in something similar to EBNF, and Antlr is no exception, so no need to worry about that.

alecthomas · 2021-07-19T00:54:24Z

alecthomas
Jul 19, 2021
Maintainer

Good point re. lexers. I punted on that by just ignoring the lexer part, but for a generalised solution that would not suffice.

The current EBNF has no concept of lexing, so it would definitely need to be extended. How does Antlr describe sub-lexers?

I think rather than blocking you on that, go ahead and do the antlr2participle. In the meantime I'll extend the EBNF format so that once your code is merged I can adapt it to the new EBNF format.

1 reply

tooolbox Jul 19, 2021
Author

I haven't actually implemented support for parsing sublexers yet, but it's described here: https://github.com/antlr/antlr4/blob/master/doc/lexer-rules.md#lexical-modes

I think rather than blocking you on that, go ahead and do the antlr2participle. In the meantime I'll extend the EBNF format so that once your code is merged I can adapt it to the new EBNF format.

Sounds good :)

tooolbox · 2021-07-19T02:37:56Z

tooolbox
Jul 19, 2021
Author

@alecthomas I'd like your opinion on one or two points if you have a moment.

Take, for example, the JSON Antlr file. If we have this rule:

value
   : STRING
   | NUMBER
   | obj
   | arr
   | 'true'
   | 'false'
   | 'null'
   ;

What would be the maximally correct translation to Participle?

// One possibility
type Value struct {
	STRING *string `@STRING`
	NUMBER *string `| @NUMBER`
	Obj    *Obj    `| @@`
	Arr    *Arr    `| @@`
	True   bool    `| @'true'`
	False  bool    `| @'false'`
	Null   bool    `| @'null'`
}

// Option 2
type Value struct {
	STRING  *string `@STRING`
	NUMBER  *string `| @NUMBER`
	Obj     *Obj    `| @@`
	Arr     *Arr    `| @@`
	Literal *string `| @( 'true' | 'false' | 'null' )`
}

It's a bit interesting because literals in parser rules are often punctuation and you're not interested in capturing it. Take, for example, this from the TSQL grammar:

throw_statement
    : THROW (throw_error_number ',' throw_message ',' throw_state)? ';'?
    ;

You don't care about those commas, which is quite a different scenario than the above JSON rule. I can't naively turn every literal in an Antlr parser rule into a bool field in the resultant struct. I mean, I could, but there would be a lot of noise.

There is a way in Antlr to specify that you care about a specific piece of a rule, applying a "label" such as in these rules:

// Give the two sql_clauses fields different names:
try_catch_statement
    : BEGIN TRY ';'? try_clauses=sql_clauses+ END TRY ';'? BEGIN CATCH ';'? catch_clauses=sql_clauses* END CATCH ';'?
    ;

// Capture the literal '=' into a boolean field named "eq", if the first alternate matches.
expression_elem
    : leftAlias=column_alias eq='=' leftAssignment=expression
    | expressionAs=expression as_column_alias?
    ;

That's fine, but in the JSON example none of the literals are labeled. So what to do? It might be worth surveying more .g4 files to see how these rules tend to be put together. My initial instinct is: if a top-level alternate of a parser rule contains nothing that would be captured (i.e. just literals) then the entire alternate should be captured as a boolean (which may involve grouping the elements in the alternate).

9 replies

tooolbox Jul 19, 2021
Author

Looking at that CreateAssembly example, I'm thinking it may be a deserved optimization/simplification to check if lexer tokens are defined as constants, like COMMA: ','; and then perform substitution. It could make sense because you're unlikely to need to capture a raw, constant literal. That example becomes, perhaps, something more like this:

type CreateAssembly struct {
	AssemblyName *Id_ `'CREATE' 'ASSEMBLY' @@`
	OwnerName    *Id_ `( 'AUTHORIZATION' @@ )? 'FROM'`
	CSB          []struct {
		STRING *string `','? ( @STRING`
		BINARY *string `| @BINARY )`
	} `@@+`
	SAFE_OR             *string `( 'WITH' 'PERMISSION_SET' '=' @( 'SAFE' | 'EXTERNAL_ACCESS' | 'UNSAFE' )`
	WITH_PERMISSION_SET bool    `)?`
}

Tricky to name some of those fields accurately, but may be worth it to reduce noise in the fields.

alecthomas Jul 19, 2021
Maintainer

Yes, definitely agreed. What I was planning to do was synthesise a name based on the tokens being matched. So the first would become

CreateAssemblyId *Id

For sigils like ',' I planned to use the english name for them so this:

       FROM (COMMA? (STRING|BINARY) )+

Might be named:

FromCommaStringOrBinary *FromCommaStringOrBinary

It might also be possible to ignore them though, as they don't add much information. It was mostly for uniqueness.

tooolbox Jul 20, 2021
Author

For sigils like ',' I planned to use the english name for them so this:

Yeah, I want to do something like that.

If there are only static literals capture into a bool.

If there are any kind of conditional statics capture them into a string.

A rule with a single named token should capture into a string.

A rule with >1 conditional static or named capture goes into a separate struct.

??

On point 4, my thought is more that "An or-more (+/*) subexpression with more than one capture in it goes in a separate struct." An at-most-one subexpression with multiple captures can be spread out across several fields, and an or-more subexpression with one capture can be captured in a slice field. It's only a combination of both that really requires a separate struct.

Couple notes and thoughts:

A little surprisingly, doesn't seem like that strcase package handles lowercasing. Neither does the first random competitor package I found.
How about this rule:

obj
: '{' pair (',' pair)* '}'
| '{' '}'
;

The first two pair captures can go into a []*Pair, but you don't really want to capture the second alternate...contrary to that JSON value example where every top-level alternate should match in a bool field.

Generally, been doing some cleanup, and refactoring to handle walls I was running into, and it's starting to pay off.
Having recently fully grasped the usefulness of the Visitor pattern, I'm tempted to generate a Visitor interface and BaseVisitor struct along with the lexer & parse objects.
An annoyingly large amount of the Antlr grammars I peeked at are using "actions", i.e. Java code snippets directly in the grammar. A common use seems to be tracking the depth of parentheses. Not sure what to do on that aside a "not supported" notice, but it's a shame to walk away from grammars like Swift or Go.
Is !@@ (or whatever variation) supported in Participle? I don't see it in the docs, but I wanted to check.

alecthomas Jul 22, 2021
Maintainer

!<expr> is in the docs as ~<expr>, assuming it means "match anything except this"?

tooolbox Jul 22, 2021
Author

Exactly, yes. I have not yet seen it in the few grammars I've looked at, but that operator is apparently available to the Parser and not just the Lexer. From the docs:

When you want to match everything but a particular token or set of tokens, use the ~ “not” operator. This operator is rarely used in the parser but is available. ~INT matches any token except the INT token. ~’,’ matches any token except the comma. ~(INT|ID) matches any token except an INT or an ID.

I can use ! to match anything but a given token, the problem comes if you can "not" a parser sub-rule, I don't think Participle would have a way to express that. ...Then again, the docs are talking about tokens rather than rules, so we may be in the clear.

(EDIT: Huh, just saw 5c4f519)

tooolbox · 2021-07-21T06:17:19Z

tooolbox
Jul 21, 2021
Author

Got a working JSON parser generated from the Antlr grammar. Seem alright?

Source:

/** Taken from "The Definitive ANTLR 4 Reference" by Terence Parr */

// Derived from http://json.org
grammar JSON;

json
   : value
   ;

obj
   : '{' pair (',' pair)* '}'
   | '{' '}'
   ;

pair
   : STRING ':' value
   ;

arr
   : '[' value (',' value)* ']'
   | '[' ']'
   ;

value
   : STRING
   | NUMBER
   | obj
   | arr
   | 'true'
   | 'false'
   | 'null'
   ;


STRING
   : '"' (ESC | SAFECODEPOINT)* '"'
   ;


fragment ESC
   : '\\' (["\\/bfnrt] | UNICODE)
   ;
fragment UNICODE
   : 'u' HEX HEX HEX HEX
   ;
fragment HEX
   : [0-9a-fA-F]
   ;
fragment SAFECODEPOINT
   : ~ ["\\\u0000-\u001F]
   ;


NUMBER
   : '-'? INT ('.' [0-9] +)? EXP?
   ;


fragment INT
   : '0' | [1-9] [0-9]*
   ;

// no leading zeros

fragment EXP
   : [Ee] [+\-]? INT
   ;

// \- since - means "range" inside [...]

WS
   : [ \t\n\r] + -> skip
   ;

Result:

package json

import (
	"github.com/alecthomas/participle/v2"
	"github.com/alecthomas/participle/v2/lexer/stateful"
)

var (
	Lexer = stateful.Must(stateful.Rules{
		"Root": {
			{"STRING", `"(\\(["\\/bfnrt]|u[0-9a-fA-F][0-9a-fA-F][0-9a-fA-F][0-9a-fA-F])|[^"\\\x{0000}-\x{001F}])*"`, nil},
			{"NUMBER", `-?(0|[1-9][0-9]*)(\.[0-9]+)?([Ee][+\-]?(0|[1-9][0-9]*))?`, nil},
			{"ws", `[ \t\n\r]+`, nil},
			{"XXX__LITERAL_,", `,`, nil},
			{"XXX__LITERAL_:", `:`, nil},
			{"XXX__LITERAL_[", `\[`, nil},
			{"XXX__LITERAL_]", `\]`, nil},
			{"XXX__LITERAL_false", `false`, nil},
			{"XXX__LITERAL_null", `null`, nil},
			{"XXX__LITERAL_true", `true`, nil},
			{"XXX__LITERAL_{", `\{`, nil},
			{"XXX__LITERAL_}", `\}`, nil},
		},
	})
	Parser = participle.MustBuild(
		&Json{},
		participle.Lexer(Lexer),
		participle.UseLookahead(2),
	)
)

type Json struct {
	Value *Value `@@`
}
type Obj struct {
	Pair []*Pair `'{' @@ ( ',' @@ )* '}' | '{' '}'`
}
type Pair struct {
	String *string `@STRING`
	Value  *Value  `':' @@`
}
type Arr struct {
	Value []*Value `'[' @@ ( ',' @@ )* ']' | '[' ']'`
}
type Value struct {
	String *string `@STRING`
	Number *string `| @NUMBER`
	Obj    *Obj    `| @@`
	Arr    *Arr    `| @@`
	True   bool    `| @'true'`
	False  bool    `| @'false'`
	Null   bool    `| @'null'`
}

5 replies

alecthomas Jul 21, 2021
Maintainer

Very nice! This is super cool.

I think for literals it might be nicer to just combine them into one lexer rule:

"Literal": `,|:|\[|\]|false|true|null|\{|\}`,

What do you think?

tooolbox Jul 22, 2021
Author

I think for literals it might be nicer to just combine them into one lexer rule

Hm, yes. I do see how the generated code is cleaner that way. I'll add that to the list.

I've been wrestling with the TSQL grammar, since that's the parser I needed in the first place. It's massive, and unfortunately it has ambiguous lexer rules, forcing me to modify your stateful lexer so it selects the longest match instead of the first match. I think this behavior will work as a lexer option, so I'll submit a PR for that.

While I can now lex TSQL files, and the token stream looks good, I'm getting an intense parser error that I may need help with:

panic: Column_Elem.sql:1:1: branch %!s(PANIC=String method: runtime error: slice bounds out of range [:1] with length 0) was accepted but did not progress the lexer at Column_Elem.sql:1:1 ("") [recovered]

I'm going to start cleaning up my code to prepare for a PR, at least a draft. Either way you will be able to see the code and possibly assist with the above.

P.S. I have a few other comments here (discussions likes to collapse threads).

alecthomas Jul 22, 2021
Maintainer

This panic means there's a lexer rule somewhere that matched but produced an empty string. For example a* when matching against b will succeed but be empty. This is an error in the lexer definition, so it panics.

tooolbox Jul 22, 2021
Author

Thanks for the explanation!

In this case, I can successfully call Lex() and print out the lexer tokens, it looks like this:

tokenWithTypes{
            Token@Column_Elem.sql:1:1{"USE", "USE"},
            Token@Column_Elem.sql:1:5{"ID", "TEMPDB"},
            Token@Column_Elem.sql:3:1{"CREATE", "CREATE"},
            Token@Column_Elem.sql:3:8{"TABLE", "TABLE"},
            Token@Column_Elem.sql:3:14{"ID", "COL_ELEM_EX"},
            Token@Column_Elem.sql:4:1{"XXX__LITERAL_(", "("},
            Token@Column_Elem.sql:5:2{"ID", "IDENT"},
            Token@Column_Elem.sql:5:8{"INT", "INT"},
            Token@Column_Elem.sql:5:12{"NOT", "NOT"},
            Token@Column_Elem.sql:5:16{"NULL_", "NULL"},
            Token@Column_Elem.sql:5:21{"IDENTITY", "IDENTITY"},
            Token@Column_Elem.sql:5:29{"XXX__LITERAL_(", "("},
            Token@Column_Elem.sql:5:30{"DECIMAL", "1"},
            Token@Column_Elem.sql:5:31{"COMMA", ","},
            Token@Column_Elem.sql:5:32{"DECIMAL", "1"},
            Token@Column_Elem.sql:5:33{"XXX__LITERAL_)", ")"},
            Token@Column_Elem.sql:5:34{"COMMA", ","},
            Token@Column_Elem.sql:6:2{"ID", "RGUID"},
            Token@Column_Elem.sql:6:8{"ID", "UNIQUEIDENTIFIER"},
            Token@Column_Elem.sql:6:25{"NOT", "NOT"},
            Token@Column_Elem.sql:6:29{"NULL_", "NULL"},
            Token@Column_Elem.sql:6:34{"DEFAULT", "DEFAULT"},
            Token@Column_Elem.sql:6:42{"ID", "NEWID"},
            Token@Column_Elem.sql:6:47{"XXX__LITERAL_(", "("},
            Token@Column_Elem.sql:6:48{"XXX__LITERAL_)", ")"},
            Token@Column_Elem.sql:6:50{"ROWGUIDCOL", "ROWGUIDCOL"},
            Token@Column_Elem.sql:6:60{"COMMA", ","},
            Token@Column_Elem.sql:7:2{"ID", "COL1"},
            Token@Column_Elem.sql:7:7{"INT", "INT"},
            Token@Column_Elem.sql:8:1{"XXX__LITERAL_)", ")"},
            Token@Column_Elem.sql:10:1{"INSERT", "INSERT"},
            Token@Column_Elem.sql:10:8{"INTO", "INTO"},
            Token@Column_Elem.sql:10:13{"ID", "COL_ELEM_EX"},
            Token@Column_Elem.sql:10:24{"XXX__LITERAL_(", "("},
            Token@Column_Elem.sql:10:25{"ID", "COL1"},
            Token@Column_Elem.sql:10:29{"XXX__LITERAL_)", ")"},
            Token@Column_Elem.sql:10:31{"VALUES", "VALUES"},
            Token@Column_Elem.sql:10:37{"XXX__LITERAL_(", "("},
            Token@Column_Elem.sql:10:38{"DECIMAL", "2"},
            Token@Column_Elem.sql:10:39{"XXX__LITERAL_)", ")"},
            Token@Column_Elem.sql:12:1{"SELECT", "SELECT"},
            Token@Column_Elem.sql:12:8{"XXX__LITERAL_$", "$"},
            Token@Column_Elem.sql:12:9{"IDENTITY", "IDENTITY"},
            Token@Column_Elem.sql:12:18{"AS", "AS"},
            Token@Column_Elem.sql:12:21{"ID", "IDENT_COL"},
            Token@Column_Elem.sql:12:31{"FROM", "FROM"},
            Token@Column_Elem.sql:12:36{"ID", "COL_ELEM_EX"},
            Token@Column_Elem.sql:14:1{"SELECT", "SELECT"},
            Token@Column_Elem.sql:14:8{"XXX__LITERAL_$", "$"},
            Token@Column_Elem.sql:14:9{"ROWGUID", "ROWGUID"},
            Token@Column_Elem.sql:14:17{"AS", "AS"},
            Token@Column_Elem.sql:14:20{"ID", "GUID_COL"},
            Token@Column_Elem.sql:14:29{"FROM", "FROM"},
            Token@Column_Elem.sql:14:34{"ID", "COL_ELEM_EX"},
            Token@Column_Elem.sql:16:1{"SELECT", "SELECT"},
            Token@Column_Elem.sql:16:8{"ID", "COL_ELEM_EX"},
            Token@Column_Elem.sql:16:19{"XXX__LITERAL_.", "."},
            Token@Column_Elem.sql:16:20{"ID", "COL1"},
            Token@Column_Elem.sql:16:25{"AS", "AS"},
            Token@Column_Elem.sql:16:28{"ID", "QUALIFIED_COL"},
            Token@Column_Elem.sql:16:42{"FROM", "FROM"},
            Token@Column_Elem.sql:16:47{"ID", "COL_ELEM_EX"},
            Token@Column_Elem.sql:18:1{"SELECT", "SELECT"},
            Token@Column_Elem.sql:18:8{"ID", "COL1"},
            Token@Column_Elem.sql:18:13{"AS", "AS"},
            Token@Column_Elem.sql:18:16{"ID", "NONQUALIFIED_COL"},
            Token@Column_Elem.sql:18:33{"FROM", "FROM"},
            Token@Column_Elem.sql:18:38{"ID", "COL_ELEM_EX"},
            Token@Column_Elem.sql:20:1{"SELECT", "SELECT"},
            Token@Column_Elem.sql:20:8{"NULL_", "NULL"},
            Token@Column_Elem.sql:20:13{"AS", "AS"},
            Token@Column_Elem.sql:20:16{"ID", "NULL_COL"},
            Token@Column_Elem.sql:20:25{"FROM", "FROM"},
            Token@Column_Elem.sql:20:30{"ID", "COL_ELEM_EX"},
            Token@Column_Elem.sql:22:1{"DROP", "DROP"},
            Token@Column_Elem.sql:22:6{"TABLE", "TABLE"},
            Token@Column_Elem.sql:22:12{"ID", "COL_ELEM_EX"},
            Token@Column_Elem.sql:23:1{"EOF", ""}
}

You can see I'm looking up the actual token type name to help my debugging. I don't see an empty string except the EOF though?

Let me know if this is too much of a tangent or you just want to see the code.

alecthomas Jul 22, 2021
Maintainer

It's hard to tell. You could try adding some debugging prints near the panic.

alecthomas · 2021-07-22T03:10:19Z

alecthomas
Jul 22, 2021
Maintainer

Mind jumping in the Slack channel linked in the README? Discussions is not that great for these kind of back and forths.

1 reply

tooolbox Jul 23, 2021
Author

Apologies, you are right, but I prefer not to. I'll stay away from the minutiae.

tooolbox · 2021-07-22T20:35:46Z

tooolbox
Jul 22, 2021
Author

I'm starting to conclude that this may be a Hard Problem™.

To recap on the TSQL grammar, the last major hurdle was that it had ambiguous lexer token rules:

// Note that this grammar is case-insensitive.
R: 'R';
ID: ( [A-Z_#] | FullWidthLetter) ( [A-Z_#$@0-9] | FullWidthLetter )*;

You can see how R is used in the parser:

alter_external_library
    : ALTER EXTERNAL LIBRARY library_name=id_ (AUTHORIZATION owner_name=id_)?
       (SET|ADD) ( LR_BRACKET CONTENT EQUAL (client_library=STRING | BINARY | NONE) (COMMA PLATFORM EQUAL (WINDOWS|LINUX)? RR_BRACKET) 
       WITH (COMMA? LANGUAGE EQUAL (R|PYTHON) | DATA_SOURCE EQUAL external_data_source_name=id_ )+ RR_BRACKET )
    ;

The problem was, since the R rule comes before the ID rule, the stateful lexer would turn SELECT rates into ["SELECT", "R", "ATES"] tokens, which then won't parse into a select statement since it wants ["SELECT", "ID"].

To resolve this, I implemented MatchLongest and sorted the lexer rules based on if they were purely literal, and then by their length. The stateful lexer would match R against the first letter of rates, but then subsequently find that ID matched more of it, so it would generate an ID token instead of R. Problem solved.

Taking a break from TSQL, I moved on to the Dart grammar. The generator had relatively few problems and I quickly had a Participle grammar, however when I went to parse some Dart files I found a different set of problematic rules:

NEWLINE: '\n'  | '\r'  | '\r\n'  ;
WHITESPACE:  [ \t\r\n\u000C]+ -> skip ;

Snippets of generated Go code:

Rules = stateful.Rules{
		"Root": {
			{"NEWLINE", `(\n|\r|\r\n)`, nil},
			{"whitespace", `[ \t\r\n\x{000C}]+`, nil},
		},
}
Lexer  = stateful.Must(Rules, stateful.MatchLongest())

// ...

type ScriptTag struct {
	NotNewline *string `'#!' ( @!NEWLINE )*`
	Newline    *string `@NEWLINE`
}

Given the character \n in your source, most of the time you want this to be lexed as WHITESPACE, otherwise you will have \n tokens all over the place, confusing the parser. However, sometimes \n needs to be lexed as NEWLINE or else you can never construct a ScriptTag. That's special.

I thought of perhaps "promoting" literal lexer tokens present in the parser rules:

// Instead of
type AlterExternalLibrary struct {
	// ...
	Python *string `( @PYTHON`
	R      *string `| @R )`
	// ...
}

// You would have
type AlterExternalLibrary struct {
	// ...
	Language *string `( 'PYTHON' | 'R' )`
	// ...
}

That would have solved the TSQL problem, albeit by making all tokens into the ID type. Messy, but it would have worked in that case. However it won't work in the Dart case because NEWLINE isn't a literal, and because if you erase that rule, only the WHITESPACE rule will match and all \n tokens will be discarded.

(I may still want to perform that operation, potentially as an option, since I believe it will make for much more readable structs in the case of certain grammars.)

I looked into how how Antlr manages to deal with these things. From browsing the FAQ and skimming the technical paper, it seems that Antlr4 can handle any arbitrary grammar because it analyzes the input at runtime and determines possible resolution paths, caching the determination as it goes. Essentially, it's a lot more sophisticated than applying a set of regexes and seeing what matches, or even what the longest match is.

Participle isn't and really shouldn't be as complex as Antlr, so it's starting to become a matter of determining tradeoffs. When does an Antlr grammar need to be adjusted before transforming it to Participle? Is this something that can be computed?

I'm trying to determine how I would even resolve the ambiguity in the Dart grammar if I were hand-coding it in Participle. Likely it could be done by pushing & popping a group in the stateful lexer, possibly with a backref.

5 replies

alecthomas Jul 22, 2021
Maintainer

This makes sense. I suspect ANTLR's parser requests specific tokens from the lexer as it traverses each branch, as opposed to Participle which tokenises first then feeds those into the parser to select a branch.

It may be possible without significant work to support this approach in Participle. The lexer interface would have to be modified to accept a desired type. eg. instead of

type Lexer interface {
	Next() (Token, error)
}

Something like:

type Lexer interface {
	Next(tt rune) (Token, error)
}

Then when the parser encounters @NEWLINE that branch can explicitly request that token type. There would be complications around whitespace I imagine.

The nice thing about this is that it could make Participle quite a bit faster as it won't be trying each regex in turn to pick a match. I suspect it will also be less confusing to users, as I've had users run into a situation where they have overlapping lexer patterns previously and not know why the parser was picking one over the other.

tooolbox Jul 23, 2021
Author

Interesting. So I imagine it would go something like this:

The Lexer attempts to match whatever rule is requested.
If that rule does match, it returns that token.
If that rule doesn't match, it lexes a fallback token using the old behavior of first-match.
If the fallback token is throwaway, toss it and go back to (1).
Otherwise, return the fallback token.

That's fairly simplistic and likely wouldn't resolve the issue by itself.

Alternatively, the Lexer could return an error if it doesn't have the asked-for token, and it's up to the Parser to traverse the branches and ask for tokens. It would also need a way to tell the Lexer it wants to match a particular string literal of any token type, and to ask the Lexer to advance through throwaway tokens? I'm mostly guessing because I haven't delved the Parser internals.

It would be interesting to enable this sort of thing in Participle. Been working on some of the other, simpler grammars in the meantime.

alecthomas Jul 23, 2021
Maintainer

It would return an error, it wouldn't fallback.

alecthomas Jul 23, 2021
Maintainer

I don't think that logic is quite correct - step 3 should not be necessary.

There are only a few cases for actually matching tokens:

<token-type> - this is covered by the proposed interface change above. If this token type does not match the grammar branch is aborted.
"...":<token-type> - as above because the type is specified.
"..." - this will require fallback to the current behaviour of iterating over all patterns.

I think the logic would be something like the following for each grammar branch:

// Branch explicitly specified a token type.
if tokenType != TokenTypeAny {
  selected = getNext(tokenType)
  if selected.Type != tokenType {
    selected = getNextSkippingElided(tokenType)
    if selected.Type != tokenType {
      return NoMatch
    }
  }
  return selected
}

// Literal with no type
return matchFirstPattern()

Mmmm. Actually, thinking about it it should be possible to type annotate the grammar automatically, eliminating the "literal with no type" case completely. Basically, walk the grammar and for each literal match it to a pattern in the lexer. Though thinking more about it still that could be very complicated with the stateful lexer, as it is probably not possible to know ahead of time when to switch states...

I'll do some experimenting.

tooolbox Jul 23, 2021
Author

Got it, makes sense.

Currently I have a functional CSV parser, working on the CSS3 parser. Each grammar reveals bugs in my converter which I'm ironing out as I go. I would like to do a few more before I submit a PR, and I may or may not implement support for lexer modes. If you have suggestions for grammars to focus on I'd be interested.

Once that work is done I could potentially look into the JIT-lexer, so to speak. If you were to explore that yourself in the meantime I wouldn't object. EDIT: Great on the experimenting, likely you will make more progress than I on that front 😉

tooolbox · 2021-07-23T18:15:03Z

tooolbox
Jul 23, 2021
Author

Conversion question:

Antlr4 supports left-recursive rules, and Participle does not. Therefore it may be prudent to look into detecting left-recursive rules and restructuring them, if this is possible.

Antlr4 can also mark a recursive alternative as right-associative. You can see both kinds in this sample from the Lua grammar:

exp
    : 'nil' | 'false' | 'true'
    | number
    | string
    | '...'
    | functiondef
    | prefixexp
    | tableconstructor
    | <assoc=right> exp operatorPower exp
    | operatorUnary exp
    | exp operatorMulDivMod exp
    | exp operatorAddSub exp
    | <assoc=right> exp operatorStrcat exp
    | exp operatorComparison exp
    | exp operatorAnd exp
    | exp operatorOr exp
    | exp operatorBitwise exp
    ;

I have not yet managed to think about associativity in grammars & ASTs without making my head hurt. I'm wondering @alecthomas if you have a grasp of what I would need to do to rewrite the above rule into a set of rules which would then convert cleanly to Participle?

EDIT: Ah, this is a whole subject: https://en.wikipedia.org/wiki/Left_recursion

5 replies

alecthomas Jul 23, 2021
Maintainer

Left recursion detection has been something I've wanted for a long time. It's quite confusing for users.

Regarding refactoring, it will need to change to something like this:

exp
    terminal
    expTail?
    ;

terminal
    : 'nil' | 'false' | 'true'
    | number
    | string
    | '...'
    | functiondef
    | prefixexp
    | tableconstructor
    | operatorUnary exp
    ;

expTail
    : operatorPower exp
    | operatorMulDivMod exp
    |  operatorAddSub exp
    | operatorStrcat exp
    | operatorComparison exp
    | operatorAnd exp
    | operatorOr exp
    | operatorBitwise exp
    ;

alecthomas Jul 26, 2021
Maintainer

Okay, I added a check (d0f908f) for left recursive grammars which seems to work. Give it a spin on one of your generated grammars when you have a moment.

tooolbox Jul 27, 2021
Author

Got it when trying out a TSQL grammar, got left recursion detected on and then a big wall of BNF. The offending rule was at the top like so:

ExpressionMult = (ExpressionMult "*" ExpressionNegate) | (ExpressionMult "/" ExpressionNegate) | (ExpressionMult "%" ExpressionNegate) | ExpressionNegate .

The struct in question:

type ExpressionMult struct {
	ExpressionMult    *ExpressionMult   `@@`
	ExpressionNegate  *ExpressionNegate `'*' @@`
	ExpressionMult2   *ExpressionMult   `| @@`
	ExpressionNegate2 *ExpressionNegate `'/' @@`
	ExpressionMult3   *ExpressionMult   `| @@`
	ExpressionNegate3 *ExpressionNegate `'%' @@`
	ExpressionNegate4 *ExpressionNegate `| @@`
}

Seems to be working :)

alecthomas Jul 27, 2021
Maintainer

So beautiful!

tooolbox Jul 28, 2021
Author

On the other hand, I have noted cases of left recursion and infinite loops that did not trigger this error and I had to quit them on my own. If I distill it to a reproducible test case I’ll let you know. Off the top, I think it was a rule like A: B “or” A | A; (which was incorrect).

Antlr4 #157

tooolbox Jul 18, 2021

Replies: 10 comments · 27 replies

alecthomas Jul 18, 2021 Maintainer

alecthomas Jul 18, 2021 Maintainer

alecthomas Jul 18, 2021 Maintainer

tooolbox Jul 19, 2021 Author

alecthomas Jul 19, 2021 Maintainer

alecthomas Jul 19, 2021 Maintainer

tooolbox Jul 19, 2021 Author

tooolbox Jul 19, 2021 Author

tooolbox Jul 19, 2021 Author

alecthomas Jul 19, 2021 Maintainer

tooolbox Jul 20, 2021 Author

alecthomas Jul 22, 2021 Maintainer

tooolbox Jul 22, 2021 Author

tooolbox Jul 21, 2021 Author

alecthomas Jul 21, 2021 Maintainer

tooolbox Jul 22, 2021 Author

alecthomas Jul 22, 2021 Maintainer

tooolbox Jul 22, 2021 Author

alecthomas Jul 22, 2021 Maintainer

alecthomas Jul 22, 2021 Maintainer

tooolbox Jul 23, 2021 Author

tooolbox Jul 22, 2021 Author

alecthomas Jul 22, 2021 Maintainer

tooolbox Jul 23, 2021 Author

alecthomas Jul 23, 2021 Maintainer

alecthomas Jul 23, 2021 Maintainer

tooolbox Jul 23, 2021 Author

tooolbox Jul 23, 2021 Author

alecthomas Jul 23, 2021 Maintainer

alecthomas Jul 26, 2021 Maintainer

tooolbox Jul 27, 2021 Author

alecthomas Jul 27, 2021 Maintainer

tooolbox Jul 28, 2021 Author

tooolbox
Jul 18, 2021

Replies: 10 comments 27 replies

alecthomas
Jul 18, 2021
Maintainer

alecthomas
Jul 18, 2021
Maintainer

alecthomas
Jul 18, 2021
Maintainer

tooolbox
Jul 19, 2021
Author

alecthomas Jul 19, 2021
Maintainer

alecthomas
Jul 19, 2021
Maintainer

tooolbox Jul 19, 2021
Author

tooolbox
Jul 19, 2021
Author

tooolbox Jul 19, 2021
Author

alecthomas Jul 19, 2021
Maintainer

tooolbox Jul 20, 2021
Author

alecthomas Jul 22, 2021
Maintainer

tooolbox Jul 22, 2021
Author

tooolbox
Jul 21, 2021
Author

alecthomas Jul 21, 2021
Maintainer

tooolbox Jul 22, 2021
Author

alecthomas Jul 22, 2021
Maintainer

tooolbox Jul 22, 2021
Author

alecthomas Jul 22, 2021
Maintainer

alecthomas
Jul 22, 2021
Maintainer

tooolbox Jul 23, 2021
Author

tooolbox
Jul 22, 2021
Author

alecthomas Jul 22, 2021
Maintainer

tooolbox Jul 23, 2021
Author

alecthomas Jul 23, 2021
Maintainer

alecthomas Jul 23, 2021
Maintainer

tooolbox Jul 23, 2021
Author

tooolbox
Jul 23, 2021
Author

alecthomas Jul 23, 2021
Maintainer

alecthomas Jul 26, 2021
Maintainer

tooolbox Jul 27, 2021
Author

alecthomas Jul 27, 2021
Maintainer

tooolbox Jul 28, 2021
Author