Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

PEG syntax considerations #17

Merged
merged 4 commits into from
Oct 19, 2023
Merged

PEG syntax considerations #17

merged 4 commits into from
Oct 19, 2023

Conversation

jontxu
Copy link
Member

@jontxu jontxu commented Oct 18, 2023

(I added @gmantele and @msdemlei as reviewers for the time being, feel free to add whomever you feel adequate.)

Hi all,

as PEG adoption for the ADQL grammar definition draws near(er), I have been testing different parser generators recently. Namely:

I discovered by trial and error, that their definition of PEG and its syntax differ from the one defined by Bryan Ford in the original paper (cf. Figure 1 and Table 1). In other words, the original syntax looks like a formal definition rather than a standard one -- there doesn't seem to be any, actually. Other tools much more idiomatic tools exist: i.e., fitting the respective programming language.

These deviations range from aesthetical changes (separators and delimiters) to extensions and in some cases syntactic sugar -- you can see them on the links above. To give a couple of examples:

  • Mouse only allows rules to be defined in camelCase, due to the underscore character _ being an extra operator.
  • Mouse having issues with character classes containing multiple groups, that is: the formal rule <- [a-zA-Z] becoming rule = [a-z] / [A-Z]
  • Negation: peg(1), allows character classes to be negated with [^s], the Mouse counterpart being ^[s], in both cases instead of the more formal ![s]

In addition, both also define the first rule as the root of the parse tree... unlike arpeggio used by lyonetia.

I think that, for completeness purposes, following the formal definition would be the sensible decision, eventually adding some clarifications to deployers regarding the disparities between different generators and optionally offer some tools for conversion.


In this case, I have updated the adql2.1.peg file to reflect the syntax as defined by Bryan Ford, and made the required changes to the testadql.py file so the conversion to the arpeggio compliant PEG syntax does not fail. Sadly, tests fail, which we would need to check.

Finally, other errors encountered by the parsers are:

  • The rule numeric_primary is redefined fixed: see 486116d
  • The rules ANY_CHAR, unsigned_hexadecimal, numeric_expression, string_expression, string_function, bitwise_expression, geometry_function are defined but not used

@jontxu jontxu requested review from msdemlei and gmantele October 18, 2023 12:04
@Zarquan
Copy link
Member

Zarquan commented Oct 18, 2023

For reference, Brian Ford's paper has a DOI https://doi.org/10.1145/982962.964011.

@Zarquan
Copy link
Member

Zarquan commented Oct 18, 2023

I agree in the absence of a standard, it is probably better to stay close with the original paper rather than an implementation specific variant.

Given that we chose PEG because it is machine readable, would it be difficult to write a conversion tool that took a BrianFord compatible grammar and converted it into a form that the other tools could read ?

@jontxu
Copy link
Member Author

jontxu commented Oct 18, 2023

Given that we chose PEG because it is machine readable, would it be difficult to write a conversion tool that took a BrianFord compatible grammar and converted it into a form that the other tools could read ?

Yes... but it can get sort of tricky, as it depends on the tool. Take into account that we are talking about a small subset of the tools mentioned on Bryan Ford's page. The more idiomatic ones rely on the language itself and have got functions like or(...), any(...), not(...) and such which I assume are then consumed by the packrat parsers.
What we should aim for, I think, is to encourage people to avoid (whenever possible) using such idiomatic tools -- in a way that any update to the grammar does not involve a major change in their respective source codes. Or, additionally, having an appendix with a list of conversors: some IVOA-endorsed, others provided by the deployers themselves.

As for the character groups, as said above, Mouse seems to favour the precedence operator when multiple sets are involved (when the opposite is actually recommended). For arpeggio, the same character sets are defined as python regexes. Luckily, the grammar included in the pull requests works out-of-the-box with peg(1)... but it has been unmaintained since 2017 -- or at least no new releases have happened since.

In short, given a rule hex_digit <- [0-9a-fA-F]:

  • peg(1) / formal PEG: hex_digit <- [0-9a-fA-F]
  • arpeggio: hex_digit <- r'[0-9a-fA-F]'; (I am not entirely sure whether the r is necessary in this case, to be fair)
  • Mouse: hexDigit = [0-9] / [a-f] / [A-F];

By the way, I actually forgot to add one of the dealbreakers (so to speak) for Mouse, where it uses the quotation marks and apostrophes in the same way Java does: 's' for a character, "String" for a string.

@jontxu jontxu force-pushed the peg-considerations branch from bec6712 to 486116d Compare October 18, 2023 13:33
Copy link
Contributor

@msdemlei msdemlei left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for tackling these. Some minor nits, and perhaps more seriously: how can this grammar work without numeric_primary?

src/peg/adql2.1.peg Outdated Show resolved Hide resolved
@@ -482,10 +477,6 @@ user_defined_function <-
(_ value_expression (_ ',' _ value_expression)* _)?
')'

numeric_primary <-
value_expression_primary
/ numeric_value_function
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hu? How can this disapear without breaking the grammar?

Copy link
Member Author

@jontxu jontxu Oct 18, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It was duplicated for some unknown reason: line 485 and line 558.

peg_rules = re.sub("\\[", "r'[", peg_rules)
peg_rules = re.sub("\\!r'\\[", "r'[^", peg_rules)
peg_rules = re.sub("\\]", "]'", peg_rules)
peg_rules = re.sub("EOF <-[^;]*;", "", peg_rules)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not sure I'm too enthused by hacking the grammar syntax with regular expressions. For now, that's probably all right, but we should have a plan to parse our grammar into a tree and then serialise it in the various syntaxes, I guess. I just wonder if someone else might not already have written something like that...

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That's indeed a very interesting idea and much stable/reliable than regular expressions, though apparently this is enough and allowed us to quickly test this grammar with different parsers.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The hacking is to fit arpeggio, which asks for regular expressions: documentation on grammars.

Regex matches are given as strings with prefix r (e.g. r'\d*\.\d*|\d+').

We'd need to think about an alternate toolset which doesn't rely on them to test the ADQL examples.

@gmantele
Copy link
Member

As discussed privately, there is another parser generator called Canopy that can generate a parser in 4 languages: Java, Javascript, Python and Ruby.

@jontxu
Copy link
Member Author

jontxu commented Oct 18, 2023

As discussed privately, there is another parser generator called Canopy that can generate a parser in 4 languages: Java, Javascript, Python and Ruby.

To add to this, the major differences with the formal PEG grammar are:

  1. A declaration in the first line with the format grammar <Name>.
  2. Underscore characters needing whitespaces around non-identifier characters: _) becomes _ ), _' becomes _ ' and such.

I was successfully able to generate parsers on all languages with these changes.


EDIT: after the generated python parser complained about reserved keywords and fixing the PEG grammar accordingly (locally), it now returns a IndentationError: too many levels of indentation error.

Copy link
Contributor

@msdemlei msdemlei left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Let's merge

@jontxu jontxu merged commit ecfeb21 into ivoa:master Oct 19, 2023
1 of 2 checks passed
@jontxu
Copy link
Member Author

jontxu commented Oct 19, 2023

Merge done.

Although I think this should be discussed further elsewhere: github issue, mailing list...

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants