Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Precedence parsing #1362

Merged
merged 32 commits into from
Dec 8, 2024
Merged

Conversation

cenodis
Copy link
Contributor

@cenodis cenodis commented Aug 15, 2021

This PR introduces a new module with parsers that can handle operator precedence in expressions. It was written with the intent to allow quick and easy translation of a usual operator precedence table into a nom parser. Individual operators and operands are parsed via nom parsers allowing for easy integration with existing parsers.

Example usage (very simple integer calculator):

fn parser(i: &str) -> IResult<&str, i64> {
  precedence(
    unary_op(1, tag("-")), //prefix operators
    fail, //postfix operators
    alt(( //binary operators
      binary_op(2, Assoc::Left, tag("*")),
      binary_op(2, Assoc::Left, tag("/")),
      binary_op(3, Assoc::Left, tag("+")),
      binary_op(3, Assoc::Left, tag("-")),
    )),
    alt(( //operands
      map_res(digit1, |s: &str| s.parse::<i64>()),
      delimited(tag("("), parser, tag(")")), //subexpression handled via recursion
    )),
    |op: Operation<&str, &str, &str, i64>| { //evaluating the expression step by step
      use nom::precedence::Operation::*;
      match op {
        Prefix("-", o) => Ok(-o),
        Binary(lhs, "*", rhs) => Ok(lhs * rhs),
        Binary(lhs, "/", rhs) => Ok(lhs / rhs),
        Binary(lhs, "+", rhs) => Ok(lhs + rhs),
        Binary(lhs, "-", rhs) => Ok(lhs - rhs),
        _ => Err("Invalid combination"),
      }
    }
  )(i)
}

TODOs

  • Better documentation.
    All elements of this module are documented and I cant think of anything more to add.
    I have done my best to write good documentation but some feedback would be appreciated. I still feel like some parts could be improved but I have no concrete ideas right now.
  • Negative tests.
    Tests for parser failures now exist.
    Currently the tests for precedence only check for positive results (i.e. successful parses). I would like to get some cases with invalid input as well. I have looked at how the existing tests handle this but the current error handling in nom escapes me. Help with this would be nice.
  • Improving API.
    A "fail" parser now exists in nom and can be used to specify "no operators of this kind". I see no other significant problems with the API.
    The current API I have come up with feels solid for the most part but I still think theres some room for improvement. Especially when handling languages that may not have certain classes of operators (for example a language with no postfix operators). This necessitates the use of a parser that always fails, but a parser with that functionality does not exist in nom so workarounds like verify(tag(""), |_: &str| false) are needed.
  • Recipes/Examples.
    The tests now have an example containing AST generation, function calls and ternary operators using this parser.
    I would like to add more examples into the recipes or example section. Especially for more involved expression containing things like function calls and ternary operators.

Open Questions

  • How should this parser handle "ambiguous precedences"? (see this comment for more details about this)
    Resolution: The current behaviour is sufficient. See here for reasoning.

@cenodis

This comment has been minimized.

@cenodis cenodis mentioned this pull request Aug 15, 2021
@cenodis cenodis force-pushed the feature_precedence branch from 24ef4a1 to 7f99f1a Compare August 17, 2021 12:06
@cenodis
Copy link
Contributor Author

cenodis commented Aug 19, 2021

@Stargateur

Why put this in nom ?

There are no guidelines on what parser specifically belong in nom or not (or if it exists I havent found it in the contribution guide).
To quote Geal:

more support for text parsing [...]
handling precedence in expressions (something like https://hackage.haskell.org/package/parsec-3.1.14.0/docs/Text-Parsec-Expr.html would be nice) [...]
This could live in nom, or in a separate crate with a faster release cycle [...]

To me this seperate crate still sounds like an open decision. And since I dont know of any such crate existing as of yet I figured I would take my chances and get it into standard nom.

specially you force complete input

Could you tell me where Im forcing complete input? If one of the operand or operator parser returns Needed it should just bubble to the caller. As far as I can see this parser should be compatible with both complete and streaming parsers and which ones are used depends solely on what type of parsers are passed to precedence.
If you could give me specific lines via a review I will gladly try and fix anything that causes an incompatability with streaming parsers.

@Stargateur
Copy link
Contributor

Could you tell me where Im forcing complete input? If one of the operand or operator parser returns Needed it should just bubble to the caller. As far as I can see this parser should be compatible with both complete and streaming parsers and which ones are used depends solely on what type of parsers are passed to precedence.
If you could give me specific lines via a review I will gladly try and fix anything that causes an incompatability with streaming parsers.

nevermind I miss understand.

@eaglgenes101
Copy link

That's a lot to pass into a single function at once, especially since some of the arguments correspond to aspects of the precedence parser that not everyone will use. Maybe the API should use the builder pattern instead?

@cenodis
Copy link
Contributor Author

cenodis commented Aug 19, 2021

@eaglgenes101

That's a lot to pass into a single function at once.

5 parameters in total. 3 of which are always required.

builder pattern

I dont think making a builder pattern for 2 optional parameters would be that great. A builder pattern would also either replicate the current API pretty closely (i.e. a single parser for each category) or I would have to use dynamic dispatch (to maintain a list of parsers for each category). And I would really like to avoid dynamic dispatch if possible.
I did some small experiments with builders while developing this and ultimately abandoned it because the builder didnt really help and just added bloat to the module.

Also there is nothing preventing you from extracting the parsers for the individual operand or operator groups into their own functions and then just passing those to the precedence parser. Same with the fold function. I just put it all in one place to give a comprehensive example of how it works.

Edit: Looking over the docs 5 parameters dont seem that out of line. There are plenty of parser that take 3 or 4 parameters, not counting alt and permutation which can (kind of) take up to 21 parameters. There is even an already existing parser with 5 parameters, fold_many_m_n.

@Geal
Copy link
Collaborator

Geal commented Aug 24, 2021

hi! just fyi, I'll review that on saturday or sunday, it looks like a great addition to nom :)

@cenodis
Copy link
Contributor Author

cenodis commented Sep 1, 2021

@Geal Any update on this?

@mullr
Copy link

mullr commented Mar 2, 2022

This looks very useful, it would certainly clean up some of my parsers. @cenodis would you consider releasing this as a separate crate, if it doesn't look like it's making it into nom proper?

@LoganDark
Copy link
Contributor

Great, a feature that I need is stuck in a PR that's been sitting around for 6 months. @Geal can you please take a look at this?

@Geal
Copy link
Collaborator

Geal commented Mar 14, 2022

It's been a weird 6 months for my open source work honestly 😅
But it's time for a new major version and precedence parsing will definitely go there

@Geal Geal added this to the 8.0 milestone Mar 14, 2022
@mullr
Copy link

mullr commented Apr 19, 2022

In case anybody else needs it: I turned the contents of this PR into a little crate that works with nom 7: https://github.com/mullr/nom-7-precedence. I'm looking forward to being able to delete it in the future, once nom 8 is a thing!

@ilonachan
Copy link

ilonachan commented Jul 14, 2024

I like using parser outputs to define the possible operators, but what bothers me is that the fold function is still separate. Rather than a single closure at the end, where I need to match against the possible operators anyway but also do sanity checking, each operator should know for itself how to combine operands together (no sanity checking required, because only operator combinations that exist can be parsed in the first place). The result would be something more like this:

fn parser(i: &str) -> IResult<&str, i64> {
  precedence(
    unary_op(1, tag("-"), |_op, val| -val), //prefix operators
    fail, //postfix operators
    alt(( //binary operators
      binary_op(2, Assoc::Left, tag("*"), |_op, lhs, rhs| lhs * rhs),
      binary_op(2, Assoc::Left, tag("/"), |_op, lhs, rhs| lhs / rhs),
      binary_op(3, Assoc::Left, tag("+").or(tag("-")),
        // just an example, I know this doesn't make much sense here
        |op, lhs, rhs| if op == "+" { lhs + rhs } else { lhs - rhs }),
    )),
    alt(( //operands
      map_res(digit1, |s: &str| s.parse::<i64>()),
      delimited(tag("("), parser, tag(")")), //subexpression handled via recursion
    ))
  )(i)
}

I've written a comparable API in this gist, but backed by a Pratt Parser instead. The fold functions in that one also allow you to advance the parser and do more complex trailing syntax, but I don't know how much that would be applicable to the algorithm used here.

To allow the sharing of the fold closures I've had to wrap them in Rc<RefCell<...>>, which would obviously be an issue if we're on a system without heap allocation (or want to do multithreading), but I don't think that's much to worry about since this implementation also uses Vecs.

@cenodis
Copy link
Contributor Author

cenodis commented Jul 15, 2024

@ilonachan

but what bothers me is that the fold function is still separate

I actually consider this to be an advantage in terms of readibility. Lexing, parsing and evaluation are three seperate stages that really don't have anything in common beyond the operator/operand types they reference. So I see no need to have them bound so closely together.

You have kind of touched on this in your example already, but consider doing more complex operations during the evaluation step. Like normalisations and constant folding. In my implementation it would look something like this:

match (operation) {
  Prefix(op, Constant(e)) => /* Do constant folding here */
  Prefix(Negation, e) => /* Normalize AST node into a multiplication with -1 */
  Prefix(Identity, e) => /* Replace with e */
  Prefix(op, e) => /* handle remaining operations */
  /* similar constructs for other combinations */
}

In yours you would also need to resort to a match (or if):

unary_op(1, tag("-"), |_op, val| match (val) {
  Constant(e) => /* constant folding */
  e => /* handle generic negation */
})
unary_op(1, tag("+"), |_op, val| val)
/* additional operators, each having its own match */

except instead of having a single match that lists all the various operator/operand combinations you now need to have a match expression for each operator. While this might be a matter of taste I would not consider that more readable. Especially if parsing the operator is more complex than a simple tag.

Rc<RefCell<...>>

Its not just allocations. This runs into the same issue that caused me to reject the builder pattern mentioned above. Your implementation needs to do full dynamic dispatch over completely static information (dyn BinaryDefFn<'a, I, O, E, Op>). This additional cost is not needed in my version.

this implementation also uses Vecs

That is true, but that is because Shunting Yard fundamentally cannot work without allocations (or at least not easily).
A big strengh (and weakness) of pratt is that it runs entirely on the call stack. Since nom also targets no_std I think if a pratt parser is included it should be available to those targets as well. Otherwise we get two parsers for alloc and none for no_std.

I am not opposed to having both versions in nom since they do have some interesting runtime differences. But I would like to suggest the following requirements:

  • The pratt parser should be able to run in no_std.
    Already touched on this above. But I do think this is the defining feature that sets pratt apart from shunting yard. At least in this context.
  • Both implementations should share the same signature.
    This should be doable since both pratt and shunting yard are, to my knowledge, the same algorithm. Just with different stack structures.
    The motivation here is to enable easier benchmark testing in user code. The actual performance differences are likely to be incredibly minor on most systems and workloads. So having indentical signatures allows users to switch them out without needing any refactoring.

sanity checking

I am not entirely sure what you mean with this. Are you refering to the fact that the example in the block above needs a default case? Thats because the example is "string typed". It is really more of a toy example, meant to show how each parameter is structured.
If you haven't already please take a look at tests/expression_ast.rs. It is a much more complete example on how to do parsing for a generic programming-like language, including function calls, ternary operators and AST normalisations.
It is also fully typed and can therefore do exhaustive matching. No default case required and any new operators are checked by the compiler.

@ilonachan
Copy link

@cenodis

Those are really great points, thanks for breaking it down! I maintain that at least for my use case the "tying an operator's logic to the operator itself" API model makes more sense and is more pleasing, but I can accept that 1. that won't be the case for every usage, and therefore 2. your API model isn't just a language limitation but can be an actually advantageous choice.

The epiphany I just had is that I don't need to force my preferences onto the user here: since the operator type definition can be anything, the user can just choose to manually bundle an Rc<RefCell<dyn ...>>> with each operator if they so wish, and otherwise just use your proposal as-is.

That means I've rewritten my thing to match yours now, but left in some convenience functions that allow using the API the way I wanted out-of-the-box (they simply wouldn't be compiled in with no_std). I haven't aimed for complete API parity yet tho, for a few reasons:

  • While it does seem like Shunting-Yard is just Pratt with a manually constructed stack, there are a few differences that maybe shouldn't be abstracted away; precedence+associativity are special cases of binding power, for example, but the user might want to keep more control over the latter.

  • The aforementioned manual stack doesn't interact that well with recursion for more complex parsers, which I've made heavy use of for e.g. the ternary operator. Your API didn't expose a way to advance the parser in the folding process, which seems necessary to me; maybe I'm wrong and it can easily be worked around. Maybe it's also easy to extend your API to allow this, but as a parsing noob I don't immediately see how to do that.

  • There's also one point of bikeshedding: I prefer defining operators like

    1. What is the operator we're matching?
    2. Once matched, what are the associated properties used to control the algorithm?
    3. (optionally) If the operator is matched, what should be done to fold the lhs+rhs?

    As opposed to frontloading the precedence and only then defining the operator. That might just be a matter of taste tho, and in any case it's just swapping the order of two parameters anyway.

I'm out today so won't be able to work on this immediately, but I think there's great promise! And I'm also learning a lot, so thanks!

@ilonachan
Copy link

ilonachan commented Jul 17, 2024

@cenodis One more question (because I'm not just a noob with parsers but also with GitHub): what's the best way to give you and everyone else access to my additions to your additions? Originally I just put my own code as a one-file gist, but it's spanning multiple files now, and also referencing your stuff... Do I make a pull request to your fork? Or just another conflicting pull request on this repo? Send you a patch file by mail??

For now I've just forked the repo, pulled your branch & rebased it on main, and copied in my work. That should be enough tor you to see what I did at least, and it also keeps the commit credit for all your prior work.

@ilonachan
Copy link

I've got another question, this time a bit deeper into the algorithm but also impacting the API: in Pratt Parsing postfix and infix operators are basically the same thing, except infix operators have a builtin mechanism for recursing the parser to fetch a rhs; so any infix operator could just be implemented as a postfix operator in this way.

I've thought that prefix operators were just inherently a different thing, and they are, but they're actually kinda the same thing for operands: either the operand returns itself as a value, or it's a prefix operator and needs to recurse the parser to fetch a rhs.

For that reason, I've sorta played with the thought of whether maybe these 2x2 operator types could be paired up into just two input parsers. Reason being that this might allow the user more control over the order in which operators are attempted to parse, even across operator types (where relevant): for example in a case where both a postfix and an infix operator could kinda match at first, but we might need to try and reject the infix first before trying the postfix. (I think I'm kinda running into this situation right now)

As a mathematician I would find this very satisfying, but as a user who doesn't completely understand the algorithm I might be confused as hell. Or maybe if it's documented well it might not be an issue. So my question is, on one hand, could it make sense to change the API like that? And if so, is there any equivalent for the Shunting-Yard version? (because there probably should be, same algorithm and all, but again I don't immediately see it)

@cenodis
Copy link
Contributor Author

cenodis commented Jul 20, 2024

@ilonachan

precedence+associativity are special cases of binding power, for example, but the user might want to keep more control over the latter

I think we have different goals with our implementations.

My goal is to enable a user to take a typical precedence table like this and translate it into a parser that correctly handles precedence. Without needing them to understand anything about infix parsing.
As such things like "binding power" or other parser capabilities, that go beyond what is needed to parse standard infix, are abstracted away. This is not an oversight of mine but an intentional choice. I want to make the API as simple and straightforward as possible to keep it accessible.
As such the parser tries very hard to hide its internals and phrases everything in relation to the input expression and grammar, resulting in the declarative nature of the API. If you look at the documentation right now it doesn't even mention a stack anywhere. The documentation that touches closest to the internals is the behaviour with ambiguous grammars. And even that is described purely with regards to the precedence table+input expression.
This doesn't mean I think of a more "powerful" API as bad, it's just not what I am aiming for right now.

My stance is that if you:

  • know how to write your own infix parser and
  • you have goals that need control over the parsers internals (performance, non-standard notation, etc)
    then you are better off rolling your own solution. The further you move away from standard infix notation the less useful a general-purpose parser will be.

Case in point, if your language only has binary operators you will likely arrive at a cleaner (and possibly more performant) implementation by just ripping out everything related to unary operators. And anything that does not follow normal algebraic structure would require adjustment to the parser anyway.
My parser is designed under the assumption that these kinds of use cases are the exception, not the rule. With this I also intend to target people who for one reason or another cannot fulfill point 1.

The "table based" approach also had some influence on the API. When implementing such a table into a parser I do not switch between parsing and evaluation. Instead I first copy the table row-by-row into a parser and then deal with transforming the tokens into the representation I need for whatever project I'm working on (usually AST nodes or bytecode). This mirrors the split between the first half containing only the parsers and the second half containing only the evaluation.
I imagine your style might be better suited to a more "experimental" approach such as initially developing a new language. In this context bundling parsing and evaluation would make it be easier to switch out/enable/disable individual operators since they are all bundled into the same statement.
Again, not saying one is somehow better than the other. Just the result of different motivations and workflows.

Your API didn't expose a way to advance the parser in the folding process

It doesn't. But at the same time I can't imagine a use case where this would be needed. Are there grammars that can't be handled without this capability? And if so, would that still match the "standard" algebraic notation that this parser is aimed at? Or is that already a new form of notation?

which I've made heavy use of for e.g. the ternary operator

Again, I invite you to check tests/expression_ast.rs if you haven't already. Among the example operators you will also find an implementation of the ternary operator using this parser. And it does not require anything special from the fold function. It is just another binary operator.

as a parsing noob

Fun fact: This parser is technically also my first parser. Granted it is its ~17th revision but it traces its origin back to the original I implemented when trying to write a parser for an obscure scripting language. So I'm no expert either.

Do I make a pull request to your fork? Or just another conflicting pull request on this repo? Send you a patch file by mail??

Good question. The answer would mostly depend on the direction that @Geal wants to take this. If he is fine with having two different approaches to precedence parsing then I am not opposed to merging your stuff into my repo and keeping this as a single PR.
Alternatively we could merge this first and then you could open a second PR with your changes rebased. Might make the review process easier since the PRs will be smaller and more focused.
Otherwise we would have to have to fix stuff to bring it in line with his demands first. Demands I currently do not know because this PR is more than 2 years overdue for a review.

@Geal Could we have some guidance on this?


As a historical fun fact, tangentially related to the "more control" topic:
Its actually possible to avoid recursion with shunting yard entirely. This can be done by handling brackets as unary operators and forcing operand folding when encountering a closing bracket until its matching bracket is found on the stack. One of my earlier revisions actually tried to expose this via the API.
Ultimately I found that this required too much knowledge of how the parser worked internally. Instead I opted to implement subexpressions via recursion because code like this:

fn expression() {
  // ...
  delimited("(", expression, ")")
}

makes the structure and nature of subexpressions instantly obvious to any reader without needing any prior education on what shunting yard is.

Maybe there is a better API that allows this kind of optimization without sacrificing readability and/or abstraction. But I was unable to find one and am happy enough with the current state of things. Suggestions welcome though.

@ilonachan
Copy link

You're probably right about the different goals. I fact...

I imagine your style might be better suited to a more "experimental" approach such as initially developing a new language. In this context bundling parsing and evaluation would make it be easier to switch out/enable/disable individual operators since they are all bundled into the same statement.

You've hit the nail on the head here, because I only stumbled upon this topic because I AM trying to design a small programming language (no idea how far that'll actually go tho). I just don't believe that this style will cease to be useful once the language exists and needs to be maintained, because an operator being coupled with the operation it should represent/perform is exactly the right kind of coupling to help readability.

Then again, I also see the use in decoupling this, for example if the same parser is used in different contexts where different fold logic should be applied, eg AST Generation vs Interpretation: you could just have two different fold functions as drop-in replacements for each other. So I've definitely come around on that one, also because Rust sadly makes my preferred style so much more awkward to implement as an API.

My goal is to enable a user to take a typical precedence table like this and translate it into a parser that correctly handles precedence. Without needing them to understand anything about infix parsing.
As such things like "binding power" or other parser capabilities, that go beyond what is needed to parse standard infix, are abstracted away. This is not an oversight of mine but an intentional choice. I want to make the API as simple and straightforward as possible to keep it accessible.
As such the parser tries very hard to hide its internals and phrases everything in relation to the input expression and grammar, resulting in the declarative nature of the API.

This is definitely the most important difference in priority. I totally get this, it's probably the easiest for an inexperienced user to handle; but I don't think we have to nor should neuter the algorithm's control surface because of it. Lots of people might just want a black box they can throw their list of operators into, but I think many would also benefit from having a general Pratt Parser harness. That could ofc be a separate crate, but since we would already use it for the easy-mode API internally, why not expose it?

So again I don't think we really need to choose one or the other here! There could be a full pratt parser with complete flexibility to do specific parsing and recursing logic; and there could be a precedence_pratt wrapper that just takes a regular precedence table (and transforms it into the more general version's input). Most people would be satisfied just using the latter, but the API would probably even be similar enough that switching to the more powerful version just means adding another input argument to the fold function, or something. If Shunting-Yard has any unique features it could get the same split, and the precedence_x functions would have the exact same interface, while the internal parser functions would be a bit more specific to their algorithm.

The further you move away from standard infix notation the less useful a general-purpose parser will be.

I see that point, mostly the "making your own stripped-down version for performance reasons" thing. But in terms of more complex syntax not benefitting from the Pratt structure anymore & needing its own thing, I frankly don't see it at all! You can even make variadic operators like function calls, easily. I guess I'm just curious if there's an example where my proposed API wouldn't be general enough to handle it... or maybe generality wasn't even your point, idk.


By the way, I've put in some work to refactor your implementation a little; it's probably not any actual advantage to the algorithm, might even have worse performance in some places (so feel free to cherry-pick that out)... I just wanted to unify similar code paths into a sub-function (or closure I guess) because I couldn't sight-read your implementation, but now I can see it's actually really simple!

This also easily reveals a way in which my "parser advancing fold" thing could be implemented too; tho because recursion doesn't really play so well with the custom stack, that may not be quite as useful. It'd instantly allow defining index brackets as an infix operator that just consumes the closing bracket in its fold function tho, and same for the paren grouping. Obviously you can make both of these happen by just calling expression recursively in the operand/operator parser (as your ternary version does too) but that kinda defeats the point of using the custom stack of Shunting-Yard... idk.

Since I'm already mostly rambling, your approach of recursing expression in the operator parser itself is interesting for Pratt too, because it's kinda equivalent to doing that recursion in the fold! It's just that now the user HAS to use a custom enum datatype for the operators to pass that data through, and it also doesn't work the same for infix operators (like index brackets); you'd have to hack them as postfix operators with a manual recursion call.

I guess for now I'll just wait if there are any more insights on what @Geal wants to do with these ideas, if anything. Either way it's a super interesting topic to discuss!

@Geal Geal self-requested a review as a code owner December 8, 2024 13:36
@Geal
Copy link
Collaborator

Geal commented Dec 8, 2024

ok so long story short, this looks great and should be merged. I am wary of integrating it in nom directly though, for various reasons:

  • it is a large feature focusing on languages, and through the history of that project, language parsing has always been contentious because people have a lot of diverging opinions
  • language parsing will need a bit more focus and iterations to get right

So I'm introducing the nom-language crate that will hold tools for language development. This will give it a different lifecycle from nom, which needs to be more stable now with the number of users.
I will fix up this PR and make it target that branch, then move it to the new crate, which will be in the same repository

@Geal Geal changed the base branch from main to nom-language December 8, 2024 16:08
@Geal Geal merged commit 6c12469 into rust-bakery:nom-language Dec 8, 2024
1 of 16 checks passed
@Geal
Copy link
Collaborator

Geal commented Dec 8, 2024

ok, this is now live in cb3b5b8 !
Just a heads up, I get a warning that Assoc is not used in https://github.com/rust-bakery/nom/pull/1792/files#diff-018c984d78a84baa841262958bdb12191adcabfca2c9e020a7aabc1d2534c0bdR45, so that may be something to look into soon

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

7 participants