-
Notifications
You must be signed in to change notification settings - Fork 0
How the parser works
This is our problem: We need a parser that takes Toki Pona sentence and makes AST out of it. The parser must take account all the possibilities of how the sentence is going to be parsed: That "tawa" could be a preposition, or maybe it could be a modifier.
There are many ways to approach this problem. One problem we have is we can't use external libraries, many parsing libraries only takes account of making just one AST, we need more than one. So we can only handwrite our own parser or handwrite a library and use it. We need some flexibilities that the libraries provides so the latter option was chosen.
So, what kind of library? We've opted to making parser combinator, a very nice way to compose parser.
Note: If you know what parser combinator is, you may skip this part.
Parser combinator libraries provides parser combinator for composing parsers. Before explaining what parser combinator is, we need to define what parsers is, as parser combinator can only take certain kind of parsers.
Parsers are function that takes string (or array of anything or some kind of reader) and outputs what it just parsed as well as the rest of the string, leaving out what it have taken. Parsers can be small like it can only parse words or numbers or it can be big as well like it can parse the whole sentence.
It is best explained with black box analogy.
Parsers can output error, which is essential, but we'll ignore it for now.
Parser combinator are functions that... get this... takes parser, could be as many parser as well as other data, and outputs a parser! It takes functions and outputs a function! It's so meta! This is something that you would expect in functional programming. Parser combinator is very much a product functional programming paradigm.
Parser combinator is used to compose parsers. If we have a parser for word and a parser for number. We could make a parser for a sequence of word and number with these two parsers as well as a combinator called "sequence combinator"
This allows us to compose parser, starting from small parsers and gradually moving towards larger parsers in a very declarative way.
In the context of parser combinator, it's best not to think of parsers as functions but rather its own unique thing. Like how we think of 4 as 4 itself instead of "4 tomatoes" "4 coins" etc. Parsers are its own thing.
Note: If you know what this is, you may skip this part.
These are important combinators:
-
sequence
combinator -
choice
combinator -
map
combinator
There are more combinators but these are essential combinator that you need to know.
The sequence
combinator, well you have seen it before, it takes multiple parsers then outputs a parser that parses a sequence of what those input parsers parses.
The choice
combinator takes multiple parsers then output a parser that tries them one by one. If one fails to parse, then it will try the next one until one parses. This is useful if we want to parse things that can take multiple forms. For example, if we're parsing a modifier, we could parse a toki pona word or a capitalized proper word.
TODO: put example here
The map
combinator will take a single parser and a single mapper function. Remember the sequence
combinator? Have you wondered what the parsing result would be if we used that combinator? It would be a tuple, if the sequence
combinator took two parsers, the result would be a two-length tuple, and so on. What if we want a different result? What if we want an object instead? That's where the map
combinator is useful at! It maps the parsing result.
The map
combinator is defined as a method of the Parser
class just so it's nicer to use.
TODO: put example here
We have introduced a very huge topic. Let's not forget why we're making parser combinator. We need a parser library that can output multiple possible AST's. Remember that Toki Pona sentences can have vague constructions.
Typical parser combinator libraries only takes account of having one AST. But we can modify our own.
One difference we're going to make is that parsers will now output multiple data. Each containing what it just parsed as well as the rest of the string. See also: The Output data type.
This also makes the combinator a lot more complicated. Lucky, once we're done with them, we don't have to look at it once again. That's the beauty of abstraction.
Although, since this is an explanation page, we're going to look at it once again.
TODO
One attractive advantage of parser combinator is that it is easy to forgo making lexer and go straight into making parser itself. This could possibly make our code faster! However, ilo Token needs a lexer. The lexer will group things to avoid partial parsing.
So what is partial parsing and why is it our problem?
Partial parsing is a word that we came up to describe a specific behavior. Consider the phrases "sina pona a a". We intend this to be parsed as "sina pona (a a)" where these two "a"s are together. However, it is also possible to be parsed as "sina (pona a) a", where the first "a" emphasizes the "pona" and the second "a" emphasizes the whole sentence. The two "a"s has been partially parsed.
The parser is very permissive and that's by design! However, we don't want partial parsing in some cases. So instead, the lexer can group words and these grouped words are considered singular in the parser.
If we're being pedantic with the terms, this is more than a lexer because it groups tokens or lexemes! But eh... this is probably fine.
TODO
This is one major disadvantage of parser combinator: It is easy to introduce infinite recursion and cause stack overflow. Sadly, the only solution here is to meticulously scan the parser and the lexer and find the possible error. The troubleshooting instruction is laid down here.