-
-
Notifications
You must be signed in to change notification settings - Fork 3.4k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
mdoc reader #10225
mdoc reader #10225
Conversation
I have only had a very cursory look, but one question that comes to mind is why you have a new lexer with a new kind of token. Is |
Part of it was just that I wanted to figure out how to implement this without having to kitbash the Roff lexer beyond recognition or keep the Man reader in sync with stuff that I changed, though I did end up extracting and reusing the escape sequences. But in a few ways the needs are fairly different. The token type used by While the mdoc format inherits the superficial elements of roff syntax and in GNU groff is still implemented as a package of roff macros, mdoc macros themselves have a more complicated syntax. See MACRO SYNTAX in mandoc's mdoc(7) manual. The upshot is that the arguments to many macros are themselves parsed for macro calls, and in turn many macros can be called in argument position. (Cf. "Callable"/"Parsed" attributes of each macro.) So the For example: .Sy hello Em world I lex this as parseSy = do
macro "Sy"
args <- manyTill lit (anyMacro <|> eol)
return $ strong $ mconcat $ intersperse space (map toString args)
parseEm = do
macro "Em"
args <- manyTill lit (anyMacro <|> eol)
return $ emph $ mconcat $ intersperse space (map toString args) If my token stream were of the existing roffTokenToMdocTokens (ControlLine nm args) = Macro nm : map litOrMacro args <> [Eol]
where
litOrMacro x | isParsedMacro nm && isCallableMacro x = Macro x
| otherwise = Lit x But! The existing roff lexer, by applying escapes, will have already destroyed some information we need if I take this approach. The following two lines will get the same lex from current .Sy hello Em world
.Sy hello \&Em world All of the above leaving aside the handling of delimiters required by Finally, the The bottom line of all this is that RoffToken and MdocToken are pretty different because the associated readers need different information from each control line. But all that being said, I guess it's plausible to at least base the lexers on some shared code by expanding on my (misnamed) |
I'd like to understand this better. I would have thought that low level roff stuff like escapes was common currency for man, ms, and mdoc. Can you explain further why we can't handle the escapes in the lexer as we were doing? |
We do continue to handle the escapes in the lexer, and I'm reusing all the escaping code from .Sy hello Em world
.Sy hello \&Em world The
So if we wanted to reuse the |
@jgm I’ll be home in a couple days and hopefully returning to work on this very soon. My plan/goal is to complete coverage of every mdoc macro used by manual pages in the OpenBSD base system, so that |
c1071f7
to
4bbc097
Compare
c9b1f60
to
f3c7b51
Compare
@jgm updated the description and marked ready, sorry it's +2000 lines of code 😅 |
@jgm I’d love to land this. Please let me know if you want me to split this PR up somehow or if I can talk you through any of it in more detail. |
I'll take a look this week. It would help if you could rebase it into logical commits (maybe just one) with the sort of commit message that could help me in crafting the changelog. |
The test failure is due to a duplicate skylighting-core in stack.yaml (my fault, now fixed in HEAD). |
2000 lines of code is a lot. Here's a thought: would it make sense to create a separate mdoc parsing library that could be used by pandoc? That's what I did with typst and commonmark; they both have independent libraries with their own types, and pandoc just includes a small interface to the pandoc types. (I don't want to imply that 2000 lines is a nonstarter. There are other writers that are that big, I think. But it's worth considering this alternative.) EDIT: I suppose that because of the sharing with the other roff based parsers, it may make sense to keep this all in pandoc. |
I wanted to at first! But I sat down with a blank page and didn't know how to start, or how to structure the AST. I only really managed to un-daunt myself by just starting it as a Pandoc reader instead. With the benefit of this experience I could probably go write a standalone mdoc library without getting instantly stuck, but I'd still feel compelled to do some new work to design an AST that retains more mdoc-specific stuff so that it has some value beyond pandoc. The org reader is 3620 lines (by wc -l) with the benefit of being organized into multiple files. I could take a stab at sorting things out a bit if it would help with maintenance, but it might be better to do that after merge. fwiw I am eager to maintain this reader over time especially if reports come in from the wild about reasonable markup I'm failing to parse. If there's refactoring to do here that will make the code better I think it's easier to do that after an initial merge of the feature-complete version, since it makes it easier to review what's actually improving. |
OK, sounds good. You're right that the org reader is much bigger! As is the LaTeX reader. When you've got this rebased into logical commits that don't recapitulate the development history so much, I will take a look; maybe this can go in the upcoming release. |
Repushed as two commits, one that extracts/parameterizes the escaping functions for the preexisting Roff lexer and one that has everything else. I made some minor tweaks to the typeclass for the Roff escapes to remove default definitions and caught a regression I had introduced; everything else is functionally the same. Some commentary has been added as comments that had been living in my commit messages. |
The existing lexRoff does some stuff I don't want to deal with in mdoc just yet, like lexing tbl, and some stuff I won't do at all, like handling macro and text string definitions and switching between modes. Uses a typeclass with associated type families to reuse most of the escaping code between Roff (i.e. man) and Mdoc. Future work could improve on this so that more lexing code could be shared between Man and Mdoc. Mdoc inherits Roff's surface syntax so hypothetically it makes sense to lex it into tokens that make sense for roff. But it happens that the Mdoc parser is much easier to build with an Mdoc specific token stream. Some discussion in jgm#10225 about the rationale. Adds a test for the roff \A escape, which I accidentally dropped support for in an earlier iteration without anything complaining.
This change introduces a reader for mdoc, a roff-derived semantic markup language for manual pages. The two relevant contemporary implementations of mdoc for manual pages are mandoc (https://mandoc.bsd.lv/), which implements the language from scratch in C, and groff (https://www.gnu.org/software/groff/), which implements it as roff macros. mdoc has a lot of semantics specific to technical manuals that aren't representable in Pandoc's AST. I've taken a cue from the mandoc HTML output and many mdoc elements are encoded as Codes or Spans with classes named for the mdoc macro that produced them. Much like web browsers with HTML, mandoc attempts to produce best-effort output given all kinds of weird and crappy mdoc input. Part of the reason it's able to do this is it uses a very accommodating parse tree and stateful output routines specialized to the output mode, and when it encounters some macro it wasn't expecting, it can easily give up on whatever it was outputting and output something else. I've encoded as much flexibility as I reasonably could into the mdoc reader here, but I don't know how to be as flexible as mandoc. This branch has been developed almost exclusively against mandoc's documentation and implementation of mdoc as a reference, and the real-world manual pages tested against are those from the OpenBSD base system. Of ~3500 manuals in mdoc format shipped with a fresh OpenBSD install, 17 cause the mdoc reader to exit with a parse error. Any further chasing of edge cases is deferred to future work. Many of the tests in test/Tests/Readers/Mdoc.hs are derived directly from mandoc's extensive regression tests. [API change] Adds readMdoc to the public API
This looks great. One thing that is needed: |
pushed the manual update on top! |
The existing lexRoff does some stuff I don't want to deal with in mdoc just yet, like lexing tbl, and some stuff I won't do at all, like handling macro and text string definitions and switching between modes. Uses a typeclass with associated type families to reuse most of the escaping code between Roff (i.e. man) and Mdoc. Future work could improve on this so that more lexing code could be shared between Man and Mdoc. Mdoc inherits Roff's surface syntax so hypothetically it makes sense to lex it into tokens that make sense for roff. But it happens that the Mdoc parser is much easier to build with an Mdoc specific token stream. Some discussion in #10225 about the rationale. Adds a test for the roff \A escape, which I accidentally dropped support for in an earlier iteration without anything complaining.
This PR introduces a reader for mdoc, a roff-derived semantic markup language for manual pages. The two relevant contemporary implementations of mdoc for manual pages are mandoc, which implements the language from scratch in C, and groff, which implements it as roff macros.
mdoc has a lot of semantics specific to technical manuals that aren't representable in Pandoc's AST. I've taken a cue from the mandoc HTML output and many mdoc elements are encoded as Codes or Spans with classes named for the mdoc macro that produced them.
Much like web browsers with HTML, mandoc attempts to produce best-effort output given all kinds of weird and crappy mdoc input. Part of the reason it's able to do this is it uses an extremely stateful output routine, and if it encounters some macro it wasn't expecting, it can easily give up on whatever it was outputting and output something else. I've encoded as much flexibility as I reasonably could into the mdoc reader here, but there will probably always be documents where mandoc prints reasonable output and we give up with a parse error, unless someone comes in and reworks our parser to handle more strange scenarios.
This branch has been developed almost exclusively against mandoc's documentation and implementation of mdoc as a reference, and the real-world manual pages tested against are those from the OpenBSD base system. Of ~3500 manuals in mdoc format shipped with a fresh OpenBSD install, 17 cause the mdoc reader to exit with a parse error. If I chase any more edge cases this PR will get even bigger and worse to review.
Much could probably still be improved here, but I'm basically at a milestone where I'm confident we can parse a lot of manuals into a pretty good Pandoc representation and further enhancements can take place after merge. After the first couple of commits on this branch my changes get reasonably atomic, and might be worth reviewing step by step.
closes #9056