Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

mdoc reader #10225

Merged
merged 3 commits into from
Dec 6, 2024
Merged

mdoc reader #10225

merged 3 commits into from
Dec 6, 2024

Conversation

silby
Copy link
Contributor

@silby silby commented Sep 27, 2024

This PR introduces a reader for mdoc, a roff-derived semantic markup language for manual pages. The two relevant contemporary implementations of mdoc for manual pages are mandoc, which implements the language from scratch in C, and groff, which implements it as roff macros.

mdoc has a lot of semantics specific to technical manuals that aren't representable in Pandoc's AST. I've taken a cue from the mandoc HTML output and many mdoc elements are encoded as Codes or Spans with classes named for the mdoc macro that produced them.

Much like web browsers with HTML, mandoc attempts to produce best-effort output given all kinds of weird and crappy mdoc input. Part of the reason it's able to do this is it uses an extremely stateful output routine, and if it encounters some macro it wasn't expecting, it can easily give up on whatever it was outputting and output something else. I've encoded as much flexibility as I reasonably could into the mdoc reader here, but there will probably always be documents where mandoc prints reasonable output and we give up with a parse error, unless someone comes in and reworks our parser to handle more strange scenarios.

This branch has been developed almost exclusively against mandoc's documentation and implementation of mdoc as a reference, and the real-world manual pages tested against are those from the OpenBSD base system. Of ~3500 manuals in mdoc format shipped with a fresh OpenBSD install, 17 cause the mdoc reader to exit with a parse error. If I chase any more edge cases this PR will get even bigger and worse to review.

Much could probably still be improved here, but I'm basically at a milestone where I'm confident we can parse a lot of manuals into a pretty good Pandoc representation and further enhancements can take place after merge. After the first couple of commits on this branch my changes get reasonably atomic, and might be worth reviewing step by step.

closes #9056

pandoc.cabal Outdated Show resolved Hide resolved
pandoc.cabal Outdated Show resolved Hide resolved
@jgm
Copy link
Owner

jgm commented Sep 27, 2024

I have only had a very cursory look, but one question that comes to mind is why you have a new lexer with a new kind of token. Is lexRoff from T.P.Readers.Roff inadequate for mdoc? Why? Could it be improved instead of adding a new module that does the same thing?

@silby
Copy link
Contributor Author

silby commented Sep 27, 2024

Part of it was just that I wanted to figure out how to implement this without having to kitbash the Roff lexer beyond recognition or keep the Man reader in sync with stuff that I changed, though I did end up extracting and reusing the escape sequences. But in a few ways the needs are fairly different.

The token type used by lexRoff in T.P.R.Roff is based on roff's native syntax, where control lines start with a request or a macro and any further arguments in the control line are simply arguments to that macro. Hence the token type constructor of ControlLine Text [Arg] SourcePos where the Text is the macro or request name and each Arg is handled as either a keyword or as literal text by the macro/request.

While the mdoc format inherits the superficial elements of roff syntax and in GNU groff is still implemented as a package of roff macros, mdoc macros themselves have a more complicated syntax. See MACRO SYNTAX in mandoc's mdoc(7) manual. The upshot is that the arguments to many macros are themselves parsed for macro calls, and in turn many macros can be called in argument position. (Cf. "Callable"/"Parsed" attributes of each macro.)

So the Mdoc.Lex lexer, instead of packaging all the arguments on a roff control line together, lexes each token from the control line individually and emits a totally linear token stream, which is more amenable to recursive parsing of macro arguments/multiple macros in one line. The lexer uses the rules for callable and parsed macros to decide whether to lex a control argument as a Macro token or as a Lit (non-macro text). It's especially handy to make this determination in the lexer because it directly takes care of escaping macro names in argument position: \&No gets lexed as Lit "No", because \& isn't a legal character to start a macro name.

For example:

.Sy hello Em world

I lex this as [Macro "Sy", Lit "hello", Macro "Em", Lit "world", Eol]. So a notional parseSy and parseEm (simpler than the ones in this branch) can boil down to this:

parseSy = do
  macro "Sy"
  args <- manyTill lit (anyMacro <|> eol)
  return $ strong $ mconcat $ intersperse space (map toString args)

parseEm = do
  macro "Em"
  args <- manyTill lit (anyMacro <|> eol)
  return $ emph $ mconcat $ intersperse space (map toString args)

If my token stream were of the existing RoffToken type, I would need to do an intermediate step to transform a ControlLine into a flat structure where macros are distinguished from lits. That's seemingly straightforward enough: ControlLine "Sy" ["hello", "Em", "world"] could become a list of my token type via something like

roffTokenToMdocTokens (ControlLine nm args) = Macro nm : map litOrMacro args <> [Eol]
  where
     litOrMacro x | isParsedMacro nm && isCallableMacro x = Macro x
                  | otherwise = Lit x

But! The existing roff lexer, by applying escapes, will have already destroyed some information we need if I take this approach. The following two lines will get the same lex from current lexRoff:

.Sy hello Em world
.Sy hello \&Em world

All of the above leaving aside the handling of delimiters required by mdoc but irrelevant to man, which is also convenient to deal with in the lexer.

Finally, the Roff lexer implements roff's macro definition requests, so it will actually expand any custom macros that are defined in a manual page read by the Man reader. This is very neat but I think it is an antifeature for mdoc documents, where use of raw roff requests at all, let alone custom macros, is discouraged and hopefully vanishingly rare in the wild. Only a subset of raw roff requests are supported by mandoc, and only about 3 are in use in mdoc manuals in the OpenBSD base system. So my intention was to not include that feature in the mdoc reader.

The bottom line of all this is that RoffToken and MdocToken are pretty different because the associated readers need different information from each control line. But all that being said, I guess it's plausible to at least base the lexers on some shared code by expanding on my (misnamed) RoffMonad typeclass found in T.P.R.Roff.Escape with functions like lexControlLine, lexTextLine. I'm not sure how much code would actually end up being shared though. Ultimately the MdocToken type I introduced is proving pretty adaptable to the things I need it to do and if I did try to reuse the existing lexRoff I'd probably still translate RoffToken to MdocToken for use in the parsers.

@jgm
Copy link
Owner

jgm commented Sep 28, 2024

But! The existing roff lexer, by applying escapes, will have already destroyed some information we need if I take this approach.

I'd like to understand this better. I would have thought that low level roff stuff like escapes was common currency for man, ms, and mdoc. Can you explain further why we can't handle the escapes in the lexer as we were doing?

@silby
Copy link
Contributor Author

silby commented Sep 28, 2024

We do continue to handle the escapes in the lexer, and I'm reusing all the escaping code from T.P.R.Roff, now moved to T.P.R.Roff.Escape. There's just an interaction between applying escapes and tokenizing control lines that needs to be handled differently for mdoc. I'll hopefully make my example from before clearer. Consider these two control lines:

.Sy hello Em world
.Sy hello \&Em world

The Roff lexer lexes this as (the moral equivalent of) [ControlLine "Sy" ["hello", "Em", "world"], ControlLine "Sy" ["hello", "Em", "world"]]. (There's a couple more types involved in the argument list but the contents boil down to Texts in this instance.)

Mdoc.Lex lexes this as [Macro "Sy", Lit "hello", Macro "Em", Lit "world", Eol, Macro "Sy", Lit "hello", Lit "Em", Lit "World", Eol]. The \&Em on the second line is escaped to Em, but it also tokenizes that Em as a literal rather than a macro call. (You can actually see the difference in github's syntax highlighting!)

So if we wanted to reuse the RoffToken type for mdoc we might have to stop processing escapes within lexRoff, because escape characters (by convention \& for zero-width space) are needed to protect strings that happen to be macro names from mdoc macro expansion. The concern doesn't exist for man because there are no man macros that expand further macros in the same control line.

@silby
Copy link
Contributor Author

silby commented Oct 24, 2024

@jgm I’ll be home in a couple days and hopefully returning to work on this very soon. My plan/goal is to complete coverage of every mdoc macro used by manual pages in the OpenBSD base system, so that pandoc -r mdoc can parse all those manuals without any parse errors or skipped content. If you have any more feedback on what I have so far let me know.

@silby silby force-pushed the mdoc branch 3 times, most recently from c1071f7 to 4bbc097 Compare November 4, 2024 00:26
@silby silby force-pushed the mdoc branch 4 times, most recently from c9b1f60 to f3c7b51 Compare November 14, 2024 01:35
@silby silby changed the title wip: mdoc reader mdoc reader Nov 14, 2024
@silby silby marked this pull request as ready for review November 14, 2024 01:51
@silby
Copy link
Contributor Author

silby commented Nov 14, 2024

@jgm updated the description and marked ready, sorry it's +2000 lines of code 😅

@silby
Copy link
Contributor Author

silby commented Dec 2, 2024

@jgm I’d love to land this. Please let me know if you want me to split this PR up somehow or if I can talk you through any of it in more detail.

@jgm
Copy link
Owner

jgm commented Dec 2, 2024

I'll take a look this week. It would help if you could rebase it into logical commits (maybe just one) with the sort of commit message that could help me in crafting the changelog.
All API changes should be marked with [API change].

@jgm
Copy link
Owner

jgm commented Dec 3, 2024

The test failure is due to a duplicate skylighting-core in stack.yaml (my fault, now fixed in HEAD).

pandoc.cabal Outdated Show resolved Hide resolved
@jgm
Copy link
Owner

jgm commented Dec 3, 2024

2000 lines of code is a lot. Here's a thought: would it make sense to create a separate mdoc parsing library that could be used by pandoc? That's what I did with typst and commonmark; they both have independent libraries with their own types, and pandoc just includes a small interface to the pandoc types.

(I don't want to imply that 2000 lines is a nonstarter. There are other writers that are that big, I think. But it's worth considering this alternative.)

EDIT: I suppose that because of the sharing with the other roff based parsers, it may make sense to keep this all in pandoc.

@silby
Copy link
Contributor Author

silby commented Dec 3, 2024

I wanted to at first! But I sat down with a blank page and didn't know how to start, or how to structure the AST. I only really managed to un-daunt myself by just starting it as a Pandoc reader instead. With the benefit of this experience I could probably go write a standalone mdoc library without getting instantly stuck, but I'd still feel compelled to do some new work to design an AST that retains more mdoc-specific stuff so that it has some value beyond pandoc.

The org reader is 3620 lines (by wc -l) with the benefit of being organized into multiple files. I could take a stab at sorting things out a bit if it would help with maintenance, but it might be better to do that after merge.

fwiw I am eager to maintain this reader over time especially if reports come in from the wild about reasonable markup I'm failing to parse. If there's refactoring to do here that will make the code better I think it's easier to do that after an initial merge of the feature-complete version, since it makes it easier to review what's actually improving.

@jgm
Copy link
Owner

jgm commented Dec 3, 2024

OK, sounds good. You're right that the org reader is much bigger! As is the LaTeX reader.

When you've got this rebased into logical commits that don't recapitulate the development history so much, I will take a look; maybe this can go in the upcoming release.

@silby
Copy link
Contributor Author

silby commented Dec 5, 2024

Repushed as two commits, one that extracts/parameterizes the escaping functions for the preexisting Roff lexer and one that has everything else. I made some minor tweaks to the typeclass for the Roff escapes to remove default definitions and caught a regression I had introduced; everything else is functionally the same. Some commentary has been added as comments that had been living in my commit messages.

@silby silby requested a review from jgm December 5, 2024 19:36
silby added 2 commits December 5, 2024 15:26
The existing lexRoff does some stuff I don't want to deal with in mdoc
just yet, like lexing tbl, and some stuff I won't do at all, like
handling macro and text string definitions and switching between modes.
Uses a typeclass with associated type families to reuse most of the
escaping code between Roff (i.e. man) and Mdoc.

Future work could improve on this so that more lexing code could be
shared between Man and Mdoc. Mdoc inherits Roff's surface syntax so
hypothetically it makes sense to lex it into tokens that make sense for
roff. But it happens that the Mdoc parser is much easier to build with
an Mdoc specific token stream. Some discussion in jgm#10225 about
the rationale.

Adds a test for the roff \A escape, which I accidentally dropped support
for in an earlier iteration without anything complaining.
This change introduces a reader for mdoc, a roff-derived semantic markup
language for manual pages. The two relevant contemporary implementations
of mdoc for manual pages are mandoc (https://mandoc.bsd.lv/), which
implements the language from scratch in C, and groff
(https://www.gnu.org/software/groff/), which implements it as roff macros.

mdoc has a lot of semantics specific to technical manuals that aren't
representable in Pandoc's AST. I've taken a cue from the mandoc HTML
output and many mdoc elements are encoded as Codes or Spans with classes
named for the mdoc macro that produced them.

Much like web browsers with HTML, mandoc attempts to produce best-effort
output given all kinds of weird and crappy mdoc input. Part of the
reason it's able to do this is it uses a very accommodating parse tree
and stateful output routines specialized to the output mode, and when it
encounters some macro it wasn't expecting, it can easily give up on
whatever it was outputting and output something else. I've encoded as
much flexibility as I reasonably could into the mdoc reader here, but I
don't know how to be as flexible as mandoc.

This branch has been developed almost exclusively against mandoc's
documentation and implementation of mdoc as a reference, and the
real-world manual pages tested against are those from the OpenBSD base
system. Of ~3500 manuals in mdoc format shipped with a fresh OpenBSD
install, 17 cause the mdoc reader to exit with a parse error. Any
further chasing of edge cases is deferred to future work.

Many of the tests in test/Tests/Readers/Mdoc.hs are derived directly
from mandoc's extensive regression tests.

[API change] Adds readMdoc to the public API
@jgm
Copy link
Owner

jgm commented Dec 6, 2024

This looks great. One thing that is needed: mdoc needs to be added to the list of legitimate input formats (under --from) in MANUAL.txt.

@silby
Copy link
Contributor Author

silby commented Dec 6, 2024

pushed the manual update on top!

@jgm jgm merged commit 5f35aa6 into jgm:main Dec 6, 2024
9 of 12 checks passed
jgm pushed a commit that referenced this pull request Dec 6, 2024
The existing lexRoff does some stuff I don't want to deal with in mdoc
just yet, like lexing tbl, and some stuff I won't do at all, like
handling macro and text string definitions and switching between modes.
Uses a typeclass with associated type families to reuse most of the
escaping code between Roff (i.e. man) and Mdoc.

Future work could improve on this so that more lexing code could be
shared between Man and Mdoc. Mdoc inherits Roff's surface syntax so
hypothetically it makes sense to lex it into tokens that make sense for
roff. But it happens that the Mdoc parser is much easier to build with
an Mdoc specific token stream. Some discussion in #10225 about
the rationale.

Adds a test for the roff \A escape, which I accidentally dropped support
for in an earlier iteration without anything complaining.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

mdoc reader
2 participants