Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Support loading the Plain TeX format #5

Open
7 of 23 tasks
jamespfennell opened this issue Jun 22, 2023 · 2 comments
Open
7 of 23 tasks

Support loading the Plain TeX format #5

jamespfennell opened this issue Jun 22, 2023 · 2 comments

Comments

@jamespfennell
Copy link
Owner

jamespfennell commented Jun 22, 2023

I've been working on this project on-and-off for 2 years in a kind of scattershot way, mostly doing projects that interest me like the recent serializable VMs work (#3). I think it would be interesting to change tack, and instead work on the large goal of making Texcraft able to parse the plain TeX format.

The format is essentially just a large TeX file and can be downloaded from CTAN. What makes it interesting is that is uses a lot of different TeX features, so supporting it necessarily means making a lot of progress on the project.

I've audited appendix A of the TeXBook, which describes the format, and come up with this list of tasks which seem necessary for it to work and for it be testable. The tasks are ordered based on where they appear in plain.tex, so as more tasks are completed the Texcraft interpreter can get further in the file.

Preamble

  • Support ^^ in the lexer (Figure out the ^^ situation #4).
  • Implement \chardef. This involves adding a new kind of command which just contains a character. In the main VM loop the character handler is invoked. In contexts outside of the loop the command does different things; e.g. is interpreted as an integer. Plain TeX uses this as a more performant way of writing \def\one{1 }.
  • Support active characters everywhere. If there is not a performance cost, I think the best way to do this is for the token value type to have a dedicated Command variant which is itself either a control sequence or an active character. Then in many places where we accept a control sequence (like the commands map) we will instead accept a command. The catch here is that this will make the token type bigger and this could have performance implications. There are perhaps variations on this were we make the command type 32-bits capable of holding either a control sequence or a Unicode code point.
  • Implement \message - this is trivial, just writes to the log.

Codes

  • Implement \mathcode. This just seems to be a registers style variable where the indices are characters and the values are integers (but need to check the integer bounds).
  • Implement \sfcode. Seems similar to \mathcode.
  • Implement \delcode. Seems similar to \mathcode.
  • Implement \mathchardef. Seems similar to \chardef and will require a new command variant.

Registers

  • Implement \dimen and \dimendef. The main work here will be adding a new variable type "dimension", adding parsing logic, updating the math commands, stuff like that. The control sequences themselves will likely be trivial and will have the exact same implementation as \count and \countdef except with a different generic pattern on the component.
  • Implement \skip and \skipdef - same as dimension, except glue.
  • Implement \muskip and \muskipdef - same as dimension, except muglue.
  • Implement \toks and \toksdef - same as dimension, except token lists.
  • Implement \relax, trivial.
  • Implement \write and the prefix command \immediate.
  • Implement \string - turns a control sequence into a list of character tokens.
  • Implement \errmessage - like \message except errors out the VM.
  • Implement \csname and \endcsname - the inverse of \string essentially.
  • Implement the uppercase and lowercase transformation primitives: \uppercase, \lowercase, \lccode (a table mapping letters to their uppercase equivalent) and \uccode.
  • Implement \edef. This may be straightforward but I read an annoying thing in the TeXBook about how when this command reads expanded tokens it doesn't expand \the. Will need to read the actual Pascal source to see what's really going on

Parameters

There's essentially nothing to do here. This section sets default values for 10s of parameters. These parameters would generally be implemented in Texcraft at the same time as the associated feature. For testing we could just implement them in a big throwaway component.

Font information

  • Implement \font. It seems that font is another type of command or maybe a variable; I'm not sure. It reads in a font file and then stores it in an internal data structure. There are places in TeX where a font is expected. Of course at some point the current font is selected, and presumably the typesetting algorithms plug into this.
  • Implement \skewchar - seems to operate on a thing defined by \font.
  • Perhaps implement \textfont, \scriptfont and \scriptscriptfont.

Marcos for text, math and output

I don't see anything here that need special handling, it's just a bunch of \defs.

Hyphenation

  • Implement \hyphenation and \patterns. These two commands read in certain patterns of input and use it to form a hyphenation data structure. This data structure is consulted by the line breaking algorithm if the initial attempt to line break fails.
@jamespfennell
Copy link
Owner Author

Just some more information on the \edef/\xdef situation in which \the is handled specially when reading tokens to define the macro. From looking at the Pascal code and experimenting it seems the rule is the following: expansion happens normally except if the command to be expanded is \the and the target of the \the command is a tokens list variable. (I think the only example of a token list variable in Knuth's TeX is a token list register defined using \toks. But of course Texcraft will support this as a variable type in general.)

This hack shouldn't be too difficult to support because the logic will live entirely in the standard library rather than the VM. The special parser for \edef/\xdef can just use a special tag on \the.

This special parsing is also used for \message.

@jamespfennell
Copy link
Owner Author

Correction: Knuth's TeX does have other token list variables. Examples: \output, \everypar.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant