You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
I've been working on this project on-and-off for 2 years in a kind of scattershot way, mostly doing projects that interest me like the recent serializable VMs work (#3). I think it would be interesting to change tack, and instead work on the large goal of making Texcraft able to parse the plain TeX format.
The format is essentially just a large TeX file and can be downloaded from CTAN. What makes it interesting is that is uses a lot of different TeX features, so supporting it necessarily means making a lot of progress on the project.
I've audited appendix A of the TeXBook, which describes the format, and come up with this list of tasks which seem necessary for it to work and for it be testable. The tasks are ordered based on where they appear in plain.tex, so as more tasks are completed the Texcraft interpreter can get further in the file.
Implement \chardef. This involves adding a new kind of command which just contains a character. In the main VM loop the character handler is invoked. In contexts outside of the loop the command does different things; e.g. is interpreted as an integer. Plain TeX uses this as a more performant way of writing \def\one{1 }.
Support active characters everywhere. If there is not a performance cost, I think the best way to do this is for the token value type to have a dedicated Command variant which is itself either a control sequence or an active character. Then in many places where we accept a control sequence (like the commands map) we will instead accept a command. The catch here is that this will make the token type bigger and this could have performance implications. There are perhaps variations on this were we make the command type 32-bits capable of holding either a control sequence or a Unicode code point.
Implement \message - this is trivial, just writes to the log.
Codes
Implement \mathcode. This just seems to be a registers style variable where the indices are characters and the values are integers (but need to check the integer bounds).
Implement \sfcode. Seems similar to \mathcode.
Implement \delcode. Seems similar to \mathcode.
Implement \mathchardef. Seems similar to \chardef and will require a new command variant.
Registers
Implement \dimen and \dimendef. The main work here will be adding a new variable type "dimension", adding parsing logic, updating the math commands, stuff like that. The control sequences themselves will likely be trivial and will have the exact same implementation as \count and \countdef except with a different generic pattern on the component.
Implement \skip and \skipdef - same as dimension, except glue.
Implement \muskip and \muskipdef - same as dimension, except muglue.
Implement \toks and \toksdef - same as dimension, except token lists.
Implement \relax, trivial.
Implement \write and the prefix command \immediate.
Implement \string - turns a control sequence into a list of character tokens.
Implement \errmessage - like \message except errors out the VM.
Implement \csname and \endcsname - the inverse of \string essentially.
Implement the uppercase and lowercase transformation primitives: \uppercase, \lowercase, \lccode (a table mapping letters to their uppercase equivalent) and \uccode.
Implement \edef. This may be straightforward but I read an annoying thing in the TeXBook about how when this command reads expanded tokens it doesn't expand \the. Will need to read the actual Pascal source to see what's really going on
Parameters
There's essentially nothing to do here. This section sets default values for 10s of parameters. These parameters would generally be implemented in Texcraft at the same time as the associated feature. For testing we could just implement them in a big throwaway component.
Font information
Implement \font. It seems that font is another type of command or maybe a variable; I'm not sure. It reads in a font file and then stores it in an internal data structure. There are places in TeX where a font is expected. Of course at some point the current font is selected, and presumably the typesetting algorithms plug into this.
Implement \skewchar - seems to operate on a thing defined by \font.
Perhaps implement \textfont, \scriptfont and \scriptscriptfont.
Marcos for text, math and output
I don't see anything here that need special handling, it's just a bunch of \defs.
Hyphenation
Implement \hyphenation and \patterns. These two commands read in certain patterns of input and use it to form a hyphenation data structure. This data structure is consulted by the line breaking algorithm if the initial attempt to line break fails.
The text was updated successfully, but these errors were encountered:
Just some more information on the \edef/\xdef situation in which \the is handled specially when reading tokens to define the macro. From looking at the Pascal code and experimenting it seems the rule is the following: expansion happens normally except if the command to be expanded is \theand the target of the \the command is a tokens list variable. (I think the only example of a token list variable in Knuth's TeX is a token list register defined using \toks. But of course Texcraft will support this as a variable type in general.)
This hack shouldn't be too difficult to support because the logic will live entirely in the standard library rather than the VM. The special parser for \edef/\xdef can just use a special tag on \the.
I've been working on this project on-and-off for 2 years in a kind of scattershot way, mostly doing projects that interest me like the recent serializable VMs work (#3). I think it would be interesting to change tack, and instead work on the large goal of making Texcraft able to parse the plain TeX format.
The format is essentially just a large TeX file and can be downloaded from CTAN. What makes it interesting is that is uses a lot of different TeX features, so supporting it necessarily means making a lot of progress on the project.
I've audited appendix A of the TeXBook, which describes the format, and come up with this list of tasks which seem necessary for it to work and for it be testable. The tasks are ordered based on where they appear in
plain.tex
, so as more tasks are completed the Texcraft interpreter can get further in the file.Preamble
^^
in the lexer (Figure out the ^^ situation #4).\chardef
. This involves adding a new kind of command which just contains a character. In the main VM loop the character handler is invoked. In contexts outside of the loop the command does different things; e.g. is interpreted as an integer. Plain TeX uses this as a more performant way of writing\def\one{1 }
.Command
variant which is itself either a control sequence or an active character. Then in many places where we accept a control sequence (like the commands map) we will instead accept a command. The catch here is that this will make the token type bigger and this could have performance implications. There are perhaps variations on this were we make the command type 32-bits capable of holding either a control sequence or a Unicode code point.\message
- this is trivial, just writes to the log.Codes
\mathcode
. This just seems to be a registers style variable where the indices are characters and the values are integers (but need to check the integer bounds).\sfcode
. Seems similar to\mathcode
.\delcode
. Seems similar to\mathcode
.\mathchardef
. Seems similar to\chardef
and will require a new command variant.Registers
\dimen
and\dimendef
. The main work here will be adding a new variable type "dimension", adding parsing logic, updating the math commands, stuff like that. The control sequences themselves will likely be trivial and will have the exact same implementation as\count
and\countdef
except with a different generic pattern on the component.\skip
and\skipdef
- same as dimension, except glue.\muskip
and\muskipdef
- same as dimension, except muglue.\toks
and\toksdef
- same as dimension, except token lists.\relax
, trivial.\write
and the prefix command\immediate
.\string
- turns a control sequence into a list of character tokens.\errmessage
- like\message
except errors out the VM.\csname
and\endcsname
- the inverse of\string
essentially.\uppercase
,\lowercase
,\lccode
(a table mapping letters to their uppercase equivalent) and\uccode
.\edef
. This may be straightforward but I read an annoying thing in the TeXBook about how when this command reads expanded tokens it doesn't expand\the
. Will need to read the actual Pascal source to see what's really going onParameters
There's essentially nothing to do here. This section sets default values for 10s of parameters. These parameters would generally be implemented in Texcraft at the same time as the associated feature. For testing we could just implement them in a big throwaway component.
Font information
\font
. It seems that font is another type of command or maybe a variable; I'm not sure. It reads in a font file and then stores it in an internal data structure. There are places in TeX where a font is expected. Of course at some point the current font is selected, and presumably the typesetting algorithms plug into this.\skewchar
- seems to operate on a thing defined by\font
.\textfont
,\scriptfont
and\scriptscriptfont
.Marcos for text, math and output
I don't see anything here that need special handling, it's just a bunch of
\def
s.Hyphenation
\hyphenation
and\patterns
. These two commands read in certain patterns of input and use it to form a hyphenation data structure. This data structure is consulted by the line breaking algorithm if the initial attempt to line break fails.The text was updated successfully, but these errors were encountered: