From 6aa9f104c1d1f29b37771931d76a467a6e3b2a35 Mon Sep 17 00:00:00 2001 From: Bill Hails Date: Sat, 2 Nov 2024 14:07:23 +0000 Subject: [PATCH] fixed bug in scanner --- docs/V2.md | 90 +++++----- fn/wonderful-life.fn | 2 +- src/pratt.yaml | 1 + src/pratt_scanner.c | 408 +++++++++++-------------------------------- 4 files changed, 152 insertions(+), 349 deletions(-) diff --git a/docs/V2.md b/docs/V2.md index 9788fbc6..f39690ac 100644 --- a/docs/V2.md +++ b/docs/V2.md @@ -1,31 +1,32 @@ # CEKF Version 2 - Bytecode -Benchmarks so far have been encouraging, the `fib(35)` test with `-O2` now -takes around 5.5 seconds, but `fib(40)` still takes around 54 seconds, +Benchmarks so far have been encouraging, the `fib(35)` test with `-O2` +now takes around 5.5 seconds, but `fib(40)` still takes around 54 seconds, while Nystrom's stack-based bytecode interpreter can do that calculation -in around 5 seconds. Of course this is due to using environments instead of -a stack, and walking trees instead of stepping through bytecode. +in around 5 seconds. Of course this is due to using environments instead +of a stack, and walking trees instead of stepping through bytecode. -Another factor is that the entire AST needs to be protected, and must be -marked every time a garbage collection occurs. +Another factor is that the entire AST needs to be protected, and must +be marked every time a garbage collection occurs. -So how difficult would it be to convert the AST to bytecode and use -a local stack? It turns out to be not so hard. Of course we still need +So how difficult would it be to convert the AST to bytecode and use a +local stack? It turns out to be not so hard. Of course we still need environments, because of closures that capture them, and it's quite -possible that version 2 will actually be **slower** initially, but -I'll discuss a possible version 2.1 later that I hope will fix that. +possible that version 2 will actually be **slower** initially, but I'll +discuss a possible version 2.1 later that I hope will fix that. Anyway let's review the math from version 1 and see what the bytecode equivalent might look like. The new machine no longer has any $\mathcal{A}$, $applyproc$ or -$applykont$ functions, as they are subsumed into the general $step$ function. -However the basic structure and discussion is the same. +$applykont$ functions, as they are subsumed into the general $step$ +function. However the basic structure and discussion is the same. -One big difference however is of course that since there is no longer an AST, -the C regiter is now an index into an array of bytecodes. +One big difference however is of course that since there is no longer +an AST, the C regiter is now an index into an array of bytecodes. -I'll present the original math for each step, then its bytecode equivalent. +I'll present the original math for each step, then its bytecode +equivalent. > CAVEAT - NONE OF THIS IS TESTED YET. Please don't rush to implement and then blame me if > it doesn't work, If/when I get it working I'll update this document. @@ -37,11 +38,12 @@ I'll present the original math for each step, then its bytecode equivalent. ## Internal Byteodes -These first few bytecodes are the equivalent of the old $\mathcal{A}$ function, -the interpreter is stepping through the code encountering these expressions -and not changing the overall machine state, just the stack. It will continue to -iterate, incrementing the address pointer appropriately, until it hits a -state-changing bytecode. State changing bytecodes are discussed in a later section. +These first few bytecodes are the equivalent of the old $\mathcal{A}$ +function, the interpreter is stepping through the code encountering +these expressions and not changing the overall machine state, just the +stack. It will continue to iterate, incrementing the address pointer +appropriately, until it hits a state-changing bytecode. State changing +bytecodes are discussed in a later section. ### Variables @@ -59,16 +61,18 @@ bytecode for that: `\| VAR \| frame \| offset \|` | `push(lookup(frame, offset, env))` | -The bytecode consiste of three bytes, a `VAR` tag that identifies that a variable is coming -up, then a byte for its frame and a byte for its offset in the frame (see [Lexical -Addressing](LEXICAL_ADDRESSING.md) for details if you haven't already.) +The bytecode consists of three bytes, a `VAR` tag that identifies that +a variable is coming up, then a byte for its frame and a byte for its +offset in the frame (see [Lexical Addressing](LEXICAL_ADDRESSING.md) +for details if you haven't already.) -On seeing that, the bytecode interpreter does the lookup, and pushes the result onto the stack. +On seeing that, the bytecode interpreter does the lookup, and pushes +the result onto the stack. `env` here is the $\rho$ argument to the old $\mathcal{A}$ function. Note that evaluating an `aexp` always has a stack cost of 1, e.g. there -is always one more element on the stack after evaluation an `aexp`. +is always one more element on the stack after evaluation of an `aexp`. ### Constants @@ -102,13 +106,15 @@ $$ | bytecode | action | |----------|--------| -| `\| LAM \| nvar \| addr(after exp) \| ..exp.. \|` | `push(clo(nvar, addr(exp), env);` | +| `\| LAM \| nvar \| addr(after exp) \| ..exp.. \|` | `push(clo(nvar, addr(exp), env));` | -Where `addr(exp)` is the index of the `exp` in the bytecode array. +Where `addr(exp)` is the index of the `exp` in the bytecode array, and +`addr(after exp)` tells the bytecode interpreter where to resume after +pushing the closure. -Note the absence of explicit variable names. Again because of -lexical addressing the only thing the closure needs to know is the size -of the environment. +Note the absence of explicit variable names. Again because of lexical +addressing the only thing the closure needs to know is the size of +the environment. ### Primitives @@ -140,17 +146,17 @@ Consider a primitive sequence like `2 + 3 * 4`. That will parse to ```mermaid flowchart TD -plus(+) +plus(plus) plus ---- two(2) -plus --- times(×) +plus --- times(times) times --- three(3) times --- four(4) ``` -There are various ways to print out that tree, for example for each -(non-terminal) node, printing the left hand branch, then printing the -operation, then printing the right hand branch would recover the infix -notation we started with. Howver if instead we print the left-hand +There are various ways to traverse that tree, for example for each +(non-terminal) node, visiting the left hand branch, then visiting the +operation, then visiting the right hand branch would recover the infix +notation we started with. However if instead we visit the left-hand branch, then the right-hand branch, then the operation, we end up with reverse polish notation: `2 3 4 * +` which is exactly the order we need to evaluate the expressions: @@ -161,18 +167,18 @@ to evaluate the expressions: * pop the 3 and the 4, multiply them and push the result 12. * pop the 2 and the 12, add them and push the result 14. -Note that the entire operation has a stack cost of 1, preserving -that invariant. +Note that the entire operation has a stack cost of 1, preserving that +invariant. ## State Changing Bytecodes -The rest of these situations change the overall state of the machine, corresponding to -$step$ returning a new state. +The rest of these situations change the overall state of the machine, +corresponding to $step$ returning a new state. ### Function calls -For function calls, `step` first evaluates the function, -then the arguments, then it applies the function: +For function calls, `step` first evaluates the function, then the +arguments, then it applies the function: $$ step(\mathtt{(aexp_0\ aexp_1\dots aexp_n)}, \rho, \kappa, f) = applyproc(proc,\langle val_1,\dots val_n\rangle, \kappa, f) diff --git a/fn/wonderful-life.fn b/fn/wonderful-life.fn index f06d2f72..154e41a4 100644 --- a/fn/wonderful-life.fn +++ b/fn/wonderful-life.fn @@ -58,4 +58,4 @@ let puts("}\n"); } in - printTree(generateTree(0.640)) + printTree(generateTree(0.64046)) diff --git a/src/pratt.yaml b/src/pratt.yaml index c27a665e..7dab6532 100644 --- a/src/pratt.yaml +++ b/src/pratt.yaml @@ -97,6 +97,7 @@ enums: - START - STR - ESC + - ESCS - UNI - CHR1 - CHR2 diff --git a/src/pratt_scanner.c b/src/pratt_scanner.c index 36d819b6..3c6cad63 100644 --- a/src/pratt_scanner.c +++ b/src/pratt_scanner.c @@ -31,287 +31,62 @@ # include "debugging_off.h" #endif -HashSymbol *TOK_MACRO(void) { - static HashSymbol *s = NULL; - if (s == NULL) s = newSymbol("macro"); - return s; -} - -HashSymbol *TOK_LEFT(void) { - static HashSymbol *s = NULL; - if (s == NULL) s = newSymbol("left"); - return s; -} - -HashSymbol *TOK_RIGHT(void) { - static HashSymbol *s = NULL; - if (s == NULL) s = newSymbol("right"); - return s; -} - -HashSymbol *TOK_PREFIX(void) { - static HashSymbol *s = NULL; - if (s == NULL) s = newSymbol("prefix"); - return s; -} - -HashSymbol *TOK_INFIX(void) { - static HashSymbol *s = NULL; - if (s == NULL) s = newSymbol("infix"); - return s; -} - -HashSymbol *TOK_POSTFIX(void) { - static HashSymbol *s = NULL; - if (s == NULL) s = newSymbol("postfix"); - return s; -} - -HashSymbol *TOK_KW_NUMBER(void) { - static HashSymbol *s = NULL; - if (s == NULL) s = newSymbol("number"); - return s; -} - -HashSymbol *TOK_BACK(void) { - static HashSymbol *s = NULL; - if (s == NULL) s = newSymbol("back"); - return s; -} - -HashSymbol *TOK_SWITCH(void) { - static HashSymbol *s = NULL; - if (s == NULL) s = newSymbol("switch"); - return s; -} - -HashSymbol *TOK_ASSERT(void) { - static HashSymbol *s = NULL; - if (s == NULL) s = newSymbol("assert"); - return s; -} - -HashSymbol *TOK_KW_CHAR(void) { - static HashSymbol *s = NULL; - if (s == NULL) s = newSymbol("char"); - return s; -} - -HashSymbol *TOK_IF(void) { - static HashSymbol *s = NULL; - if (s == NULL) s = newSymbol("if"); - return s; -} - -HashSymbol *TOK_ELSE(void) { - static HashSymbol *s = NULL; - if (s == NULL) s = newSymbol("else"); - return s; -} - -HashSymbol *TOK_PIPE(void) { - static HashSymbol *s = NULL; - if (s == NULL) s = newSymbol("|"); - return s; -} - -HashSymbol *TOK_WILDCARD(void) { - static HashSymbol *s = NULL; - if (s == NULL) s = newSymbol("_"); - return s; -} - -HashSymbol *TOK_LCURLY(void) { - static HashSymbol *s = NULL; - if (s == NULL) s = newSymbol("{"); - return s; -} - -HashSymbol *TOK_RCURLY(void) { - static HashSymbol *s = NULL; - if (s == NULL) s = newSymbol("}"); - return s; -} - -HashSymbol *TOK_LSQUARE(void) { - static HashSymbol *s = NULL; - if (s == NULL) s = newSymbol("["); - return s; -} - -HashSymbol *TOK_RSQUARE(void) { - static HashSymbol *s = NULL; - if (s == NULL) s = newSymbol("]"); - return s; -} - -HashSymbol *TOK_ATOM(void) { - static HashSymbol *s = NULL; - if (s == NULL) s = newSymbol(" ATOM"); // tokens with leading spaces are internal to the parser - return s; -} - -HashSymbol *TOK_NUMBER(void) { - static HashSymbol *s = NULL; - if (s == NULL) s = newSymbol(" NUMBER"); - return s; -} - -HashSymbol *TOK_EOF(void) { - static HashSymbol *s = NULL; - if (s == NULL) s = newSymbol(" EOF"); - return s; -} - -HashSymbol *TOK_STRING(void) { - static HashSymbol *s = NULL; - if (s == NULL) s = newSymbol(" STRING"); - return s; -} - -HashSymbol *TOK_ERROR(void) { - static HashSymbol *s = NULL; - if (s == NULL) s = newSymbol(" ERROR"); - return s; -} - -HashSymbol *TOK_CHAR(void) { - static HashSymbol *s = NULL; - if (s == NULL) s = newSymbol(" CHAR"); - return s; -} - -HashSymbol *TOK_TUPLE(void) { - static HashSymbol *s = NULL; - if (s == NULL) s = newSymbol("#("); - return s; -} - -HashSymbol *TOK_OPEN(void) { - static HashSymbol *s = NULL; - if (s == NULL) s = newSymbol("("); - return s; -} - -HashSymbol *TOK_CLOSE(void) { - static HashSymbol *s = NULL; - if (s == NULL) s = newSymbol(")"); - return s; -} - -HashSymbol *TOK_COMMA(void) { - static HashSymbol *s = NULL; - if (s == NULL) s = newSymbol(","); - return s; -} - -HashSymbol *TOK_ARROW(void) { - static HashSymbol *s = NULL; - if (s == NULL) s = newSymbol("->"); - return s; -} - -HashSymbol *TOK_ASSIGN(void) { - static HashSymbol *s = NULL; - if (s == NULL) s = newSymbol("="); - return s; -} - -HashSymbol *TOK_COLON(void) { - static HashSymbol *s = NULL; - if (s == NULL) s = newSymbol(":"); - return s; -} - -HashSymbol *TOK_HASH(void) { - static HashSymbol *s = NULL; - if (s == NULL) s = newSymbol("#"); - return s; -} - -HashSymbol *TOK_BANG(void) { - static HashSymbol *s = NULL; - if (s == NULL) s = newSymbol("!"); - return s; -} - -HashSymbol *TOK_PERIOD(void) { - static HashSymbol *s = NULL; - if (s == NULL) s = newSymbol("."); - return s; -} - -HashSymbol *TOK_LET(void) { - static HashSymbol *s = NULL; - if (s == NULL) s = newSymbol("let"); - return s; -} - -HashSymbol *TOK_IN(void) { - static HashSymbol *s = NULL; - if (s == NULL) s = newSymbol("in"); - return s; -} - -HashSymbol *TOK_NAMESPACE(void) { - static HashSymbol *s = NULL; - if (s == NULL) s = newSymbol("namespace"); - return s; -} - -HashSymbol *TOK_KW_ERROR(void) { - static HashSymbol *s = NULL; - if (s == NULL) s = newSymbol("error"); - return s; -} - -HashSymbol *TOK_TYPEDEF(void) { - static HashSymbol *s = NULL; - if (s == NULL) s = newSymbol("typedef"); - return s; -} - -HashSymbol *TOK_UNSAFE(void) { - static HashSymbol *s = NULL; - if (s == NULL) s = newSymbol("unsafe"); - return s; -} - -HashSymbol *TOK_FN(void) { - static HashSymbol *s = NULL; - if (s == NULL) s = newSymbol("fn"); - return s; -} - -HashSymbol *TOK_LINK(void) { - static HashSymbol *s = NULL; - if (s == NULL) s = newSymbol("link"); - return s; -} - -HashSymbol *TOK_AS(void) { - static HashSymbol *s = NULL; - if (s == NULL) s = newSymbol("as"); - return s; -} - -HashSymbol *TOK_ALIAS(void) { - static HashSymbol *s = NULL; - if (s == NULL) s = newSymbol("alias"); - return s; -} - -HashSymbol *TOK_SEMI(void) { - static HashSymbol *s = NULL; - if (s == NULL) s = newSymbol(";"); - return s; -} - -HashSymbol *TOK_PRINT(void) { - static HashSymbol *s = NULL; - if (s == NULL) s = newSymbol("print"); - return s; -} +#define TOKFN(name, string) \ +HashSymbol *TOK_ ## name(void) { \ + static HashSymbol *s = NULL; \ + if (s == NULL) s = newSymbol(string); \ + return s; \ +} + +TOKFN(MACRO,"macro") +TOKFN(LEFT,"left") +TOKFN(RIGHT,"right") +TOKFN(PREFIX,"prefix") +TOKFN(INFIX,"infix") +TOKFN(POSTFIX,"postfix") +TOKFN(KW_NUMBER,"number") +TOKFN(BACK,"back") +TOKFN(SWITCH,"switch") +TOKFN(ASSERT,"assert") +TOKFN(KW_CHAR,"char") +TOKFN(IF,"if") +TOKFN(ELSE,"else") +TOKFN(PIPE,"|") +TOKFN(WILDCARD,"_") +TOKFN(LCURLY,"{") +TOKFN(RCURLY,"}") +TOKFN(LSQUARE,"[") +TOKFN(RSQUARE,"]") +TOKFN(ATOM," ATOM") // tokens with leading spaces are internal to the parser +TOKFN(NUMBER," NUMBER") +TOKFN(EOF," EOF") +TOKFN(STRING," STRING") +TOKFN(ERROR," ERROR") +TOKFN(CHAR," CHAR") +TOKFN(TUPLE,"#(") +TOKFN(OPEN,"(") +TOKFN(CLOSE, ")") +TOKFN(COMMA, ",") +TOKFN(ARROW, "->") +TOKFN(ASSIGN, "=") +TOKFN(COLON, ":") +TOKFN(HASH, "#") +TOKFN(BANG, "!") +TOKFN(PERIOD, ".") +TOKFN(LET, "let") +TOKFN(IN, "in") +TOKFN(NAMESPACE, "namespace") +TOKFN(KW_ERROR, "error") +TOKFN(TYPEDEF, "typedef") +TOKFN(UNSAFE, "unsafe") +TOKFN(FN, "fn") +TOKFN(LINK, "link") +TOKFN(AS, "as") +TOKFN(ALIAS, "alias") +TOKFN(SEMI, ";"); +TOKFN(PRINT, "print") + +#undef TOKFN static bool isALPHA(char c) { return isalpha(c) || c == '_'; @@ -769,6 +544,7 @@ static PrattToken *parseString(PrattParser *parser, bool single, char sep) { state = PRATTSTRINGSTATE_TYPE_STR; break; case PRATTSTRINGSTATE_TYPE_STR: + case PRATTSTRINGSTATE_TYPE_ESCS: DEBUG("parseString %s %d (sep %c) STR: %c", lexer->bufList->filename->name, lexer->bufList->lineno, sep, buffer->start[buffer->length]); if (isTwoByteUtf8(buffer->start[buffer->length])) { pushPrattUTF8(string, buffer->start[buffer->length]); @@ -785,32 +561,53 @@ static PrattToken *parseString(PrattParser *parser, bool single, char sep) { } else if (isTrailingByteUtf8(buffer->start[buffer->length])) { parserError(parser, "Malformed UTF8"); ++buffer->length; - } else if (buffer->start[buffer->length] == sep) { - if (single) { - parserError(parser, "empty char"); - } - ++buffer->length; - state = PRATTSTRINGSTATE_TYPE_END; } else { - switch (buffer->start[buffer->length]) { - case '\\': - ++buffer->length; - state = PRATTSTRINGSTATE_TYPE_ESC; - break; - case '\n': - parserError(parser, "unexpected EOL"); + if (state == PRATTSTRINGSTATE_TYPE_STR) { + if (buffer->start[buffer->length] == sep) { + if (single) { + parserError(parser, "empty char"); + } ++buffer->length; - ++lexer->bufList->lineno; - break; - case '\0': - parserError(parser, "unexpected EOF"); state = PRATTSTRINGSTATE_TYPE_END; - break; - default: - pushPrattUTF8(string, buffer->start[buffer->length]); - ++buffer->length; - state = single ? PRATTSTRINGSTATE_TYPE_CHR1 : PRATTSTRINGSTATE_TYPE_STR; - break; + } else { + switch (buffer->start[buffer->length]) { + case '\\': + ++buffer->length; + state = PRATTSTRINGSTATE_TYPE_ESC; + break; + case '\n': + parserError(parser, "unexpected EOL"); + ++buffer->length; + ++lexer->bufList->lineno; + break; + case '\0': + parserError(parser, "unexpected EOF"); + state = PRATTSTRINGSTATE_TYPE_END; + break; + default: + pushPrattUTF8(string, buffer->start[buffer->length]); + ++buffer->length; + state = single ? PRATTSTRINGSTATE_TYPE_CHR1 : PRATTSTRINGSTATE_TYPE_STR; + break; + } + } + } else { // PRATTSTRINGSTATE_TYPE_ESCS + switch (buffer->start[buffer->length]) { + case '\n': + parserError(parser, "unexpected EOL"); + ++buffer->length; + ++lexer->bufList->lineno; + break; + case '\0': + parserError(parser, "unexpected EOF"); + state = PRATTSTRINGSTATE_TYPE_END; + break; + default: + pushPrattUTF8(string, buffer->start[buffer->length]); + ++buffer->length; + state = single ? PRATTSTRINGSTATE_TYPE_CHR1 : PRATTSTRINGSTATE_TYPE_STR; + break; + } } } break; @@ -877,8 +674,7 @@ static PrattToken *parseString(PrattParser *parser, bool single, char sep) { state = PRATTSTRINGSTATE_TYPE_END; break; default: - ++buffer->length; - state = single ? PRATTSTRINGSTATE_TYPE_CHR1 : PRATTSTRINGSTATE_TYPE_STR; + state = PRATTSTRINGSTATE_TYPE_ESCS; } break; case PRATTSTRINGSTATE_TYPE_UNI: