From 6aa9f104c1d1f29b37771931d76a467a6e3b2a35 Mon Sep 17 00:00:00 2001
From: Bill Hails <billhails2014@gmail.com>
Date: Sat, 2 Nov 2024 14:07:23 +0000
Subject: [PATCH] fixed bug in scanner

---
 docs/V2.md           |  90 +++++-----
 fn/wonderful-life.fn |   2 +-
 src/pratt.yaml       |   1 +
 src/pratt_scanner.c  | 408 +++++++++++--------------------------------
 4 files changed, 152 insertions(+), 349 deletions(-)

diff --git a/docs/V2.md b/docs/V2.md
index 9788fbc6..f39690ac 100644
--- a/docs/V2.md
+++ b/docs/V2.md
@@ -1,31 +1,32 @@
 # CEKF Version 2 - Bytecode
 
-Benchmarks so far have been encouraging, the `fib(35)` test with `-O2` now
-takes around 5.5 seconds, but `fib(40)` still takes around 54 seconds,
+Benchmarks so far have been encouraging, the `fib(35)` test with `-O2`
+now takes around 5.5 seconds, but `fib(40)` still takes around 54 seconds,
 while Nystrom's stack-based bytecode interpreter can do that calculation
-in around 5 seconds. Of course this is due to using environments instead of
-a stack, and walking trees instead of stepping through bytecode.
+in around 5 seconds. Of course this is due to using environments instead
+of a stack, and walking trees instead of stepping through bytecode.
 
-Another factor is that the entire AST needs to be protected, and must be
-marked every time a garbage collection occurs.
+Another factor is that the entire AST needs to be protected, and must
+be marked every time a garbage collection occurs.
 
-So how difficult would it be to convert the AST to bytecode and use
-a local stack? It turns out to be not so hard. Of course we still need
+So how difficult would it be to convert the AST to bytecode and use a
+local stack? It turns out to be not so hard. Of course we still need
 environments, because of closures that capture them, and it's quite
-possible that version 2 will actually be **slower** initially, but
-I'll discuss a possible version 2.1 later that I hope will fix that.
+possible that version 2 will actually be **slower** initially, but I'll
+discuss a possible version 2.1 later that I hope will fix that.
 
 Anyway let's review the math from version 1 and see what the bytecode
 equivalent might look like.
 
 The new machine no longer has any $\mathcal{A}$, $applyproc$ or
-$applykont$ functions, as they are subsumed into the general $step$ function.
-However the basic structure and discussion is the same.
+$applykont$ functions, as they are subsumed into the general $step$
+function.  However the basic structure and discussion is the same.
 
-One big difference however is of course that since there is no longer an AST,
-the C regiter is now an index into an array of bytecodes.
+One big difference however is of course that since there is no longer
+an AST, the C regiter is now an index into an array of bytecodes.
 
-I'll present the original math for each step, then its bytecode equivalent.
+I'll present the original math for each step, then its bytecode
+equivalent.
 
 > CAVEAT - NONE OF THIS IS TESTED YET. Please don't rush to implement and then blame me if
 > it doesn't work, If/when I get it working I'll update this document.
@@ -37,11 +38,12 @@ I'll present the original math for each step, then its bytecode equivalent.
 
 ## Internal Byteodes
 
-These first few bytecodes are the equivalent of the old $\mathcal{A}$ function,
-the interpreter is stepping through the code encountering these expressions
-and not changing the overall machine state, just the stack. It will continue to
-iterate, incrementing the address pointer appropriately, until it hits a
-state-changing bytecode. State changing bytecodes are discussed in a later section.
+These first few bytecodes are the equivalent of the old $\mathcal{A}$
+function, the interpreter is stepping through the code encountering
+these expressions and not changing the overall machine state, just the
+stack. It will continue to iterate, incrementing the address pointer
+appropriately, until it hits a state-changing bytecode. State changing
+bytecodes are discussed in a later section.
 
 ### Variables
 
@@ -59,16 +61,18 @@ bytecode for that:
 `\| VAR \| frame \| offset \|` | `push(lookup(frame, offset, env))` |
 
 
-The bytecode consiste of three bytes, a `VAR` tag that identifies that a variable is coming
-up, then a byte for its frame and a byte for its offset in the frame (see [Lexical
-Addressing](LEXICAL_ADDRESSING.md) for details if you haven't already.)
+The bytecode consists of three bytes, a `VAR` tag that identifies that
+a variable is coming up, then a byte for its frame and a byte for its
+offset in the frame (see [Lexical Addressing](LEXICAL_ADDRESSING.md)
+for details if you haven't already.)
 
-On seeing that, the bytecode interpreter does the lookup, and pushes the result onto the stack.
+On seeing that, the bytecode interpreter does the lookup, and pushes
+the result onto the stack.
 
 `env` here is the $\rho$ argument to the old $\mathcal{A}$ function.
 
 Note that evaluating an `aexp` always has a stack cost of 1, e.g. there
-is always one more element on the stack after evaluation an `aexp`.
+is always one more element on the stack after evaluation of an `aexp`.
 
 ### Constants
 
@@ -102,13 +106,15 @@ $$
 
 | bytecode | action |
 |----------|--------|
-| `\| LAM \| nvar \| addr(after exp) \| ..exp.. \|` | `push(clo(nvar, addr(exp), env);` |
+| `\| LAM \| nvar \| addr(after exp) \| ..exp.. \|` | `push(clo(nvar, addr(exp), env));` |
 
-Where `addr(exp)` is the index of the `exp` in the bytecode array.
+Where `addr(exp)` is the index of the `exp` in the bytecode array, and
+`addr(after exp)` tells the bytecode interpreter where to resume after
+pushing the closure.
 
-Note the absence of explicit variable names. Again because of
-lexical addressing the only thing the closure needs to know is the size
-of the environment.
+Note the absence of explicit variable names. Again because of lexical
+addressing the only thing the closure needs to know is the size of
+the environment.
 
 ### Primitives
 
@@ -140,17 +146,17 @@ Consider a primitive sequence like `2 + 3 * 4`. That will parse to
 
 ```mermaid
 flowchart TD
-plus(+)
+plus(plus)
 plus ---- two(2)
-plus --- times(&times;)
+plus --- times(times)
 times --- three(3)
 times --- four(4)
 ```
 
-There are various ways to print out that tree, for example for each
-(non-terminal) node, printing the left hand branch, then printing the
-operation, then printing the right hand branch would recover the infix
-notation we started with. Howver if instead we print the left-hand
+There are various ways to traverse that tree, for example for each
+(non-terminal) node, visiting the left hand branch, then visiting the
+operation, then visiting the right hand branch would recover the infix
+notation we started with. However if instead we visit the left-hand
 branch, then the right-hand branch, then the operation, we end up with
 reverse polish notation: `2 3 4 * +` which is exactly the order we need
 to evaluate the expressions:
@@ -161,18 +167,18 @@ to evaluate the expressions:
 * pop the 3 and the 4, multiply them and push the result 12.
 * pop the 2 and the 12, add them and push the result 14.
 
-Note that the entire operation has a stack cost of 1, preserving
-that invariant.
+Note that the entire operation has a stack cost of 1, preserving that
+invariant.
 
 ## State Changing Bytecodes
 
-The rest of these situations change the overall state of the machine, corresponding to
-$step$ returning a new state.
+The rest of these situations change the overall state of the machine,
+corresponding to $step$ returning a new state.
 
 ### Function calls
 
-For function calls, `step` first evaluates the function,
-then the arguments, then it applies the function:
+For function calls, `step` first evaluates the function, then the
+arguments, then it applies the function:
 
 $$
 step(\mathtt{(aexp_0\ aexp_1\dots aexp_n)}, \rho, \kappa, f) = applyproc(proc,\langle val_1,\dots val_n\rangle, \kappa, f)
diff --git a/fn/wonderful-life.fn b/fn/wonderful-life.fn
index f06d2f72..154e41a4 100644
--- a/fn/wonderful-life.fn
+++ b/fn/wonderful-life.fn
@@ -58,4 +58,4 @@ let
             puts("}\n");
     }
 in
-    printTree(generateTree(0.640))
+    printTree(generateTree(0.64046))
diff --git a/src/pratt.yaml b/src/pratt.yaml
index c27a665e..7dab6532 100644
--- a/src/pratt.yaml
+++ b/src/pratt.yaml
@@ -97,6 +97,7 @@ enums:
         - START
         - STR
         - ESC
+        - ESCS
         - UNI
         - CHR1
         - CHR2
diff --git a/src/pratt_scanner.c b/src/pratt_scanner.c
index 36d819b6..3c6cad63 100644
--- a/src/pratt_scanner.c
+++ b/src/pratt_scanner.c
@@ -31,287 +31,62 @@
 #  include "debugging_off.h"
 #endif
 
-HashSymbol *TOK_MACRO(void) {
-    static HashSymbol *s = NULL;
-    if (s == NULL) s = newSymbol("macro");
-    return s;
-}
-
-HashSymbol *TOK_LEFT(void) {
-    static HashSymbol *s = NULL;
-    if (s == NULL) s = newSymbol("left");
-    return s;
-}
-
-HashSymbol *TOK_RIGHT(void) {
-    static HashSymbol *s = NULL;
-    if (s == NULL) s = newSymbol("right");
-    return s;
-}
-
-HashSymbol *TOK_PREFIX(void) {
-    static HashSymbol *s = NULL;
-    if (s == NULL) s = newSymbol("prefix");
-    return s;
-}
-
-HashSymbol *TOK_INFIX(void) {
-    static HashSymbol *s = NULL;
-    if (s == NULL) s = newSymbol("infix");
-    return s;
-}
-
-HashSymbol *TOK_POSTFIX(void) {
-    static HashSymbol *s = NULL;
-    if (s == NULL) s = newSymbol("postfix");
-    return s;
-}
-
-HashSymbol *TOK_KW_NUMBER(void) {
-    static HashSymbol *s = NULL;
-    if (s == NULL) s = newSymbol("number");
-    return s;
-}
-
-HashSymbol *TOK_BACK(void) {
-    static HashSymbol *s = NULL;
-    if (s == NULL) s = newSymbol("back");
-    return s;
-}
-
-HashSymbol *TOK_SWITCH(void) {
-    static HashSymbol *s = NULL;
-    if (s == NULL) s = newSymbol("switch");
-    return s;
-}
-
-HashSymbol *TOK_ASSERT(void) {
-    static HashSymbol *s = NULL;
-    if (s == NULL) s = newSymbol("assert");
-    return s;
-}
-
-HashSymbol *TOK_KW_CHAR(void) {
-    static HashSymbol *s = NULL;
-    if (s == NULL) s = newSymbol("char");
-    return s;
-}
-
-HashSymbol *TOK_IF(void) {
-    static HashSymbol *s = NULL;
-    if (s == NULL) s = newSymbol("if");
-    return s;
-}
-
-HashSymbol *TOK_ELSE(void) {
-    static HashSymbol *s = NULL;
-    if (s == NULL) s = newSymbol("else");
-    return s;
-}
-
-HashSymbol *TOK_PIPE(void) {
-    static HashSymbol *s = NULL;
-    if (s == NULL) s = newSymbol("|");
-    return s;
-}
-
-HashSymbol *TOK_WILDCARD(void) {
-    static HashSymbol *s = NULL;
-    if (s == NULL) s = newSymbol("_");
-    return s;
-}
-
-HashSymbol *TOK_LCURLY(void) {
-    static HashSymbol *s = NULL;
-    if (s == NULL) s = newSymbol("{");
-    return s;
-}
-
-HashSymbol *TOK_RCURLY(void) {
-    static HashSymbol *s = NULL;
-    if (s == NULL) s = newSymbol("}");
-    return s;
-}
-
-HashSymbol *TOK_LSQUARE(void) {
-    static HashSymbol *s = NULL;
-    if (s == NULL) s = newSymbol("[");
-    return s;
-}
-
-HashSymbol *TOK_RSQUARE(void) {
-    static HashSymbol *s = NULL;
-    if (s == NULL) s = newSymbol("]");
-    return s;
-}
-
-HashSymbol *TOK_ATOM(void) {
-    static HashSymbol *s = NULL;
-    if (s == NULL) s = newSymbol(" ATOM"); // tokens with leading spaces are internal to the parser
-    return s;
-}
-
-HashSymbol *TOK_NUMBER(void) {
-    static HashSymbol *s = NULL;
-    if (s == NULL) s = newSymbol(" NUMBER");
-    return s;
-}
-
-HashSymbol *TOK_EOF(void) {
-    static HashSymbol *s = NULL;
-    if (s == NULL) s = newSymbol(" EOF");
-    return s;
-}
-
-HashSymbol *TOK_STRING(void) {
-    static HashSymbol *s = NULL;
-    if (s == NULL) s = newSymbol(" STRING");
-    return s;
-}
-
-HashSymbol *TOK_ERROR(void) {
-    static HashSymbol *s = NULL;
-    if (s == NULL) s = newSymbol(" ERROR");
-    return s;
-}
-
-HashSymbol *TOK_CHAR(void) {
-    static HashSymbol *s = NULL;
-    if (s == NULL) s = newSymbol(" CHAR");
-    return s;
-}
-
-HashSymbol *TOK_TUPLE(void) {
-    static HashSymbol *s = NULL;
-    if (s == NULL) s = newSymbol("#(");
-    return s;
-}
-
-HashSymbol *TOK_OPEN(void) {
-    static HashSymbol *s = NULL;
-    if (s == NULL) s = newSymbol("(");
-    return s;
-}
-
-HashSymbol *TOK_CLOSE(void) {
-    static HashSymbol *s = NULL;
-    if (s == NULL) s = newSymbol(")");
-    return s;
-}
-
-HashSymbol *TOK_COMMA(void) {
-    static HashSymbol *s = NULL;
-    if (s == NULL) s = newSymbol(",");
-    return s;
-}
-
-HashSymbol *TOK_ARROW(void) {
-    static HashSymbol *s = NULL;
-    if (s == NULL) s = newSymbol("->");
-    return s;
-}
-
-HashSymbol *TOK_ASSIGN(void) {
-    static HashSymbol *s = NULL;
-    if (s == NULL) s = newSymbol("=");
-    return s;
-}
-
-HashSymbol *TOK_COLON(void) {
-    static HashSymbol *s = NULL;
-    if (s == NULL) s = newSymbol(":");
-    return s;
-}
-
-HashSymbol *TOK_HASH(void) {
-    static HashSymbol *s = NULL;
-    if (s == NULL) s = newSymbol("#");
-    return s;
-}
-
-HashSymbol *TOK_BANG(void) {
-    static HashSymbol *s = NULL;
-    if (s == NULL) s = newSymbol("!");
-    return s;
-}
-
-HashSymbol *TOK_PERIOD(void) {
-    static HashSymbol *s = NULL;
-    if (s == NULL) s = newSymbol(".");
-    return s;
-}
-
-HashSymbol *TOK_LET(void) {
-    static HashSymbol *s = NULL;
-    if (s == NULL) s = newSymbol("let");
-    return s;
-}
-
-HashSymbol *TOK_IN(void) {
-    static HashSymbol *s = NULL;
-    if (s == NULL) s = newSymbol("in");
-    return s;
-}
-
-HashSymbol *TOK_NAMESPACE(void) {
-    static HashSymbol *s = NULL;
-    if (s == NULL) s = newSymbol("namespace");
-    return s;
-}
-
-HashSymbol *TOK_KW_ERROR(void) {
-    static HashSymbol *s = NULL;
-    if (s == NULL) s = newSymbol("error");
-    return s;
-}
-
-HashSymbol *TOK_TYPEDEF(void) {
-    static HashSymbol *s = NULL;
-    if (s == NULL) s = newSymbol("typedef");
-    return s;
-}
-
-HashSymbol *TOK_UNSAFE(void) {
-    static HashSymbol *s = NULL;
-    if (s == NULL) s = newSymbol("unsafe");
-    return s;
-}
-
-HashSymbol *TOK_FN(void) {
-    static HashSymbol *s = NULL;
-    if (s == NULL) s = newSymbol("fn");
-    return s;
-}
-
-HashSymbol *TOK_LINK(void) {
-    static HashSymbol *s = NULL;
-    if (s == NULL) s = newSymbol("link");
-    return s;
-}
-
-HashSymbol *TOK_AS(void) {
-    static HashSymbol *s = NULL;
-    if (s == NULL) s = newSymbol("as");
-    return s;
-}
-
-HashSymbol *TOK_ALIAS(void) {
-    static HashSymbol *s = NULL;
-    if (s == NULL) s = newSymbol("alias");
-    return s;
-}
-
-HashSymbol *TOK_SEMI(void) {
-    static HashSymbol *s = NULL;
-    if (s == NULL) s = newSymbol(";");
-    return s;
-}
-
-HashSymbol *TOK_PRINT(void) {
-    static HashSymbol *s = NULL;
-    if (s == NULL) s = newSymbol("print");
-    return s;
-}
+#define TOKFN(name, string) \
+HashSymbol *TOK_ ## name(void) { \
+    static HashSymbol *s = NULL; \
+    if (s == NULL) s = newSymbol(string); \
+    return s; \
+}
+
+TOKFN(MACRO,"macro")
+TOKFN(LEFT,"left")
+TOKFN(RIGHT,"right")
+TOKFN(PREFIX,"prefix")
+TOKFN(INFIX,"infix")
+TOKFN(POSTFIX,"postfix")
+TOKFN(KW_NUMBER,"number")
+TOKFN(BACK,"back")
+TOKFN(SWITCH,"switch")
+TOKFN(ASSERT,"assert")
+TOKFN(KW_CHAR,"char")
+TOKFN(IF,"if")
+TOKFN(ELSE,"else")
+TOKFN(PIPE,"|")
+TOKFN(WILDCARD,"_")
+TOKFN(LCURLY,"{")
+TOKFN(RCURLY,"}")
+TOKFN(LSQUARE,"[")
+TOKFN(RSQUARE,"]")
+TOKFN(ATOM," ATOM") // tokens with leading spaces are internal to the parser
+TOKFN(NUMBER," NUMBER")
+TOKFN(EOF," EOF")
+TOKFN(STRING," STRING")
+TOKFN(ERROR," ERROR")
+TOKFN(CHAR," CHAR")
+TOKFN(TUPLE,"#(")
+TOKFN(OPEN,"(")
+TOKFN(CLOSE, ")")
+TOKFN(COMMA, ",")
+TOKFN(ARROW, "->")
+TOKFN(ASSIGN, "=")
+TOKFN(COLON, ":")
+TOKFN(HASH, "#")
+TOKFN(BANG, "!")
+TOKFN(PERIOD, ".")
+TOKFN(LET, "let")
+TOKFN(IN, "in")
+TOKFN(NAMESPACE, "namespace")
+TOKFN(KW_ERROR, "error")
+TOKFN(TYPEDEF, "typedef")
+TOKFN(UNSAFE, "unsafe")
+TOKFN(FN, "fn")
+TOKFN(LINK, "link")
+TOKFN(AS, "as")
+TOKFN(ALIAS, "alias")
+TOKFN(SEMI, ";");
+TOKFN(PRINT, "print")
+
+#undef TOKFN
 
 static bool isALPHA(char c) {
     return isalpha(c) || c == '_';
@@ -769,6 +544,7 @@ static PrattToken *parseString(PrattParser *parser, bool single, char sep) {
                 state = PRATTSTRINGSTATE_TYPE_STR;
                 break;
             case PRATTSTRINGSTATE_TYPE_STR:
+            case PRATTSTRINGSTATE_TYPE_ESCS:
                 DEBUG("parseString %s %d (sep %c) STR: %c", lexer->bufList->filename->name, lexer->bufList->lineno, sep, buffer->start[buffer->length]);
                 if (isTwoByteUtf8(buffer->start[buffer->length])) {
                     pushPrattUTF8(string, buffer->start[buffer->length]);
@@ -785,32 +561,53 @@ static PrattToken *parseString(PrattParser *parser, bool single, char sep) {
                 } else if (isTrailingByteUtf8(buffer->start[buffer->length])) {
                     parserError(parser, "Malformed UTF8");
                     ++buffer->length;
-                } else if (buffer->start[buffer->length] == sep) {
-                    if (single) {
-                        parserError(parser, "empty char");
-                    }
-                    ++buffer->length;
-                    state = PRATTSTRINGSTATE_TYPE_END;
                 } else {
-                    switch (buffer->start[buffer->length]) {
-                        case '\\':
-                            ++buffer->length;
-                            state = PRATTSTRINGSTATE_TYPE_ESC;
-                            break;
-                        case '\n':
-                            parserError(parser, "unexpected EOL");
+                    if (state == PRATTSTRINGSTATE_TYPE_STR) {
+                        if (buffer->start[buffer->length] == sep) {
+                            if (single) {
+                                parserError(parser, "empty char");
+                            }
                             ++buffer->length;
-                            ++lexer->bufList->lineno;
-                            break;
-                        case '\0':
-                            parserError(parser, "unexpected EOF");
                             state = PRATTSTRINGSTATE_TYPE_END;
-                            break;
-                        default:
-                            pushPrattUTF8(string, buffer->start[buffer->length]);
-                            ++buffer->length;
-                            state = single ? PRATTSTRINGSTATE_TYPE_CHR1 : PRATTSTRINGSTATE_TYPE_STR;
-                            break;
+                        } else {
+                            switch (buffer->start[buffer->length]) {
+                                case '\\':
+                                    ++buffer->length;
+                                    state = PRATTSTRINGSTATE_TYPE_ESC;
+                                    break;
+                                case '\n':
+                                    parserError(parser, "unexpected EOL");
+                                    ++buffer->length;
+                                    ++lexer->bufList->lineno;
+                                    break;
+                                case '\0':
+                                    parserError(parser, "unexpected EOF");
+                                    state = PRATTSTRINGSTATE_TYPE_END;
+                                    break;
+                                default:
+                                    pushPrattUTF8(string, buffer->start[buffer->length]);
+                                    ++buffer->length;
+                                    state = single ? PRATTSTRINGSTATE_TYPE_CHR1 : PRATTSTRINGSTATE_TYPE_STR;
+                                    break;
+                            }
+                        }
+                    } else { // PRATTSTRINGSTATE_TYPE_ESCS
+                        switch (buffer->start[buffer->length]) {
+                                case '\n':
+                                    parserError(parser, "unexpected EOL");
+                                    ++buffer->length;
+                                    ++lexer->bufList->lineno;
+                                    break;
+                                case '\0':
+                                    parserError(parser, "unexpected EOF");
+                                    state = PRATTSTRINGSTATE_TYPE_END;
+                                    break;
+                                default:
+                                    pushPrattUTF8(string, buffer->start[buffer->length]);
+                                    ++buffer->length;
+                                    state = single ? PRATTSTRINGSTATE_TYPE_CHR1 : PRATTSTRINGSTATE_TYPE_STR;
+                                    break;
+                        }
                     }
                 }
                 break;
@@ -877,8 +674,7 @@ static PrattToken *parseString(PrattParser *parser, bool single, char sep) {
                         state = PRATTSTRINGSTATE_TYPE_END;
                         break;
                     default:
-                        ++buffer->length;
-                        state = single ? PRATTSTRINGSTATE_TYPE_CHR1 : PRATTSTRINGSTATE_TYPE_STR;
+                        state = PRATTSTRINGSTATE_TYPE_ESCS;
                 }
                 break;
             case PRATTSTRINGSTATE_TYPE_UNI: