Merge pull request #124 from billhails/import

Import
billhails · Oct 27, 2024 · daf02e5 · daf02e5
2 parents f7a9bfb + 63ac30c
commit daf02e5
Show file tree

Hide file tree

Showing 23 changed files with 762 additions and 389 deletions.
diff --git a/README.md b/README.md
@@ -68,8 +68,11 @@ the $step$ function: one to deal with `amb` and one to deal with `back`.
 flowchart TD
 classDef process fill:#aef;
 source(Source) -->
+scanner([Scanner]):::process -->
+tokens(Tokens) -->
 parser([Parser]):::process
-parser <--> oi([Operator Inlining]):::process
+parser --> oi([Operator Inlining]):::process
+oi --> scanner
 parser --> ast(AST) -->
 lc([Lambda Conversion]):::process --> tpmc([Pattern Matching Compiler]):::process
 lc <---> pg([Print Function Generator]):::process
@@ -94,12 +97,14 @@ bc --> cekf([CEKF Runtime VM]):::process
 ```
 
 The various components named in the diagram above are linked to their implementation entry point here:
-* Parser [parser.y](src/parser.y)
+* Scanner [pratt_scanner.c](src/pratt_scanner.c)
+* Parser [pratt_parser.c](src/pratt_parser.c)
 * AST [ast.yaml](src/ast.yaml)
 * Lambda Conversion [lambda_conversion.c](src/lambda_conversion.c)
 * Tpmc [tpmc_logic.c](src/tpmc_logic.c)
 * Print Function Generator [print_generator.c](src/print_generator.c)
 * Variable Substitution [lambda_substitution.c](src/lambda_substitution.c)
+* Macro Expansion [macro_substitution.c](src/macro_substitution.c)
 * Plain Lambda Form [lambda.yaml](src/lambda.yaml)
 * Type Checking [tc_analyze.c](src/tc_analyze.c)
 * Print Compiler [print_compiler.c](src/print_compiler.c)

diff --git a/docs/NAMESPACES.md b/docs/NAMESPACES.md
@@ -592,4 +592,146 @@ could use this id to find the namespace in which to lookup the
 
 Use of scoped types in pattern matching.
 
+# Postscript
+
+Now that namespaces are implemented this is a review of their
+implementation, and their shortcomings. Those shortcomings are crying
+out for an additional `import` operation, and hopefully the namespace
+implementation can be re-used.
+
+## Shortcomings
+
+* Operators defined in a namespace are not visible outside of it.
+* Aliases defined in a namespace are not visible outside of it.
+* Functions and types defined in a namespace *have* to be accessed
+  via the lookup operator (`.`) on the namespace id.
+
+## Intent of an `import` command
+
+I'm hoping I can get an `import <string>;` declaration to do 2 things:
+
+1. make the environment of the import the base environment for the
+   rest of the file.
+2. use the resulting extended parser (with potentially extra operators)
+   to parse the rest of the file.
+
+## Description of the Current Implementation
+
+### Lambda conversion etc.
+
+Namespaces are detected (recursively) during parsing of the main file.
+When a namespace is found, the referenced file is statted with `stat`
+and the resulting device id and inode (unix) numbers used to produce a
+unique identifier. If the namespace is deduced to be previously unseen,
+it is parsed and the result placed in a namespace array, otherwise the
+existing namespace is re-used.  The symbol of the `link` declaration is
+associated with the index of the namespace in that array.
+
+Namespaces are parsed as if they were a `let` declaration with no
+associated `in`. During lambda conversion they are converted to a true
+nest, where the `in` section is a single `env` directive (not available
+to the surface language) which instructs subsequent processing stages
+to return the current environment after processing the `let` declarations.
+
+The workhorse of lambda conversion in `lambda_conversion.c` is the static
+`lamConvert` procedure. When called from the top level it takes an AST
+of definitions, an array of namespaces, an AST of expressions and an
+environment. In the top level case the definitions are the preamble,
+the ns array is all the namespaces and the expressions are the main
+program. When called recursively on a namespace, the definitions are
+the body of the namespace, the nsarray is null and the expressions are
+the single `env` directive mentioned earlier.  When called recursively
+on a normal nest, the definitions are the nest definitions, the nsarray
+is null and the expressions are the nest expressions.
+
+| Argument        | Top          | Namespace       | Nest              |
+|-----------------|--------------|-----------------|-------------------|
+| **definitions** | preamble     | ns declarations | nest declarations |
+| **nsarray**     | namespaces   | NULL            | NULL              |
+| **expressions** | main program | "env"           | nest expressions  |
+| **env**         | empty        | preamble env    | parent env        |
+
+This means it can use the environment constructed from parsing the
+preamble as context for each of the namespaces and for the main file:
 
+```mermaid
+flowchart BT
+  pa(Preamble)
+  subgraph nsarray
+      ns1(Namespace 1)
+      ns2(Namespace 2)
+      ns3(Namespace 3)
+  end
+  pa --> ns1
+  pa --> ns2
+  pa --> ns3
+  pa --> main(Main Program)
+```
+
+Notice there is no nesting of namespaces even if they were recursively
+linked. This is as it should be, each namespace assumes only the preamble
+as a base environment.
+
+During subsequent processing steps (type checking, bytecode generation
+etc.) the components are processed in the same order: preamble then
+namespaces then main program. The order of namespaces in the
+array *is* significant, each must be processed before it is referred
+to. Luckily the parser, by parsing namespaces while parsing the file
+that links them, guarantees this property because namespaces are added
+to the ns array immediately after they are parsed.
+
+### Parsing
+
+Because of the extensibility of the parser with user-defined operators,
+the parser uses a similar environmental model to the other processing
+steps. A new parser "environment" is pushed on entry to a new scope
+and popped on exit. In order to capture the parser environment
+constructed while parsing a namespace, the parser will need to know
+that it is parsing a namespace and where to put the value. Since the
+parser is returning AST elements it can't simply return the environment.
+Maybe it can poke it into the AST as a new expression type?
+
+## Problems
+
+A single import is only slightly problematic, the bigger problem is
+multiple imports, name conflict resolution etc. A simple but inefficient
+solution would be to inline the contents of a namespace at the point it
+is imported. This is particularily inefficient for large and commonly
+used libraries like `listutils` and really isn't an option.
+
+But if we can't merely duplicate, how can we arrange environments so
+that one import does not disturb another? Each namespace must be
+able to safely assume only the preamble as a basis.
+
+## Trial and Error
+
+First attempt, thinking out loud.
+
+Exporting operators may be easier than exporting environments, so let's
+tackle that first.
+
+Currently the parser delegates actually parsing a `link` directive
+(parsing the linked file that is) to a `parseLink` procedure.  `parseLink`
+handles the detection of duplicate files and protection against recursive
+includes, and finally delegates to `prattParseLink` to do the actual
+parsing.
+
+`prattParseLink` unwinds its argument parser to the parser that was used
+to parse the preamble, then extends that with a new child set up with
+a lexer to parse the linked file. When done it discards the parser and
+returns the AstDefinitions from the file.
+
+It should be possible to meld the parser returned with the parser being
+used to parse the file doing the import, incorporating additional tries
+and parser records. Because of the way the parser "hoists" parser records
+only the top-level parser need be inspected, and the tries are similarily
+functional data structures.
+
+Hmm, of course this only works when we first parse the file, we're going
+to have to keep an additional ParseMNamespaceArray of parsers for when
+we're seeing the same file a second time.
+
+So initial steps are ok, there's now a PrattParsers array type and a
+parserStack of that congruent with the fileIdStack of namespaces and
+the parser captures each parser instance used to parse a namespace in
+that array.
diff --git a/docs/OPERATORS.md b/docs/OPERATORS.md
@@ -0,0 +1,114 @@
+# Operators (and macros)
+
+Some issues with the initial implementation.
+
+I'd thought I could get away with a pure parser-only implementation of
+infix operators and it basically works, but there are some issues which
+make that approach quite clunky.  One specific scenario is where I'm
+declaring an infix addition operator in the preamble as:
+
+```
+infix left 100 "+" addition;
+```
+
+Where `addition` is the symbol for the two-argument built-in addition
+operator. Trouble is in another file I'd redefined `addition` as a type
+of expression for unrelated purposes, and because i.e. `2 + 2` gets
+re-written to plain old `addition(2, 2)` unconditionally, in that context
+the interpreter finds the `addition` type from the current environment
+rather than the one where the infix operator was declared.
+
+This is clearly unacceptable.
+
+[Hygenic Macros in Wikipedia](https://en.wikipedia.org/wiki/Hygienic_macro) states:
+
+> The basic strategy is to identify bindings in the macro definition and
+> replace those names with gensyms, and to identify free variables in the
+> macro definition and make sure those names are looked up in the scope
+> of the macro definition instead of the scope where the macro was used.
+
+This offers hope, if we can re-work the macro system to be hygenic by default,
+then the parser instead of generating
+`addition(a, b)` for `a + b` could instead generate:
+
+```
+macro gensym$1(a, b) { addition(a, b) }
+```
+
+at the point of the operator declaration, and generate `gensym$1(a, b)`
+when `a + b` is subsequently encountered.
+
+Firstly I now think the use of a `$` prefix to indicate a gensym in a macro
+is not the best idea. Instead the lambda conversion should identify bound
+`let` variables and replace them automatically. That also frees up `$` as a
+potentially useful user-defined prefix operator.
+
+The bigger problem is that we can't continue to do naiive macro expansion during
+the lambda conversion step, or we'd be back where we started with
+`addition(a, b)` referring to whatever `addition` happens to be the current
+definition.
+
+We may have to revert to the scheme definition of a macro: pass the arguments
+unevaluated to the macro, evaluate the macro body, then re-evaluate the result.
+
+But we really don't want to have the macro evaluated like that, because F♮ is not
+homoiconic, "evaluating the macro body" can only mean substitution.
+
+What if the arguments to macros were wrapped in a closure?
+
+```
+macro AND(a, b) { if (a) { b } else { false } } => fn AND(a, b) { if (a()) { b() } else { false } }
+
+AND(a, b) => AND(fn () { a }, fn () { b })
+```
+
+That would definately work, though it won't be quite as efficient. It solves both
+local scoping rules, since `AND` is now a normal function then free variables in the
+body are evaluated in the context of the function definition, and variables in the
+argument expressions are evaluated in the calling context.
+
+Got that working, and we're also handling local shadowing of arguments so they don't
+get wrapped in an invocation unless they are the same lexical variable.
+
+One little unnecssary inefficiency needs to be addressed. If one macro calls another,
+for example
+
+```
+macro NAND(a, b) { NOT(AND(a, b)) }
+```
+
+This first gets rewritten, by `lambda_conversion.c` to
+
+```
+fn NAND(a, b) { NOT(AND(fn () {a}, fn () {b})) }
+```
+
+and then subsequently by `macro_substitution.c` to
+
+```
+fn NAND(a, b) { NOT(AND(fn () {a()}, fn () {b()})) }
+```
+
+While correct, the expression `fn () {a()}` is just `a` so we'll need
+a pass to optimise away this unnecessary wrapping and unwrapping,
+essentially restoring
+
+```
+fn NAND(a, b) { NOT(AND(a, b)) }
+```
+
+Two approaches:
+
+1. Macro specific, have a special type for macro argument application
+   and another for macro argument wrapping, and detect the explicit
+   combination of the two.
+2. Generic pass that would detect this wherever it occurs and optimize it.
+
+In either case we need to be a little bit careful that we allow the
+pattern if the argument is being modified, for example if a macro
+called another with it's argument modified in some way then the pattern
+i.e. `fn() { a() + 1 }` would be necessary.
+
+Got option 1 working, but no need for extra types, just inspect the
+thunk during macro conversion, if it has no arguments and just contains
+a symbol that would otherwise be invoked then return the symbol.
diff --git a/src/ast.yaml b/src/ast.yaml
@@ -51,10 +51,6 @@ structs:
         symbol: HashSymbol
         expression: AstExpression
 
-    AstGensymDefine:
-        basename: HashSymbol
-        expression: AstExpression
-
     AstAlias:
         name: HashSymbol
         type: AstType
@@ -194,7 +190,6 @@ unions:
 
     AstDefinition:
         define: AstDefine
-        gensymDefine: AstGensymDefine
         typeDef: AstTypeDef
         macro: AstDefMacro
         alias: AstAlias
@@ -225,7 +220,6 @@ unions:
         funCall: AstFunCall
         lookup: AstLookup
         symbol: HashSymbol
-        gensym: HashSymbol
         number: MaybeBigInt
         character: character
         fun: AstCompositeFunction

diff --git a/src/lambda.yaml b/src/lambda.yaml
@@ -31,11 +31,6 @@ structs:
         args: LamVarList
         exp: LamExp
 
-    LamMacro:
-        args: LamVarList
-        exp: LamExp
-        env: LamContext
-
     LamVarList:
         var: HashSymbol
         next: LamVarList
@@ -141,7 +136,6 @@ structs:
 
     LamLetRecBindings:
         var: HashSymbol
-        isGenSym: bool
         val: LamExp
         next: LamLetRecBindings
 
@@ -240,7 +234,6 @@ unions:
         namespaces: LamNamespaceArray
         lam: LamLam
         var: HashSymbol
-        gensym: HashSymbol
         stdint: int
         biginteger: MaybeBigInt
         prim: LamPrimApp
@@ -294,10 +287,10 @@ unions:
 
 hashes:
     LamMacroTable:
-        entries: LamMacro
+        entries: void_ptr
 
-    LamGenSymTable:
-        entries: HashSymbol
+    LamMacroArgsTable:
+        entries: void_ptr
 
     LamInfoTable:
         entries: LamInfo