Skip to content

Commit

Permalink
Merge pull request #124 from billhails/import
Browse files Browse the repository at this point in the history
Import
  • Loading branch information
billhails authored Oct 27, 2024
2 parents f7a9bfb + 63ac30c commit daf02e5
Show file tree
Hide file tree
Showing 23 changed files with 762 additions and 389 deletions.
9 changes: 7 additions & 2 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -68,8 +68,11 @@ the $step$ function: one to deal with `amb` and one to deal with `back`.
flowchart TD
classDef process fill:#aef;
source(Source) -->
scanner([Scanner]):::process -->
tokens(Tokens) -->
parser([Parser]):::process
parser <--> oi([Operator Inlining]):::process
parser --> oi([Operator Inlining]):::process
oi --> scanner
parser --> ast(AST) -->
lc([Lambda Conversion]):::process --> tpmc([Pattern Matching Compiler]):::process
lc <---> pg([Print Function Generator]):::process
Expand All @@ -94,12 +97,14 @@ bc --> cekf([CEKF Runtime VM]):::process
```

The various components named in the diagram above are linked to their implementation entry point here:
* Parser [parser.y](src/parser.y)
* Scanner [pratt_scanner.c](src/pratt_scanner.c)
* Parser [pratt_parser.c](src/pratt_parser.c)
* AST [ast.yaml](src/ast.yaml)
* Lambda Conversion [lambda_conversion.c](src/lambda_conversion.c)
* Tpmc [tpmc_logic.c](src/tpmc_logic.c)
* Print Function Generator [print_generator.c](src/print_generator.c)
* Variable Substitution [lambda_substitution.c](src/lambda_substitution.c)
* Macro Expansion [macro_substitution.c](src/macro_substitution.c)
* Plain Lambda Form [lambda.yaml](src/lambda.yaml)
* Type Checking [tc_analyze.c](src/tc_analyze.c)
* Print Compiler [print_compiler.c](src/print_compiler.c)
Expand Down
142 changes: 142 additions & 0 deletions docs/NAMESPACES.md
Original file line number Diff line number Diff line change
Expand Up @@ -592,4 +592,146 @@ could use this id to find the namespace in which to lookup the

Use of scoped types in pattern matching.

# Postscript

Now that namespaces are implemented this is a review of their
implementation, and their shortcomings. Those shortcomings are crying
out for an additional `import` operation, and hopefully the namespace
implementation can be re-used.

## Shortcomings

* Operators defined in a namespace are not visible outside of it.
* Aliases defined in a namespace are not visible outside of it.
* Functions and types defined in a namespace *have* to be accessed
via the lookup operator (`.`) on the namespace id.

## Intent of an `import` command

I'm hoping I can get an `import <string>;` declaration to do 2 things:

1. make the environment of the import the base environment for the
rest of the file.
2. use the resulting extended parser (with potentially extra operators)
to parse the rest of the file.

## Description of the Current Implementation

### Lambda conversion etc.

Namespaces are detected (recursively) during parsing of the main file.
When a namespace is found, the referenced file is statted with `stat`
and the resulting device id and inode (unix) numbers used to produce a
unique identifier. If the namespace is deduced to be previously unseen,
it is parsed and the result placed in a namespace array, otherwise the
existing namespace is re-used. The symbol of the `link` declaration is
associated with the index of the namespace in that array.

Namespaces are parsed as if they were a `let` declaration with no
associated `in`. During lambda conversion they are converted to a true
nest, where the `in` section is a single `env` directive (not available
to the surface language) which instructs subsequent processing stages
to return the current environment after processing the `let` declarations.

The workhorse of lambda conversion in `lambda_conversion.c` is the static
`lamConvert` procedure. When called from the top level it takes an AST
of definitions, an array of namespaces, an AST of expressions and an
environment. In the top level case the definitions are the preamble,
the ns array is all the namespaces and the expressions are the main
program. When called recursively on a namespace, the definitions are
the body of the namespace, the nsarray is null and the expressions are
the single `env` directive mentioned earlier. When called recursively
on a normal nest, the definitions are the nest definitions, the nsarray
is null and the expressions are the nest expressions.

| Argument | Top | Namespace | Nest |
|-----------------|--------------|-----------------|-------------------|
| **definitions** | preamble | ns declarations | nest declarations |
| **nsarray** | namespaces | NULL | NULL |
| **expressions** | main program | "env" | nest expressions |
| **env** | empty | preamble env | parent env |

This means it can use the environment constructed from parsing the
preamble as context for each of the namespaces and for the main file:

```mermaid
flowchart BT
pa(Preamble)
subgraph nsarray
ns1(Namespace 1)
ns2(Namespace 2)
ns3(Namespace 3)
end
pa --> ns1
pa --> ns2
pa --> ns3
pa --> main(Main Program)
```

Notice there is no nesting of namespaces even if they were recursively
linked. This is as it should be, each namespace assumes only the preamble
as a base environment.

During subsequent processing steps (type checking, bytecode generation
etc.) the components are processed in the same order: preamble then
namespaces then main program. The order of namespaces in the
array *is* significant, each must be processed before it is referred
to. Luckily the parser, by parsing namespaces while parsing the file
that links them, guarantees this property because namespaces are added
to the ns array immediately after they are parsed.

### Parsing

Because of the extensibility of the parser with user-defined operators,
the parser uses a similar environmental model to the other processing
steps. A new parser "environment" is pushed on entry to a new scope
and popped on exit. In order to capture the parser environment
constructed while parsing a namespace, the parser will need to know
that it is parsing a namespace and where to put the value. Since the
parser is returning AST elements it can't simply return the environment.
Maybe it can poke it into the AST as a new expression type?

## Problems

A single import is only slightly problematic, the bigger problem is
multiple imports, name conflict resolution etc. A simple but inefficient
solution would be to inline the contents of a namespace at the point it
is imported. This is particularily inefficient for large and commonly
used libraries like `listutils` and really isn't an option.

But if we can't merely duplicate, how can we arrange environments so
that one import does not disturb another? Each namespace must be
able to safely assume only the preamble as a basis.

## Trial and Error

First attempt, thinking out loud.

Exporting operators may be easier than exporting environments, so let's
tackle that first.

Currently the parser delegates actually parsing a `link` directive
(parsing the linked file that is) to a `parseLink` procedure. `parseLink`
handles the detection of duplicate files and protection against recursive
includes, and finally delegates to `prattParseLink` to do the actual
parsing.

`prattParseLink` unwinds its argument parser to the parser that was used
to parse the preamble, then extends that with a new child set up with
a lexer to parse the linked file. When done it discards the parser and
returns the AstDefinitions from the file.

It should be possible to meld the parser returned with the parser being
used to parse the file doing the import, incorporating additional tries
and parser records. Because of the way the parser "hoists" parser records
only the top-level parser need be inspected, and the tries are similarily
functional data structures.

Hmm, of course this only works when we first parse the file, we're going
to have to keep an additional ParseMNamespaceArray of parsers for when
we're seeing the same file a second time.

So initial steps are ok, there's now a PrattParsers array type and a
parserStack of that congruent with the fileIdStack of namespaces and
the parser captures each parser instance used to parse a namespace in
that array.
114 changes: 114 additions & 0 deletions docs/OPERATORS.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,114 @@
# Operators (and macros)

Some issues with the initial implementation.

I'd thought I could get away with a pure parser-only implementation of
infix operators and it basically works, but there are some issues which
make that approach quite clunky. One specific scenario is where I'm
declaring an infix addition operator in the preamble as:

```
infix left 100 "+" addition;
```

Where `addition` is the symbol for the two-argument built-in addition
operator. Trouble is in another file I'd redefined `addition` as a type
of expression for unrelated purposes, and because i.e. `2 + 2` gets
re-written to plain old `addition(2, 2)` unconditionally, in that context
the interpreter finds the `addition` type from the current environment
rather than the one where the infix operator was declared.

This is clearly unacceptable.

[Hygenic Macros in Wikipedia](https://en.wikipedia.org/wiki/Hygienic_macro) states:

> The basic strategy is to identify bindings in the macro definition and
> replace those names with gensyms, and to identify free variables in the
> macro definition and make sure those names are looked up in the scope
> of the macro definition instead of the scope where the macro was used.
This offers hope, if we can re-work the macro system to be hygenic by default,
then the parser instead of generating
`addition(a, b)` for `a + b` could instead generate:

```
macro gensym$1(a, b) { addition(a, b) }
```

at the point of the operator declaration, and generate `gensym$1(a, b)`
when `a + b` is subsequently encountered.

Firstly I now think the use of a `$` prefix to indicate a gensym in a macro
is not the best idea. Instead the lambda conversion should identify bound
`let` variables and replace them automatically. That also frees up `$` as a
potentially useful user-defined prefix operator.

The bigger problem is that we can't continue to do naiive macro expansion during
the lambda conversion step, or we'd be back where we started with
`addition(a, b)` referring to whatever `addition` happens to be the current
definition.

We may have to revert to the scheme definition of a macro: pass the arguments
unevaluated to the macro, evaluate the macro body, then re-evaluate the result.

But we really don't want to have the macro evaluated like that, because F♮ is not
homoiconic, "evaluating the macro body" can only mean substitution.

What if the arguments to macros were wrapped in a closure?

```
macro AND(a, b) { if (a) { b } else { false } } => fn AND(a, b) { if (a()) { b() } else { false } }
AND(a, b) => AND(fn () { a }, fn () { b })
```

That would definately work, though it won't be quite as efficient. It solves both
local scoping rules, since `AND` is now a normal function then free variables in the
body are evaluated in the context of the function definition, and variables in the
argument expressions are evaluated in the calling context.

Got that working, and we're also handling local shadowing of arguments so they don't
get wrapped in an invocation unless they are the same lexical variable.

One little unnecssary inefficiency needs to be addressed. If one macro calls another,
for example

```
macro NAND(a, b) { NOT(AND(a, b)) }
```

This first gets rewritten, by `lambda_conversion.c` to

```
fn NAND(a, b) { NOT(AND(fn () {a}, fn () {b})) }
```

and then subsequently by `macro_substitution.c` to

```
fn NAND(a, b) { NOT(AND(fn () {a()}, fn () {b()})) }
```

While correct, the expression `fn () {a()}` is just `a` so we'll need
a pass to optimise away this unnecessary wrapping and unwrapping,
essentially restoring

```
fn NAND(a, b) { NOT(AND(a, b)) }
```

Two approaches:

1. Macro specific, have a special type for macro argument application
and another for macro argument wrapping, and detect the explicit
combination of the two.
2. Generic pass that would detect this wherever it occurs and optimize it.

In either case we need to be a little bit careful that we allow the
pattern if the argument is being modified, for example if a macro
called another with it's argument modified in some way then the pattern
i.e. `fn() { a() + 1 }` would be necessary.

Got option 1 working, but no need for extra types, just inspect the
thunk during macro conversion, if it has no arguments and just contains
a symbol that would otherwise be invoked then return the symbol.
6 changes: 0 additions & 6 deletions src/ast.yaml
Original file line number Diff line number Diff line change
Expand Up @@ -51,10 +51,6 @@ structs:
symbol: HashSymbol
expression: AstExpression

AstGensymDefine:
basename: HashSymbol
expression: AstExpression

AstAlias:
name: HashSymbol
type: AstType
Expand Down Expand Up @@ -194,7 +190,6 @@ unions:

AstDefinition:
define: AstDefine
gensymDefine: AstGensymDefine
typeDef: AstTypeDef
macro: AstDefMacro
alias: AstAlias
Expand Down Expand Up @@ -225,7 +220,6 @@ unions:
funCall: AstFunCall
lookup: AstLookup
symbol: HashSymbol
gensym: HashSymbol
number: MaybeBigInt
character: character
fun: AstCompositeFunction
Expand Down
13 changes: 3 additions & 10 deletions src/lambda.yaml
Original file line number Diff line number Diff line change
Expand Up @@ -31,11 +31,6 @@ structs:
args: LamVarList
exp: LamExp

LamMacro:
args: LamVarList
exp: LamExp
env: LamContext

LamVarList:
var: HashSymbol
next: LamVarList
Expand Down Expand Up @@ -141,7 +136,6 @@ structs:

LamLetRecBindings:
var: HashSymbol
isGenSym: bool
val: LamExp
next: LamLetRecBindings

Expand Down Expand Up @@ -240,7 +234,6 @@ unions:
namespaces: LamNamespaceArray
lam: LamLam
var: HashSymbol
gensym: HashSymbol
stdint: int
biginteger: MaybeBigInt
prim: LamPrimApp
Expand Down Expand Up @@ -294,10 +287,10 @@ unions:

hashes:
LamMacroTable:
entries: LamMacro
entries: void_ptr

LamGenSymTable:
entries: HashSymbol
LamMacroArgsTable:
entries: void_ptr

LamInfoTable:
entries: LamInfo
Expand Down
Loading

0 comments on commit daf02e5

Please sign in to comment.