Architecture

This is a high-level architecture of Millet. Millet is a language server for Standard ML (SML), written in Rust.

A language server is a long-running, stateful process that transforms a repository of code, and the client's edits to that code, into a semantic model about that code that can be queried by the client.

SML is a functional programming language, formally defined by its Definition. It emphasizes modularity, purity, and formal reasoning about programs. It is often used in educational contexts to teach students the basics of functional programming.

Rust is a programming language with an emphasis on reliability and efficiency. Being written in Rust, Millet is split into several, somewhat modular "crates".

Type crates

These crates provide foundational types that many other crates use.

`crates/sml-syntax`

SML tokens and lossless concrete syntax trees (CST), and abstract syntax tree (AST) wrapper API. Depends on rowan, a library for CSTs.

"Lossless" means the exact input source file can be reproduced from its CST.

In a conventional compiler, source code is often first lexed into tokens, then those tokens are parsed into ASTs. Often, by design, these AST types do not allow for representing partially-formed programs. For example, we could represent 1 + 3 or foo(3, 5) - bar(4) with an AST, but not 1 + or foo(3,.

This is fine for the regular compiler use case; in fact, it is desired, since a compiler should not compile malformed, incomplete programs. But for the language server use case, this is unacceptable, because the client is often in the middle of editing their code, and therefore their code is syntactically invalid. A language server should, in spite of this, strive to understand as much about the (possibly syntactically invalid) code as possible.

Thus, since the AST wrapper API provides a "view" into a concrete syntax tree, and that concrete tree may be partial and/or malformed, pretty much all "fields" exposed by the AST API are optional.

Mostly generated from its ungrammar.

`crates/sml-hir`

High-level Intermediate Representation.

Uses arenas, and indices into those arenas, to represent recursive structure. For instance, instead of the usual

enum Exp {
  Num(u32),
  Str(String),
  Fn(Var, Box<Exp>),
  App(Box<Exp>, Box<Exp>),
}

We'd represent it (almost) like this:

/// an arena of Exp allocations
type ExpArena = Arena<Exp>;
/// an index into that arena
type ExpIdx = Idx<Exp>;

enum Exp {
  Num(u32),
  Str(String),
  Fn(Var, ExpIdx),
  App(ExpIdx, ExpIdx),
}

The "almost" is because: since we need to represent partial nodes (from a partial parse), HIR nodes need to be able to contain "empty" sub-nodes. So we make just one change to the above:

/// an index might not actually be there if the sub-node couldn't be lowered
type ExpIdx = Option<Idx<Exp>>;

`crates/sml-lab`

Lab, a label for tuples/records.

`crates/sml-scon`

SCon, a special value constructor (aka "literal") for primitive types (int, string, etc).

`crates/sml-path`

A type representing a "path", aka a non-empty sequence of names.

`crates/sml-fixity`

Types relating to SML infix operators.

`crates/sml-namespace`

Types representing various SML namespaces.

`crates/sml-statics-types`

Important types and operations on those types for static analysis.

This used to be a part of statics itself, but the types involved for statics are complicated and plentiful enough that this was made its own crate.

Passes crates

These crates are the main "passes" on SML code. Together they form essentially a SML compiler frontend.

Notably, most passes that can produce errors have the approximate shape

Input -> (Output, Vec<Error>)

instead of

Input -> Result<Output, Error>

That is to say, we always produce both as best an output we could make and as many errors as we could find, as opposed to either a "perfect" output or the first error we encountered.

The latter is more common in "regular" compilers, but we're a language server. In the case of a language server, we want to analyze as much of the code as possible as far as possible, even if the information we have is imperfect.

`crates/input`

&std::path::Path -> (Input, Vec<InputError>)

Get input from a filesystem.

The most important thing in the input is a mapping from file names to either:

raw (string) contents, for SML source files
parsed and processed contents, for CM/MLB "group" files

`crates/sml-lex`

&str -> (Vec<sml_syntax::Token<'_>>, Vec<LexError>)

Lex (aka tokenize) a string of an SML program into tokens.

Notably, we consider things like whitespace and comments as tokens, instead of ignoring/skipping them. This means that we can exactly reconstruct the input string from the output vec of tokens by concatenating all of the underlying token strings in order.

`crates/sml-parse`

&[sml_syntax::Token<'_>] -> (sml_syntax::ast::Root, Vec<ParseError>)

Parses a sequence of tokens into a sequence of "events". Events are like:

start a node
consume a token
emit an error
finish a node

Then processes those events to build a lossless CST.

`crates/sml-hir-lower`

&sml_syntax::ast::Root -> (sml_hir::Root, TwoWayPointers, Vec<LowerError>)

Lowers ("elaborates") AST into HIR.

When lowering, we turn complex constructs into more fundamental ones. For instance:

  a andalso b
= if a then b else false
= case a of true => b | false => false
= (fn true => b | false => false) a

We also construct a two-way mapping between HIR indices and "pointers" to AST nodes. Each direction has a use:

HIR -> AST: for recovering the text range for which to report an error for a HIR index.
AST -> HIR: for knowing what HIR index is e.g. being hovered over by the client.

`crates/sml-ty-var-scope`

sml_hir::Root -> sml_hir::Root

Handles adding implicitly-scoped type variables to their val declaration/specification binding sites.

`crates/sml-statics`

(State, Basis, sml_hir::Root) -> (State, Basis, Info, Vec<StaticsError>)

Does static analysis ("typechecking") on HIR, according to the statics semantics of SML from its Definition.

The other passes thus far operate on single files, but this one is meant to be run "across" many files. So we take in and give out updated state.

This crate contains the implementation of most of the rules of the static semantics for SML from the Definition. (Some trivial ones like the rules for parenthesized expressions are handled implicitly when lowering). Relevant code is tagged with the special comment @def(N) for statics rule N.

Statics errors use an abstract Idx, and this index gets turned into an actual text range with the TwoWayPointers from lower.

In the future we could add more to this Idx, like "the name of the third con bind in the second dat bind of this datatype dec".

`crates/sml-dynamics`

Runs a program, according to the dynamic semantics of SML from its Definition. The program should have been typechecked with the statics already.

SML-adjacent helper crates

These crates do interesting things related to SML files, but aren't really full "passes".

`crates/sml-naive-fmt`

Naively format SML files.

Doesn't handle comments or line length.

`crates/sml-comment`

Extract interesting comments from above SML syntax nodes, like doc comments.

`crates/sml-symbol-kind`

Given information about a value, returns what kind of symbol it is.

`crates/lex-util`

Some common lex utilities used in multiple lexing crates, like:

Handling (* ... *) style block comments with nesting.
Handling SML's definition of whitespace. (We actually consider a superset of the SML definition to be whitespace.)

`crates/slash-var-path`

Handle paths with $(FOO) or $BAR style variables in them, as in MLB and CM files.

SML group file crates

These crates are related to "group files" in the SML ecosystem, namely SML/NJ Compilation Manager and ML Basis.

`crates/cm-syntax`

Processes the syntax of SML/NJ Compilation Manager (.cm) files.

`crates/mlb-syntax`

Processes the syntax of ML Basis (.mlb) files into AST values.

`crates/mlb-hir`

HIR for MLB.

`crates/mlb-statics`

Static semantics for MLB files.

Because the semantics for MLB files determines when source (SML) files get parsed and statically analyzed, this depends on most of the crates that analyze SML.

Overall crates

These crates depend on many other crates to "unite" everything together.

`crates/sml-file`

Kinds of source files (like signature and functor files).

`crates/sml-file-syntax`

Unites all of the purely syntactical SML passes into one single API.

`crates/analysis`

Unifies ~all of the passes into one single API.

`crates/lang-srv`

Depends on analysis and a bunch of third party crates to implement a language server based on analysis. Has a fair amount of "glue code" to convert between analysis types and LSP types.

Other crates

These crates don't really fit in anywhere else.

`crates/paths-glob`

Extends the paths crate with globs (like **/*.sml).

`crates/cov-mark`

Inspired by the cov-mark crate, but doesn't use linker shenanigans.

`crates/panic-hook`

A good default panic hook that shows backtraces and exhorts the reader to file a bug report.

`crates/config`

The format for the optional Millet configuration file.

`crates/tests`

All the tests for all of the other crates. Depends on analysis, and consumes its public API to test functionality of each of the 'passes'.

No crate except this crate is expected to have tests, and this crate is expected to have nothing but tests.

Test case are usually SML programs, which contain "expectation comments" asserting that analysis should behave a certain way about a certain region of the program.

use crate::check::check;

#[test]
fn undefined() {
  check(
    r#"
val _ = nope
(**     ^^^^ undefined value: nope *)
"#,
  );
}

`crates/sml-dynamics-tests`

Tests for the WIP dynamics.

Binary crates

These crates are the only ones that produce executable binaries. Every other crate is only compiled to a library.

`crates/millet-ls`

Depends on lang-srv and actually runs it.

`crates/millet-cli`

A thin CLI wrapper around analysis. It basically does one full analysis of the input, prints any errors to stdout, and exits, much like a conventional compiler or linter.

Other code

Most of the code is contained in the Rust crates documented above, but some code lives elsewhere.

`docs`

Documentation, like this!

`editors`

Support for specific text editors via language client extensions/"glue code".

`editors/vscode`

The VS Code client extension, in TypeScript.

`xtask`

A task runner written in Rust.

Allows you to invoke cargo xtask <task> to run the <task>.

`.cargo`

Configuration for Cargo, Rust's package manager and build tool. Allows cargo xtask to work.

`Cargo.toml`

At the top level, this sets up a "Cargo workspace" of all the crates in crates and xtask. Each crate also has its own Cargo.toml defining fundamental things about the crate like its name and dependencies.

`Cargo.lock`

A lockfile for Cargo that records the exact versions of all dependencies that Cargo resolved.

`rustfmt.toml`

Configuration for rustfmt, Rust's automatic code formatter. Before checking code into git, it must be formatted by rustfmt.

`.github`

Configuration for GitHub, like:

PR templates
Issue templates
How to run CI on GitHub Actions

`.gitignore`

Configuration for git, to ignore generated files like /target (the output folder for Rust).

`.vscode`

Configuration for VS Code, like:

How to launch an instance of VS Code with a local version of the extension for testing
What files to ignore in the sidebar (e.g. /target)

Files

ARCHITECTURE.md

Latest commit

History

ARCHITECTURE.md

File metadata and controls

Architecture

Type crates

crates/sml-syntax

crates/sml-hir

crates/sml-lab

crates/sml-scon

crates/sml-path

crates/sml-fixity

crates/sml-namespace

crates/sml-statics-types

Passes crates

crates/input

crates/sml-lex

crates/sml-parse

crates/sml-hir-lower

crates/sml-ty-var-scope

crates/sml-statics

crates/sml-dynamics

SML-adjacent helper crates

crates/sml-naive-fmt

crates/sml-comment

crates/sml-symbol-kind

crates/lex-util

crates/slash-var-path

SML group file crates

crates/cm-syntax

crates/mlb-syntax

crates/mlb-hir

crates/mlb-statics

Overall crates

crates/sml-file

crates/sml-file-syntax

crates/analysis

crates/lang-srv