This is a high-level architecture of Millet. Millet is a language server for Standard ML (SML), written in Rust.
A language server is a long-running, stateful process that transforms a repository of code, and the client's edits to that code, into a semantic model about that code that can be queried by the client.
SML is a functional programming language, formally defined by its Definition. It emphasizes modularity, purity, and formal reasoning about programs. It is often used in educational contexts to teach students the basics of functional programming.
Rust is a programming language with an emphasis on reliability and efficiency. Being written in Rust, Millet is split into several, somewhat modular "crates".
These crates provide foundational types that many other crates use.
SML tokens and lossless concrete syntax trees (CST), and abstract syntax tree (AST) wrapper API. Depends on rowan, a library for CSTs.
"Lossless" means the exact input source file can be reproduced from its CST.
In a conventional compiler, source code is often first lexed into tokens, then those tokens are parsed into ASTs. Often, by design, these AST types do not allow for representing partially-formed programs. For example, we could represent 1 + 3
or foo(3, 5) - bar(4)
with an AST, but not 1 +
or foo(3,
.
This is fine for the regular compiler use case; in fact, it is desired, since a compiler should not compile malformed, incomplete programs. But for the language server use case, this is unacceptable, because the client is often in the middle of editing their code, and therefore their code is syntactically invalid. A language server should, in spite of this, strive to understand as much about the (possibly syntactically invalid) code as possible.
Thus, since the AST wrapper API provides a "view" into a concrete syntax tree, and that concrete tree may be partial and/or malformed, pretty much all "fields" exposed by the AST API are optional.
Mostly generated from its ungrammar.
High-level Intermediate Representation.
Uses arenas, and indices into those arenas, to represent recursive structure. For instance, instead of the usual
enum Exp {
Num(u32),
Str(String),
Fn(Var, Box<Exp>),
App(Box<Exp>, Box<Exp>),
}
We'd represent it (almost) like this:
/// an arena of Exp allocations
type ExpArena = Arena<Exp>;
/// an index into that arena
type ExpIdx = Idx<Exp>;
enum Exp {
Num(u32),
Str(String),
Fn(Var, ExpIdx),
App(ExpIdx, ExpIdx),
}
The "almost" is because: since we need to represent partial nodes (from a partial parse), HIR nodes need to be able to contain "empty" sub-nodes. So we make just one change to the above:
/// an index might not actually be there if the sub-node couldn't be lowered
type ExpIdx = Option<Idx<Exp>>;
Lab
, a label for tuples/records.
SCon
, a special value constructor (aka "literal") for primitive types (int
, string
, etc).
A type representing a "path", aka a non-empty sequence of names.
Types relating to SML infix operators.
Types representing various SML namespaces.
Important types and operations on those types for static analysis.
This used to be a part of statics itself, but the types involved for statics are complicated and plentiful enough that this was made its own crate.
These crates are the main "passes" on SML code. Together they form essentially a SML compiler frontend.
Notably, most passes that can produce errors have the approximate shape
Input -> (Output, Vec<Error>)
instead of
Input -> Result<Output, Error>
That is to say, we always produce both as best an output we could make and as many errors as we could find, as opposed to either a "perfect" output or the first error we encountered.
The latter is more common in "regular" compilers, but we're a language server. In the case of a language server, we want to analyze as much of the code as possible as far as possible, even if the information we have is imperfect.
&std::path::Path -> (Input, Vec<InputError>)
Get input from a filesystem.
The most important thing in the input is a mapping from file names to either:
- raw (string) contents, for SML source files
- parsed and processed contents, for CM/MLB "group" files
&str -> (Vec<sml_syntax::Token<'_>>, Vec<LexError>)
Lex (aka tokenize) a string of an SML program into tokens.
Notably, we consider things like whitespace and comments as tokens, instead of ignoring/skipping them. This means that we can exactly reconstruct the input string from the output vec of tokens by concatenating all of the underlying token strings in order.
&[sml_syntax::Token<'_>] -> (sml_syntax::ast::Root, Vec<ParseError>)
Parses a sequence of tokens into a sequence of "events". Events are like:
- start a node
- consume a token
- emit an error
- finish a node
Then processes those events to build a lossless CST.
&sml_syntax::ast::Root -> (sml_hir::Root, TwoWayPointers, Vec<LowerError>)
Lowers ("elaborates") AST into HIR.
When lowering, we turn complex constructs into more fundamental ones. For instance:
a andalso b
= if a then b else false
= case a of true => b | false => false
= (fn true => b | false => false) a
We also construct a two-way mapping between HIR indices and "pointers" to AST nodes. Each direction has a use:
- HIR -> AST: for recovering the text range for which to report an error for a HIR index.
- AST -> HIR: for knowing what HIR index is e.g. being hovered over by the client.
sml_hir::Root -> sml_hir::Root
Handles adding implicitly-scoped type variables to their val
declaration/specification binding sites.
(State, Basis, sml_hir::Root) -> (State, Basis, Info, Vec<StaticsError>)
Does static analysis ("typechecking") on HIR, according to the statics semantics of SML from its Definition.
The other passes thus far operate on single files, but this one is meant to be run "across" many files. So we take in and give out updated state.
This crate contains the implementation of most of the rules of the static semantics for SML from the Definition. (Some trivial ones like the rules for parenthesized expressions are handled implicitly when lowering). Relevant code is tagged with the special comment @def(N)
for statics rule N.
Statics errors use an abstract Idx
, and this index gets turned into an actual text range with the TwoWayPointers
from lower.
In the future we could add more to this Idx
, like "the name of the third con bind in the second dat bind of this datatype dec".
Runs a program, according to the dynamic semantics of SML from its Definition. The program should have been typechecked with the statics already.
These crates do interesting things related to SML files, but aren't really full "passes".
Naively format SML files.
Doesn't handle comments or line length.
Extract interesting comments from above SML syntax nodes, like doc comments.
Given information about a value, returns what kind of symbol it is.
Some common lex utilities used in multiple lexing crates, like:
- Handling
(* ... *)
style block comments with nesting. - Handling SML's definition of whitespace. (We actually consider a superset of the SML definition to be whitespace.)
Handle paths with $(FOO)
or $BAR
style variables in them, as in MLB and CM files.
These crates are related to "group files" in the SML ecosystem, namely SML/NJ Compilation Manager and ML Basis.
Processes the syntax of SML/NJ Compilation Manager (.cm
) files.
Processes the syntax of ML Basis (.mlb
) files into AST values.
HIR for MLB.
Static semantics for MLB files.
Because the semantics for MLB files determines when source (SML) files get parsed and statically analyzed, this depends on most of the crates that analyze SML.
These crates depend on many other crates to "unite" everything together.
Kinds of source files (like signature and functor files).
Unites all of the purely syntactical SML passes into one single API.
Unifies ~all of the passes into one single API.
Depends on analysis
and a bunch of third party crates to implement a language server based on analysis
. Has a fair amount of "glue code" to convert between analysis types and LSP types.
These crates don't really fit in anywhere else.
Extends the paths
crate with globs (like **/*.sml
).
Inspired by the cov-mark crate, but doesn't use linker shenanigans.
A good default panic hook that shows backtraces and exhorts the reader to file a bug report.
The format for the optional Millet configuration file.
All the tests for all of the other crates. Depends on analysis
, and consumes its public API to test functionality of each of the 'passes'.
No crate except this crate is expected to have tests, and this crate is expected to have nothing but tests.
Test case are usually SML programs, which contain "expectation comments" asserting that analysis
should behave a certain way about a certain region of the program.
use crate::check::check;
#[test]
fn undefined() {
check(
r#"
val _ = nope
(** ^^^^ undefined value: nope *)
"#,
);
}
Tests for the WIP dynamics.
These crates are the only ones that produce executable binaries. Every other crate is only compiled to a library.
Depends on lang-srv
and actually runs it.
A thin CLI wrapper around analysis
. It basically does one full analysis of the input, prints any errors to stdout, and exits, much like a conventional compiler or linter.
Most of the code is contained in the Rust crates documented above, but some code lives elsewhere.
Documentation, like this!
Support for specific text editors via language client extensions/"glue code".
The VS Code client extension, in TypeScript.
A task runner written in Rust.
Allows you to invoke cargo xtask <task>
to run the <task>
.
Configuration for Cargo, Rust's package manager and build tool. Allows cargo xtask
to work.
At the top level, this sets up a "Cargo workspace" of all the crates in crates
and xtask
. Each crate also has its own Cargo.toml
defining fundamental things about the crate like its name and dependencies.
A lockfile for Cargo that records the exact versions of all dependencies that Cargo resolved.
Configuration for rustfmt, Rust's automatic code formatter. Before checking code into git, it must be formatted by rustfmt.
Configuration for GitHub, like:
- PR templates
- Issue templates
- How to run CI on GitHub Actions
Configuration for git, to ignore generated files like /target
(the output folder for Rust).
Configuration for VS Code, like:
- How to launch an instance of VS Code with a local version of the extension for testing
- What files to ignore in the sidebar (e.g.
/target
)