diff --git a/papers/README.md b/papers/README.md index 9149172..303eedf 100644 --- a/papers/README.md +++ b/papers/README.md @@ -1,3 +1,101 @@ # emDB Papers All academic work, related documents and experiments using emDB. + +# Proposed Projects +## Faster Parallel Iterators +Implement a parallel operator implementation for `minister`. +- Efficient splitting on data into chunks for parallel computation. +- A compute bound problem +- Minimising overhead from work-strealing +- Should be usable outside minister as a standalone competitor to rayon's + parallel iterators. + +Evaluation is as follows: +- Micro benchmarks against a baseline defined as the current implementations +- Macro benchmark on the emDB benchmarks, comparing operator backends + +## Incremental View Maintenance +Implement an incremental view maintenance backend for emDB. +levels of difficulty: +1. Applied to only select queries, with simple inserts supported. +2. Working with mutations. +3. Working for all operators including mixed mutations. + +Optionally a hybrid IVM-Serialized backend could be considered. + +Contexts are very, very difficult. As there is a need to determine which +expressions depend on values in the context that can be changed (e.g. query +parameters) +- Implementation can simply error out for complex cases with an appropriate message. + +Evaluation: +- Micro benchmarks to demonstrate particular advantages for simple select queries +- Macro benchmarks for IVM backend + +## No-std +Embedded databases for embedded programming. +- Implement a backend (activated with a `no-std` feature for emDB) that generates + a no-std compatible backend +- Constraint on the memory used is important, ideally generates a `const fn` to + estimate this. +- Backend can throw errors for features not supported (in particular tables that + do not have a `limit` constraint) + +## Persistence +Generate a fingerprint for each schema, modify pulpit to use a custom allocator, +that is backed by some memory mapped file. +- WAL for commit? +- Load at start, and save on commit. +- Optional panic handler? + +Evaluation: +- Microbenchmark basic key-value schema against LMDB (integer keys, versus + row-references in emDB) +- Macrobenchmark (implement the traits generated by `Interface` backend, to run + benchmarks), evaluate effect on performance with flush-on-commit on & off. +- Compare cost of persisted versus `Serialized` backends + +## Schema Compilation for Aggregation +Modify emDB to support n-way joins (tricky due to lack of variadict generics) + - Need to alter the interface for `minister` and code generation + - Needs to beat duckdb on the [h20 ai benchmark](https://h2oai.github.io/db-benchmark/) + +## emDB Linter & Formatter +Create an emDB formatting tool (e.g. from logical plan -> AST -> emQL) + - Similar to [leptosfmt](https://github.com/bram209/leptosfmt) + +Evaluation: +- Demonstrate it is performant enough for in-IDE interaction (live demo + `cargo --timings`) +- Demonstrate applicability of libraries built for use in other 'fmt by DSL' & custom lints. +- Improvements to + +## Combi-based Data Ingestion +Implement an efficient string, or bytes based set of combis (similar to the current +token combis). + +Then use this to create a past parser for json, [bson](https://bsonspec.org/spec.html), csv or some +other format commonly used for storing data to be ingested by an embedded database. + +The key innovation expected: + - Split combinators into basic tokensize stage (that converts to a ery, very simple tokentree - convert bracketing into a tree) + - Parse tokentree in parallel (`(<1>)<2>` both `1` and `2` can be parsed and converted to data structures in parallel) + - Do any semantic checking needed in the combi, rather than as a separate traversal + - Streaming results out of parser (i.e. an into a table) before parsing is finished + +*Unecessary extra: Could potentially either add an `ingest` operator (take file path & format, add to table), or a special constraint that allows some tables to be filled on `Datastore` construction (file path, easier to deal with concurrency & choice of data structure)* + +Evaluation: + - Microbenchmarks against some other rust bson/json parsers (e.g. mongoDB's bson parser) + - Macrobenchmark emDB using a combi basd json parse (with some structure to be transformed) + into an emDB database, versus duckDB reading from a file. May well be slower, consider other + possible uses, and explain why. + +## Grammar Aware Fuzzing +Implement easy to use fuzzing for parsers expressed using combi. +- Use combi structure to generate likely inputs (for some we cannot tell from structure of combis - e.g. code that panics inside a `mapsuc`, a pipe to a custom), then test parser +- Can modify the `Combi` trait for this (e.g. add a 'fuzz input` method to generate input) + +Evaluation: + - Use this to find bugs in the current emQL language frontend. + - Compare with non-grammar aware fuzzer