Added other possible projects

OliverKillane · Jul 1, 2024 · 503b87a · 503b87a
1 parent ade54f7
commit 503b87a
Showing 1 changed file with 98 additions and 0 deletions.
diff --git a/papers/README.md b/papers/README.md
@@ -1,3 +1,101 @@
 # <img src="./../crates/emdb/docs/logo.drawio.svg" alt="emDB" style="vertical-align: middle;" title="emdb logo" width="100"/> Papers
 
 All academic work, related documents and experiments using <img src="./../crates/emdb/docs/logo.drawio.svg" alt="emDB" style="vertical-align: middle;" title="emdb logo" width="50"/>.
+
+# Proposed Projects
+## Faster Parallel Iterators
+Implement a parallel operator implementation for `minister`.
+- Efficient splitting on data into chunks for parallel computation.
+- A compute bound problem
+- Minimising overhead from work-strealing
+- Should be usable outside minister as a standalone competitor to rayon's 
+  parallel iterators.
+
+Evaluation is as follows:
+- Micro benchmarks against a baseline defined as the current implementations
+- Macro benchmark on the emDB benchmarks, comparing operator backends
+
+## Incremental View Maintenance
+Implement an incremental view maintenance backend for emDB.
+levels of difficulty:
+1. Applied to only select queries, with simple inserts supported.
+2. Working with mutations.
+3. Working for all operators including mixed mutations.
+
+Optionally a hybrid IVM-Serialized backend could be considered.
+
+Contexts are very, very difficult. As there is a need to determine which 
+expressions depend on values in the context that can be changed (e.g. query 
+parameters)
+- Implementation can simply error out for complex cases with an appropriate message.
+
+Evaluation:
+- Micro benchmarks to demonstrate particular advantages for simple select queries
+- Macro benchmarks for IVM backend
+
+## No-std
+Embedded databases for embedded programming.
+- Implement a backend (activated with a `no-std` feature for emDB) that generates 
+  a no-std compatible backend
+- Constraint on the memory used is important, ideally generates a `const fn` to 
+  estimate this.
+- Backend can throw errors for features not supported (in particular tables that 
+  do not have a `limit` constraint)
+
+## Persistence
+Generate a fingerprint for each schema, modify pulpit to use a custom allocator, 
+that is backed by some memory mapped file.
+- WAL for commit?
+- Load at start, and save on commit.
+- Optional panic handler?
+
+Evaluation:
+- Microbenchmark basic key-value schema against LMDB (integer keys, versus 
+  row-references in emDB)
+- Macrobenchmark (implement the traits generated by `Interface` backend, to run 
+  benchmarks), evaluate effect on performance with flush-on-commit on & off.
+- Compare cost of persisted versus `Serialized` backends
+
+## Schema Compilation for Aggregation
+Modify emDB to support n-way joins (tricky due to lack of variadict generics)
+ - Need to alter the interface for `minister` and code generation
+ - Needs to beat duckdb on the [h20 ai benchmark](https://h2oai.github.io/db-benchmark/)
+
+## emDB Linter & Formatter
+Create an emDB formatting tool (e.g. from logical plan -> AST -> emQL)
+ - Similar to [leptosfmt](https://github.com/bram209/leptosfmt)
+
+Evaluation:
+- Demonstrate it is performant enough for in-IDE interaction (live demo + `cargo --timings`) 
+- Demonstrate applicability of libraries built for use in other 'fmt by DSL' & custom lints.
+- Improvements to 
+
+## Combi-based Data Ingestion
+Implement an efficient string, or bytes based set of combis (similar to the current
+token combis).
+
+Then use this to create a past parser for json, [bson](https://bsonspec.org/spec.html), csv or some 
+other format commonly used for storing data to be ingested by an embedded database.
+
+The key innovation expected:
+ - Split combinators into basic tokensize stage (that converts to a ery, very simple tokentree - convert bracketing into a tree)
+ - Parse tokentree in parallel (`(<1>)<2>` both `1` and `2` can be parsed and converted to data structures in parallel)
+ - Do any semantic checking needed in the combi, rather than as a separate traversal
+ - Streaming results out of parser (i.e. an into a table) before parsing is finished
+
+*Unecessary extra: Could potentially either add an `ingest` operator (take file path & format, add to table), or a special constraint that allows some tables to be filled on `Datastore` construction (file path, easier to deal with concurrency & choice of data structure)*
+
+Evaluation:
+ - Microbenchmarks against some other rust bson/json parsers (e.g. mongoDB's bson parser)
+ - Macrobenchmark emDB using a combi basd json parse (with some structure to be transformed) 
+   into an emDB database, versus duckDB reading from a file. May well be slower, consider other
+   possible uses, and explain why.
+
+## Grammar Aware Fuzzing
+Implement easy to use fuzzing for parsers expressed using combi.
+- Use combi structure to generate likely inputs (for some we cannot tell from structure of combis - e.g. code that panics inside a `mapsuc`, a pipe to a custom), then test parser
+- Can modify the `Combi` trait for this (e.g. add a 'fuzz input` method to generate input)
+
+Evaluation:
+ - Use this to find bugs in the current emQL language frontend.
+ - Compare with non-grammar aware fuzzer