-
Notifications
You must be signed in to change notification settings - Fork 0
Commit
This commit does not belong to any branch on this repository, and may belong to a fork outside of the repository.
- Loading branch information
1 parent
ade54f7
commit 503b87a
Showing
1 changed file
with
98 additions
and
0 deletions.
There are no files selected for viewing
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -1,3 +1,101 @@ | ||
# <img src="./../crates/emdb/docs/logo.drawio.svg" alt="emDB" style="vertical-align: middle;" title="emdb logo" width="100"/> Papers | ||
|
||
All academic work, related documents and experiments using <img src="./../crates/emdb/docs/logo.drawio.svg" alt="emDB" style="vertical-align: middle;" title="emdb logo" width="50"/>. | ||
|
||
# Proposed Projects | ||
## Faster Parallel Iterators | ||
Implement a parallel operator implementation for `minister`. | ||
- Efficient splitting on data into chunks for parallel computation. | ||
- A compute bound problem | ||
- Minimising overhead from work-strealing | ||
- Should be usable outside minister as a standalone competitor to rayon's | ||
parallel iterators. | ||
|
||
Evaluation is as follows: | ||
- Micro benchmarks against a baseline defined as the current implementations | ||
- Macro benchmark on the emDB benchmarks, comparing operator backends | ||
|
||
## Incremental View Maintenance | ||
Implement an incremental view maintenance backend for emDB. | ||
levels of difficulty: | ||
1. Applied to only select queries, with simple inserts supported. | ||
2. Working with mutations. | ||
3. Working for all operators including mixed mutations. | ||
|
||
Optionally a hybrid IVM-Serialized backend could be considered. | ||
|
||
Contexts are very, very difficult. As there is a need to determine which | ||
expressions depend on values in the context that can be changed (e.g. query | ||
parameters) | ||
- Implementation can simply error out for complex cases with an appropriate message. | ||
|
||
Evaluation: | ||
- Micro benchmarks to demonstrate particular advantages for simple select queries | ||
- Macro benchmarks for IVM backend | ||
|
||
## No-std | ||
Embedded databases for embedded programming. | ||
- Implement a backend (activated with a `no-std` feature for emDB) that generates | ||
a no-std compatible backend | ||
- Constraint on the memory used is important, ideally generates a `const fn` to | ||
estimate this. | ||
- Backend can throw errors for features not supported (in particular tables that | ||
do not have a `limit` constraint) | ||
|
||
## Persistence | ||
Generate a fingerprint for each schema, modify pulpit to use a custom allocator, | ||
that is backed by some memory mapped file. | ||
- WAL for commit? | ||
- Load at start, and save on commit. | ||
- Optional panic handler? | ||
|
||
Evaluation: | ||
- Microbenchmark basic key-value schema against LMDB (integer keys, versus | ||
row-references in emDB) | ||
- Macrobenchmark (implement the traits generated by `Interface` backend, to run | ||
benchmarks), evaluate effect on performance with flush-on-commit on & off. | ||
- Compare cost of persisted versus `Serialized` backends | ||
|
||
## Schema Compilation for Aggregation | ||
Modify emDB to support n-way joins (tricky due to lack of variadict generics) | ||
- Need to alter the interface for `minister` and code generation | ||
- Needs to beat duckdb on the [h20 ai benchmark](https://h2oai.github.io/db-benchmark/) | ||
|
||
## emDB Linter & Formatter | ||
Create an emDB formatting tool (e.g. from logical plan -> AST -> emQL) | ||
- Similar to [leptosfmt](https://github.com/bram209/leptosfmt) | ||
|
||
Evaluation: | ||
- Demonstrate it is performant enough for in-IDE interaction (live demo + `cargo --timings`) | ||
- Demonstrate applicability of libraries built for use in other 'fmt by DSL' & custom lints. | ||
- Improvements to | ||
|
||
## Combi-based Data Ingestion | ||
Implement an efficient string, or bytes based set of combis (similar to the current | ||
token combis). | ||
|
||
Then use this to create a past parser for json, [bson](https://bsonspec.org/spec.html), csv or some | ||
other format commonly used for storing data to be ingested by an embedded database. | ||
|
||
The key innovation expected: | ||
- Split combinators into basic tokensize stage (that converts to a ery, very simple tokentree - convert bracketing into a tree) | ||
- Parse tokentree in parallel (`(<1>)<2>` both `1` and `2` can be parsed and converted to data structures in parallel) | ||
- Do any semantic checking needed in the combi, rather than as a separate traversal | ||
- Streaming results out of parser (i.e. an into a table) before parsing is finished | ||
|
||
*Unecessary extra: Could potentially either add an `ingest` operator (take file path & format, add to table), or a special constraint that allows some tables to be filled on `Datastore` construction (file path, easier to deal with concurrency & choice of data structure)* | ||
|
||
Evaluation: | ||
- Microbenchmarks against some other rust bson/json parsers (e.g. mongoDB's bson parser) | ||
- Macrobenchmark emDB using a combi basd json parse (with some structure to be transformed) | ||
into an emDB database, versus duckDB reading from a file. May well be slower, consider other | ||
possible uses, and explain why. | ||
|
||
## Grammar Aware Fuzzing | ||
Implement easy to use fuzzing for parsers expressed using combi. | ||
- Use combi structure to generate likely inputs (for some we cannot tell from structure of combis - e.g. code that panics inside a `mapsuc`, a pipe to a custom), then test parser | ||
- Can modify the `Combi` trait for this (e.g. add a 'fuzz input` method to generate input) | ||
|
||
Evaluation: | ||
- Use this to find bugs in the current emQL language frontend. | ||
- Compare with non-grammar aware fuzzer |