Skip to content

Commit

Permalink
Added other possible projects
Browse files Browse the repository at this point in the history
  • Loading branch information
OliverKillane committed Jul 1, 2024
1 parent ade54f7 commit 503b87a
Showing 1 changed file with 98 additions and 0 deletions.
98 changes: 98 additions & 0 deletions papers/README.md
Original file line number Diff line number Diff line change
@@ -1,3 +1,101 @@
# <img src="./../crates/emdb/docs/logo.drawio.svg" alt="emDB" style="vertical-align: middle;" title="emdb logo" width="100"/> Papers

All academic work, related documents and experiments using <img src="./../crates/emdb/docs/logo.drawio.svg" alt="emDB" style="vertical-align: middle;" title="emdb logo" width="50"/>.

# Proposed Projects
## Faster Parallel Iterators
Implement a parallel operator implementation for `minister`.
- Efficient splitting on data into chunks for parallel computation.
- A compute bound problem
- Minimising overhead from work-strealing
- Should be usable outside minister as a standalone competitor to rayon's
parallel iterators.

Evaluation is as follows:
- Micro benchmarks against a baseline defined as the current implementations
- Macro benchmark on the emDB benchmarks, comparing operator backends

## Incremental View Maintenance
Implement an incremental view maintenance backend for emDB.
levels of difficulty:
1. Applied to only select queries, with simple inserts supported.
2. Working with mutations.
3. Working for all operators including mixed mutations.

Optionally a hybrid IVM-Serialized backend could be considered.

Contexts are very, very difficult. As there is a need to determine which
expressions depend on values in the context that can be changed (e.g. query
parameters)
- Implementation can simply error out for complex cases with an appropriate message.

Evaluation:
- Micro benchmarks to demonstrate particular advantages for simple select queries
- Macro benchmarks for IVM backend

## No-std
Embedded databases for embedded programming.
- Implement a backend (activated with a `no-std` feature for emDB) that generates
a no-std compatible backend
- Constraint on the memory used is important, ideally generates a `const fn` to
estimate this.
- Backend can throw errors for features not supported (in particular tables that
do not have a `limit` constraint)

## Persistence
Generate a fingerprint for each schema, modify pulpit to use a custom allocator,
that is backed by some memory mapped file.
- WAL for commit?
- Load at start, and save on commit.
- Optional panic handler?

Evaluation:
- Microbenchmark basic key-value schema against LMDB (integer keys, versus
row-references in emDB)
- Macrobenchmark (implement the traits generated by `Interface` backend, to run
benchmarks), evaluate effect on performance with flush-on-commit on & off.
- Compare cost of persisted versus `Serialized` backends

## Schema Compilation for Aggregation
Modify emDB to support n-way joins (tricky due to lack of variadict generics)
- Need to alter the interface for `minister` and code generation
- Needs to beat duckdb on the [h20 ai benchmark](https://h2oai.github.io/db-benchmark/)

## emDB Linter & Formatter
Create an emDB formatting tool (e.g. from logical plan -> AST -> emQL)
- Similar to [leptosfmt](https://github.com/bram209/leptosfmt)

Evaluation:
- Demonstrate it is performant enough for in-IDE interaction (live demo + `cargo --timings`)
- Demonstrate applicability of libraries built for use in other 'fmt by DSL' & custom lints.
- Improvements to

## Combi-based Data Ingestion
Implement an efficient string, or bytes based set of combis (similar to the current
token combis).

Then use this to create a past parser for json, [bson](https://bsonspec.org/spec.html), csv or some
other format commonly used for storing data to be ingested by an embedded database.

The key innovation expected:
- Split combinators into basic tokensize stage (that converts to a ery, very simple tokentree - convert bracketing into a tree)
- Parse tokentree in parallel (`(<1>)<2>` both `1` and `2` can be parsed and converted to data structures in parallel)
- Do any semantic checking needed in the combi, rather than as a separate traversal
- Streaming results out of parser (i.e. an into a table) before parsing is finished

*Unecessary extra: Could potentially either add an `ingest` operator (take file path & format, add to table), or a special constraint that allows some tables to be filled on `Datastore` construction (file path, easier to deal with concurrency & choice of data structure)*

Evaluation:
- Microbenchmarks against some other rust bson/json parsers (e.g. mongoDB's bson parser)
- Macrobenchmark emDB using a combi basd json parse (with some structure to be transformed)
into an emDB database, versus duckDB reading from a file. May well be slower, consider other
possible uses, and explain why.

## Grammar Aware Fuzzing
Implement easy to use fuzzing for parsers expressed using combi.
- Use combi structure to generate likely inputs (for some we cannot tell from structure of combis - e.g. code that panics inside a `mapsuc`, a pipe to a custom), then test parser
- Can modify the `Combi` trait for this (e.g. add a 'fuzz input` method to generate input)

Evaluation:
- Use this to find bugs in the current emQL language frontend.
- Compare with non-grammar aware fuzzer

0 comments on commit 503b87a

Please sign in to comment.