[RFC] Issues and limitations in DDlog-1. #1

ryzhyk · 2021-09-10T06:46:42Z

This is a laundry list of issues and limitations in the first version of DDlog. We cannot realistically address all of them, but as we are starting a new iteration of the project, this is a good time to brainstorm our options and incorporate solutions to some of the problems in the design of DDlog-2.

[The plan is to write a paragraph for each bullet point below and use separate RFCs for detailed discussions]

Language ergonomics

Confusing syntax and semantics of group_by (addressed by the design proposed in [RFC] Embedding Datalog in a functional language #2)
- group_by currently combines projection and grouping operations. This is the only time when an explicit projection is required in DDlog, as in all other situations the compiler automatically keeps the subset of variables used in the remainder of the rule and projects away everything else. Projection is easy to get wrong. Put differently, the programmer does not normally need to explicitly define the output relation produced by each ddlog operator. The compiler maintains that relation automatically as a tuple of all variables defined in the prefix of the rule and used anywhere in its suffix. In case of group_by the input relation must be constructed explicitly in order to achieve the desired behavior.
- Variables that are not part of the group key become unavailable after group_by
It is not possible to mix imperative and relational code (addressed by the design proposed in [RFC] Embedding Datalog in a functional language #2). The language currently has two subsets with very different syntax. In essence, the programmer needs to learn two languages that only share the type system. While imperative code can be embedded in a rule, relational operations cannot be used inside functions. This often complicates refactoring, e.g., when code written as a function must be broken up if it needs to operate over multiple relations in the new implementation. As a separate annoying quirk, variables in imperative code are mutable, but variables in rules aren't. Can we unify the two subsets of the language.
Relation transformers (addressed by the design proposed in [RFC] Embedding Datalog in a functional language #2). The ability to write code parameterized by input/output relations would enable new types of modularity and code reuse. This is related to the previous issue, as in a language that unifies functions and rules such transformers could be just functions that take relations as arguments.
DDlog lacks interface inheritance (or traits) (see [RFC] Traits #5). Without this, our support for generics remains very limited. Interfaces will become more important if we want to support first-class relations and relation transformers, as a transformer will typically need to know something about the type of the input relation. Of course, this is not a minor feature and will bring a lot of complexity along with it.
Cleaner syntax&semantics for streaming extensions. We currently have -1 and differentiation operators, and we support joins and group_by over streams, which behave differently than regular join and group_by. These features are not easy to use to build stream processing applications. We may need additional low-level features (e.g., builtin support for sliding windows), as well as syntactic sugar to build high-level operators on top.
Support for incremental group-by. Groups are not modified incrementally, they are atomic fat objects. This is a prerequisite for incremental aggregation.
Support for Incremental aggregates. Some aggregates can be computed incrementally. We might need separate syntax for incremental aggregation (unified syntax would of course be nice, but I'm not sure it's possible).
More powerful module system: re-exports, selective exports and imports (addresses in Added Syntax RFC #4 and [RFC] DDlog Packages and Rust integration #10).
Add support for disjunction in rules.
Simplify the development of bindings for Rust libraries; simplify writing of extern functions (see [RFC] DDlog Packages and Rust integration #10):
- Rust code lives in isolated files, which complicates the use of Rust tooling designed for code in crates (see [RFC] DDlog Packages and Rust integration #10).
- Exporting Rust types to DDlog requires implementing numerous traits (FromRecord, Serialize, flatbuf-related stuff, etc).
- Auto-generated crate tree makes it tricky to import Rust code outside of the local crate (see [RFC] DDlog Packages and Rust integration #10 ).
Automate interning operations. Calling intern() and ival() is burdensome. Consider injecting these calls automatically. The challenge is that there can exist multiple functionally correct ways to inject these calls with different performance impact.
As a limited version of this, we can only automate interning of strings, so the programmer can write regular quoted string literals ("foo") with the compiler converting them to interned strings, e.g., i"foo" where necessary.
Correct use of the Ref<> type is not obvious for most users. Ref<> is an important performance optimization that avoids copying large values between tables as well as in procedural code (e.g., it can be expensive to return a container by value from a large function). However, the need for Ref and where to best apply it is not at all obvious for most users. One alternative is to get the compiler to wrap values in Ref<> automatically using heuristics, thus completely hiding this type of smart pointer from the user.
.dl and .dat files use different syntax. There is a school of thought that they should be unified, especially as we are planning to add the interpreted mode, so the comand language could just become part of the interactive interpeter CLI.
LINQ-like syntax (addressed by the design proposed in [RFC] Embedding Datalog in a functional language #2). There is no reason we could not have an alternative syntax based on functions applied to collections.
Here is an example:
```
relation R, S, T
T :- S.map(|e| { (e.x + 2, e.y) })
      .filter(|e| { e.x > 5 })
      .join(R, |e| { e.x }, |r| { r.z }, |(e,r)| { (e.x, e.y, r.h) }
      .group_by(|t| { t.x + t.h })
      .map(|g| { g.map(|e| { e.x }).sum() }). 
```
One important benefit of LINQ is that transformers become just functions that happen to take relations. They aren't special in any way.
Souffle components have some nice ergonomic features for reusing computation, including generics. We should look at the more closely: https://souffle-lang.github.io/components
Variable introduction: it is not always clear syntactically which expressions introduce and bind a new variable and which ones use the value of a bound variable. Perhaps we could have a binding indicator, e.g., #v as a shortcut for var v which can be used anywhere, including in patterns.

Performance

DDlog generates inefficient imperative code mostly due to excessive cloning (e.g., return-by-value requires cloning). See [RFC] Expression compiler #7.
Functional data types should (probably) be used for collections. There are a couple of reasons for this. First, values in relations are immutable, so a small change requires cloning and modifying a potentially large collection. Second, the imperative fragment has pass-by-value semantics, which may also require excessive cloning even with an optimized compiler. Functional data types allow cloning collections in O(1). We already have a functional HashSet. Perhaps, all DDlog collection types should be backed by functional data structures by default. See [RFC] Expression compiler #7.
Delta queries for long join sequences. DD provides the primitives to implement delta queries and worst-case optimal joins to avoid maintaining intermediate arrangements in long sequences of joins. We had support for that in a branch that is now bit-rotted. I think this is a potentially valuable feature and we should consider supporting it.
Reliance on DD. This is both a positive and a negative. DD does a lot of things for us, but it is not easy to understand and/or debug.
Bottom-up evaluation. This is a problem for programs that instantiate potentially very large relations
Not many optimizations. The compiler could probably do more to optimize the computation (e.g. inter-rule analysis, imperative code dataflow analysis, etc.)
Multi-core scalability is weak. Could this be improved for large deltas?
Implementation is unsafe: due to both weight overflow and timestamps overflow.

Dynamic DDlog

Calling Rust functions from interpreted code. We need to compile Rust libraries as standalone .so with an interpreter-friendly interface.

Compilation speed

Serde blows up executable size and compilation speed. Consider either implementing our own serialization infrastructure (probably not a good idea) or finding a way to use dynamic dispatch with serde in a way that will improve compilation speed, possibly at the cost of a small performance hit.

New features

Persistence

Tooling

Language server
Debugger
Flatbufs are a pain in the back. Can we drop them?

Documentation

The lack of a precise specification of the semantics of the language.
The documentation structure is suboptimal. The tutorial is essentially a language reference - every little feature is documented there. There should be separate documents for teaching and specification.

The text was updated successfully, but these errors were encountered:

blp · 2021-09-14T16:10:22Z

I'm tempted to say that the FLWOR syntax is a failure. I don't think that we should have two different syntaxes.

mihaibudiu · 2021-09-14T16:11:22Z

Perhaps three is the magic number?

ryzhyk · 2021-09-14T16:30:41Z

I'm tempted to say that the FLWOR syntax is a failure. I don't think that we should have two different syntaxes.

I think our FLWOR-based syntax can be greatly improved. My main issue with it though is that I don't know how to unify it nicely with the other two. In any case, I agree that having multiple syntaxes is not great and is in itself a symptom of a problem. So the goal is to have one syntax that will combine the strengths of all alternatives.

blp · 2021-09-14T18:29:56Z

So the goal is to have one syntax that will combine the strengths of all alternatives.

I like the goal, but I wouldn't want to end up with one syntax that makes no one happy. Datalog syntax is good (with the exception of aggregation, which is suboptimal), it just takes some getting used to.

Kixiron · 2021-09-16T18:22:56Z

I talked to some ddlog users and they listed a few things

They want ddlog to keep being expressive, they should be able to focus on writing pure logic instead of boilerplate
They want it to be well optimized, the user shouldn't have to think about speed, just about writing pure logic
Some things relating to compilers utilizing ddlog, I think writing a toy compiler of some sort would be a good example (and benchmark) for people
Better interface, the current rust api is very painful
Keep the ability to drop down into hand written DDflow code
Better learning materials, once we have something we should invest in making a comprehensive, hands-on set of tutorials

mihaibudiu · 2021-09-20T18:28:56Z

I added a few more limitations by editing the issue. I am not sure that this causes notifications.

mihaibudiu · 2021-09-20T18:34:04Z

Editing the issues is a very inefficient way to discuss this topic. The editor for issues is also really primitive; some of my bullets were really in the wrong sections. This kind of discussion would sit much better as a markup document.

ryzhyk · 2021-09-20T22:16:50Z

I added a few more limitations by editing the issue. I am not sure that this causes notifications.

Thanks. I moved documentation-related items to a separate section.

ryzhyk added the rfc Request for Comments label Sep 10, 2021

ryzhyk mentioned this issue Sep 16, 2021

[RFC] Embedding Datalog in a functional language #2

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[RFC] Issues and limitations in DDlog-1. #1

[RFC] Issues and limitations in DDlog-1. #1

ryzhyk commented Sep 10, 2021 •

edited

Loading

blp commented Sep 14, 2021

mihaibudiu commented Sep 14, 2021

ryzhyk commented Sep 14, 2021

blp commented Sep 14, 2021

Kixiron commented Sep 16, 2021

mihaibudiu commented Sep 20, 2021

mihaibudiu commented Sep 20, 2021

ryzhyk commented Sep 20, 2021

[RFC] Issues and limitations in DDlog-1. #1

[RFC] Issues and limitations in DDlog-1. #1

Comments

ryzhyk commented Sep 10, 2021 • edited Loading

Language ergonomics

Performance

Dynamic DDlog

Compilation speed

New features

Tooling

Documentation

blp commented Sep 14, 2021

mihaibudiu commented Sep 14, 2021

ryzhyk commented Sep 14, 2021

blp commented Sep 14, 2021

Kixiron commented Sep 16, 2021

mihaibudiu commented Sep 20, 2021

mihaibudiu commented Sep 20, 2021

ryzhyk commented Sep 20, 2021

ryzhyk commented Sep 10, 2021 •

edited

Loading