-
Notifications
You must be signed in to change notification settings - Fork 2
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[RFC] Issues and limitations in DDlog-1. #1
Comments
I'm tempted to say that the FLWOR syntax is a failure. I don't think that we should have two different syntaxes. |
Perhaps three is the magic number? |
I think our FLWOR-based syntax can be greatly improved. My main issue with it though is that I don't know how to unify it nicely with the other two. In any case, I agree that having multiple syntaxes is not great and is in itself a symptom of a problem. So the goal is to have one syntax that will combine the strengths of all alternatives. |
I like the goal, but I wouldn't want to end up with one syntax that makes no one happy. Datalog syntax is good (with the exception of aggregation, which is suboptimal), it just takes some getting used to. |
I talked to some ddlog users and they listed a few things
|
I added a few more limitations by editing the issue. I am not sure that this causes notifications. |
Editing the issues is a very inefficient way to discuss this topic. The editor for issues is also really primitive; some of my bullets were really in the wrong sections. This kind of discussion would sit much better as a markup document. |
Thanks. I moved documentation-related items to a separate section. |
This is a laundry list of issues and limitations in the first version of DDlog. We cannot realistically address all of them, but as we are starting a new iteration of the project, this is a good time to brainstorm our options and incorporate solutions to some of the problems in the design of DDlog-2.
[The plan is to write a paragraph for each bullet point below and use separate RFCs for detailed discussions]
Language ergonomics
Confusing syntax and semantics of
group_by
(addressed by the design proposed in [RFC] Embedding Datalog in a functional language #2)group_by
currently combines projection and grouping operations. This is the only time when an explicit projection is required in DDlog, as in all other situations the compiler automatically keeps the subset of variables used in the remainder of the rule and projects away everything else. Projection is easy to get wrong. Put differently, the programmer does not normally need to explicitly define the output relation produced by each ddlog operator. The compiler maintains that relation automatically as a tuple of all variables defined in the prefix of the rule and used anywhere in its suffix. In case ofgroup_by
the input relation must be constructed explicitly in order to achieve the desired behavior.group_by
It is not possible to mix imperative and relational code (addressed by the design proposed in [RFC] Embedding Datalog in a functional language #2). The language currently has two subsets with very different syntax. In essence, the programmer needs to learn two languages that only share the type system. While imperative code can be embedded in a rule, relational operations cannot be used inside functions. This often complicates refactoring, e.g., when code written as a function must be broken up if it needs to operate over multiple relations in the new implementation. As a separate annoying quirk, variables in imperative code are mutable, but variables in rules aren't. Can we unify the two subsets of the language.
Relation transformers (addressed by the design proposed in [RFC] Embedding Datalog in a functional language #2). The ability to write code parameterized by input/output relations would enable new types of modularity and code reuse. This is related to the previous issue, as in a language that unifies functions and rules such transformers could be just functions that take relations as arguments.
DDlog lacks interface inheritance (or traits) (see [RFC] Traits #5). Without this, our support for generics remains very limited. Interfaces will become more important if we want to support first-class relations and relation transformers, as a transformer will typically need to know something about the type of the input relation. Of course, this is not a minor feature and will bring a lot of complexity along with it.
Cleaner syntax&semantics for streaming extensions. We currently have
-1
and differentiation operators, and we support joins andgroup_by
over streams, which behave differently than regular join andgroup_by
. These features are not easy to use to build stream processing applications. We may need additional low-level features (e.g., builtin support for sliding windows), as well as syntactic sugar to build high-level operators on top.Support for incremental group-by. Groups are not modified incrementally, they are atomic fat objects. This is a prerequisite for incremental aggregation.
Support for Incremental aggregates. Some aggregates can be computed incrementally. We might need separate syntax for incremental aggregation (unified syntax would of course be nice, but I'm not sure it's possible).
More powerful module system: re-exports, selective exports and imports (addresses in Added Syntax RFC #4 and [RFC] DDlog Packages and Rust integration #10).
Add support for disjunction in rules.
Simplify the development of bindings for Rust libraries; simplify writing of
extern
functions (see [RFC] DDlog Packages and Rust integration #10):FromRecord
,Serialize
, flatbuf-related stuff, etc).Automate interning operations. Calling
intern()
andival()
is burdensome. Consider injecting these calls automatically. The challenge is that there can exist multiple functionally correct ways to inject these calls with different performance impact.As a limited version of this, we can only automate interning of strings, so the programmer can write regular quoted string literals (
"foo"
) with the compiler converting them to interned strings, e.g.,i"foo"
where necessary.Correct use of the
Ref<>
type is not obvious for most users.Ref<>
is an important performance optimization that avoids copying large values between tables as well as in procedural code (e.g., it can be expensive to return a container by value from a large function). However, the need forRef
and where to best apply it is not at all obvious for most users. One alternative is to get the compiler to wrap values inRef<>
automatically using heuristics, thus completely hiding this type of smart pointer from the user..dl
and.dat
files use different syntax. There is a school of thought that they should be unified, especially as we are planning to add the interpreted mode, so the comand language could just become part of the interactive interpeter CLI.LINQ-like syntax (addressed by the design proposed in [RFC] Embedding Datalog in a functional language #2). There is no reason we could not have an alternative syntax based on functions applied to collections.
Here is an example:
One important benefit of LINQ is that transformers become just functions that happen to take relations. They aren't special in any way.
Souffle components have some nice ergonomic features for reusing computation, including generics. We should look at the more closely: https://souffle-lang.github.io/components
Variable introduction: it is not always clear syntactically which expressions introduce and bind a new variable and which ones use the value of a bound variable. Perhaps we could have a binding indicator, e.g.,
#v
as a shortcut forvar v
which can be used anywhere, including in patterns.Performance
DDlog generates inefficient imperative code mostly due to excessive cloning (e.g., return-by-value requires cloning). See [RFC] Expression compiler #7.
Functional data types should (probably) be used for collections. There are a couple of reasons for this. First, values in relations are immutable, so a small change requires cloning and modifying a potentially large collection. Second, the imperative fragment has pass-by-value semantics, which may also require excessive cloning even with an optimized compiler. Functional data types allow cloning collections in
O(1)
. We already have a functionalHashSet
. Perhaps, all DDlog collection types should be backed by functional data structures by default. See [RFC] Expression compiler #7.Delta queries for long join sequences. DD provides the primitives to implement delta queries and worst-case optimal joins to avoid maintaining intermediate arrangements in long sequences of joins. We had support for that in a branch that is now bit-rotted. I think this is a potentially valuable feature and we should consider supporting it.
Reliance on DD. This is both a positive and a negative. DD does a lot of things for us, but it is not easy to understand and/or debug.
Bottom-up evaluation. This is a problem for programs that instantiate potentially very large relations
Not many optimizations. The compiler could probably do more to optimize the computation (e.g. inter-rule analysis, imperative code dataflow analysis, etc.)
Multi-core scalability is weak. Could this be improved for large deltas?
Implementation is unsafe: due to both weight overflow and timestamps overflow.
Dynamic DDlog
.so
with an interpreter-friendly interface.Compilation speed
New features
Tooling
Documentation
The lack of a precise specification of the semantics of the language.
The documentation structure is suboptimal. The tutorial is essentially a language reference - every little feature is documented there. There should be separate documents for teaching and specification.
The text was updated successfully, but these errors were encountered: