Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[RFC] Issues and limitations in DDlog-1. #1

Open
35 tasks
ryzhyk opened this issue Sep 10, 2021 · 8 comments
Open
35 tasks

[RFC] Issues and limitations in DDlog-1. #1

ryzhyk opened this issue Sep 10, 2021 · 8 comments
Labels
rfc Request for Comments

Comments

@ryzhyk
Copy link
Contributor

ryzhyk commented Sep 10, 2021

This is a laundry list of issues and limitations in the first version of DDlog. We cannot realistically address all of them, but as we are starting a new iteration of the project, this is a good time to brainstorm our options and incorporate solutions to some of the problems in the design of DDlog-2.

[The plan is to write a paragraph for each bullet point below and use separate RFCs for detailed discussions]

Language ergonomics

  • Confusing syntax and semantics of group_by (addressed by the design proposed in [RFC] Embedding Datalog in a functional language #2)

    • group_by currently combines projection and grouping operations. This is the only time when an explicit projection is required in DDlog, as in all other situations the compiler automatically keeps the subset of variables used in the remainder of the rule and projects away everything else. Projection is easy to get wrong. Put differently, the programmer does not normally need to explicitly define the output relation produced by each ddlog operator. The compiler maintains that relation automatically as a tuple of all variables defined in the prefix of the rule and used anywhere in its suffix. In case of group_by the input relation must be constructed explicitly in order to achieve the desired behavior.
    • Variables that are not part of the group key become unavailable after group_by
  • It is not possible to mix imperative and relational code (addressed by the design proposed in [RFC] Embedding Datalog in a functional language #2). The language currently has two subsets with very different syntax. In essence, the programmer needs to learn two languages that only share the type system. While imperative code can be embedded in a rule, relational operations cannot be used inside functions. This often complicates refactoring, e.g., when code written as a function must be broken up if it needs to operate over multiple relations in the new implementation. As a separate annoying quirk, variables in imperative code are mutable, but variables in rules aren't. Can we unify the two subsets of the language.

  • Relation transformers (addressed by the design proposed in [RFC] Embedding Datalog in a functional language #2). The ability to write code parameterized by input/output relations would enable new types of modularity and code reuse. This is related to the previous issue, as in a language that unifies functions and rules such transformers could be just functions that take relations as arguments.

  • DDlog lacks interface inheritance (or traits) (see [RFC] Traits #5). Without this, our support for generics remains very limited. Interfaces will become more important if we want to support first-class relations and relation transformers, as a transformer will typically need to know something about the type of the input relation. Of course, this is not a minor feature and will bring a lot of complexity along with it.

  • Cleaner syntax&semantics for streaming extensions. We currently have -1 and differentiation operators, and we support joins and group_by over streams, which behave differently than regular join and group_by. These features are not easy to use to build stream processing applications. We may need additional low-level features (e.g., builtin support for sliding windows), as well as syntactic sugar to build high-level operators on top.

  • Support for incremental group-by. Groups are not modified incrementally, they are atomic fat objects. This is a prerequisite for incremental aggregation.

  • Support for Incremental aggregates. Some aggregates can be computed incrementally. We might need separate syntax for incremental aggregation (unified syntax would of course be nice, but I'm not sure it's possible).

  • More powerful module system: re-exports, selective exports and imports (addresses in Added Syntax RFC #4 and [RFC] DDlog Packages and Rust integration #10).

  • Add support for disjunction in rules.

  • Simplify the development of bindings for Rust libraries; simplify writing of extern functions (see [RFC] DDlog Packages and Rust integration #10):

  • Automate interning operations. Calling intern() and ival() is burdensome. Consider injecting these calls automatically. The challenge is that there can exist multiple functionally correct ways to inject these calls with different performance impact.
    As a limited version of this, we can only automate interning of strings, so the programmer can write regular quoted string literals ("foo") with the compiler converting them to interned strings, e.g., i"foo" where necessary.

  • Correct use of the Ref<> type is not obvious for most users. Ref<> is an important performance optimization that avoids copying large values between tables as well as in procedural code (e.g., it can be expensive to return a container by value from a large function). However, the need for Ref and where to best apply it is not at all obvious for most users. One alternative is to get the compiler to wrap values in Ref<> automatically using heuristics, thus completely hiding this type of smart pointer from the user.

  • .dl and .dat files use different syntax. There is a school of thought that they should be unified, especially as we are planning to add the interpreted mode, so the comand language could just become part of the interactive interpeter CLI.

  • LINQ-like syntax (addressed by the design proposed in [RFC] Embedding Datalog in a functional language #2). There is no reason we could not have an alternative syntax based on functions applied to collections.
    Here is an example:

    relation R, S, T
    T :- S.map(|e| { (e.x + 2, e.y) })
          .filter(|e| { e.x > 5 })
          .join(R, |e| { e.x }, |r| { r.z }, |(e,r)| { (e.x, e.y, r.h) }
          .group_by(|t| { t.x + t.h })
          .map(|g| { g.map(|e| { e.x }).sum() }). 
    

    One important benefit of LINQ is that transformers become just functions that happen to take relations. They aren't special in any way.

  • Souffle components have some nice ergonomic features for reusing computation, including generics. We should look at the more closely: https://souffle-lang.github.io/components

  • Variable introduction: it is not always clear syntactically which expressions introduce and bind a new variable and which ones use the value of a bound variable. Perhaps we could have a binding indicator, e.g., #v as a shortcut for var v which can be used anywhere, including in patterns.

Performance

  • DDlog generates inefficient imperative code mostly due to excessive cloning (e.g., return-by-value requires cloning). See [RFC] Expression compiler #7.

  • Functional data types should (probably) be used for collections. There are a couple of reasons for this. First, values in relations are immutable, so a small change requires cloning and modifying a potentially large collection. Second, the imperative fragment has pass-by-value semantics, which may also require excessive cloning even with an optimized compiler. Functional data types allow cloning collections in O(1). We already have a functional HashSet. Perhaps, all DDlog collection types should be backed by functional data structures by default. See [RFC] Expression compiler #7.

  • Delta queries for long join sequences. DD provides the primitives to implement delta queries and worst-case optimal joins to avoid maintaining intermediate arrangements in long sequences of joins. We had support for that in a branch that is now bit-rotted. I think this is a potentially valuable feature and we should consider supporting it.

  • Reliance on DD. This is both a positive and a negative. DD does a lot of things for us, but it is not easy to understand and/or debug.

  • Bottom-up evaluation. This is a problem for programs that instantiate potentially very large relations

  • Not many optimizations. The compiler could probably do more to optimize the computation (e.g. inter-rule analysis, imperative code dataflow analysis, etc.)

  • Multi-core scalability is weak. Could this be improved for large deltas?

  • Implementation is unsafe: due to both weight overflow and timestamps overflow.

Dynamic DDlog

  • Calling Rust functions from interpreted code. We need to compile Rust libraries as standalone .so with an interpreter-friendly interface.

Compilation speed

  • Serde blows up executable size and compilation speed. Consider either implementing our own serialization infrastructure (probably not a good idea) or finding a way to use dynamic dispatch with serde in a way that will improve compilation speed, possibly at the cost of a small performance hit.

New features

  • Persistence

Tooling

  • Language server
  • Debugger
  • Flatbufs are a pain in the back. Can we drop them?

Documentation

  • The lack of a precise specification of the semantics of the language.

  • The documentation structure is suboptimal. The tutorial is essentially a language reference - every little feature is documented there. There should be separate documents for teaching and specification.

@ryzhyk ryzhyk added the rfc Request for Comments label Sep 10, 2021
@blp
Copy link

blp commented Sep 14, 2021

I'm tempted to say that the FLWOR syntax is a failure. I don't think that we should have two different syntaxes.

@mihaibudiu
Copy link

Perhaps three is the magic number?

@ryzhyk
Copy link
Contributor Author

ryzhyk commented Sep 14, 2021

I'm tempted to say that the FLWOR syntax is a failure. I don't think that we should have two different syntaxes.

I think our FLWOR-based syntax can be greatly improved. My main issue with it though is that I don't know how to unify it nicely with the other two. In any case, I agree that having multiple syntaxes is not great and is in itself a symptom of a problem. So the goal is to have one syntax that will combine the strengths of all alternatives.

@blp
Copy link

blp commented Sep 14, 2021

So the goal is to have one syntax that will combine the strengths of all alternatives.

I like the goal, but I wouldn't want to end up with one syntax that makes no one happy. Datalog syntax is good (with the exception of aggregation, which is suboptimal), it just takes some getting used to.

@Kixiron
Copy link
Contributor

Kixiron commented Sep 16, 2021

I talked to some ddlog users and they listed a few things

  • They want ddlog to keep being expressive, they should be able to focus on writing pure logic instead of boilerplate
  • They want it to be well optimized, the user shouldn't have to think about speed, just about writing pure logic
  • Some things relating to compilers utilizing ddlog, I think writing a toy compiler of some sort would be a good example (and benchmark) for people
  • Better interface, the current rust api is very painful
  • Keep the ability to drop down into hand written DDflow code
  • Better learning materials, once we have something we should invest in making a comprehensive, hands-on set of tutorials

@mihaibudiu
Copy link

I added a few more limitations by editing the issue. I am not sure that this causes notifications.

@mihaibudiu
Copy link

Editing the issues is a very inefficient way to discuss this topic. The editor for issues is also really primitive; some of my bullets were really in the wrong sections. This kind of discussion would sit much better as a markup document.

@ryzhyk
Copy link
Contributor Author

ryzhyk commented Sep 20, 2021

I added a few more limitations by editing the issue. I am not sure that this causes notifications.

Thanks. I moved documentation-related items to a separate section.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
rfc Request for Comments
Projects
None yet
Development

No branches or pull requests

4 participants