Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[RFC] RVSDG as an intermedeate IR #8

Open
Kixiron opened this issue Sep 23, 2021 · 11 comments
Open

[RFC] RVSDG as an intermedeate IR #8

Kixiron opened this issue Sep 23, 2021 · 11 comments
Labels
rfc Request for Comments

Comments

@Kixiron
Copy link
Contributor

Kixiron commented Sep 23, 2021

Background:

As an intermediate representation we should use the RVSDG to represent both the timely-level dataflow graph and the inter-operator expressions (map functions, for instance). Using this unified representation to encompass both aspects of ddlog programs brings a number of benefits:

  • Intrinsic optimization, if we know both the contents of the dataflow graph and the contents of each function within it we can perform both easy and complex dataflow (in the compiler sense, not the paradigm) optimizations such as relation.filter(|_| false) => empty_relation() and .map(|_| 10).filter(|x| x == 10) => .map(|_| 10)
  • Trivial dead code elimination on both a dataflow graph and expression level, this allows us to remove functions that are never called, relations that are never used and close inputs that are never read from
  • Trivial de-duplication of the dataflow graph to allow for minimal re-computation (this extends to both collections and arrangements)
  • Modular compiler internals. Since the RVSDG is very abstract it's very decoupled from both the frontend and the final codegen target, allowing easy swapping of both or even using the RVSDG as a target for entirely third-party languages or libraries

For those unfamiliar, the RVSDG is a way to represent programs as an abstract graph with only four distinct components:

  • Nodes, these are atomic expressions within the program such as 1, add 1, 2 or map
  • Regions, these are nodes that themselves contain sub-nodes. Regions are used for representing both control flow and effect flow, for example the phi node represents an if ... { ... } else { ... } and the theta node represents a loop { ... }
  • Value edges, these are edges between nodes that express data dependence such as an add node having two incoming value edges from a 1 and a 2 node
  • Effect edges, these edges express state dependence, such as two print() calls being sequential relative to each other (this is an important note as the RVSDG is inherently unstructured). Effect edges won't be used very much within the ddlog compiler, but they do have an important, albeit sparse, role to play in it

With these four primitives we can express programs in a very abstract and inherently normalizing way. That is to say, many different source language programs can (and will) look identical once translated into an RVSDG due to it's inherent lack of ordering. Instead of add 1, 2; add 2, 3 and add 2, 3, add 1, 2 being seen as two separate programs, the RVSDG simply sees

add_one_two

This (as mentioned before) also comes into effect when analyzing the dataflow graph as a whole since both the nodes of the dataflow graph and the expressions they contain can be unified, cutting the total program's potential runtime in half:

input relation R(x: usize)

output relation X(x: usize)
X(x) :- R(x), x + 5, x > 10.
X(x) :- R(x), x + 5, x < 4.

Ignoring the questionable nature of this program, the raw translation of this program into a graph would render this

raw_translation

From which we can see that the map operation is trivially reducible, that's duplicated effort we've done and we can remove it

reduced_map

That's a rather trivial example that could definitely be optimized further (say, by turning multiple filters on one relation that are then concatenated into a single filter), but it expresses the intent quite nicely.

Another optimization I mentioned was dead code elimination, with two simple rules we can do all the DCE on dataflow graphs that we need to

  1. Any node that cannot reach an output node is dead
  2. Any node that is not reachable by an input node is dead

no_input no_output
Both of these graphs would be considered dead code

So to wrap things up, I believe that the RVSDG is an optimal representation for compiling and optimizing our dataflow graphs and the expressions within them

@ryzhyk ryzhyk added the rfc Request for Comments label Sep 23, 2021
@mihaibudiu
Copy link

Representing and manipulating expressions is the easy part in any compiler.
The paper you sent is about very low level representations, suitable for compiler back-ends.
We need a representation in the front-end.
How do you represent type declarations, generics, functions, literals?
Are the nodes typed? What are the types of nodes supported?
How is the IR traversal done? Is there a generic infrastructure for that?

@Kixiron
Copy link
Contributor Author

Kixiron commented Sep 23, 2021

This isn't for the frontend, it's for the backend optimization of programs

@ryzhyk
Copy link
Contributor

ryzhyk commented Oct 4, 2021

@Kixiron , are you thinking of using an existing RVSDG crate?

@mihaibudiu
Copy link

I don't think that this is a good representation for use as the main IR.
The main IR should be a tree-like representation that is more abstract.
A dataflow graph representation should be built from the IR when needed, used to analyze and optimize the code by generating a new IR, and then discarded.

@Kixiron
Copy link
Contributor Author

Kixiron commented Oct 4, 2021

Are you thinking of using an existing RVSDG crate?

At present there aren't any, I'm currently making one

I don't think that this is a good representation for use as the main IR.

This is an optimization IR, it's not strictly tied to the dataflow semantics of our target. It brings along many benefits to "normal" optimization as well as our domain specific needs

@ryzhyk
Copy link
Contributor

ryzhyk commented Oct 4, 2021

Is the plan to use equality-saturation-based optimization?

@Kixiron
Copy link
Contributor Author

Kixiron commented Oct 4, 2021

That's a definite possibility, but one I need to investigate further

@ryzhyk
Copy link
Contributor

ryzhyk commented Oct 4, 2021

  • There's little question in my mind that dataflow graphs are the right IR for the relational part of the language. DDlog is just a syntax; the actual semantics of the language is the dataflow graph, so it is the most natural representation.
  • As far as representing expressions goes, there are two types of optimization we mostly care about (these are the optimizations that we cannot simply delegate to the Rust compiler):
    • Optimize the implementation of the copy semantics, e.g., by replacing cloning with sharing by reference or by ownership transfer. This has to primarily be based on dataflow analysis, so intuitively a dataflow graph is a good representation.
    • Co-optimize relational and functional parts of the program, as outlined by @Kixiron above, so unifying the two IRs makes sense

@ryzhyk
Copy link
Contributor

ryzhyk commented Oct 4, 2021

Having said that, I think we need a more detailed design before implementing anything. In particular, what would the rewrite rules and a rule engine for an optimizer based on such a hybrid graph look like?

@ryzhyk
Copy link
Contributor

ryzhyk commented Oct 4, 2021

And also what are the exact node types in the graph? E.g., our surface syntax consists of operators like filter, map, join, etc. Internally, we want additional nodes like arrange, distinct, and consolidate. In addition, we may want to split the program into multiple dataflow graphs. Do these correspond to nested regions?

@Kixiron
Copy link
Contributor Author

Kixiron commented Oct 4, 2021

And also what are the exact node types in the graph?

I'd probably use the lower leveled representation of things so that we can share as much work as possible, ex. only arranging any given stream once instead of re-arranging it multiple times

Do these correspond to nested regions?

Yep, each self-contained subgraph would be contained within its own region, ditto with recursive/iterative dataflow scopes

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
rfc Request for Comments
Projects
None yet
Development

No branches or pull requests

3 participants