Testing dataflow rendering + transformations #5332

benesch · 2021-01-15T16:02:47Z

benesch
Jan 15, 2021
Maintainer

There are two main places in Materialize today where we optimize query plans for performance:

The first is what's colloquially known as the "optimizer" in the transform crate. This applies mostly logical transformations to a query plan, but it also applies some bits of physical planning, like choosing a join implementation.
The second is the dataflow rendering module, dataflow::render, which makes some further physical optimization decisions in the course of converting that query AST to actual timely/differential closures.

The specific concern I want to raise is that I don't know how to ensure that our various SQL integration tests (specifically, sqllogictest and testdrive) are properly and fully exercising the optimizer and the dataflow rendering code. There are probably a few ways to solve this! Here are a few thoughts, off the top of my head:

The folks familiar with the optimizer/dataflow layer could help to come up with a checklist of classes of SQL queries that need to be written. One concrete example that came up recently was the difference between SELECT ... FROM VALUES and SELECT ... FROM tbl. The former exercises constant folding, and the latter exercises the actual dataflow code. In theory, we need to write both types of tests if we want to exercise both code paths.
We could invest more in writing optimizer-specific tests, like in https://github.com/MaterializeInc/materialize/blob/main/src/transform/tests/testdata/join-implementation. I think this is viable today for the optimizer, but I'm not sure how we would go about doing the same for the render module, since the output of dataflow rendering (a stack of closures) is not something that easily serializes to a textual format.
- The way I'd always envisioned solving the "it's hard to test render/mod.rs" problem was to split out a separate physical plan whose translation to differential closures was straightforward and obvious. Then you could spit that plan out into datadriven test files, like we do elsewhere. But @frankmcsherry was terrified by this idea, and he has more context on this than I do.
- The other thing that comes to mind is to serialize a dataflow post construction to some JSON representation. Timely/differential have various tools for describing their current state. Perhaps we could hook into those tools and emit a textual dataflow graph, and emit that into datadriven.

wangandi · 2021-01-25T23:12:32Z

wangandi
Jan 25, 2021

The folks familiar with the optimizer/dataflow layer could help to come up with a checklist of classes of SQL queries that need to be written. One concrete example that came up recently was the difference between SELECT ... FROM VALUES and SELECT ... FROM tbl. The former exercises constant folding, and the latter exercises the actual dataflow code. In theory, we need to write both types of tests if we want to exercise both code paths.

I already started on an attempt at this, which you can find at:
https://github.com/MaterializeInc/materialize/blob/main/doc/developer/sqllogictest.md#test-writing-guidelines
We should add the concrete example above to the checklist.

We could invest more in writing optimizer-specific tests, like in https://github.com/MaterializeInc/materialize/blob/main/src/transform/tests/testdata/join-implementation. I think this is viable today for the optimizer, but I'm not sure how we would go about doing the same for the render module, since the output of dataflow rendering (a stack of closures) is not something that easily serializes to a textual format.

As someone who often has to debug at the render/mod.rs level, what I would love is a test framework that would

Allow me to define
1. an arbitrary RelationExpr
2. Collections of (row, time, diff) to bind to each RelationExpr::Get(id: Id::Global(something), typ: <...>) within (1.i)
Render the corresponding dataflow to (1.i)
Spit out the result of running the dataflow rendered in (2) using the input collections defined in (1.ii)

This test framework could be used to create correctness unit tests for each RelationExpr::SomeEnum -> dataflow transformation, each MFP{RelationExpr::Something} -> dataflow transformation, etc.

It would be incredibly helpful for debugging because it would help narrow down which operator in an often complicated plan is not rendering data in the expected manner.

0 replies

wangandi · 2021-06-30T17:26:06Z

wangandi
Jun 30, 2021

We now have arbitrary IR creation. Design doc here: https://github.com/MaterializeInc/materialize/blob/main/doc/developer/design/20210601_build_mirrelationexpr.md

0 replies

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Testing dataflow rendering + transformations #5332

{{title}}

{{editor}}'s edit

{{editor}}'s edit

Replies: 2 comments

{{title}}

{{title}}

Select a reply

Testing dataflow rendering + transformations #5332

benesch Jan 15, 2021 Maintainer

Replies: 2 comments

wangandi Jan 25, 2021

wangandi Jun 30, 2021

benesch
Jan 15, 2021
Maintainer

wangandi
Jan 25, 2021

wangandi
Jun 30, 2021