Testing dataflow rendering + transformations #5332
Replies: 2 comments
-
I already started on an attempt at this, which you can find at:
As someone who often has to debug at the
This test framework could be used to create correctness unit tests for each RelationExpr::SomeEnum -> dataflow transformation, each MFP{RelationExpr::Something} -> dataflow transformation, etc. It would be incredibly helpful for debugging because it would help narrow down which operator in an often complicated plan is not rendering data in the expected manner. |
Beta Was this translation helpful? Give feedback.
-
We now have arbitrary IR creation. Design doc here: https://github.com/MaterializeInc/materialize/blob/main/doc/developer/design/20210601_build_mirrelationexpr.md |
Beta Was this translation helpful? Give feedback.
-
There are two main places in Materialize today where we optimize query plans for performance:
transform
crate. This applies mostly logical transformations to a query plan, but it also applies some bits of physical planning, like choosing a join implementation.dataflow::render
, which makes some further physical optimization decisions in the course of converting that query AST to actual timely/differential closures.The specific concern I want to raise is that I don't know how to ensure that our various SQL integration tests (specifically, sqllogictest and testdrive) are properly and fully exercising the optimizer and the dataflow rendering code. There are probably a few ways to solve this! Here are a few thoughts, off the top of my head:
The folks familiar with the optimizer/dataflow layer could help to come up with a checklist of classes of SQL queries that need to be written. One concrete example that came up recently was the difference between
SELECT ... FROM VALUES
andSELECT ... FROM tbl
. The former exercises constant folding, and the latter exercises the actual dataflow code. In theory, we need to write both types of tests if we want to exercise both code paths.We could invest more in writing optimizer-specific tests, like in https://github.com/MaterializeInc/materialize/blob/main/src/transform/tests/testdata/join-implementation. I think this is viable today for the optimizer, but I'm not sure how we would go about doing the same for the render module, since the output of dataflow rendering (a stack of closures) is not something that easily serializes to a textual format.
The way I'd always envisioned solving the "it's hard to test render/mod.rs" problem was to split out a separate physical plan whose translation to differential closures was straightforward and obvious. Then you could spit that plan out into datadriven test files, like we do elsewhere. But @frankmcsherry was terrified by this idea, and he has more context on this than I do.
The other thing that comes to mind is to serialize a dataflow post construction to some JSON representation. Timely/differential have various tools for describing their current state. Perhaps we could hook into those tools and emit a textual dataflow graph, and emit that into datadriven.
Beta Was this translation helpful? Give feedback.
All reactions