-
Notifications
You must be signed in to change notification settings - Fork 465
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
design doc: MIR typechecking using representation types #27239
base: main
Are you sure you want to change the base?
Changes from all commits
File filter
Filter by extension
Conversations
Jump to
Diff view
Diff view
There are no files selected for viewing
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,134 @@ | ||
# MIR Representation Types | ||
|
||
- Associated: | ||
+ https://github.com/MaterializeInc/materialize/issues/4171 | ||
+ https://github.com/MaterializeInc/materialize/pull/27029 | ||
+ https://github.com/MaterializeInc/materialize/pull/27109 | ||
|
||
MIR is typechecked using user-facing SQL types. | ||
SQL types make distinctions between types (e.g., `CHAR(5)`, `VARCHAR(5)`, and `TEXT`) that will not be realized at runtime (all of these are represented using `Datum::String`). | ||
|
||
This design doc proposes _representation types_---a new notion of type for MIR terms. | ||
A representation type maps to a subset of constructors/variants of the `Datum` enum. | ||
|
||
Making this change should help address issues in optimization, particularly index selection (see ["The Problem"](#the-problem)). | ||
It will also simplify and clarify what MIR is doing. | ||
|
||
## The Problem | ||
|
||
[Postgres has several types for character data:](https://www.postgresql.org/docs/current/datatype-character.html) | ||
|
||
- `char(n)`, `bpchar(n)` for fixed-length strings with a fixed-length representation (whitespace padded) | ||
- `varchar(n)` for fixed-length strings with a variable-length representation | ||
- `varchar` for arbitrary-length strings with a variable-length representation | ||
- `bpchar` for artbirary-length strings with a variable-length representation where trailing whitespace is trimmed | ||
- `text` for arbitrary-length strings with a variable-length representation | ||
|
||
(There are also `"char"` and `name`, but they are very much internal to Postgres---let's ignore them for the moment.) | ||
|
||
Materialize implements all five of these character sequence types using `Datum::String`. | ||
Materialize only implements meaningful operations on the `text` type; to operate on other character sequence types, one must convert to `text`. | ||
Conversion to `text` is a no-op at runtime, since we already have the string. | ||
We freely insert implicit coercions to `text` in the early stages of query processing. | ||
For correct error behavior, however, we need to carefully keep track of failures to meet length limits. | ||
|
||
These implicit coercions---often of the form `varchar_to_text(#c)` for some column number `c`---get in the way during join planning. | ||
Even if we have an index on column `c` as a `varchar`, we cannot use that index to plan a join on `varchar_to_text(#c)`: not only are the types different, the terms are different, too. | ||
We know these implicit coercions interfere with joins; they may also interfere with other aspects of optimization. | ||
|
||
## Success Criteria | ||
|
||
1. Plan joins without implicit coercions between equivalent representation types getting in the way. | ||
2. It should be possible for us to elide noop casts at the MIR level. | ||
3. The catalog should reflect SQL-level types. | ||
4. There is clear guidance for customers regarding the different character sequence types. | ||
5. Notices help customers avoid pitfalls with character sequence types. | ||
|
||
## Out of Scope | ||
|
||
We will not rethink other components of MIR types, like nullability or unique keys. | ||
|
||
In general, we may want a "smart index selector", which can appropriately handle mixed-type lookups, i.e., it should let us look up an `i32` in an `i64`-sized index without needing to build a separate `i32`-sized arrangement. Such an approach generalizes `eq-indx` (see ["Alternatives"](#alternatives)). | ||
|
||
## Solution Proposal | ||
|
||
Having MIR be typechecked in terms of representation types demands a few changes, which can be broken into three PRs: | ||
|
||
1. Rename the existing `ScalarType` to `HirScalarType`. (We may want to rename `ColumnType` and `RelationType`, as well.) | ||
2. Have MIR work with `MirScalarType` (see [MVP](#minimal-viable-prototype)). | ||
a. Introduce the `MirScalarType` datatype (see [MVP](#minimal-viable-prototype)). | ||
b. Change the optimizer to work with `MirScalarType` throughout. | ||
c. Change the adapter to record the `HirScalarType` of a query and update its nullability information using the `MirScalarType` at the end of optimization. | ||
3. Elide noop casts, confirming that we plan the problematic joins better. | ||
|
||
PR #2 will have a large diff. | ||
On the plus side, it should be a largely mechanical change: there is a straightforward way to project a `HirScalarType` to a `MirScalarType`, and most uses of types in the optimizer will be parametric. | ||
|
||
## Minimal Viable Prototype | ||
|
||
The representation types will look like the following: | ||
|
||
```rust | ||
pub enum MirScalarType { | ||
Bool, | ||
Int16, | ||
Int32, | ||
Int64, | ||
UInt8, // also includes HirScalarType::PgLegacyChar | ||
UInt16, | ||
UInt32, // also includes HirScalarType::{Oid,RegClass,RegProc,RegType} | ||
UInt64, | ||
Float32, | ||
Float64, | ||
Numeric, | ||
Date, | ||
Time, | ||
Timestamp { precision: Option<TimestampPrecision>, }, | ||
TimestampTz { precision: Option<TimestampPrecision>, }, | ||
MzTimestamp, | ||
Interval, | ||
Bytes, | ||
Jsonb, | ||
String, // also includes HirScalarType::{VarChar,Char,PgLegacyName} | ||
Uuid, | ||
Array(Box<MirScalarType>), // also includes HirScalarType::Int2Vector | ||
List { element_type: Box<MirScalarType> }, // also includes HirScalarType::Record | ||
Map { value_type: Box<MirScalarType> }, | ||
Range { element_type: Box<MirScalarType> }, | ||
MzAclItem, | ||
AclItem, | ||
} | ||
|
||
pub struct MirColumnType { | ||
pub scalar_type: MirScalarType, | ||
pub nullable: bool, | ||
} | ||
``` | ||
|
||
Since LIR is effectively untyped, there's no need for us to define `LirScalarType` or the like. We already use `MirScalarExpr` inside of `Mfp`s in LIR. | ||
|
||
We'll need to define a projection from `HirScalarType` to `MirScalarType` (see [`ScalarType::is_instance_of()`](https://github.com/MaterializeInc/materialize/blob/e4578164fb28a204b43c58ab8ff6c1d3e3b4427f/src/repr/src/scalar.rs#L947) for the correspondence). | ||
|
||
In `EXPLAIN PLAN`s, we will report `MirScalarType`s with `r`-prefixed names, where `r` is short for "representation", e.g., `rstring` is the `MirScalarType` corresponding to `text`, `varchar`, etc. | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Maybe this prefix should be There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. I'm not sure the word There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. I'd go with |
||
|
||
## Alternatives | ||
|
||
We considered several alternatives: | ||
|
||
- (`eq-type`) Change the MIR typechecker to conflate `varchar` and `text` types (and drop calls to `varchar_to_text`). | ||
- (`eq-indx`) Change index selection to conflate `varchar` and `text` types (and leave calls to `varchar_to_text`). | ||
- (`remixvc`) Automatically turn add `varchar_to_text` to all indices on a `varchar`. | ||
- (`remix++`) Automatically wrap every use of a `varchar` with `varchar_to_text`. | ||
|
||
The `eq-type` and `eq-indx` approach came up in early discussions. | ||
These are fundamentally hacks, with the upside of being quick fixes. | ||
The downside is the addition of a tricky corner case, some risk (what if keeping the types separate matters somewhere we've forgotten about?), and some technical debt/mess-making. | ||
Comment on lines
+124
to
+125
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. What makes them fundamentally hacks/tech debt? There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. The hacky/tech-debt/messy feeling here is that it's a very specialized way to work with But your point about numeric indices makes it clear to me that index selection will need to be a little bit special no matter what---and index creation, too (#27030). So, I'm coming around to There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Something that's been nagging at me, but I may simply be confused about: the hash functions for There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. I'm not really sure how So, in addition to modifying the index selection in (Note that this problem didn't occur in #15476, because there it's only one side of the equality that is doing an index lookup, so there we just move the fiddling with the types to the other side of the equality.) There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. We discussed this a bit with @mgree, and the hashing issue might be worse than I thought. We are doing partitioned joins, i.e., both of the arrangements are partitioned by a hash, and pairs of rows that the join will emit need to be in the same partition. But now with a cast on one of the keys, the partitioning would need to be modified on one of the arrangements, and I'm not sure if there is any better way to do this than to rebuild the arrangement, which would completely defeat the purpose. There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Maybe @frankmcsherry and I can talk about how There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. (Btw., not an ideal solution, but we could add a notice that would make it easier for users to find these implicit cast issues and manually fix them by tweaking their index keys to include the cast: https://github.com/MaterializeInc/materialize/issues/22133) There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Oh, look what I found! On a more serious note, suddenly finding this long-forgotten, really detailed discussion of these same problems that we are discussing now really demonstrates the cost of the churn of optimizer people... There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Interesting! One possibility for We could even treat everything as There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Hmm, that sounds good to me! But let's wait for some rendering expert to say how feasible this is. |
||
|
||
The `remixvc` and `remix++` approaches address the problem at the user/adapter side. | ||
Such rewriting feels a bit risky/annoying/confusing, and I'm not sure how it would interact with sources. | ||
|
||
## Open questions | ||
|
||
It's possible that (as Nikhil worries) somewhere in the pgwire or WS layers calls `.typ()` on a `MirRelationExpr` and expects a `HirScalarType` but will now find a `MirScalarType`. | ||
I don't know those layers well, but we could derisk some of the work of PR #2 by checking for that in advance. | ||
(In any case, the Rust type checker will be eager to tell us about it.) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What about other similar types with multiple representations, like integers? Integers can be implicitly promoted in many contexts. That gums up planning similarly, no? Do we want to consider tackling that as part of this same line of work?
Integers seem trickier, since we truly do have multiple physical representations of integers. It's not as simple as just eliding the type information.
I don't want to cause unnecessary scope creep. Just want to make sure that we're not going to feel silly if/when implicit casts of integers become an acute problem and we have to solve it in a different but more general way. If that more general way would work for the textual types as well, we might want to focus our efforts there instead.
Way back when we introduced the concept of cast inverses to prevent implicit casts from interfering with point lookups (issue: https://github.com/MaterializeInc/materialize/issues/15525, fix PR: #15476). I imagine there's something clever you could do with these inverses during index selection for joins?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There are two related ideas here:
What types characterize integers?
My primary goal was to have the types at the MIR level characterize the
Datum
s the operator would deal with. All of the various string types will just beDatum::String
at runtime, so one type works for them. But we actually have different numericDatum
constructors: many different sizes of ints (in unsigned and signed flavors), two sizes of floats, and arbitrary precision numerics. So If MIR types characterizeDatum
constructors, then it seems we would probably need different numeric types.I can imagine a system where there's a unified numeric supertype
Num
which subsumes all of these individual types, i.e., a value of type num could be any possible numericDatum
. But a secondary goal is to avoid complex/automatic typing relationships; one proxy for this complexity is how much overlap there is in the mapping from MIR types toDatum
constructors. It's unfortunately not possible to have an injective mapping from MIR types toDatum
constructors, becauseDatum::True
andDatum::False
are booleans but also JSON values... but adding aNum
supertype would be fairly complicating. And, it wouldn't resolve our other issue...How do we deal with casts and their possibly-erroring inverses?
Most of our conversions have inverses---
text_to_varchar
hasvarchar_to_text
, timestamps can be converted to and from strings, and the numeric types can move into and out of each other. But in general, these are not true inverses/bijections: one side might produce an error.text_to_varchar
is a left inverse ofvarchar_to_text
, but not a right inverse. That is,text_to_varchar[k](varchar_to_text[k](s)) == s
for alls
of typevarchar(k)
, for allk
. But it is not the case thatvarchar_to_text[k](text_to_varchar[k](s))
for alls
of typetext
for allk
---it could be thats
is longer thank
andtext_to_varchar[k](s)
produces an error. Put differently,varchar_to_text
is a section andtext_to_varchar
is a retraction.In general, there's no way to remove the fallible casts and adding in a catch-all numeric type without introducing something worse: pervasive checks and coercions between numeric types.
I think the right way forward here is:
UnaryFunc::could_error()
withUnaryFunc::inverse()
---if neither can error, we can simplifyf(f_inv(x))
tox
.f(f_inv(x))
and only the outer one is fallible, we have a left inverse and we can simplify it tox
.I'm starting to see your point about index selection being a key moment; I'll address it in my response to your other comment.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
When you say "Identify left-inverses explicitly", you mean that we'd add a new function property, i.e., it's not enough to simply call
could_error
on the inner one, right? Because, in theory, the outer one might fail for some other reason besides not being able to represent the value in the original type. E.g., one might imagine a (thoroughly contrived) functionvarchar_to_text_if_not_empty
, which errors if the varchar is empty. This function would be aninverse
oftext_to_varchar
in the sense that the current code is using the wordinverse
, which is something like "if bothpreserves_uniqueness
then it's an inverse for such values for which neither errors".But maybe we could redefine our
inverse
to mean that you just have to checkcould_error
on the inner one. Probably the above contrived case is not actually occurring for any of our functions.Btw. I just found an old Slack thread where I wrote more details on what the current code means by
inverse
than what to code comment is saying. (I had it on my todo list to actually improve the code comment with this information, but somehow it got buried.)There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You're absolutely right that if we had a function with a declared inverse that errored for some other reason we would have a problem with that approach.
I agree that we should rethink/refine how we think about inverses. Having a single function that says "this is my inverse and here's whether or not I could error" would simplify the thinking here.