-
Notifications
You must be signed in to change notification settings - Fork 465
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
design doc: MIR typechecking using representation types #27239
base: main
Are you sure you want to change the base?
design doc: MIR typechecking using representation types #27239
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Dropped in some comments! My biggest question is around whether we're overindexing on textual implicit casts and underindexing on implicit casts between integers of different widths. There may be other benefits to representation types beyond implicit casts that make this work worth doing even though it doesn't fix all implicit cast problems, but it seemed worth asking the question.
|
||
These implicit coercions---often of the form `varchar_to_text(#c)` for some column number `c`---get in the way during join planning. | ||
Even if we have an index on column `c` as a `varchar`, we cannot use that index to plan a join on `varchar_to_text(#c)`: not only are the types different, the terms are different, too. | ||
We know these implicit coercions interfere with joins; they may also interfere with other aspects of optimization. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What about other similar types with multiple representations, like integers? Integers can be implicitly promoted in many contexts. That gums up planning similarly, no? Do we want to consider tackling that as part of this same line of work?
Integers seem trickier, since we truly do have multiple physical representations of integers. It's not as simple as just eliding the type information.
I don't want to cause unnecessary scope creep. Just want to make sure that we're not going to feel silly if/when implicit casts of integers become an acute problem and we have to solve it in a different but more general way. If that more general way would work for the textual types as well, we might want to focus our efforts there instead.
Way back when we introduced the concept of cast inverses to prevent implicit casts from interfering with point lookups (issue: https://github.com/MaterializeInc/materialize/issues/15525, fix PR: #15476). I imagine there's something clever you could do with these inverses during index selection for joins?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There are two related ideas here:
- what types characterize integers?
- how do we deal with casts and their possibly-erroring inverses?
What types characterize integers?
My primary goal was to have the types at the MIR level characterize the Datum
s the operator would deal with. All of the various string types will just be Datum::String
at runtime, so one type works for them. But we actually have different numeric Datum
constructors: many different sizes of ints (in unsigned and signed flavors), two sizes of floats, and arbitrary precision numerics. So If MIR types characterize Datum
constructors, then it seems we would probably need different numeric types.
I can imagine a system where there's a unified numeric supertype Num
which subsumes all of these individual types, i.e., a value of type num could be any possible numeric Datum
. But a secondary goal is to avoid complex/automatic typing relationships; one proxy for this complexity is how much overlap there is in the mapping from MIR types to Datum
constructors. It's unfortunately not possible to have an injective mapping from MIR types to Datum
constructors, because Datum::True
and Datum::False
are booleans but also JSON values... but adding a Num
supertype would be fairly complicating. And, it wouldn't resolve our other issue...
How do we deal with casts and their possibly-erroring inverses?
Most of our conversions have inverses---text_to_varchar
has varchar_to_text
, timestamps can be converted to and from strings, and the numeric types can move into and out of each other. But in general, these are not true inverses/bijections: one side might produce an error.
text_to_varchar
is a left inverse of varchar_to_text
, but not a right inverse. That is, text_to_varchar[k](varchar_to_text[k](s)) == s
for all s
of type varchar(k)
, for all k
. But it is not the case that varchar_to_text[k](text_to_varchar[k](s))
for all s
of type text
for all k
---it could be that s
is longer than k
and text_to_varchar[k](s)
produces an error. Put differently, varchar_to_text
is a section and text_to_varchar
is a retraction.
In general, there's no way to remove the fallible casts and adding in a catch-all numeric type without introducing something worse: pervasive checks and coercions between numeric types.
I think the right way forward here is:
- Use
UnaryFunc::could_error()
withUnaryFunc::inverse()
---if neither can error, we can simplifyf(f_inv(x))
tox
. - Identify left-inverses explicitly: if we have
f(f_inv(x))
and only the outer one is fallible, we have a left inverse and we can simplify it tox
. - Explore optimizations around numeric sizing and index selection.
I'm starting to see your point about index selection being a key moment; I'll address it in my response to your other comment.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
When you say "Identify left-inverses explicitly", you mean that we'd add a new function property, i.e., it's not enough to simply call could_error
on the inner one, right? Because, in theory, the outer one might fail for some other reason besides not being able to represent the value in the original type. E.g., one might imagine a (thoroughly contrived) function varchar_to_text_if_not_empty
, which errors if the varchar is empty. This function would be an inverse
of text_to_varchar
in the sense that the current code is using the word inverse
, which is something like "if both preserves_uniqueness
then it's an inverse for such values for which neither errors".
But maybe we could redefine our inverse
to mean that you just have to check could_error
on the inner one. Probably the above contrived case is not actually occurring for any of our functions.
Btw. I just found an old Slack thread where I wrote more details on what the current code means by inverse
than what to code comment is saying. (I had it on my todo list to actually improve the code comment with this information, but somehow it got buried.)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You're absolutely right that if we had a function with a declared inverse that errored for some other reason we would have a problem with that approach.
I agree that we should rethink/refine how we think about inverses. Having a single function that says "this is my inverse and here's whether or not I could error" would simplify the thinking here.
These are fundamentally hacks, with the upside of being quick fixes. | ||
The downside is the addition of a tricky corner case, some risk (what if keeping the types separate matters somewhere we've forgotten about?), and some technical debt/mess-making. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What makes them fundamentally hacks/tech debt? eq-indx
in particular seems like a narrow case of a smart index selector (let's call it conv-indx
) that can handle mixed type lookups, e.g. looking up an i32 in an i64 index or vice-versa, and that seems useful/powerful/desirable?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The hacky/tech-debt/messy feeling here is that it's a very specialized way to work with ScalarExpr
in a very narrow context, and I worry about duplicated work and inconsistent handling.
But your point about numeric indices makes it clear to me that index selection will need to be a little bit special no matter what---and index creation, too (#27030).
So, I'm coming around to eq-indx
/conv-indx
. An advantage of mirtype
is that it will let us remove many infallible casts wholesale while simplifying our view of types at the MIR level, which will address our immediate problem. But we'll need something like conv-indx
in general. (With conv-indx
, you could avoid doing the work of #15476. Or, alternatively, it may be possible to implement conv-indx
using the techniques from #15476.)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Something that's been nagging at me, but I may simply be confused about: the hash functions for i32
and i64
are not going to be the same, i.e., sign extending n::i32
to signext(n)::i64
will likely not hash the same, so I think we'd need some careful logic here. I remember bringing something similar up to @frankmcsherry (I think using an index on n::i64
to lookup n+1
or some similarly invertible operation) and him pointing out why it wouldn't work, but I can't remember what the exact issue was.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm not really sure how eq-indx
would work in a join when the actual representations are different. If we have an index on t1
with a Datum::Int32
key, and an index on t2
with a Datum::Int64
key, then at what moment could we do the cast? I think our current join rendering just does exact equality on the arrangement keys as they are, and there is currently no way to give it a closure or something which would sneakily convert between Int32
and Int64
.
So, in addition to modifying the index selection in JoinImplementation
, we'd need to somehow also modify the join rendering (for Differential join it's mz_join_core.rs
, and for Delta join it's dogsdogsdogs::operators::half_join::half_join_internal_unsafe
, in the DD repository). This should involve a solution to the hashing problem mentioned above, besides changing how the keys are equated (somewhere around the seek_key
calls).
(Note that this problem didn't occur in #15476, because there it's only one side of the equality that is doing an index lookup, so there we just move the fiddling with the types to the other side of the equality.)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We discussed this a bit with @mgree, and the hashing issue might be worse than I thought. We are doing partitioned joins, i.e., both of the arrangements are partitioned by a hash, and pairs of rows that the join will emit need to be in the same partition. But now with a cast on one of the keys, the partitioning would need to be modified on one of the arrangements, and I'm not sure if there is any better way to do this than to rebuild the arrangement, which would completely defeat the purpose.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Maybe @frankmcsherry and I can talk about how eq-indx
would work some time this month? I agree that it seems complicated. 😔
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
(Btw., not an ideal solution, but we could add a notice that would make it easier for users to find these implicit cast issues and manually fix them by tweaking their index keys to include the cast: https://github.com/MaterializeInc/materialize/issues/22133)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Oh, look what I found!
https://github.com/MaterializeInc/materialize/issues/10871
It's a really well-preserved archeological specimen: haven't been touched or seen in years, and even features the old EXPLAIN output format!
On a more serious note, suddenly finding this long-forgotten, really detailed discussion of these same problems that we are discussing now really demonstrates the cost of the churn of optimizer people...
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Interesting!
One possibility for eq-idx
is to automatically coerce to "widest-possible" when doing hashing and ordering. That is, we would hash and order int16
, int32
, and so on all as if it were int128
(and we would similarly hash and order uint16
, uint32
, and on on as if it were uint128
). I don't think there's any storage cost to doing this, just a small computational cost. And then we'd be able to automatically use a wider index for narrower values.
We could even treat everything as uint128
for hashing purposes, but it probably isn't safe to treat signed ints as uint128
for ordering purposes (if we ever use ordering to express "real" ordering and not arbitrary ordering).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hmm, that sounds good to me! But let's wait for some rendering expert to say how feasible this is.
|
||
How should we report `ReprScalarType`s to the user in `EXPLAIN PLAN`s? | ||
There's a potential for confusion if we report something as type `string` that the user expected to have type `VARCHAR(18)` or something similar. | ||
Apparently we already treat `string` as a synonym for `text`... can we somehow choose names that will be unambiguous? | ||
Say, `mzstring` or `charseq` or `utf8`? |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Seems fine to just report them in whatever representation is convenient for us. MirScalarExpr
s aren't SQL expressions, and are hard to map to SQL expressions; I don't think we need to feel any obligation for the type information that annotates MirScalarExpr
s to map directly to SQL types, either.
Maybe just stick an r
prefix in front of it to make it clear(er) it's a lower-level "repr" type? rstring
, rint32
, etc.
1. Rename the existing `ScalarType` to `SqlScalarType`. (We may want to rename `ColumnType` and `RelationType`, as well.) | ||
2. Have MIR work with `ReprScalarType` (see [MVP](#minimal-viable-prototype)). |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't love that we have two parallel naming schemes: "HIR", "MIR", and "LIR" for expressions, and "Sql" and "Repr" for types. Is it right that HIR types are always SqlScalarTypes
and MIR and LIR types are always ReprScalarTypes
? Is there some intuition that suggests we'll only need the two ends of the spectrum for types, even though we needed three IRs for expressions? Or do we think it's likely we'll one day have HirScalarType
, MirScalarType
, and LirScalarType
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think we might also call them HirScalarType
and MirScalarType
. LIR already has a precedent of using MirScalarExpr
for its scalars. If we want to have the IR families aligned in their prefixes we might introduce type aliases
type LirScalarType = MirScalarType;
type LirScalarExpr = MirScalarExpr;
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm easy on names. Right now LIR is effectively untyped. If we were to add types to LIR, they would be the same as MIR types. (@frankmcsherry, @antiguru, and I have discussed the idea of a bytecode/instruction LIR, which might end up having a finer grained type system.)
e0585ca
to
db905d2
Compare
db905d2
to
da14039
Compare
You're right that this proposed change really only addresses issues with text data. We represent textual data the same for all pg "character types", so changing the MIR type system lets us conflate them (possibly keeping some cast operations around for their errors). For numbers, the representations are all different, so this change wouldn't affect us much. I've updated the document a bit, but I guess I'd propose a multi-pronged approach here:
We could do these steps in pretty much any order. I don't have a clear sense of which will have the highest impact. |
This makes sense to me! It all seems plausibly quite valuable and as long as we feel good about having
This one I'm not sold on! In the broad universe of SQL, I don't think there's anything more correct or "natural" about That said, no qualms about doing this first as a stopgap solution, if it's substantially easier than making |
|
||
We'll need to define a projection from `HirScalarType` to `MirScalarType` (see [`ScalarType::is_instance_of()`](https://github.com/MaterializeInc/materialize/blob/e4578164fb28a204b43c58ab8ff6c1d3e3b4427f/src/repr/src/scalar.rs#L947) for the correspondence). | ||
|
||
In `EXPLAIN PLAN`s, we will report `MirScalarType`s with `r`-prefixed names, where `r` is short for "representation", e.g., `rstring` is the `MirScalarType` corresponding to `text`, `varchar`, etc. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Maybe this prefix should be m
or mir
if we're going with MirScalarType
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm not sure the word mir
is shown anywhere to users in our current EXPLAIN
output. Very happy to be told what to do here (the r
came from earlier discussion).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'd go with mir
(e.g., mirstring
) to start. I think it's the clearest. Since EXPLAIN
output isn't stable we have a ton of flexibility to change it if we get user complaints.
That's a good point. Especially considering that many of the casts we're seeing are from generated code, a notice would be annoying (and devalue notices).
Eh, the type reconciliation should mostly close the gap. (So long as no one ever provides limits for |
Rendered document.
Motivation
Implicit casts (
varchar_to_text
) can impede join planning; by having MIR use a type system that corresponds to the representation we'll actually use, we can type-safely elide implicit casts.Issue reference: https://github.com/MaterializeInc/database-issues/issues/1291