Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

design doc: MIR typechecking using representation types #27239

Open
wants to merge 3 commits into
base: main
Choose a base branch
from

Conversation

mgree
Copy link
Contributor

@mgree mgree commented May 22, 2024

Rendered document.

Motivation

  • This PR adds a feature that has not yet been specified.

Implicit casts (varchar_to_text) can impede join planning; by having MIR use a type system that corresponds to the representation we'll actually use, we can type-safely elide implicit casts.

Issue reference: https://github.com/MaterializeInc/database-issues/issues/1291

@mgree mgree added C-refactoring Category: replacing or reorganizing code A-optimization Area: query optimization and transformation A-compute Area: compute labels May 22, 2024
Copy link
Member

@benesch benesch left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Dropped in some comments! My biggest question is around whether we're overindexing on textual implicit casts and underindexing on implicit casts between integers of different widths. There may be other benefits to representation types beyond implicit casts that make this work worth doing even though it doesn't fix all implicit cast problems, but it seemed worth asking the question.


These implicit coercions---often of the form `varchar_to_text(#c)` for some column number `c`---get in the way during join planning.
Even if we have an index on column `c` as a `varchar`, we cannot use that index to plan a join on `varchar_to_text(#c)`: not only are the types different, the terms are different, too.
We know these implicit coercions interfere with joins; they may also interfere with other aspects of optimization.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What about other similar types with multiple representations, like integers? Integers can be implicitly promoted in many contexts. That gums up planning similarly, no? Do we want to consider tackling that as part of this same line of work?

Integers seem trickier, since we truly do have multiple physical representations of integers. It's not as simple as just eliding the type information.

I don't want to cause unnecessary scope creep. Just want to make sure that we're not going to feel silly if/when implicit casts of integers become an acute problem and we have to solve it in a different but more general way. If that more general way would work for the textual types as well, we might want to focus our efforts there instead.

Way back when we introduced the concept of cast inverses to prevent implicit casts from interfering with point lookups (issue: https://github.com/MaterializeInc/materialize/issues/15525, fix PR: #15476). I imagine there's something clever you could do with these inverses during index selection for joins?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There are two related ideas here:

  1. what types characterize integers?
  2. how do we deal with casts and their possibly-erroring inverses?

What types characterize integers?

My primary goal was to have the types at the MIR level characterize the Datums the operator would deal with. All of the various string types will just be Datum::String at runtime, so one type works for them. But we actually have different numeric Datum constructors: many different sizes of ints (in unsigned and signed flavors), two sizes of floats, and arbitrary precision numerics. So If MIR types characterize Datum constructors, then it seems we would probably need different numeric types.

I can imagine a system where there's a unified numeric supertype Num which subsumes all of these individual types, i.e., a value of type num could be any possible numeric Datum. But a secondary goal is to avoid complex/automatic typing relationships; one proxy for this complexity is how much overlap there is in the mapping from MIR types to Datum constructors. It's unfortunately not possible to have an injective mapping from MIR types to Datum constructors, because Datum::True and Datum::False are booleans but also JSON values... but adding a Num supertype would be fairly complicating. And, it wouldn't resolve our other issue...

How do we deal with casts and their possibly-erroring inverses?

Most of our conversions have inverses---text_to_varchar has varchar_to_text, timestamps can be converted to and from strings, and the numeric types can move into and out of each other. But in general, these are not true inverses/bijections: one side might produce an error.

⚠️ Unnecessary math jargon: ⚠️ text_to_varchar is a left inverse of varchar_to_text, but not a right inverse. That is, text_to_varchar[k](varchar_to_text[k](s)) == s for all s of type varchar(k), for all k. But it is not the case that varchar_to_text[k](text_to_varchar[k](s)) for all s of type text for all k---it could be that s is longer than k and text_to_varchar[k](s) produces an error. Put differently, varchar_to_text is a section and text_to_varchar is a retraction.

In general, there's no way to remove the fallible casts and adding in a catch-all numeric type without introducing something worse: pervasive checks and coercions between numeric types.

I think the right way forward here is:

  1. Use UnaryFunc::could_error() with UnaryFunc::inverse()---if neither can error, we can simplify f(f_inv(x)) to x.
  2. Identify left-inverses explicitly: if we have f(f_inv(x)) and only the outer one is fallible, we have a left inverse and we can simplify it to x.
  3. Explore optimizations around numeric sizing and index selection.

I'm starting to see your point about index selection being a key moment; I'll address it in my response to your other comment.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

When you say "Identify left-inverses explicitly", you mean that we'd add a new function property, i.e., it's not enough to simply call could_error on the inner one, right? Because, in theory, the outer one might fail for some other reason besides not being able to represent the value in the original type. E.g., one might imagine a (thoroughly contrived) function varchar_to_text_if_not_empty, which errors if the varchar is empty. This function would be an inverse of text_to_varchar in the sense that the current code is using the word inverse, which is something like "if both preserves_uniqueness then it's an inverse for such values for which neither errors".

But maybe we could redefine our inverse to mean that you just have to check could_error on the inner one. Probably the above contrived case is not actually occurring for any of our functions.

Btw. I just found an old Slack thread where I wrote more details on what the current code means by inverse than what to code comment is saying. (I had it on my todo list to actually improve the code comment with this information, but somehow it got buried.)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You're absolutely right that if we had a function with a declared inverse that errored for some other reason we would have a problem with that approach.

I agree that we should rethink/refine how we think about inverses. Having a single function that says "this is my inverse and here's whether or not I could error" would simplify the thinking here.

Comment on lines +118 to +125
These are fundamentally hacks, with the upside of being quick fixes.
The downside is the addition of a tricky corner case, some risk (what if keeping the types separate matters somewhere we've forgotten about?), and some technical debt/mess-making.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What makes them fundamentally hacks/tech debt? eq-indx in particular seems like a narrow case of a smart index selector (let's call it conv-indx) that can handle mixed type lookups, e.g. looking up an i32 in an i64 index or vice-versa, and that seems useful/powerful/desirable?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The hacky/tech-debt/messy feeling here is that it's a very specialized way to work with ScalarExpr in a very narrow context, and I worry about duplicated work and inconsistent handling.

But your point about numeric indices makes it clear to me that index selection will need to be a little bit special no matter what---and index creation, too (#27030).

So, I'm coming around to eq-indx/conv-indx. An advantage of mirtype is that it will let us remove many infallible casts wholesale while simplifying our view of types at the MIR level, which will address our immediate problem. But we'll need something like conv-indx in general. (With conv-indx, you could avoid doing the work of #15476. Or, alternatively, it may be possible to implement conv-indx using the techniques from #15476.)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Something that's been nagging at me, but I may simply be confused about: the hash functions for i32 and i64 are not going to be the same, i.e., sign extending n::i32 to signext(n)::i64 will likely not hash the same, so I think we'd need some careful logic here. I remember bringing something similar up to @frankmcsherry (I think using an index on n::i64 to lookup n+1 or some similarly invertible operation) and him pointing out why it wouldn't work, but I can't remember what the exact issue was.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not really sure how eq-indx would work in a join when the actual representations are different. If we have an index on t1 with a Datum::Int32 key, and an index on t2 with a Datum::Int64 key, then at what moment could we do the cast? I think our current join rendering just does exact equality on the arrangement keys as they are, and there is currently no way to give it a closure or something which would sneakily convert between Int32 and Int64.

So, in addition to modifying the index selection in JoinImplementation, we'd need to somehow also modify the join rendering (for Differential join it's mz_join_core.rs, and for Delta join it's dogsdogsdogs::operators::half_join::half_join_internal_unsafe, in the DD repository). This should involve a solution to the hashing problem mentioned above, besides changing how the keys are equated (somewhere around the seek_key calls).

(Note that this problem didn't occur in #15476, because there it's only one side of the equality that is doing an index lookup, so there we just move the fiddling with the types to the other side of the equality.)

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We discussed this a bit with @mgree, and the hashing issue might be worse than I thought. We are doing partitioned joins, i.e., both of the arrangements are partitioned by a hash, and pairs of rows that the join will emit need to be in the same partition. But now with a cast on one of the keys, the partitioning would need to be modified on one of the arrangements, and I'm not sure if there is any better way to do this than to rebuild the arrangement, which would completely defeat the purpose.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe @frankmcsherry and I can talk about how eq-indx would work some time this month? I agree that it seems complicated. 😔

Copy link
Contributor

@ggevay ggevay Jun 12, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

(Btw., not an ideal solution, but we could add a notice that would make it easier for users to find these implicit cast issues and manually fix them by tweaking their index keys to include the cast: https://github.com/MaterializeInc/materialize/issues/22133)

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oh, look what I found!
https://github.com/MaterializeInc/materialize/issues/10871
It's a really well-preserved archeological specimen: haven't been touched or seen in years, and even features the old EXPLAIN output format!

On a more serious note, suddenly finding this long-forgotten, really detailed discussion of these same problems that we are discussing now really demonstrates the cost of the churn of optimizer people...

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Interesting!

One possibility for eq-idx is to automatically coerce to "widest-possible" when doing hashing and ordering. That is, we would hash and order int16, int32, and so on all as if it were int128 (and we would similarly hash and order uint16, uint32, and on on as if it were uint128). I don't think there's any storage cost to doing this, just a small computational cost. And then we'd be able to automatically use a wider index for narrower values.

We could even treat everything as uint128 for hashing purposes, but it probably isn't safe to treat signed ints as uint128 for ordering purposes (if we ever use ordering to express "real" ordering and not arbitrary ordering).

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hmm, that sounds good to me! But let's wait for some rendering expert to say how feasible this is.

Comment on lines 125 to 129

How should we report `ReprScalarType`s to the user in `EXPLAIN PLAN`s?
There's a potential for confusion if we report something as type `string` that the user expected to have type `VARCHAR(18)` or something similar.
Apparently we already treat `string` as a synonym for `text`... can we somehow choose names that will be unambiguous?
Say, `mzstring` or `charseq` or `utf8`?
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Seems fine to just report them in whatever representation is convenient for us. MirScalarExprs aren't SQL expressions, and are hard to map to SQL expressions; I don't think we need to feel any obligation for the type information that annotates MirScalarExprs to map directly to SQL types, either.

Maybe just stick an r prefix in front of it to make it clear(er) it's a lower-level "repr" type? rstring, rint32, etc.

Comment on lines 55 to 56
1. Rename the existing `ScalarType` to `SqlScalarType`. (We may want to rename `ColumnType` and `RelationType`, as well.)
2. Have MIR work with `ReprScalarType` (see [MVP](#minimal-viable-prototype)).
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't love that we have two parallel naming schemes: "HIR", "MIR", and "LIR" for expressions, and "Sql" and "Repr" for types. Is it right that HIR types are always SqlScalarTypes and MIR and LIR types are always ReprScalarTypes? Is there some intuition that suggests we'll only need the two ends of the spectrum for types, even though we needed three IRs for expressions? Or do we think it's likely we'll one day have HirScalarType, MirScalarType, and LirScalarType?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we might also call them HirScalarType and MirScalarType. LIR already has a precedent of using MirScalarExpr for its scalars. If we want to have the IR families aligned in their prefixes we might introduce type aliases

type LirScalarType = MirScalarType;
type LirScalarExpr = MirScalarExpr;

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm easy on names. Right now LIR is effectively untyped. If we were to add types to LIR, they would be the same as MIR types. (@frankmcsherry, @antiguru, and I have discussed the idea of a bytecode/instruction LIR, which might end up having a finer grained type system.)

@mgree mgree mentioned this pull request May 29, 2024
5 tasks
@mgree mgree force-pushed the mir-representation-types-design-doc branch from e0585ca to db905d2 Compare May 29, 2024 17:55
@mgree mgree force-pushed the mir-representation-types-design-doc branch from db905d2 to da14039 Compare May 29, 2024 17:58
@mgree
Copy link
Contributor Author

mgree commented May 29, 2024

My biggest question is around whether we're overindexing on textual implicit casts and underindexing on implicit casts between integers of different widths. There may be other benefits to representation types beyond implicit casts that make this work worth doing even though it doesn't fix all implicit cast problems, but it seemed worth asking the question.

You're right that this proposed change really only addresses issues with text data. We represent textual data the same for all pg "character types", so changing the MIR type system lets us conflate them (possibly keeping some cast operations around for their errors).

For numbers, the representations are all different, so this change wouldn't affect us much. I've updated the document a bit, but I guess I'd propose a multi-pronged approach here:

  1. Change the type system to deal with textual stuff wholesale.
  2. Improve handling of inverses across all types.
  3. Implement some form of conv-indx across all applicable types/operations.
  4. Add notices for "unnecessary" casts: user-written casts that eventually need to be converted back to their original type. (I've seen this pattern in at least two clients, both times in what I would guess was generated code.)

We could do these steps in pretty much any order. I don't have a clear sense of which will have the highest impact.

@benesch
Copy link
Member

benesch commented May 30, 2024

My biggest question is around whether we're overindexing on textual implicit casts and underindexing on implicit casts between integers of different widths. There may be other benefits to representation types beyond implicit casts that make this work worth doing even though it doesn't fix all implicit cast problems, but it seemed worth asking the question.

You're right that this proposed change really only addresses issues with text data. We represent textual data the same for all pg "character types", so changing the MIR type system lets us conflate them (possibly keeping some cast operations around for their errors).

For numbers, the representations are all different, so this change wouldn't affect us much. I've updated the document a bit, but I guess I'd propose a multi-pronged approach here...

We could do these steps in pretty much any order. I don't have a clear sense of which will have the highest impact.

This makes sense to me! It all seems plausibly quite valuable and as long as we feel good about having MirScalarType and conv-indx in the limit, no concerns for me. I too don't have a clear sense of which steps will have the highest impact.

4. Add notices for "unnecessary" casts: user-written casts that eventually need to be converted back to their original type. (I've seen this pattern in at least two clients, both times in what I would guess was generated code.)

This one I'm not sold on! In the broad universe of SQL, I don't think there's anything more correct or "natural" about text or varchar. AFAICT, which you prefer just depends on what database system you were brought up on. My long term ideal UX here is that Materialize absorbs all the complexity here by making both text / varchar equally efficient, instead of pushing the complexity back up to users and asking them to rewrite their queries (or adjust the application generating queries, which might be ~impossible).

That said, no qualms about doing this first as a stopgap solution, if it's substantially easier than making text and varchar equally efficient, and then later removing the notice when we've closed the efficiency gap.


We'll need to define a projection from `HirScalarType` to `MirScalarType` (see [`ScalarType::is_instance_of()`](https://github.com/MaterializeInc/materialize/blob/e4578164fb28a204b43c58ab8ff6c1d3e3b4427f/src/repr/src/scalar.rs#L947) for the correspondence).

In `EXPLAIN PLAN`s, we will report `MirScalarType`s with `r`-prefixed names, where `r` is short for "representation", e.g., `rstring` is the `MirScalarType` corresponding to `text`, `varchar`, etc.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe this prefix should be m or mir if we're going with MirScalarType?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not sure the word mir is shown anywhere to users in our current EXPLAIN output. Very happy to be told what to do here (the r came from earlier discussion).

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'd go with mir (e.g., mirstring) to start. I think it's the clearest. Since EXPLAIN output isn't stable we have a ton of flexibility to change it if we get user complaints.

@mgree mgree self-assigned this Jun 5, 2024
@mgree
Copy link
Contributor Author

mgree commented Jun 5, 2024

  1. Add notices for "unnecessary" casts: user-written casts that eventually need to be converted back to their original type. (I've seen this pattern in at least two clients, both times in what I would guess was generated code.)

This one I'm not sold on! In the broad universe of SQL, I don't think there's anything more correct or "natural" about text or varchar. AFAICT, which you prefer just depends on what database system you were brought up on. My long term ideal UX here is that Materialize absorbs all the complexity here by making both text / varchar equally efficient, instead of pushing the complexity back up to users and asking them to rewrite their queries (or adjust the application generating queries, which might be ~impossible).

That's a good point. Especially considering that many of the casts we're seeing are from generated code, a notice would be annoying (and devalue notices).

That said, no qualms about doing this first as a stopgap solution, if it's substantially easier than making text and varchar equally efficient, and then later removing the notice when we've closed the efficiency gap.

Eh, the type reconciliation should mostly close the gap. (So long as no one ever provides limits for varchar/bpchar, we'll be able to flatten everything.)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
A-compute Area: compute A-optimization Area: query optimization and transformation C-refactoring Category: replacing or reorganizing code
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants