Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

design doc: MIR typechecking using representation types #27239

Open
wants to merge 3 commits into
base: main
Choose a base branch
from
Open
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
134 changes: 134 additions & 0 deletions doc/developer/design/20240522_mir_representation_types.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,134 @@
# MIR Representation Types

- Associated:
+ https://github.com/MaterializeInc/materialize/issues/4171
+ https://github.com/MaterializeInc/materialize/pull/27029
+ https://github.com/MaterializeInc/materialize/pull/27109

MIR is typechecked using user-facing SQL types.
SQL types make distinctions between types (e.g., `CHAR(5)`, `VARCHAR(5)`, and `TEXT`) that will not be realized at runtime (all of these are represented using `Datum::String`).

This design doc proposes _representation types_---a new notion of type for MIR terms.
A representation type maps to a subset of constructors/variants of the `Datum` enum.

Making this change should help address issues in optimization, particularly index selection (see ["The Problem"](#the-problem)).
It will also simplify and clarify what MIR is doing.

## The Problem

[Postgres has several types for character data:](https://www.postgresql.org/docs/current/datatype-character.html)

- `char(n)`, `bpchar(n)` for fixed-length strings with a fixed-length representation (whitespace padded)
- `varchar(n)` for fixed-length strings with a variable-length representation
- `varchar` for arbitrary-length strings with a variable-length representation
- `bpchar` for artbirary-length strings with a variable-length representation where trailing whitespace is trimmed
- `text` for arbitrary-length strings with a variable-length representation

(There are also `"char"` and `name`, but they are very much internal to Postgres---let's ignore them for the moment.)

Materialize implements all five of these character sequence types using `Datum::String`.
Materialize only implements meaningful operations on the `text` type; to operate on other character sequence types, one must convert to `text`.
Conversion to `text` is a no-op at runtime, since we already have the string.
We freely insert implicit coercions to `text` in the early stages of query processing.
For correct error behavior, however, we need to carefully keep track of failures to meet length limits.

These implicit coercions---often of the form `varchar_to_text(#c)` for some column number `c`---get in the way during join planning.
Even if we have an index on column `c` as a `varchar`, we cannot use that index to plan a join on `varchar_to_text(#c)`: not only are the types different, the terms are different, too.
We know these implicit coercions interfere with joins; they may also interfere with other aspects of optimization.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What about other similar types with multiple representations, like integers? Integers can be implicitly promoted in many contexts. That gums up planning similarly, no? Do we want to consider tackling that as part of this same line of work?

Integers seem trickier, since we truly do have multiple physical representations of integers. It's not as simple as just eliding the type information.

I don't want to cause unnecessary scope creep. Just want to make sure that we're not going to feel silly if/when implicit casts of integers become an acute problem and we have to solve it in a different but more general way. If that more general way would work for the textual types as well, we might want to focus our efforts there instead.

Way back when we introduced the concept of cast inverses to prevent implicit casts from interfering with point lookups (issue: https://github.com/MaterializeInc/materialize/issues/15525, fix PR: #15476). I imagine there's something clever you could do with these inverses during index selection for joins?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There are two related ideas here:

  1. what types characterize integers?
  2. how do we deal with casts and their possibly-erroring inverses?

What types characterize integers?

My primary goal was to have the types at the MIR level characterize the Datums the operator would deal with. All of the various string types will just be Datum::String at runtime, so one type works for them. But we actually have different numeric Datum constructors: many different sizes of ints (in unsigned and signed flavors), two sizes of floats, and arbitrary precision numerics. So If MIR types characterize Datum constructors, then it seems we would probably need different numeric types.

I can imagine a system where there's a unified numeric supertype Num which subsumes all of these individual types, i.e., a value of type num could be any possible numeric Datum. But a secondary goal is to avoid complex/automatic typing relationships; one proxy for this complexity is how much overlap there is in the mapping from MIR types to Datum constructors. It's unfortunately not possible to have an injective mapping from MIR types to Datum constructors, because Datum::True and Datum::False are booleans but also JSON values... but adding a Num supertype would be fairly complicating. And, it wouldn't resolve our other issue...

How do we deal with casts and their possibly-erroring inverses?

Most of our conversions have inverses---text_to_varchar has varchar_to_text, timestamps can be converted to and from strings, and the numeric types can move into and out of each other. But in general, these are not true inverses/bijections: one side might produce an error.

⚠️ Unnecessary math jargon: ⚠️ text_to_varchar is a left inverse of varchar_to_text, but not a right inverse. That is, text_to_varchar[k](varchar_to_text[k](s)) == s for all s of type varchar(k), for all k. But it is not the case that varchar_to_text[k](text_to_varchar[k](s)) for all s of type text for all k---it could be that s is longer than k and text_to_varchar[k](s) produces an error. Put differently, varchar_to_text is a section and text_to_varchar is a retraction.

In general, there's no way to remove the fallible casts and adding in a catch-all numeric type without introducing something worse: pervasive checks and coercions between numeric types.

I think the right way forward here is:

  1. Use UnaryFunc::could_error() with UnaryFunc::inverse()---if neither can error, we can simplify f(f_inv(x)) to x.
  2. Identify left-inverses explicitly: if we have f(f_inv(x)) and only the outer one is fallible, we have a left inverse and we can simplify it to x.
  3. Explore optimizations around numeric sizing and index selection.

I'm starting to see your point about index selection being a key moment; I'll address it in my response to your other comment.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

When you say "Identify left-inverses explicitly", you mean that we'd add a new function property, i.e., it's not enough to simply call could_error on the inner one, right? Because, in theory, the outer one might fail for some other reason besides not being able to represent the value in the original type. E.g., one might imagine a (thoroughly contrived) function varchar_to_text_if_not_empty, which errors if the varchar is empty. This function would be an inverse of text_to_varchar in the sense that the current code is using the word inverse, which is something like "if both preserves_uniqueness then it's an inverse for such values for which neither errors".

But maybe we could redefine our inverse to mean that you just have to check could_error on the inner one. Probably the above contrived case is not actually occurring for any of our functions.

Btw. I just found an old Slack thread where I wrote more details on what the current code means by inverse than what to code comment is saying. (I had it on my todo list to actually improve the code comment with this information, but somehow it got buried.)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You're absolutely right that if we had a function with a declared inverse that errored for some other reason we would have a problem with that approach.

I agree that we should rethink/refine how we think about inverses. Having a single function that says "this is my inverse and here's whether or not I could error" would simplify the thinking here.


## Success Criteria

1. Plan joins without implicit coercions between equivalent representation types getting in the way.
2. It should be possible for us to elide noop casts at the MIR level.
3. The catalog should reflect SQL-level types.
4. There is clear guidance for customers regarding the different character sequence types.
5. Notices help customers avoid pitfalls with character sequence types.

## Out of Scope

We will not rethink other components of MIR types, like nullability or unique keys.

In general, we may want a "smart index selector", which can appropriately handle mixed-type lookups, i.e., it should let us look up an `i32` in an `i64`-sized index without needing to build a separate `i32`-sized arrangement. Such an approach generalizes `eq-indx` (see ["Alternatives"](#alternatives)).

## Solution Proposal

Having MIR be typechecked in terms of representation types demands a few changes, which can be broken into three PRs:

1. Rename the existing `ScalarType` to `HirScalarType`. (We may want to rename `ColumnType` and `RelationType`, as well.)
2. Have MIR work with `MirScalarType` (see [MVP](#minimal-viable-prototype)).
a. Introduce the `MirScalarType` datatype (see [MVP](#minimal-viable-prototype)).
b. Change the optimizer to work with `MirScalarType` throughout.
c. Change the adapter to record the `HirScalarType` of a query and update its nullability information using the `MirScalarType` at the end of optimization.
3. Elide noop casts, confirming that we plan the problematic joins better.

PR #2 will have a large diff.
On the plus side, it should be a largely mechanical change: there is a straightforward way to project a `HirScalarType` to a `MirScalarType`, and most uses of types in the optimizer will be parametric.

## Minimal Viable Prototype

The representation types will look like the following:

```rust
pub enum MirScalarType {
Bool,
Int16,
Int32,
Int64,
UInt8, // also includes HirScalarType::PgLegacyChar
UInt16,
UInt32, // also includes HirScalarType::{Oid,RegClass,RegProc,RegType}
UInt64,
Float32,
Float64,
Numeric,
Date,
Time,
Timestamp { precision: Option<TimestampPrecision>, },
TimestampTz { precision: Option<TimestampPrecision>, },
MzTimestamp,
Interval,
Bytes,
Jsonb,
String, // also includes HirScalarType::{VarChar,Char,PgLegacyName}
Uuid,
Array(Box<MirScalarType>), // also includes HirScalarType::Int2Vector
List { element_type: Box<MirScalarType> }, // also includes HirScalarType::Record
Map { value_type: Box<MirScalarType> },
Range { element_type: Box<MirScalarType> },
MzAclItem,
AclItem,
}

pub struct MirColumnType {
pub scalar_type: MirScalarType,
pub nullable: bool,
}
```

Since LIR is effectively untyped, there's no need for us to define `LirScalarType` or the like. We already use `MirScalarExpr` inside of `Mfp`s in LIR.

We'll need to define a projection from `HirScalarType` to `MirScalarType` (see [`ScalarType::is_instance_of()`](https://github.com/MaterializeInc/materialize/blob/e4578164fb28a204b43c58ab8ff6c1d3e3b4427f/src/repr/src/scalar.rs#L947) for the correspondence).

In `EXPLAIN PLAN`s, we will report `MirScalarType`s with `r`-prefixed names, where `r` is short for "representation", e.g., `rstring` is the `MirScalarType` corresponding to `text`, `varchar`, etc.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe this prefix should be m or mir if we're going with MirScalarType?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not sure the word mir is shown anywhere to users in our current EXPLAIN output. Very happy to be told what to do here (the r came from earlier discussion).

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'd go with mir (e.g., mirstring) to start. I think it's the clearest. Since EXPLAIN output isn't stable we have a ton of flexibility to change it if we get user complaints.


## Alternatives

We considered several alternatives:

- (`eq-type`) Change the MIR typechecker to conflate `varchar` and `text` types (and drop calls to `varchar_to_text`).
- (`eq-indx`) Change index selection to conflate `varchar` and `text` types (and leave calls to `varchar_to_text`).
- (`remixvc`) Automatically turn add `varchar_to_text` to all indices on a `varchar`.
- (`remix++`) Automatically wrap every use of a `varchar` with `varchar_to_text`.

The `eq-type` and `eq-indx` approach came up in early discussions.
These are fundamentally hacks, with the upside of being quick fixes.
The downside is the addition of a tricky corner case, some risk (what if keeping the types separate matters somewhere we've forgotten about?), and some technical debt/mess-making.
Comment on lines +124 to +125
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What makes them fundamentally hacks/tech debt? eq-indx in particular seems like a narrow case of a smart index selector (let's call it conv-indx) that can handle mixed type lookups, e.g. looking up an i32 in an i64 index or vice-versa, and that seems useful/powerful/desirable?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The hacky/tech-debt/messy feeling here is that it's a very specialized way to work with ScalarExpr in a very narrow context, and I worry about duplicated work and inconsistent handling.

But your point about numeric indices makes it clear to me that index selection will need to be a little bit special no matter what---and index creation, too (#27030).

So, I'm coming around to eq-indx/conv-indx. An advantage of mirtype is that it will let us remove many infallible casts wholesale while simplifying our view of types at the MIR level, which will address our immediate problem. But we'll need something like conv-indx in general. (With conv-indx, you could avoid doing the work of #15476. Or, alternatively, it may be possible to implement conv-indx using the techniques from #15476.)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Something that's been nagging at me, but I may simply be confused about: the hash functions for i32 and i64 are not going to be the same, i.e., sign extending n::i32 to signext(n)::i64 will likely not hash the same, so I think we'd need some careful logic here. I remember bringing something similar up to @frankmcsherry (I think using an index on n::i64 to lookup n+1 or some similarly invertible operation) and him pointing out why it wouldn't work, but I can't remember what the exact issue was.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not really sure how eq-indx would work in a join when the actual representations are different. If we have an index on t1 with a Datum::Int32 key, and an index on t2 with a Datum::Int64 key, then at what moment could we do the cast? I think our current join rendering just does exact equality on the arrangement keys as they are, and there is currently no way to give it a closure or something which would sneakily convert between Int32 and Int64.

So, in addition to modifying the index selection in JoinImplementation, we'd need to somehow also modify the join rendering (for Differential join it's mz_join_core.rs, and for Delta join it's dogsdogsdogs::operators::half_join::half_join_internal_unsafe, in the DD repository). This should involve a solution to the hashing problem mentioned above, besides changing how the keys are equated (somewhere around the seek_key calls).

(Note that this problem didn't occur in #15476, because there it's only one side of the equality that is doing an index lookup, so there we just move the fiddling with the types to the other side of the equality.)

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We discussed this a bit with @mgree, and the hashing issue might be worse than I thought. We are doing partitioned joins, i.e., both of the arrangements are partitioned by a hash, and pairs of rows that the join will emit need to be in the same partition. But now with a cast on one of the keys, the partitioning would need to be modified on one of the arrangements, and I'm not sure if there is any better way to do this than to rebuild the arrangement, which would completely defeat the purpose.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe @frankmcsherry and I can talk about how eq-indx would work some time this month? I agree that it seems complicated. 😔

Copy link
Contributor

@ggevay ggevay Jun 12, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

(Btw., not an ideal solution, but we could add a notice that would make it easier for users to find these implicit cast issues and manually fix them by tweaking their index keys to include the cast: https://github.com/MaterializeInc/materialize/issues/22133)

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oh, look what I found!
https://github.com/MaterializeInc/materialize/issues/10871
It's a really well-preserved archeological specimen: haven't been touched or seen in years, and even features the old EXPLAIN output format!

On a more serious note, suddenly finding this long-forgotten, really detailed discussion of these same problems that we are discussing now really demonstrates the cost of the churn of optimizer people...

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Interesting!

One possibility for eq-idx is to automatically coerce to "widest-possible" when doing hashing and ordering. That is, we would hash and order int16, int32, and so on all as if it were int128 (and we would similarly hash and order uint16, uint32, and on on as if it were uint128). I don't think there's any storage cost to doing this, just a small computational cost. And then we'd be able to automatically use a wider index for narrower values.

We could even treat everything as uint128 for hashing purposes, but it probably isn't safe to treat signed ints as uint128 for ordering purposes (if we ever use ordering to express "real" ordering and not arbitrary ordering).

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hmm, that sounds good to me! But let's wait for some rendering expert to say how feasible this is.


The `remixvc` and `remix++` approaches address the problem at the user/adapter side.
Such rewriting feels a bit risky/annoying/confusing, and I'm not sure how it would interact with sources.

## Open questions

It's possible that (as Nikhil worries) somewhere in the pgwire or WS layers calls `.typ()` on a `MirRelationExpr` and expects a `HirScalarType` but will now find a `MirScalarType`.
I don't know those layers well, but we could derisk some of the work of PR #2 by checking for that in advance.
(In any case, the Rust type checker will be eager to tell us about it.)