-
Notifications
You must be signed in to change notification settings - Fork 465
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
design: Unified Compute Introspection #27548
design: Unified Compute Introspection #27548
Conversation
- Whenever a replica connection is established, the compute controller installs all defined introspection subscribes on the replica, as replica-targeted subscribes. | ||
- Whenever a batch of updates arrives for an introspection subscribe, the compute controller (a) prepends the replica ID to all updates and (b) sends the thus enriched updates to the storage controller, tagged with the corresponding `IntrospectionType`. | ||
- The storage controller's `CollectionManager` appends the introspection updates to the collection associated with the given `IntrospectionType`. | ||
- Whenever a replica disconnects, the compute controller cancels its introspection subscribes and instructs the storage controller to delete all introspection data previously emitted for this replica. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Just to be clear, this means it would emit retractions for the data, so a SUBSCRIBE AS OF
would still include data for the dropped replica, correct?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes! In this regard the unified introspection relations behave like any system relation that contains a replica_id
column (mz_cluster_replicas
, mz_cluster_replica_statuses
, ...).
2ed670d
to
a99a99f
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks good, some comments around how we define the dataflows to serve subscribes.
The `DataflowDescription`, since it is defined in the compute controller without access to the catalog or the optimizer, must be written using LIR constructs and can only reference introspection indexes, not builtin sources or views that may be defined in the catalog. | ||
These are limitations that seem reasonable for an MVP. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Agree that that's OK for an MVP, but unlikely it's what we want for the final implementation. Expressing complicated queries for example with reductions is increasingly hard to do manually.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
A different take would be that all computations on introspection data should be done in the form of logging dataflows because they're running on all replicas and could have resource impact. If we restrict the plans to essentially be subscribes to existing arrangements, this would be the lowest cost possible because it allows us to hand-optimize the dataflows.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That's true, but I doubt that handwriting a dataflow is any easier than writing the LIR definition (which you can avoid by adding a dbg!(dataflow)
in the right place) :)
If we do this as a cost optimization, we'll want to avoid creating new indexes, so we'd have to introduce the concept of a "logging subscribe", i.e. an implicit subscribe that's installed on the replica at CreateInstance
time. Seems feasible but not something we should consider if we don't observe a need. In contrast to the logging dataflows the introspection subscribes proposed in this design will show up in introspection, so we'll be able to see when they use non-negligible resources.
We propose extending the `CollectionManager`'s API to also accept deletion requests. | ||
A deletion request contains a collection ID, as well as a `filter` closure `Box<dyn Fn(&Row) -> bool>`. | ||
The `CollectionManager` handles a deletion request by reading back the target collection at its latest readable time, applying the `filter` closure to each read `Row`, and appending retractions for each `Row` the `filter` returned `true` for. | ||
Append and deletion requests must be applied by the `CollectionManager` in the same order they have been issued by the client, to guarantee that deletions will retract all previously sent appends. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What's the memory requirements for providing such a feature?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This depends a lot on the implementation of the persist client.
With the naive implementation (build the retractions in memory, then append them):
- In the "best case" the persist client streams in one update at a time and we only need to keep those belonging to the deleted replica, so the memory usage would be equal to the size of existing updates of that replica.
- In the realistic case, the persist client probably introduces some overhead because it gives us the updates in batches, not one at a time. We can hope that this is constant overhead, but I'm not sure.
- In the worst case we'd need to keep the whole collection snapshot in memory.
Rather than collecting retractions in memory, we can use a BatchBuilder
, which will flush out parts to S3 when they get too large. If we make sure the unified introspection collection stay small that's probably not an optimization we need.
Also note that the current Pv2 plan includes making the CollectionManager
self-correcting (#27496), similar to the persist_sink, in which case it will need to read the whole snapshot into memory anyway.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I was slow to get to this, but LGTM! Thanks very much for writing this up. Extremely helpful context for me.
In the interest of moving fast we opt to not follow this approach for the MVP implementation. | ||
If we find that the limitations of the controller-based approach are too great, we can revisit this decision. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Makes sense to me.
This would allow the compute controller to mint its own collection identifiers without needing to synchronize with the coordinator. | ||
|
||
While this would be a desirable change, also for a possible `ALTER MATERIALIZED VIEW ... SET CLUSTER` feature, the required refactoring work would be significant. | ||
Again in the interest of moving fast, we opt for the simpler approach of sharing the `transient_id_gen` instead. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
(Already discussed and approved on Slack, but recording here for posterity my 👍🏽 for this decision.)
The storage controller delegates the writing of storage-managed collections to a background task called the `CollectionManager`. | ||
The `CollectionManager` accepts `GlobalId`s identifying storage collections and corresponding `(Row, Diff)` updates to be appended to these collections, and appends them at the current system time. | ||
It does not yet expose a mechanism to delete previously written updates from a collection by reading back the collection contents, determining the necessary retractions, and appending them. | ||
|
||
To remove introspection data for dropped/disconnected replicas from the unified introspection relations, we require such a deletion mechanism. | ||
Note that the compute controller is not itself able to determine the set of necessary retractions. | ||
While it can be taught to read the contents of storage-managed collections, it doesn't have any way of knowing the time as of which all previously emitted introspection data has been fully written to its target collection. | ||
The compute controller would risk reading back the target collection's contents too soon and retract only part of the introspection data previously emitted for the disconnected replica. | ||
|
||
We propose extending the `CollectionManager`'s API to also accept deletion requests. | ||
A deletion request contains a collection ID, as well as a `filter` closure `Box<dyn Fn(&Row) -> bool>`. | ||
The `CollectionManager` handles a deletion request by reading back the target collection at its latest readable time, applying the `filter` closure to each read `Row`, and appending retractions for each `Row` the `filter` returned `true` for. | ||
Append and deletion requests must be applied by the `CollectionManager` in the same order they have been issued by the client, to guarantee that deletions will retract all previously sent appends. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
#27496 makes this a bit easier because because it makes the CollectionManager
keep the whole snapshot state in memory, so we can build the retractions from there directly. Eventually we'll still want to move to the design described here, when the CollectionManager
learns to only keep the desired
-persist
diff around.
This PR adds a design for Unified Compute Introspection (https://github.com/MaterializeInc/database-issues/issues/7898). The design is for an MVP that can be quickly implemented by the cluster team/me, and many of the design decisions reflect that, e.g., by preferring the simpler approach over the more flexible one.
Motivation
Part of MaterializeInc/database-issues#7898.
Checklist
$T ⇔ Proto$T
mapping (possibly in a backwards-incompatible way), then it is tagged with aT-proto
label.