Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

design: Unified Compute Introspection #27548

Merged
merged 1 commit into from
Jun 19, 2024

Conversation

teskje
Copy link
Contributor

@teskje teskje commented Jun 10, 2024

This PR adds a design for Unified Compute Introspection (https://github.com/MaterializeInc/database-issues/issues/7898). The design is for an MVP that can be quickly implemented by the cluster team/me, and many of the design decisions reflect that, e.g., by preferring the simpler approach over the more flexible one.

Motivation

  • This PR adds a known-desirable feature.

Part of MaterializeInc/database-issues#7898.

Checklist

@teskje teskje requested review from antiguru and a team June 10, 2024 14:07
- Whenever a replica connection is established, the compute controller installs all defined introspection subscribes on the replica, as replica-targeted subscribes.
- Whenever a batch of updates arrives for an introspection subscribe, the compute controller (a) prepends the replica ID to all updates and (b) sends the thus enriched updates to the storage controller, tagged with the corresponding `IntrospectionType`.
- The storage controller's `CollectionManager` appends the introspection updates to the collection associated with the given `IntrospectionType`.
- Whenever a replica disconnects, the compute controller cancels its introspection subscribes and instructs the storage controller to delete all introspection data previously emitted for this replica.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just to be clear, this means it would emit retractions for the data, so a SUBSCRIBE AS OF would still include data for the dropped replica, correct?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes! In this regard the unified introspection relations behave like any system relation that contains a replica_id column (mz_cluster_replicas, mz_cluster_replica_statuses, ...).

Copy link
Member

@antiguru antiguru left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good, some comments around how we define the dataflows to serve subscribes.

Comment on lines +59 to +60
The `DataflowDescription`, since it is defined in the compute controller without access to the catalog or the optimizer, must be written using LIR constructs and can only reference introspection indexes, not builtin sources or views that may be defined in the catalog.
These are limitations that seem reasonable for an MVP.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Agree that that's OK for an MVP, but unlikely it's what we want for the final implementation. Expressing complicated queries for example with reductions is increasingly hard to do manually.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

A different take would be that all computations on introspection data should be done in the form of logging dataflows because they're running on all replicas and could have resource impact. If we restrict the plans to essentially be subscribes to existing arrangements, this would be the lowest cost possible because it allows us to hand-optimize the dataflows.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That's true, but I doubt that handwriting a dataflow is any easier than writing the LIR definition (which you can avoid by adding a dbg!(dataflow) in the right place) :)

If we do this as a cost optimization, we'll want to avoid creating new indexes, so we'd have to introduce the concept of a "logging subscribe", i.e. an implicit subscribe that's installed on the replica at CreateInstance time. Seems feasible but not something we should consider if we don't observe a need. In contrast to the logging dataflows the introspection subscribes proposed in this design will show up in introspection, so we'll be able to see when they use non-negligible resources.

Comment on lines +93 to +96
We propose extending the `CollectionManager`'s API to also accept deletion requests.
A deletion request contains a collection ID, as well as a `filter` closure `Box<dyn Fn(&Row) -> bool>`.
The `CollectionManager` handles a deletion request by reading back the target collection at its latest readable time, applying the `filter` closure to each read `Row`, and appending retractions for each `Row` the `filter` returned `true` for.
Append and deletion requests must be applied by the `CollectionManager` in the same order they have been issued by the client, to guarantee that deletions will retract all previously sent appends.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What's the memory requirements for providing such a feature?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This depends a lot on the implementation of the persist client.

With the naive implementation (build the retractions in memory, then append them):

  • In the "best case" the persist client streams in one update at a time and we only need to keep those belonging to the deleted replica, so the memory usage would be equal to the size of existing updates of that replica.
  • In the realistic case, the persist client probably introduces some overhead because it gives us the updates in batches, not one at a time. We can hope that this is constant overhead, but I'm not sure.
  • In the worst case we'd need to keep the whole collection snapshot in memory.

Rather than collecting retractions in memory, we can use a BatchBuilder, which will flush out parts to S3 when they get too large. If we make sure the unified introspection collection stay small that's probably not an optimization we need.

Also note that the current Pv2 plan includes making the CollectionManager self-correcting (#27496), similar to the persist_sink, in which case it will need to read the whole snapshot into memory anyway.

Copy link
Member

@benesch benesch left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I was slow to get to this, but LGTM! Thanks very much for writing this up. Extremely helpful context for me.

Comment on lines +117 to +118
In the interest of moving fast we opt to not follow this approach for the MVP implementation.
If we find that the limitations of the controller-based approach are too great, we can revisit this decision.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Makes sense to me.

This would allow the compute controller to mint its own collection identifiers without needing to synchronize with the coordinator.

While this would be a desirable change, also for a possible `ALTER MATERIALIZED VIEW ... SET CLUSTER` feature, the required refactoring work would be significant.
Again in the interest of moving fast, we opt for the simpler approach of sharing the `transient_id_gen` instead.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

(Already discussed and approved on Slack, but recording here for posterity my 👍🏽 for this decision.)

Comment on lines +84 to +96
The storage controller delegates the writing of storage-managed collections to a background task called the `CollectionManager`.
The `CollectionManager` accepts `GlobalId`s identifying storage collections and corresponding `(Row, Diff)` updates to be appended to these collections, and appends them at the current system time.
It does not yet expose a mechanism to delete previously written updates from a collection by reading back the collection contents, determining the necessary retractions, and appending them.

To remove introspection data for dropped/disconnected replicas from the unified introspection relations, we require such a deletion mechanism.
Note that the compute controller is not itself able to determine the set of necessary retractions.
While it can be taught to read the contents of storage-managed collections, it doesn't have any way of knowing the time as of which all previously emitted introspection data has been fully written to its target collection.
The compute controller would risk reading back the target collection's contents too soon and retract only part of the introspection data previously emitted for the disconnected replica.

We propose extending the `CollectionManager`'s API to also accept deletion requests.
A deletion request contains a collection ID, as well as a `filter` closure `Box<dyn Fn(&Row) -> bool>`.
The `CollectionManager` handles a deletion request by reading back the target collection at its latest readable time, applying the `filter` closure to each read `Row`, and appending retractions for each `Row` the `filter` returned `true` for.
Append and deletion requests must be applied by the `CollectionManager` in the same order they have been issued by the client, to guarantee that deletions will retract all previously sent appends.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm assuming you and @aljoscha have already chatted about this, as this has an interesting interaction with #27496. I'm +1 on the design you propose in isolation, but I have not yet thought deeply about how it interacts with the 0dt work.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

#27496 makes this a bit easier because because it makes the CollectionManager keep the whole snapshot state in memory, so we can build the retractions from there directly. Eventually we'll still want to move to the design described here, when the CollectionManager learns to only keep the desired-persist diff around.

@teskje teskje merged commit f1afc09 into MaterializeInc:main Jun 19, 2024
7 checks passed
@teskje teskje deleted the unified-introspection-design branch June 19, 2024 08:38
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants