-
Notifications
You must be signed in to change notification settings - Fork 274
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Mark drop_dataflow as stable #519
base: master
Are you sure you want to change the base?
Conversation
Between 2673e8c and ff516c8, drop_dataflow appears to be working well, based on testing in Materialize (MaterializeInc/materialize#18442). So remove the "public beta" warning from the docstring. Fix TimelyDataflow#306.
This is not going to merge without a great deal more attention, I'm afraid! |
Great! :D I am personally quite nervous about using |
For me, the issue is that "the remaining problems" are that things in timely (and many distributed systems) work not because they have an enumeration of remaining problems that we burn down, but because the things have specific reasons that they are meant to work. We then debug any non-adherence to those specific reasons, but the reasons need to exist first. Here, So, foremost of the remaining problems is the absence of an explanation for on what basis |
Thanks. This is really helpful. |
Here is a starting point for an explanation that we can fortify or reject if it leaks from anywhere. To construct the equivalence of fn drop_dataflow_panic() {
disable_network();
panic!();
} We can now split our safety argument into two subgoals. Property This gives us latitude to change Property This split is important because changing Timely's abstraction of inter-worker communication is provided by the /// Disables the network of the given dataflow id and immediately drops it
/// Panics if the dataflow is under construction
fn drop_dataflow_real(&mut self, dataflow_id: usize) {
for channel in self.dataflow_channels[dataflow_id] {
channel.disable();
}
if dataflow_under_construction {
panic!();
} else {
self.dataflows.remove(dataflow_id);
}
} We have almost arrived to the conclusion that we want but we have made a huge assumption, namely that the timely cluster is running a single dataflow. If the cluster is running more than one dataflow then As far as I understand, this is the crux of the issue. If we can say that timely clusters can run multiple dataflows only as an optimization and the programmer should assume that each dataflow is its own failure domain that can crash independently then If instead we want to guarantee to timely programmers than when a dataflow panics on a worker no other dataflow gets to run on that worker, further elaboration is needed and it's unclear if On its face, having each dataflow be its own failure domain sounds like a fine model to program against. The programmer can imagine that every time they call @frankmcsherry would love to hear your thoughts on this analysis |
cc @teskje @antiguru @frankmcsherry
Just wanted to queue this up since MaterializeInc/materialize#18442 is inching closer to completeness. If there are remaining concerns about stabilizing
drop_dataflow
(beyond 2673e8c), I figured this would be a good way to draw them out. Marked as a draft, since we obviously we shouldn't merge this until all those concerns are resolved.Between 2673e8c and ff516c8, drop_dataflow appears to be working well, based on testing in Materialize
(MaterializeInc/materialize#18442). So remove the "public beta" warning from the docstring.
Fix #306.