-
Notifications
You must be signed in to change notification settings - Fork 122
D3log Design
D3log is an effort to distribute differential-datalog
computation among a set of compute nodes. This document acts as the
design space exploration document.
We identified two possible architectures. In each nodes are connected through TCP.
The first architecture has a global control plane that is assumed to be highly available and that is orchestrating changes to the system. The entity is assumed to have the full view of the system. High availability is ensured through replication or similar means.
+----------------------+
| Global Control Plane |
| (highly available) |
+-/------/----\--------+
/----- /- -\
/------ /- -\
/----- / -\
+---------------+ /- -\
| Node 1 | / \
|---------------| /- -\
| Local Control --- /- -\
| Plane | \--- / -\
| | /---- -\
+---------------+ /- \---- \
/ \--- -\
/- \---- -\
/ \--- -\
/- \---- -\
/- \---- +---------------+
+---------------+ \--- | Node 3 |
| Node 2 | \-----------------|
|---------------| | Local Control |
| Local Control ------------------------------------------| Plane |
| Plane | | |
| | +---------------+
+---------------+
(https://textik.com/#89fae6998c0d415f)
It is imaginable that the global control plane is implemented using a replicated state machine that runs on actual compute nodes, i.e., that resources are shared between control plane and compute nodes.
An alternative approach uses a distributed key-value store (DKVS) as the source of truth pertaining the system state. There would be no global controller in this model, rather, all nodes subscribe to updates on this store and would react to them locally. A selective subscription mechanism, i.e., one that allows subscription to a certain namespace or set of keys, would reduce the amount of chatter and unnecessary computation happening. E.g., certain nodes may only be interested in changes to certain keys and with a selective subscription approach they would only receive updates for those as opposed to also receiving updates for unrelated changes which they would have to identify as irrelevant.
The main difference to a central controller approach is that there is no central "active" entity taking care of the orchestration. Instead, that logic is split amongst the individual compute nodes.
Similar to the global control plane approach, the distributed key-value store itself could share resources with the actual computation (i.e., run on the same nodes).
For the sake of simplicity, we decided to go with the distributed controller approach. Not having a global controller perhaps most importantly eliminates cases where the controller is partitioned off from the rest of the system. With a central controller approach, it is not clear how we would be able to keep the computation running in such a case, because the central controller would interpret such a partition as failure of those nodes and recreate the relevant part of the program.
It also eliminates the need to implement a "control channel", as we would really only need a connection to the DKVS. On the other hand, now we have to deal with creating a proper schema for how to represent our state in the DKVS. We probably would need something similar anyway at some point, but for an initial version that presumably could be more ad hoc.
Multiple Phases:
- once global controller booted it waits for incoming connections from nodes OR
- it just boots and then waits for the program to arrive -> strictly speaking there is no need to get informed about nodes having booted, the controller could just retry connecting to them once it needs to -> on the other hand, in order to connect to them they need to listen on a specific port; we could cut out this dependency by making nodes connect to the global control plane
- but that would mean local control planes need to know where to find the global one (i.e., they need some configuration), which is probably something we want to avoid unless we really require it
=> Design: local nodes boot up and listen on a specific port on which the local control plane accepts connections from the global one
-
program contains addresses to nodes (IP/DNS name) and program to run on each node TODO: Perhaps "actual" node names are unnecessary in the program itself?
-
global control plane connects to each node, reporting an error if that connection fails (it could retry for a certain amount of time as well or whatever)
-
this connection constitutes some form of control channel
-
that control channel comes with a heartbeat to detect failed nodes
-
list of supported messages on the control channel:
-
Ping
&Pong
// ping the other end and expect a pong within a // certain deadline -
Adjust(state)
// set new desired state, i.e., how to connect // to other nodes etc. (can be called before and // afterRun
-
Run(prog)
// start the computation running the given // program
->
state
above essentially needs to describe the connections between nodes and what data moves from where to where- this data needs to be inferred from the program passed to the control plane
- a program could look as follows:
prog:
- (prog1, node1, out1: [p1_1_out: [node2, ...], p1_2_out: [node3], ...])
-> that is:
-
prog1
is the actual program -
node1
is the DNS name of the node this program runs on -
out1
is some description of the output relations and how they should connect to other nodes
-
- (prog1, node1, out1: [p1_1_out: [node2, ...], p1_2_out: [node3], ...])
-> that is:
-
TODO: Do we want to fit all of that into
ddlog
relations? -
2.1)
- description of the desired local configuration is pushed to each
node over the control channel (via
Adjust
message) - program to run is pushed to each node and then started (
Run
)
- Node failure:
- detected via heartbeat timeout
- that mechanism can't detect malicious behavior or anything like that which is out of scope
- heartbeat timeouts may actually also trigger on slow nodes
- or in case of network partition -> in both cases data may still be pushed to nodes; can that be a problem? I guess it should not be?
- Network partition:
- could be between global control plane and local ones, just between local ones, or anything in between
- unclear how we deal with that