Skip to content

D3log Design

Daniel Müller edited this page Oct 22, 2019 · 6 revisions

D3log Design

D3log is an effort to distribute differential-datalog computation among a set of compute nodes. This document acts as the design document.

                               +----------------------+
                               | Global Control Plane |
                               | (highly available)   |
                               +-/------/----\--------+
                           /-----     /-      -\
                    /------         /-          -\
              /-----               /              -\
  +---------------+              /-                 -\
  | Node 1        |             /                     \
  |---------------|           /-                       -\
  | Local Control ---       /-                           -\
  | Plane         |  \---  /                               -\
  |               |      /----                               -\
  +---------------+    /-     \----                            \
                      /            \---                         -\
                    /-                 \----                      -\
                   /                        \---                    -\
                 /-                             \----                 -\
               /-                                    \----      +---------------+
      +---------------+                                   \---  | Node 3        |
      | Node 2        |                                       \-----------------|
      |---------------|                                         | Local Control |
      | Local Control ------------------------------------------| Plane         |
      | Plane         |                                         |               |
      |               |                                         +---------------+
      +---------------+

(https://textik.com/#89fae6998c0d415f)

At a high level, we have a global control plane that acts as the orchestrating entity. It is supposed to have the full view of the system. High availability is ensure through replication or similar means.

Connection between the global control plane and the local ones as well as amongst the local ones themselves is established over TCP.

System Instantiation

Multiple Phases:

1) "system" boot

  • once global controller booted it waits for incoming connections from nodes OR
  • it just boots and then waits for the program to arrive -> strictly speaking there is no need to get informed about nodes having booted, the controller could just retry connecting to them once it needs to -> on the other hand, in order to connect to them they need to listen on a specific port; we could cut out this dependency by making nodes connect to the global control plane
  • but that would mean local control planes need to know where to find the global one (i.e., they need some configuration), which is probably something we want to avoid unless we really require it

=> Design: local nodes boot up and listen on a specific port on which the local control plane accepts connections from the global one

2) program is pushed to global control plane (by user, or ...)

  • program contains addresses to nodes (IP/DNS name) and program to run on each node TODO: Perhaps "actual" node names are unnecessary in the program itself?

  • global control plane connects to each node, reporting an error if that connection fails (it could retry for a certain amount of time as well or whatever)

    • this connection constitutes some form of control channel

    • that control channel comes with a heartbeat to detect failed nodes

    • list of supported messages on the control channel:

      • Ping & Pong // ping the other end and expect a pong within a // certain deadline
      • Adjust(state) // set new desired state, i.e., how to connect // to other nodes etc. (can be called before and // after Run
      • Run(prog) // start the computation running the given // program

      -> state above essentially needs to describe the connections between nodes and what data moves from where to where

      • this data needs to be inferred from the program passed to the control plane
      • a program could look as follows: prog:
        • (prog1, node1, out1: [p1_1_out: [node2, ...], p1_2_out: [node3], ...]) -> that is:
          • prog1 is the actual program
          • node1 is the DNS name of the node this program runs on
          • out1 is some description of the output relations and how they should connect to other nodes

    TODO: Do we want to fit all of that into ddlog relations?

2.1)

  • description of the desired local configuration is pushed to each node over the control channel (via Adjust message)
  • program to run is pushed to each node and then started (Run)

3) Failure

  • Node failure:
    • detected via heartbeat timeout
    • that mechanism can't detect malicious behavior or anything like that which is out of scope
    • heartbeat timeouts may actually also trigger on slow nodes
    • or in case of network partition -> in both cases data may still be pushed to nodes; can that be a problem? I guess it should not be?
  • Network partition:
    • could be between global control plane and local ones, just between local ones, or anything in between
    • unclear how we deal with that
Clone this wiki locally