Skip to content
This repository has been archived by the owner on Jun 7, 2024. It is now read-only.
Justin Weber edited this page Jul 28, 2021 · 17 revisions

Welcome to the Distributed Nodeworks wiki!

Description

Support the development of a distributed workflow (i.e., coordinated execution of multiple tasks performed by different computers) by developing a prototype task management server. This server will accept a diverse range of workflows from an application like Nodeworks ( https://mfix.netl.doe.gov/nodeworks/ ) and manage a collection of task queues that are populated based on the workflow requirements. End points, or workers, running on distributed hardware will accept, execute, and publish results from the tasks back to the server so the next highest priority task in the workflow can be executed.

Recommended prerequisites: Familiarity with intermediate to advanced Python, Git, database server concepts and other tools like Flask. Also some patience & willingness to explore new things.

Deliverables

Deliverables:

  • Flask server with a REST API that accepts new users, generates tokens, and accepts Nodeworks workflow files (json).
  • Server populates task queues based on individual resource requirements in Nodeworks files, ensuring that the workflows are executed in the correct order based on the dependencies of the workflow.
  • End points that continuously poll task queues, and execute jobs (bash, Python, SLURM batch queuing, etc.) in the queue on distributed resources both local and on the cloud (AWS, GCP, AZURE, etc.)
  • React front end for displaying job status, endpoints, and queue tasks.
  • React front end that displays and allows editing of the workflow.
  • React front end that displays estimated start time of execution or completion of the complete workflow based on historical execution data and user prescribed workflow requirements.

Specification

Orchestrator (Backend)

The Orchestrator is the main communication point. It stores and executes the workflow DAGs, allowing authenticated users to submit edits. Each change to the DAG is tracked using git, providing git hashes that are directly tied to workflow artifacts, enabling provenance of the workflow results.

The Orchestrator also determines the execution order and populates the queue. Tasks that have similar runner requirements are grouped together as much as possible to reduce runner overhead.

Resources (similar projects)

DAG solver

A simple Directed acyclic graph (DAG) solver will need to be developed to figure out the order which the nodes should be executed in (based on the individual dependencies of each node).

Workflow files

Need a better name for these files. Ideas?

The files that describe the nodes and the connections are json files. They contain all the information required to load a workflow. There are two primary objects in the files: nodes and connections.

Nodes look like this:

{
'type':        'Node', # type: Node or Connection
'name':        <str>,
'uniquename':  <str>,
'pos':         [x position <float>,
                y position <float>],
'path':        [<str>,],  # path to the actual node; ['numpy', 'array']
'terminals':   {},  # dict of terminals
'tunnel':      <bool>,
'forcerun':    <bool>,
'hidden':      <bool>,
'customState': {},  # node specific information
'layer':       <int>
}

Connections look like this:

{
"type": "Connection",  # type: Node or Connection
"name": <str>,  
"line": "cubic",  # line style: line/cubic
"uniquename": <str>,
"input": [  # input connection
    <str>,  # node unique name
    <str>   # terminal name
],
"output": [  # output connection
    <str>,   # node unique name
    <str>    # terminal name
],
"controlpoints": [], # list of control points to move the line
"feedback": false
}

Queue

The queue contains a collection of tasks that the runners need to execute. This queue is populated and managed by the orchestrator

Resources

Database

The database will store the workflows, log files, and job artifacts (although it is recognized that job artifacts could be large).

Proposed REST API

A feature rich web UI is not anticipated for the Orchestrator. Instead, a comprehensive API will be used so that both a web app and Nodeworks desktop app can interact with the server.

call action access
/api/v1/status get the status of the Orchestrator public
/api/v1/workflow/get download the selected workflow user
/api/v1/workflow/publish upload the selected workflow user
/api/v1/workflow/execute execute the selected workflow user
/api/v1/workflow/cancel stop execution of the selected workflow user
/api/v1/workflow/status get the status of the selected workflow user
/api/v1/workflow/history get the git history of the selected workflow user
/api/v1/workflow/delete delete the selected workflow user
/api/v1/queue/status get the status of the queue user
/api/v1/queue/clear remove all tasks from the queue user
/api/v1/runners/status get the status of the runners user
/api/v1/runners/register register a runner user
/api/v1/runners/cancel cancel runner job user
/api/v1/runners/shutdown shutdown a runner user
/api/v1/events/status get the status of events user
/api/v1/events/push push a new event user
/api/v1/timers/status get the status of a timer user
/api/v1/timers/start start a new timer user
/api/v1/admin/cancel_all cancel all workflows, clear all tasks from the queue admin
/api/v1/admin/shutdown cancel all workflows and shutdown admin

Runners

Runners are small programs that will handle communications with the Orchestrator and queue(s). Following a similar model as the gitlab-runners, runners will be registered with the Orchestrator. Once the runner is registered, it will periodically check the queue for new jobs. If a new job is present, the runner will start the executor.

Runners will have a collection of tags describing the environment and the executor. These tags will be used to decide what jobs to execute with which runner.

Executors

The executor will actually handle running the job. Several executors will be available to support different jobs:

  • Shell - execute a list of commands in a shell
  • Docker - start a selected docker container and execute a list of commands in that container
  • Nodeworks - create a nodeworks conda environment and execute the node in that environment. This could also be containerized
  • Kubernetes

Gitlab seems to have a nice model of this (deploys well in production environments) that we can use for inspiration:

React App (Front end)

A web application using REACT will be created to provide an intuitive interface for visualizing, editing, and displaying workflows and the execution progress. The application will need to:

  • allow users to signup/login
  • get a user specific token for API access to the flask server (for the Nodeworks desktop app)
  • See a list of workflows
  • open and visualize a workflow (stretch: edit the workflow)
  • Execute the workflow (submit to the flask server)
  • Visualize the execution progress
  • register Executors

Home page

The home page should:

  • allow users to signup with the flask api (/api/v1/register)
  • allow users to login with the flask api (`'/api/v1/login')
  • show some nice graphics

List of workflows

The list of workflows page should

  • show the list of available workflows from the flask api (/api/v1/workflow/)
  • allow users to create new workflows
  • allow the user to select one to open it
  • allow users to delete workflows (/api/v1/workflow/delete/<file_id>)

Workflow

The workflow page should:

  • load the workflow from the flask api (/api/v1/workflow/get/)
  • show the workflow
  • allow users to drag/drop/delete nodes and edges (connections)
  • save the workflow back to the flask api (/api/v1/workflow/publish)

REACT node libraries

metric react-node-graph react-digraph reactflow
license
cost
documentation
examples
widgets in nodes
usability
active
installation
  • documentation - is the documentation comprehensive?
  • examples - are there example use cases?
  • widgets in nodes - can you place widgets in the nodes? such as a button, slider, spinbox?
  • usability - how intuitive is the interface? Did you need to read/see directions?
  • active - is the development active? When was the last commit?
  • installation - how easy is it to install? is there a large list of dependencies?

*Evaluation of the three libraries: https://imgur.com/pV81Qev

Resources