Skip to content

JungleComputing/raid-constellation

 
 

Repository files navigation

Resource Aware Inference Distribution - RAID

RAID is a dynamic resource management and scheduling system for inference task distribution on edge devices. It uses the Java based scheduling platform Constellation for communication, and TensorFlow Serving for applying ML models.

The system has three types of Agents, Source, Predictor and Target.

  • The Source produces data, currently supports reading from the file system. It can be extended to come from an external input source, such as a camera.
  • The Target collects the results and stores them in a log file, this can be extended to whatever functionality that is desired from the results.
  • The Predictor will steal tasks from the source, perform the prediction and send the result to a specified target. Predictors will typically be run on the edge devices, but it can be used anywhere.

RAID supports context-aware execution, meaning that we can specify what type of tasks should be performed at what Predictor. This is done by using labels when starting up a source, only predictors with matching labels will steal this specific task.

The following image depicts an example execution in RAID with two Sources uploading data, 5 different Predictors with different labels (A, B, or both), and one Target that collects the result.

RAID execution example

Requirements

Running

  • Java JRE >= 11 (Constellation supports Java 8, but since it is depricated we made sure RAID would run on Java 11).
  • TensorFlow Serving installed on all devices where predictions will occur. Requires docker unless you wish to build the binary yourself.

Compiling

  • All dependencies of RAID, compiled with gradle
  • Java JDK >= 11

Currently Supported Models

  • mnist: MNIST DNN
  • mnist_cnn: MNIST CNN, slightly larger model with better accuracy than MNIST DNN
  • cifar10: CIFAR10 CNN
  • yolo: YOLO v2 full model
  • tiny_yolo: YOLO v2 smaller model

After pulling the repository, extract the models from tensorflow/tensorflow_serving/models.tgz. The tiny_yolo and yolo models are too big to fit in the tar file. If you desire to use them, manually add them to the tensorflow/tensorflow_serving/models directory.

Extending With New Models

See README in src/main/java/nl/zakarias/constellation/raid/.

Installation

In order to install everything and compile a distribution run the following in the root directory:

git clone https://github.com/ZakariasLaws/raid-constellation
cd raid-constellation
./gradlew installDist

This will create the distribution in build/install/raid-constellation.

Edge Devices

- When installing on edge devices, only copy the distribution directory, the TensorFlow Serving config file and the desired models to the device. Make sure to maintain the same folder structure as if you installed everything with gradle.

RAID uses TensorFlow Serving to serve models and perform predictions. This binary needs to be manually installed on each device and the location must be provided during Configuration. For AArch64 devices, you will most likely need to cross-compile it from source, This Github Tool for TF Serving on Arm might do the trick.

RAID Configuration

Environment Variables

For running this application, RAID requires the following environment variable to be set on ALL devices.

export RAID_DIR=/build/path/raid-constellation
export TENSORFLOW_SERVING_PORT=8000
export CONSTELLATION_PORT=4567

Set SSH Environment (remote startup script uses ssh, not strictly necessary)

To setup the SSH keys, see: Setup SSH keys and copy ID

In order to enable passing environment variables through SSH sessions on Linux, configure the ./ssh/environment file as well as enable PermitUserEnvironment in the sshd configuration. See this StackExchange thread

TensorFlow Serving

The application uses TensorFlow Serving to support different TensorFlow ML models. When starting a Predictor with the run.bash script, the TensorFlow serving API will start on in the background and run on local host. The TensorFlow Model config file is located at tensorflow/tensorflow_serving/ModelServerConfig.conf, it only supports absolute paths and must therefore be modified with the device system paths, on each device.

TensorFlow serving can be run either from a binary, or more commonly using docker, the startup script supports both. If using docker, type docker when creating the RAID config file. Make sure that the permissions are set to allow user level docker commands on your system.

The TensorFlow Model config file should look something like this, (see TensorFlow Serving Config File for more options):

model_config_list {
  config {
    name: 'mnist'
    base_path: '/path/to/model/dir/mnist'
    model_platform: 'tensorflow'
  }
  config {
  ....
}

The output of the tensorflow_model_serving is stored in tensorflow_model_serving.log in the bin directory. If one or more agents in charge of prediction for some reason do not work during run time, view this log to see if the error is related to TensorFlow Serving.

RAID Configuration File

Each device running an agent must have a configuration file in the location pointed to by the environment variable RAID_DIR (see Environment Variable). To create this configuration file, run the ./configuration/configure.sh script from the root directory and answer the questions.

The script will place a file named config.RAID in the RAID_DIR, looking something like this:

CONSTELLATION_PORT=4567
TENSORFLOW_BIN=/usr/bin/tensorflow_model_server
TENSORFLOW_SERVING_CONFIG=/home/username/raid-constellation/tensorflow/tensorflow_serving/ModelServerConfig.conf

NOTE that the CONSTELLATION_PORT number must be identical on all devices in order for them to connect to the server and communicate.

Running

To start up an agent, navigate to the bin directory and execute the /bin/distributed/run.bash scripts with the appropriate parameters for that agent (available agents are here). Upon starting up a new execution, always startup the agents in the following order:

  1. Constellation Server
  2. Target (in order to get Activity ID)
  3. Source(s) and Predictor(s)

It is possible to add another target during runtime, but this new target cannot receive classifications from images produced by an already running source. Predictors however, can process images from newly added sources and send results to any target specified when starting up the source.

To start the server, type the following command /bin/distributed/constellation-server.

cd $RAID_DIR
$ ./bin/distributed/constellation-server
Ibis server running on 172.17.0.1/10.72.152.146-4567#8a.a0.ee.40.52.7d.00.00.8f.dd.4e.46.8e.a9.36.23~zaklaw01+22
List of Services:
    Central Registry service on virtual port 302
    Management service on virtual port 304
    Bootstrap service on virtual port 303
Known hubs now: 172.17.0.1/10.72.152.146-4567#8a.a0.ee.40.52.7d.00.00.8f.dd.4e.46.8e.a9.36.23~zaklaw01+22

When executing the server, we see the IP and the port number on which it listens, from the example above the IP=10.72.152.146 and port=4567. The port is retrieved from the RAID configuration file (see configuration) and the IP should be provided as the second argument when starting any agent.

When starting one of the agents, the first, second and third argument follows the same pattern for all of them. The first argument (s/t/p) specifies whether it should run the source, target or predictor respectively, the second is the IP and the third is the pool name. The pool name can be any name, used by the server to distinguish each Constellation execution in the case of multiple simultaneous ones.

./bin/distributed/run.bash <s/t/p> <IP> <Pool Name> [params]

Target

When starting the target, the ID of the activity collecting the results will be printed to the screen. Use this ID when starting up a source agent.

./bin/distributed/run.bash t 10.72.152.146 test.pool.name

...
09:57:35,085 INFO  [CID:0:1] nl.zakarias.constellation.raid.collectActivities.CollectAndProcessEvents - In order to target this activity with classifications add the following as argument (exactly as printed) when initializing the new SOURCE: "0:1:0"
...

Possible parameters for the Target are:

  • -outputFile /path/to/store/output/log
    • Each target produces a log file, storing the results of the predictions
  • -profileOutput /path/to/store/profiling
    • Each target produces a gantt log file, which can be used to visualize the scheduling of jobs in Constellation.

The -profileOutput argument MUST be the last argument provided (if provided).

Predictor

Labels used here are A, B and C, meaning that this agent will only steal jobs having labels A, B or C.

./bin/distributed/run.bash p 10.72.152.146 test.pool.name -context A,B,C

possible parameters for Predictor are:

  • -nrExecutors <number>
    • Set the number of executors to use (each executor runs asynchronously on a separate thread)
  • -context: Comma separated list of strings, containing at least one value, for example "label-1,test,2kb". The Predictor will only steal tasks with at least one matching label

Only three models are in this repository (mnist, mnist_cnn, and cifar10). Additional ones need to be added manually. Store new models in directory /tensorflow/tensorflow_serving/models/, using TensorFlow SavedModel format, see TensorFlow SavedModel. Also, update the TensorFlow Model Serving config file to include the newly added model.

Source

The source requires the following arguments:

  • -context: Comma separated list of strings, containing at least one value, for example "label-1,test,2kb". All submitted images will have all of these labels, meaning they can be stolen by predictors with one or more matching label.
  • -target: The target activity identifier to send the result of the predictions to, printed to the screen when starting up a target agent.
  • -dataDir: The directory of where the data to be transmitted is stored
  • -modelName: The type of model which should be used, see Inference Models for availability
  • -batchSize: The number of images to send in each task (default is 1)
  • -endless: If set the source will keep submitting images forever, batchCount will be ignored (default is false)
  • -batchCount: The number of batches to send in total before exiting, ignored if endless is set to true (default is 100)
  • -timeInterval: The time to wait between submitting two batches, in milliseconds (default it 100)
./bin/distributed/run.bash s 10.72.152.146 test.pool.name -context A -target 0:1:0 -dataDir /home/username/MNIST_data/ -modelName mnist -batchSize 1

When using the MNIST models, provide the the directory containing the following two files as input as input:

t10k-labels-idx1-ubyte
t10k-images-idx3-ubyte

Modify the src/.../models/mnist/Mnist class to read input in a different way, for example from user input.

For CIFAR-10, provide the directory suitable for C, downloaded from The CIFAR-10 Dataset. The directory will most likely be called cifar-10-batches-bin.

For YOLO models, provide a directory containing any number of images to predict. The model was trained on the COCO Dataset.

Production

When executing in production, everything in the log4j.properties file should be set to false, and the command line arguments supplied when starting up a Constellation agent should be reviewed. It can be found in the run.bash file and has the following syntax java -p .... Especially profiling (-Dibis.constellation.profile=true)can drastically slow down execution.

For more Constellation specific arguments see Constellation Configuration Javadoc

About

No description, website, or topics provided.

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages

  • Java 70.7%
  • Python 20.6%
  • Shell 8.7%