RAID is a dynamic resource management and scheduling system for inference task distribution on edge devices. It uses the Java based scheduling platform Constellation for communication, and TensorFlow Serving for applying ML models.
The system has three types of Agents, Source, Predictor and Target.
- The Source produces data, currently supports reading from the file system. It can be extended to come from an external input source, such as a camera.
- The Target collects the results and stores them in a log file, this can be extended to whatever functionality that is desired from the results.
- The Predictor will steal tasks from the source, perform the prediction and send the result to a specified target. Predictors will typically be run on the edge devices, but it can be used anywhere.
RAID supports context-aware execution, meaning that we can specify what type of tasks should be performed at what Predictor. This is done by using labels when starting up a source, only predictors with matching labels will steal this specific task.
The following image depicts an example execution in RAID with two Sources uploading data, 5 different Predictors with different labels (A, B, or both), and one Target that collects the result.
- Java JRE >= 11 (Constellation supports Java 8, but since it is depricated we made sure RAID would run on Java 11).
- TensorFlow Serving installed on all devices where predictions will occur. Requires docker unless you wish to build the binary yourself.
- All dependencies of RAID, compiled with gradle
- Java JDK >= 11
- mnist: MNIST DNN
- mnist_cnn: MNIST CNN, slightly larger model with better accuracy than MNIST DNN
- cifar10: CIFAR10 CNN
- yolo: YOLO v2 full model
- tiny_yolo: YOLO v2 smaller model
After pulling the repository, extract the models from tensorflow/tensorflow_serving/models.tgz
. The tiny_yolo and
yolo models are too big to fit in the tar file. If you desire to use them, manually add them to the
tensorflow/tensorflow_serving/models
directory.
See README in src/main/java/nl/zakarias/constellation/raid/
.
In order to install everything and compile a distribution run the following in the root directory:
git clone https://github.com/ZakariasLaws/raid-constellation
cd raid-constellation
./gradlew installDist
This will create the distribution in build/install/raid-constellation
.
- When installing on edge devices, only copy the distribution directory, the TensorFlow Serving config file and the desired models to the device. Make sure to maintain the same folder structure as if you installed everything with gradle.
RAID uses TensorFlow Serving to serve models and perform predictions. This binary needs to be manually installed on each device and the location must be provided during Configuration. For AArch64 devices, you will most likely need to cross-compile it from source, This Github Tool for TF Serving on Arm might do the trick.
For running this application, RAID requires the following environment variable to be set on ALL devices.
export RAID_DIR=/build/path/raid-constellation
export TENSORFLOW_SERVING_PORT=8000
export CONSTELLATION_PORT=4567
To setup the SSH keys, see: Setup SSH keys and copy ID
In order to enable passing environment variables through SSH sessions on Linux, configure the ./ssh/environment
file as well as enable PermitUserEnvironment
in the sshd configuration.
See this StackExchange thread
The application uses TensorFlow Serving to support different TensorFlow ML models. When starting a Predictor with the run.bash
script, the TensorFlow serving API will start on in the background and run on local host. The TensorFlow Model config file is located at tensorflow/tensorflow_serving/ModelServerConfig.conf
, it only supports absolute paths and must therefore be modified with the device system paths, on each device.
TensorFlow serving can be run either from a binary, or more commonly using docker, the startup script supports both. If using docker, type docker when creating the RAID config file. Make sure that the permissions are set to allow user level docker commands on your system.
The TensorFlow Model config file should look something like this, (see TensorFlow Serving Config File for more options):
model_config_list {
config {
name: 'mnist'
base_path: '/path/to/model/dir/mnist'
model_platform: 'tensorflow'
}
config {
....
}
The output of the tensorflow_model_serving
is stored in tensorflow_model_serving.log
in the bin directory. If one or more agents in charge of prediction for some reason do not work during run time, view this log to see if the error is related to TensorFlow Serving.
Each device running an agent must have a configuration file in the location pointed to by the environment variable
RAID_DIR
(see Environment Variable). To create this configuration file, run the
./configuration/configure.sh
script from the root directory and answer the questions.
The script will place a file named config.RAID
in the RAID_DIR
, looking something like this:
CONSTELLATION_PORT=4567
TENSORFLOW_BIN=/usr/bin/tensorflow_model_server
TENSORFLOW_SERVING_CONFIG=/home/username/raid-constellation/tensorflow/tensorflow_serving/ModelServerConfig.conf
NOTE that the CONSTELLATION_PORT
number must be identical on all devices in order for them to connect to the
server and communicate.
To start up an agent, navigate to the bin directory and execute the /bin/distributed/run.bash
scripts with the
appropriate parameters for that agent (available agents are here). Upon starting up a new execution,
always startup the agents in the following order:
- Constellation Server
- Target (in order to get Activity ID)
- Source(s) and Predictor(s)
It is possible to add another target during runtime, but this new target cannot receive classifications from images produced by an already running source. Predictors however, can process images from newly added sources and send results to any target specified when starting up the source.
To start the server, type the following command /bin/distributed/constellation-server
.
cd $RAID_DIR
$ ./bin/distributed/constellation-server
Ibis server running on 172.17.0.1/10.72.152.146-4567#8a.a0.ee.40.52.7d.00.00.8f.dd.4e.46.8e.a9.36.23~zaklaw01+22
List of Services:
Central Registry service on virtual port 302
Management service on virtual port 304
Bootstrap service on virtual port 303
Known hubs now: 172.17.0.1/10.72.152.146-4567#8a.a0.ee.40.52.7d.00.00.8f.dd.4e.46.8e.a9.36.23~zaklaw01+22
When executing the server, we see the IP and the port number on which it listens, from the example above the
IP=10.72.152.146
and port=4567
. The port is retrieved from the RAID configuration file
(see configuration) and the IP should be provided as the second argument when starting any agent.
When starting one of the agents, the first, second and third argument follows the same pattern for all of them. The first argument (s/t/p) specifies whether it should run the source, target or predictor respectively, the second is the IP and the third is the pool name. The pool name can be any name, used by the server to distinguish each Constellation execution in the case of multiple simultaneous ones.
./bin/distributed/run.bash <s/t/p> <IP> <Pool Name> [params]
When starting the target, the ID of the activity collecting the results will be printed to the screen. Use this ID when starting up a source agent.
./bin/distributed/run.bash t 10.72.152.146 test.pool.name
...
09:57:35,085 INFO [CID:0:1] nl.zakarias.constellation.raid.collectActivities.CollectAndProcessEvents - In order to target this activity with classifications add the following as argument (exactly as printed) when initializing the new SOURCE: "0:1:0"
...
Possible parameters for the Target are:
- -outputFile /path/to/store/output/log
- Each target produces a log file, storing the results of the predictions
- -profileOutput /path/to/store/profiling
- Each target produces a gantt log file, which can be used to visualize the scheduling of jobs in Constellation.
The -profileOutput argument MUST be the last argument provided (if provided).
Labels used here are A, B and C, meaning that this agent will only steal jobs having labels A, B or C.
./bin/distributed/run.bash p 10.72.152.146 test.pool.name -context A,B,C
possible parameters for Predictor are:
- -nrExecutors <number>
- Set the number of executors to use (each executor runs asynchronously on a separate thread)
- -context: Comma separated list of strings, containing at least one value, for example "label-1,test,2kb". The Predictor will only steal tasks with at least one matching label
Only three models are in this repository (mnist, mnist_cnn, and cifar10). Additional ones need to be added manually. Store new models in directory /tensorflow/tensorflow_serving/models/
, using TensorFlow SavedModel format,
see TensorFlow SavedModel. Also, update the TensorFlow Model Serving config file to include the newly added model.
The source requires the following arguments:
- -context: Comma separated list of strings, containing at least one value, for example "label-1,test,2kb". All submitted images will have all of these labels, meaning they can be stolen by predictors with one or more matching label.
- -target: The target activity identifier to send the result of the predictions to, printed to the screen when starting up a target agent.
- -dataDir: The directory of where the data to be transmitted is stored
- -modelName: The type of model which should be used, see Inference Models for availability
- -batchSize: The number of images to send in each task (default is 1)
- -endless: If set the source will keep submitting images forever, batchCount will be ignored (default is false)
- -batchCount: The number of batches to send in total before exiting, ignored if endless is set to true (default is 100)
- -timeInterval: The time to wait between submitting two batches, in milliseconds (default it 100)
./bin/distributed/run.bash s 10.72.152.146 test.pool.name -context A -target 0:1:0 -dataDir /home/username/MNIST_data/ -modelName mnist -batchSize 1
When using the MNIST models, provide the the directory containing the following two files as input as input:
t10k-labels-idx1-ubyte
t10k-images-idx3-ubyte
Modify the src/.../models/mnist/Mnist
class to read input in a different way, for example from user input.
For CIFAR-10, provide the directory suitable for C, downloaded from The CIFAR-10 Dataset.
The directory will most likely be called cifar-10-batches-bin
.
For YOLO models, provide a directory containing any number of images to predict. The model was trained on the COCO Dataset.
When executing in production, everything in the log4j.properties
file should be set to false, and the command line
arguments supplied when starting up a Constellation agent should be reviewed. It can be found in the run.bash
file and has the following syntax java -p ...
. Especially profiling (-Dibis.constellation.profile=true
)can drastically slow down execution.
For more Constellation specific arguments see Constellation Configuration Javadoc