In the following, we show an example of running MONAI-bundle configurations with NVFlare.
This example includes instructions on running FedAvg with experiment tracking using MLflow and homomorphic encryption for secure aggregation. It uses the provisioning and the admin API to submit jobs, similar to how one would set up experiments similar to real-world deployment.
In this example, we use an already prepared provisioning file (project.yml) to run experiments on a single machine. For real-world deployment, additional considerations must be taken into account. See here for more information.
For an example to get started with FL simulator, see here.
To execute the below commands, please open a terminal. And go to the folder containing this tutorial
To execute the below commands, please open a terminal and go to the folder containing this tutorial.
Follow the setup to create a virtual environment with the MONAI-NVFlare integration installed to use in JupyterLab.
Install the required packages in your virtual environment:
pip install -r ./requirements.txt
Download the MONAI bundle as ./jobs/${JOB_NAME}/app/config/spleen_ct_segmentation
:
JOB_NAME=spleen_ct_segmentation_local
python3 -m monai.bundle download --name "spleen_ct_segmentation" --version "0.5.4" --bundle_dir ./jobs/${JOB_NAME}/app/config
In this example, JOB_NAME
can be either spleen_ct_segmentation_local
or spleen_ct_segmentation_he
, depending on the configuration you would like to run (see below).
The final folder structure under JOB_NAME
will be:
.
├── app
│ └── config
│ ├── config_fed_client.json
│ ├── config_fed_server.json
│ └── spleen_ct_segmentation
│ ├── LICENSE
│ ├── configs
│ │ ├── evaluate.json
│ │ ├── inference.json
│ │ ├── logging.conf
│ │ ├── metadata.json
│ │ ├── multi_gpu_evaluate.json
│ │ ├── multi_gpu_train.json
│ │ └── train.json
│ ├── docs
│ │ ├── README.md
│ │ └── data_license.txt
│ └── models
│ ├── model.pt
│ └── model.ts
└── meta.json
Download the spleen CT data from the MSD challenge and update data path.
Note: The dataset will be saved under
./data
by default with thedownload_spleen_dataset.py
command. Thesed
command replaces the dataset path in the bundle of/workspace/data/Task09_Spleen
with the default location the dataset will be downloaded to in this directory.
JOB_NAME=spleen_ct_segmentation_local
python3 download_spleen_dataset.py
sed -i "s|/workspace/data/Task09_Spleen|${PWD}/data/Task09_Spleen|g" ./jobs/${JOB_NAME}/app/config/spleen_ct_segmentation/configs/train.json
You can use a POC system or scripts that will start the FL server and 2 clients automatically to run FL experiments.
Skip to 4.1 below for provisioning with homomorphic encryption and using the start_fl_secure.sh
script.
To run FL experiments in POC mode, create your local FL workspace the below command. In the following experiments, we will be using 2 clients. Press y and enter when prompted.
nvflare poc prepare -n 2
By default, POC will create startup kits at /tmp/nvflare/poc
.
NOTE: POC stands for "proof of concept" and is used for quick experimentation with different amounts of clients. It doesn't need any advanced configurations while provisioning the startup kits for the server and clients.
The secure workspace on the other hand is needed to run experiments that require encryption keys such as the homomorphic encryption (HE) one shown below. These startup kits allow secure deployment of FL in real-world scenarios using SSL certificated communication channels.
Then in another terminal start the FL system in POC mode with all provisioned clients by running:
nvflare poc start -ex [email protected]
The project file for creating a secure workspace with homomorphic encryption used in this example is shown at ./workspaces/secure_project.yml.
If you want to run the homomorphic encryption job, please install TenSEAL:
pip install tenseal
(this example was tested with tenseal==0.3.12)
Otherwise, please remove the HEBuilder
section from workspaces/secure_project.yml.
To create the secure workspace, please use the following to build a package and copy it
to secure_workspace
for later experimentation.
cd ./workspaces
nvflare provision -p ./secure_project.yml
cp -r ./workspace/secure_project/prod_00 ./secure_workspace
cd ..
POC mode also uses secure provisioning now and can support homomorphic encryption if set up with:
nvflare poc prepare -he
For starting the FL system with 2 clients in the secure workspace, run
./start_fl_secure.sh 2
Here, we assume jobs are submitted and run one at a time. For details about resource management and consumption, please refer to the documentation.
Note: Full FL training could take several hours for this task. To speed up your experimentation, you can reduce the
num_rounds
value inconfig_fed_server.json
, e.g. to 5 rounds.
Experiment tracking for the FLARE-MONAI integration now uses NVFlareStatsHandler
to provide a set of Ignite Event-handlers to support both iteration and epoch-level events for automatic metric streaming.
In this example, the spleen_ct_segmentation_local
job is configured to automatically log metrics to MLflow through the FL server.
- The
config_fed_client.conf
contains theNVFlareStatsHandler
,MetricsSender
, andMetricRelay
(with their respective pipes) to send the metrics to the server side as federated events. - Then in
config_fed_server.conf
, theMLflowReceiver
is configured for the server to write the results to the MLflow tracking server URIhttp://127.0.0.1:5000
.
We need to start MLflow tracking server before running this job:
mlflow server
NOTE: The receiver on the server side can be easily configured to support other experiment tracking formats. In addition to the
MLflowReceiver
, theWandBReceiver
andTBAnalyticsReceiver
can also be used inconfig_fed_server.conf
for Tensorboard and Weights & Biases experiment tracking streaming to the server.
Next, we can submit the job.
To run FedAvg using with the Job CLI, submit the job with:
nvflare job submit -j jobs/spleen_ct_segmentation_local
NOTE: You can always use the admin console to manually abort a running job. using
abort_job [JOB_ID]
. For a complete list of admin commands, see here.For the secure workspace admin console, use the username "[email protected]"
After training, each client's best model will be used for cross-site validation. The results can be downloaded and shown with the admin console using
download_job [JOB_ID]
where [JOB_ID]
is the ID assigned by the system when submitting the job.
You can use the list_jobs
admin command to find the relevant JOB_ID
.
The result will be downloaded to your admin workspace (the exact download path will be displayed when running the command). You should see the cross-site validation results at
[DOWNLOAD_DIR]/[JOB_ID]/workspace/cross_site_val/cross_val_results.json
Alternatively we can run FedAvg using homomorphic encryption (HE) for secure aggregation on the server.
NOTE: For HE, we need to use the securely provisioned workspace. It will also take longer due to the additional encryption, decryption, encrypted aggregation, and increased encrypted messages sizes involved.
Follow the steps above for downloading the bundle and setting the data using JOB_NAME=spleen_ct_segementation_he
.
Then, submit the job to run FedAvg with HE:
nvflare job submit -j jobs/spleen_ct_segementation_he
To view the results, you can access the MLflow dashboard in your browser using the tracking uri http://127.0.0.1:5000
.
Once the training is started, you can see the experiment curves for the local clients in the current run on the MLflow dashboard.
NOTE: If you prefer not to start the MLflow server before federated training, you can alternatively choose to write the metrics streaming results to the server's job workspace directory. Remove the tracking_uri argument from the MLflowReceiver configuration. After the job finishes, download the server job workspace and unzip it. You can view the results by running mlflow ui --port 5000 in the directory containing the mlruns/ directory within the server job workspace.