This is the code for paper "Cost Effective MLaaS Federation: A Combinatorial Reinforcement Learning Approach" published in INFOCOM 2022.
We recommend to setup the environment with anaconda.
conda create -n mfed python=3.6
conda activate mfed
git clone https://github.com/openai/spinningup.git
cd spinningup
pip install -e .
pip install -r requirements.txt
-
Build
data
andresults
directory;mkdir data mkdir results
-
Download preprocessed prediction dataset, which contains
train
andtest
directory, extract them into thedata
directory. There areaws
,azure
,google
, andground-truth
dirs in each dir. Theground-truth
is derived from the COCO dataset, whose origin form is a json file for all images. I transform the origin json file into a directory contains a series of texts files for the purpose of training RL models. Thetrain/test/*/0.txt
stands for the ground truth or predictions of train/test image. The mapping of image's name to its rank can be found atsrc/scripts/*rank.json
. -
Download the COCO dataset 2017 Train and Val Images, then extract these images into
data/train/images
anddata/test/images
directories, respectively. Due to the ground truth of COCO 2017 test images has not been released, we put the validation set as the test set in our experiments. -
Replace the
WORKDIR
insrc/common.py
with your current work directory.
Just enter the command below to train the RL model. You can tune the hyper-parameters in train.py
. I think this paradigm works the same as argparser
, so I don't want to write an args' parser.
python train.py
- We use
scapy
to parse.pcap
files, so first install it.pip install scapy
- Download latency records that in
.pcap
format. Extract them in anywhere you want. Just modify the input data paths (maybelag_dir
or other variables) to the path you store the.pcap
files. These paths are in latency analysis files that placed insrc/latency
directory. Because these files require a huge place to store, I upload them to Baidu NetDisk. Here are the link and key:link: https://pan.baidu.com/s/15xW6xPZ5ubVJj_WmZi-HLg key: stv8
We request AWS Rekognition (AWS), Azure Computer Vision (AZU), Google Cloud Vision AI (GCP), and Aliyun Object Detection Service (ALI) via Python SDK and capture the TCP packets by tcpdump
. We devide the latency into transmission
latency and inference latency. By indexing the packet with special HTTP code,
we distinguish the first packet and the last packet in a request. Thus we can get the full latency. And the transimission latency can be calculated by indexing the first packet and the last ack packet from server. We also find that the ip of MLaaS is not constant, and GCP would help the customer to locate the service with lowest latency. You can find the measurement time of a file by its name, such as 20210508_010101.cap
. And us2sg
means send a request from a VM in the United States to Singapore. Please refer to the src/latency
directory for more details.
We record the packet files from AWS, AZU, and GCP with tcpdump
. You can open these .pcap
files with wireshark to get a better illustration. What's more, as we test the services in different data regions every hour, we leverage crontab
to repeat these measurements for convenience. We send the requests according to the ranks in src/deploy/event.json
. We only employ the COCO Validation Set to measure the latency for the sake of saving money. For more details, please refer to the src/deploy
directory.
If you want to repeat the measurments, please modify the SUB_KEY
and END_POINT
variables in src/deploy/azure_sender.py
and config your aws and google cloud accounts. Please refer to their offical documentations for more details.
If you want the raw returned predictions from AWS, AZU, and GCP, please send an email to [email protected]
. To be honest, I'm too lazy to sort through the clutter files if no one needs them. However, if you really need these files, feel free to ask me :).
This table is placed in src/sripts/word2num.json
, most of the entries are refered from WordNet.
I borrowed the pre-trained models from detectron2 and Tensorflow model garden to simulate 7 MLaaSes. If you want to replicate the expertiment, please transform the predictions of these models into the format that utilized in data/train/google
. For more details, please refer to src/simulate
.
MODELS_SELECT = {
0 : 'centernet_resnet50_v1_fpn_512x512_kpts_coco17_tpu-8', # 29.3
1 : 'centernet_resnet50_v2_512x512_kpts_coco17_tpu-8', # 27.6
2 : 'ssd_mobilenet_v2_fpnlite_320x320_coco17_tpu-8', # 22.2
3 : 'ssd_mobilenet_v2_fpnlite_640x640_coco17_tpu-8', # 28.2
4 : 'ssd_mobilenet_v2_320x320_coco17_tpu-8', # 20.2
5 : 'faster_rcnn_resnet50_v1_640x640_coco17_tpu-8', # 29.3
6 : 'ssd_mobilenet_v1_fpn_640x640_coco17_tpu-8' # 29.1
}
If you use any part of our code or dataset in your research, please cite our paper:
@inproceedings{xie2022cost,
title={Cost Effective MLaaS Federation: A Combinatorial Reinforcement Learning Approach},
author={Xie, Shuzhao and Xue, Yuan and Zhu, Yifei and Wang, Zhi},
booktitle={IEEE INFOCOM 2022-IEEE Conference on Computer Communications},
pages={1--10},
year={2022},
organization={IEEE}
}
Shuzhao Xie thanks Chen Tang, Jiahui Ye, Shiji Zhou, and Wenwu Zhu for their help in making this work possible. Yifei Zhu's work is funded by the SJTU Explore-X grant. This work is supported in part by NSFC (Grant No. 61872215), and Shenzhen Science and Technology Program (Grant No. RCYX20200714114523079). We would like to thank Tencent for sponsoring the research.
There are also many other interesting discoveries, I will summary them when I get a day off. Also, you can find them by yourself.