-
Notifications
You must be signed in to change notification settings - Fork 4
Commit
This commit does not belong to any branch on this repository, and may belong to a fork outside of the repository.
- Loading branch information
0 parents
commit 6345d9f
Showing
78 changed files
with
3,048,770 additions
and
0 deletions.
There are no files selected for viewing
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,21 @@ | ||
MIT License | ||
|
||
Copyright (c) 2021-present NTU S-Lab | ||
|
||
Permission is hereby granted, free of charge, to any person obtaining a copy | ||
of this software and associated documentation files (the "Software"), to deal | ||
in the Software without restriction, including without limitation the rights | ||
to use, copy, modify, merge, publish, distribute, sublicense, and/or sell | ||
copies of the Software, and to permit persons to whom the Software is | ||
furnished to do so, subject to the following conditions: | ||
|
||
The above copyright notice and this permission notice shall be included in all | ||
copies or substantial portions of the Software. | ||
|
||
THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR | ||
IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY, | ||
FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE | ||
AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER | ||
LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, | ||
OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE | ||
SOFTWARE. |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,179 @@ | ||
# Artifact for SC '21 | ||
|
||
|
||
This repository contains the artifact for the SC '21 paper "*Characterization and Prediction of Deep LearningWorkloads in Large-Scale GPU Datacenters*". It includes following four parts: | ||
|
||
+ `enviornment`: The experimental environment in ***Appendix: Artifact Description/Artifact Evaluation***. | ||
|
||
+ `data`: Helios traces download from [HeliosData](https://github.com/S-Lab-System-Group/HeliosData). | ||
|
||
+ `analysis`: It contains scripts for analyzing traces. | ||
|
||
+ `framework`: It contains `QSSF Service` and `CES Service` scripts | ||
|
||
|
||
|
||
> **Note that only the `Venus` trace is public available now. Other traces are being censored. We will release them as soon as possible.** | ||
## Detailed Introduction | ||
|
||
### `enviornment` | ||
Provide details on the experimental environment as shown in ***Appendix: Artifact Description/Artifact Evaluation***. | ||
|
||
+ `collect_environment.sh`: Gather execution environment information for GPU compute node and analysis platform. | ||
|
||
+ `env_analysis_platform`: Execution environment information for trace analysis platform. | ||
|
||
+ `env_datacenter_node`: Execution environment information for GPU compute node in our datacenter (from Volta Cluster). | ||
|
||
+ ***Summary*** | ||
|
||
| | Analysis Platform | Datacenter Node | | ||
| ------- | ------------------- | ------------------------ | | ||
| System | Ubuntu 20.04 LTS | CentOS 7.4 | | ||
| CPU | Intel Core i9-10900 | 2 x Intel Xeon Gold 6146 | | ||
| Memory | 32GB DDR4 | 376GB DDR4 | | ||
| GPU | GeForce RTX 2080 Ti | 8 x Tesla V100-SXM2 | | ||
| Network | Ethernet | InfiniBand EDR | | ||
|
||
### `data` | ||
Initially, this folder is ***NOT exist***. You need to download and unzip the dataset from [HeliosData](https://github.com/S-Lab-System-Group/HeliosData). After that, this folder structure should be: | ||
|
||
|
||
``` | ||
📦data | ||
┣ 📂Earth | ||
┃ ┣ 📜cluster_gpu_number.csv | ||
┃ ┗ 📜cluster_log.csv | ||
┣ 📂Saturn | ||
┃ ┣ 📜cluster_gpu_number.csv | ||
┃ ┗ 📜cluster_log.csv | ||
┣ 📂Uranus | ||
┃ ┣ 📜cluster_gpu_number.csv | ||
┃ ┗ 📜cluster_log.csv | ||
┗ 📂Venus | ||
┃ ┣ 📜cluster_gpu_number.csv | ||
┃ ┗ 📜cluster_log.csv | ||
``` | ||
|
||
> **Note that only the `Venus` trace is public available now.** | ||
|
||
### `analysis` | ||
Contains parsing and plotting code to analyze traces. | ||
|
||
+ **compare with Philly trace**: Figure 1: Comparisons of job characteristics between Helios and Philly. | ||
|
||
+ **cluster characterization**: Figure 2: Daily pattern of the cluster usage in Helios. | ||
|
||
Figure 3: Monthly trends of cluster activities in Helios. | ||
|
||
Figure 4: The boxplot of utilization distributions for thetop 10 largest VCs of Earth in May (sorted by size). | ||
|
||
+ **job characterization**: Figure 5: CDF of GPU (a) and CPU (b) job duration. | ||
|
||
Figure 6: The CDFs of job sizes (in GPU number) with the number of jobs (a) and GPU time (b). | ||
|
||
Figure 7: Distribution of jobs by their final statuses. | ||
|
||
|
||
|
||
+ **user characterization**: Figure 8: The CDFs of users that consume the cluster resources in terms of (a) GPU Time (b) CPU Time. | ||
|
||
Figure 9: (a) CDFs of users w.r.t. GPU job queuing delay. (b)Distributions of user GPU job completion ratios. | ||
|
||
|
||
|
||
### `framework` | ||
An prediction-based GPU resource management framework. | ||
|
||
This folder contains `QSSF Service` and `CES Service` scripts and related data. | ||
|
||
|
||
|
||
## Quick Start | ||
These scripts have been tested on Ubuntu 20.04 with Python 3.8 (on the analysis platform). | ||
|
||
Here are the ***step-by-step*** instructions for artifact. | ||
### Preparing | ||
|
||
1. Download Helios artifact and data repository. | ||
```bash | ||
git clone [email protected]:S-Lab-System-Group/HeliosArtifact.git | ||
cd HeliosArtifact | ||
|
||
git clone [email protected]:S-Lab-System-Group/HeliosData.git | ||
mv ./HeliosData/data ./ | ||
``` | ||
|
||
2. Check software dependencies: | ||
|
||
For the `analysis` part, JupyterLab / JupyterNotebook is needed. | ||
|
||
For the other python libraries used in this project, you can check `requirements.txt`. | ||
|
||
|
||
### Reproducing `analysis` | ||
|
||
3. Prepare and parse the trace files for analyzing. | ||
```bash | ||
cd analysis | ||
python ./trace_parser.py --cluster-list 'Venus' | ||
``` | ||
4. After generating all required data, you can analyze traces through `.ipynb` files within 4 sub-folders of `analysis`:**1_compare with Philly trace**, **2_cluster characterization**, **3_job characterization**, **4_user characterization**. | ||
|
||
These Jupyter Notebook scripts are used for generating figures of the trace analysis part of the paper. | ||
|
||
> **Note that only the `Venus` trace is public available now. Thus, some generated figures are incomplete comparing with the paper version.** | ||
|
||
|
||
### Reproducing `framework` | ||
|
||
|
||
#### `QSSF Service` | ||
|
||
5. Before executing the simulation of QSSF service, data preparation is needed. | ||
|
||
It generates VC configuration and job trace for each cluster. | ||
|
||
```bash | ||
cd framework/QSSF\ Service/data | ||
bash prepare_data.sh | ||
``` | ||
|
||
6. Then, you can run all scheduling policies on **Philly** trace using `sweep` mode, as below: | ||
|
||
```bash | ||
cd .. | ||
python simulator.py -e='Philly' -t='./data/Philly' --sweep | ||
``` | ||
|
||
See `run.sh` for more usage examples on **Helios**. Note that since we do not release job name information, the `estimator` and `qssf policy` are not available for **Helios**. | ||
|
||
|
||
|
||
7. After the program is executed, you can check the result in the `log` folder. The job log and time sequence of each VC are provided separately. | ||
|
||
8. Besides, we provide simulation analysis and plot script in `plot`. | ||
|
||
You can generate Figure 13 in the paper through this script. | ||
|
||
#### `CES Service` | ||
|
||
9. Run CES simulation on **Helios**: | ||
|
||
```bash | ||
cd framework/CES\ Service | ||
python CES_Helios.py | ||
``` | ||
|
||
You can specify different cluster in the script and adjust the different configurations of CES service by the `hyperparameter` function. | ||
|
||
|
||
10. Similarly, run CES simulation on **Philly**: | ||
|
||
```bash | ||
python CES_Philly.py | ||
``` | ||
|
||
11. From the code output and generated figures `helios_ces` (Figure 14) & `philly_ces` (Figure 15), we can analyze the CES service performance in detail. |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,11 @@ | ||
+ `philly_trace.csv` | ||
|
||
It is used to compare with our datacenter workloads. | ||
|
||
We transfer the original Philly trace file into `.csv` format and select the same period of job logs as described in ["Analysis of Large-Scale Multi-Tenant GPU Clusters for DNN Training Workloads"](https://www.usenix.org/system/files/atc19-jeon.pdf) (ATC’19). | ||
|
||
The official public data can be download from [philly-traces](https://github.com/msr-fiddle/philly-traces). | ||
|
||
+ `philly_trace_B.csv` | ||
|
||
Failed jobs would be retried for a fixed number of times in Philly. If we process Philly trace by regarding each attempt as an individual job, we generate `philly_trace_B.csv`. |
Binary file not shown.
227 changes: 227 additions & 0 deletions
227
analysis/1_compare with Philly trace/compare_with_Philly_trace.ipynb
Large diffs are not rendered by default.
Oops, something went wrong.
6 changes: 6 additions & 0 deletions
6
analysis/1_compare with Philly trace/helios_cluster_summary.csv
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,6 @@ | ||
id,gpu_num,job_num,cpu_job_num,gpu_job_num,avg_run_time_cpu,avg_run_time_gpu,avg_que_time_cpu,avg_que_time_gpu,avg_gpu_num,med_run_time_cpu,med_run_time_gpu,med_que_time_cpu,med_que_time_gpu,med_gpu_num,complete_rate,cancel_rate,fail_rate,complete_rate_cpu,cancel_rate_cpu,fail_rate_cpu,complete_rate_gpu,cancel_rate_gpu,fail_rate_gpu,complete_gpu_time,cancel_gpu_time,fail_gpu_time | ||
Venus,1022,246708,121405,125303,1649.859,13040.598,773.009,1253.288,6.736,21.0,204.0,0.0,0.0,1.0,0.69,0.186,0.124,0.86,0.097,0.043,0.526,0.272,0.202,6373315670.0,4920266598.0,1052030340.0 | ||
Earth,997,872886,445738,427148,162.73,5130.609,3.281,319.483,2.101,1.0,234.0,0.0,0.0,1.0,0.812,0.069,0.119,0.885,0.005,0.11,0.735,0.136,0.129,5713038873.0,4793853520.0,932310362.0 | ||
Saturn,2080,1753078,1054182,698896,619.062,5252.927,16.255,611.561,4.01,2.0,124.0,0.0,0.0,1.0,0.799,0.12,0.08,0.943,0.019,0.037,0.582,0.272,0.145,14375709962.0,10733792235.0,2542859066.0 | ||
Uranus,2119,490309,161192,329117,1211.799,9163.725,119.704,1949.182,4.038,32.0,280.0,0.0,0.0,1.0,0.668,0.173,0.159,0.791,0.116,0.092,0.607,0.201,0.192,13532231815.0,10267196772.0,2761134934.0 | ||
total,6220,3362981,1782517,1580464,628.759,6651.681,73.907,862.047,3.716,2.0,206.0,0.0,0.0,1.0,0.775,0.119,0.105,0.909,0.03,0.061,0.624,0.221,0.155,39994296320.0,30715109125.0,7288334702.0 |
Oops, something went wrong.