Skip to content

Commit

Permalink
initial commit
Browse files Browse the repository at this point in the history
  • Loading branch information
Qinghao-Hu committed Jul 20, 2021
0 parents commit 6345d9f
Show file tree
Hide file tree
Showing 78 changed files with 3,048,770 additions and 0 deletions.
21 changes: 21 additions & 0 deletions LICENSE.txt
Original file line number Diff line number Diff line change
@@ -0,0 +1,21 @@
MIT License

Copyright (c) 2021-present NTU S-Lab

Permission is hereby granted, free of charge, to any person obtaining a copy
of this software and associated documentation files (the "Software"), to deal
in the Software without restriction, including without limitation the rights
to use, copy, modify, merge, publish, distribute, sublicense, and/or sell
copies of the Software, and to permit persons to whom the Software is
furnished to do so, subject to the following conditions:

The above copyright notice and this permission notice shall be included in all
copies or substantial portions of the Software.

THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR
IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY,
FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE
AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER
LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM,
OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE
SOFTWARE.
179 changes: 179 additions & 0 deletions README.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,179 @@
# Artifact for SC '21


This repository contains the artifact for the SC '21 paper "*Characterization and Prediction of Deep LearningWorkloads in Large-Scale GPU Datacenters*". It includes following four parts:

+ `enviornment`: The experimental environment in ***Appendix: Artifact Description/Artifact Evaluation***.

+ `data`: Helios traces download from [HeliosData](https://github.com/S-Lab-System-Group/HeliosData).

+ `analysis`: It contains scripts for analyzing traces.

+ `framework`: It contains `QSSF Service` and `CES Service` scripts



> **Note that only the `Venus` trace is public available now. Other traces are being censored. We will release them as soon as possible.**
## Detailed Introduction

### `enviornment`
Provide details on the experimental environment as shown in ***Appendix: Artifact Description/Artifact Evaluation***.

+ `collect_environment.sh`: Gather execution environment information for GPU compute node and analysis platform.

+ `env_analysis_platform`: Execution environment information for trace analysis platform.

+ `env_datacenter_node`: Execution environment information for GPU compute node in our datacenter (from Volta Cluster).

+ ***Summary***

| | Analysis Platform | Datacenter Node |
| ------- | ------------------- | ------------------------ |
| System | Ubuntu 20.04 LTS | CentOS 7.4 |
| CPU | Intel Core i9-10900 | 2 x Intel Xeon Gold 6146 |
| Memory | 32GB DDR4 | 376GB DDR4 |
| GPU | GeForce RTX 2080 Ti | 8 x Tesla V100-SXM2 |
| Network | Ethernet | InfiniBand EDR |

### `data`
Initially, this folder is ***NOT exist***. You need to download and unzip the dataset from [HeliosData](https://github.com/S-Lab-System-Group/HeliosData). After that, this folder structure should be:


```
📦data
┣ 📂Earth
┃ ┣ 📜cluster_gpu_number.csv
┃ ┗ 📜cluster_log.csv
┣ 📂Saturn
┃ ┣ 📜cluster_gpu_number.csv
┃ ┗ 📜cluster_log.csv
┣ 📂Uranus
┃ ┣ 📜cluster_gpu_number.csv
┃ ┗ 📜cluster_log.csv
┗ 📂Venus
┃ ┣ 📜cluster_gpu_number.csv
┃ ┗ 📜cluster_log.csv
```

> **Note that only the `Venus` trace is public available now.**

### `analysis`
Contains parsing and plotting code to analyze traces.

+ **compare with Philly trace**: Figure 1: Comparisons of job characteristics between Helios and Philly.

+ **cluster characterization**: Figure 2: Daily pattern of the cluster usage in Helios.

Figure 3: Monthly trends of cluster activities in Helios.

Figure 4: The boxplot of utilization distributions for thetop 10 largest VCs of Earth in May (sorted by size).

+ **job characterization**: Figure 5: CDF of GPU (a) and CPU (b) job duration.

Figure 6: The CDFs of job sizes (in GPU number) with the number of jobs (a) and GPU time (b).

Figure 7: Distribution of jobs by their final statuses.



+ **user characterization**: Figure 8: The CDFs of users that consume the cluster resources in terms of (a) GPU Time (b) CPU Time.

Figure 9: (a) CDFs of users w.r.t. GPU job queuing delay. (b)Distributions of user GPU job completion ratios.



### `framework`
An prediction-based GPU resource management framework.

This folder contains `QSSF Service` and `CES Service` scripts and related data.



## Quick Start
These scripts have been tested on Ubuntu 20.04 with Python 3.8 (on the analysis platform).

Here are the ***step-by-step*** instructions for artifact.
### Preparing

1. Download Helios artifact and data repository.
```bash
git clone [email protected]:S-Lab-System-Group/HeliosArtifact.git
cd HeliosArtifact

git clone [email protected]:S-Lab-System-Group/HeliosData.git
mv ./HeliosData/data ./
```

2. Check software dependencies:

For the `analysis` part, JupyterLab / JupyterNotebook is needed.

For the other python libraries used in this project, you can check `requirements.txt`.


### Reproducing `analysis`

3. Prepare and parse the trace files for analyzing.
```bash
cd analysis
python ./trace_parser.py --cluster-list 'Venus'
```
4. After generating all required data, you can analyze traces through `.ipynb` files within 4 sub-folders of `analysis`:**1_compare with Philly trace**, **2_cluster characterization**, **3_job characterization**, **4_user characterization**.

These Jupyter Notebook scripts are used for generating figures of the trace analysis part of the paper.

> **Note that only the `Venus` trace is public available now. Thus, some generated figures are incomplete comparing with the paper version.**


### Reproducing `framework`


#### `QSSF Service`

5. Before executing the simulation of QSSF service, data preparation is needed.

It generates VC configuration and job trace for each cluster.

```bash
cd framework/QSSF\ Service/data
bash prepare_data.sh
```

6. Then, you can run all scheduling policies on **Philly** trace using `sweep` mode, as below:

```bash
cd ..
python simulator.py -e='Philly' -t='./data/Philly' --sweep
```

See `run.sh` for more usage examples on **Helios**. Note that since we do not release job name information, the `estimator` and `qssf policy` are not available for **Helios**.



7. After the program is executed, you can check the result in the `log` folder. The job log and time sequence of each VC are provided separately.

8. Besides, we provide simulation analysis and plot script in `plot`.

You can generate Figure 13 in the paper through this script.

#### `CES Service`

9. Run CES simulation on **Helios**:

```bash
cd framework/CES\ Service
python CES_Helios.py
```

You can specify different cluster in the script and adjust the different configurations of CES service by the `hyperparameter` function.


10. Similarly, run CES simulation on **Philly**:

```bash
python CES_Philly.py
```

11. From the code output and generated figures `helios_ces` (Figure 14) & `philly_ces` (Figure 15), we can analyze the CES service performance in detail.
11 changes: 11 additions & 0 deletions analysis/1_compare with Philly trace/README.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,11 @@
+ `philly_trace.csv`

It is used to compare with our datacenter workloads.

We transfer the original Philly trace file into `.csv` format and select the same period of job logs as described in ["Analysis of Large-Scale Multi-Tenant GPU Clusters for DNN Training Workloads"](https://www.usenix.org/system/files/atc19-jeon.pdf) (ATC’19).

The official public data can be download from [philly-traces](https://github.com/msr-fiddle/philly-traces).

+ `philly_trace_B.csv`

Failed jobs would be retried for a fixed number of times in Philly. If we process Philly trace by regarding each attempt as an individual job, we generate `philly_trace_B.csv`.
Binary file not shown.
227 changes: 227 additions & 0 deletions analysis/1_compare with Philly trace/compare_with_Philly_trace.ipynb

Large diffs are not rendered by default.

Original file line number Diff line number Diff line change
@@ -0,0 +1,6 @@
id,gpu_num,job_num,cpu_job_num,gpu_job_num,avg_run_time_cpu,avg_run_time_gpu,avg_que_time_cpu,avg_que_time_gpu,avg_gpu_num,med_run_time_cpu,med_run_time_gpu,med_que_time_cpu,med_que_time_gpu,med_gpu_num,complete_rate,cancel_rate,fail_rate,complete_rate_cpu,cancel_rate_cpu,fail_rate_cpu,complete_rate_gpu,cancel_rate_gpu,fail_rate_gpu,complete_gpu_time,cancel_gpu_time,fail_gpu_time
Venus,1022,246708,121405,125303,1649.859,13040.598,773.009,1253.288,6.736,21.0,204.0,0.0,0.0,1.0,0.69,0.186,0.124,0.86,0.097,0.043,0.526,0.272,0.202,6373315670.0,4920266598.0,1052030340.0
Earth,997,872886,445738,427148,162.73,5130.609,3.281,319.483,2.101,1.0,234.0,0.0,0.0,1.0,0.812,0.069,0.119,0.885,0.005,0.11,0.735,0.136,0.129,5713038873.0,4793853520.0,932310362.0
Saturn,2080,1753078,1054182,698896,619.062,5252.927,16.255,611.561,4.01,2.0,124.0,0.0,0.0,1.0,0.799,0.12,0.08,0.943,0.019,0.037,0.582,0.272,0.145,14375709962.0,10733792235.0,2542859066.0
Uranus,2119,490309,161192,329117,1211.799,9163.725,119.704,1949.182,4.038,32.0,280.0,0.0,0.0,1.0,0.668,0.173,0.159,0.791,0.116,0.092,0.607,0.201,0.192,13532231815.0,10267196772.0,2761134934.0
total,6220,3362981,1782517,1580464,628.759,6651.681,73.907,862.047,3.716,2.0,206.0,0.0,0.0,1.0,0.775,0.119,0.105,0.909,0.03,0.061,0.624,0.221,0.155,39994296320.0,30715109125.0,7288334702.0
Loading

0 comments on commit 6345d9f

Please sign in to comment.