Skip to content

Commit

Permalink
Stablize the metrics
Browse files Browse the repository at this point in the history
  • Loading branch information
xuzhao9 committed Jan 24, 2025
1 parent 666eb49 commit c0f0313
Show file tree
Hide file tree
Showing 4 changed files with 39 additions and 2 deletions.
13 changes: 13 additions & 0 deletions .ci/gpu/reset-gcp-h100.sh
Original file line number Diff line number Diff line change
@@ -0,0 +1,13 @@
#!/usr/bin/env bash
# Script to tune NVIDIA H100 GPU on GCP
# To reset GPU status

# Reset GPU and Memory clocks
sudo nvidia-smi -rgc
sudo nvidia-smi -rmc

# Restore the default power limit (500W)
sudo nvidia-smi -pl 500

# Disable persistent mode
sudo nvidia-smi -pm 0
16 changes: 16 additions & 0 deletions .ci/gpu/tune-gcp-h100.sh
Original file line number Diff line number Diff line change
@@ -0,0 +1,16 @@
#!/usr/bin/env bash
# Script to tune NVIDIA H100 GPU on GCP
# To stablize performance

set -ex

# Enable persistent mode
sudo nvidia-smi -pm 1
# Lock power limit to 650W
sudo nvidia-smi -pl 650

# Default Memory Frequency: 2619 MHz
# Default Graphics Frequency: 1980 MHz
sudo nvidia-smi -lgc 1980,1980
sudo nvidia-smi -lmc 2619,2619
sudo nvidia-smi -ac 2619,1980
10 changes: 8 additions & 2 deletions .github/workflows/_linux-benchmark-h100.yml
Original file line number Diff line number Diff line change
Expand Up @@ -36,7 +36,7 @@ jobs:
submodules: recursive
- name: Tune Nvidia GPU
run: |
sudo nvidia-smi -pm 1
bash .ci/gpu/tune-gcp-h100.sh
sudo ldconfig
nvidia-smi
- name: Benchmarking
Expand All @@ -52,4 +52,10 @@ jobs:
run: |
. "${SETUP_SCRIPT}"
latest_result_json=$(find ./benchmark-output/ -name "result.json" | sort -r | head -n 1)
python .ci/upload/scribe.py --json ${latest_result_json}
python ./.ci/upload/scribe.py --json ${latest_result_json}
- name: Restore Nvidia GPU
if: always()
run: |
bash .ci/gpu/reset-gcp-h100.sh
sudo ldconfig
nvidia-smi
2 changes: 2 additions & 0 deletions benchmarks/nightly/run.py
Original file line number Diff line number Diff line change
Expand Up @@ -46,6 +46,7 @@ def setup_tritonbench_cwd():
"latency,gbps",
"--num-inputs",
"6",
"--cudagraph",
],
"bf16_gemm": [
"--op",
Expand All @@ -58,6 +59,7 @@ def setup_tritonbench_cwd():
"latency,tflops",
"--num-inputs",
"4",
"--cudagraph",
],
}

Expand Down

0 comments on commit c0f0313

Please sign in to comment.