Skip to content

Commit

Permalink
Enable T3K Resnet Tests (#11030)
Browse files Browse the repository at this point in the history
* #10244: Fix optional output tensor handling for reshard

* #0: Add enable_async_mode device fixture and refactor use_program_cache

* #0: Cleanup ttnn_resnet single device test files

* #10244: Fix multi-device api issues for ttnn resnet tests, and add them to ci

* #10244: Add E2E performance tests for ttnn_resnet on t3000

* #10244: Add t3000 perf results for ttnn_resnet to README

* #0: Remove initial space when parsing perf csv

* #0: Increase timeout for Nightly N300 WH-only models job due to some ci machines being slower than others
  • Loading branch information
tt-aho authored Aug 5, 2024
1 parent 8d4008d commit ec0cc14
Show file tree
Hide file tree
Showing 16 changed files with 798 additions and 129 deletions.
Original file line number Diff line number Diff line change
Expand Up @@ -23,7 +23,7 @@ jobs:
{ name: "Common models N300 WH B0", arch: wormhole_b0, cmd: tests/scripts/single_card/nightly/run_common_models.sh, timeout: 40 },
{ name: "GS-only ttnn nightly", arch: grayskull, cmd: tests/scripts/single_card/nightly/run_ttnn.sh, timeout: 40 },
{ name: "GS-only models", arch: grayskull, cmd: tests/scripts/single_card/nightly/run_gs_only.sh, timeout: 40 },
{ name: "N300 WH-only models", arch: wormhole_b0, cmd: tests/scripts/single_card/nightly/run_wh_b0_only.sh, timeout: 40 },
{ name: "N300 WH-only models", arch: wormhole_b0, cmd: tests/scripts/single_card/nightly/run_wh_b0_only.sh, timeout: 50 },
{ name: "API tests GS", arch: grayskull, cmd: ./tests/scripts/run_tests.sh --tt-arch grayskull --pipeline-type frequent_api --dispatch-mode fast, timeout: 40 },
{ name: "API tests N300 WH B0", arch: wormhole_b0, cmd: ./tests/scripts/run_tests.sh --tt-arch wormhole_b0 --pipeline-type frequent_api --dispatch-mode fast, timeout: 40 },
# #9945: Skip SD for now
Expand Down
16 changes: 9 additions & 7 deletions .github/workflows/t3000-frequent-tests.yaml
Original file line number Diff line number Diff line change
Expand Up @@ -17,18 +17,20 @@ jobs:
fail-fast: false
matrix:
test-group: [
{ name: "t3k tteager tests", arch: wormhole_b0, cmd: run_t3000_tteager_tests, timeout: 60,
{ name: "t3k tteager tests", arch: wormhole_b0, cmd: run_t3000_tteager_tests, timeout: 60,
runs-on: ["config-t3000", "in-service", "pipeline-functional"], owner_id: ULMEPM2MA}, #Sean Nijjar
{ name: "t3k ethernet tests", arch: wormhole_b0, cmd: run_t3000_ethernet_tests, timeout: 60,
{ name: "t3k ethernet tests", arch: wormhole_b0, cmd: run_t3000_ethernet_tests, timeout: 60,
runs-on: ["config-t3000", "in-service", "pipeline-functional"], owner_id: ULMEPM2MA}, #Sean Nijjar
{ name: "t3k trace stress tests", arch: wormhole_b0, cmd: run_t3000_trace_stress_tests, timeout: 120,
{ name: "t3k trace stress tests", arch: wormhole_b0, cmd: run_t3000_trace_stress_tests, timeout: 120,
runs-on: ["config-t3000", "in-service", "pipeline-functional"], owner_id: U03NG0A5ND7}, #Aditya Saigal
{ name: "t3k falcon40b tests", arch: wormhole_b0, cmd: run_t3000_falcon40b_tests, timeout: 120,
runs-on: ["config-t3000", "in-service", "pipeline-functional"], owner_id: U04S2UV6L8N}, #Sofija Jovic
{ name: "t3k llama2_70b tests", arch: wormhole_b0, cmd: run_t3000_llama2_70b_tests, timeout: 60,
{ name: "t3k falcon40b tests", arch: wormhole_b0, cmd: run_t3000_falcon40b_tests, timeout: 120,
runs-on: ["config-t3000", "in-service", "pipeline-functional"], owner_id: U04S2UV6L8N}, #Sofija Jovic
{ name: "t3k llama2_70b tests", arch: wormhole_b0, cmd: run_t3000_llama2_70b_tests, timeout: 60,
runs-on: ["config-t3000", "in-service", "pipeline-functional"], owner_id: U03FJB5TM5Y}, #Colman Glagovich
{ name: "t3k mixtral tests", arch: wormhole_b0, cmd: run_t3000_mixtral_tests, timeout: 60,
{ name: "t3k mixtral tests", arch: wormhole_b0, cmd: run_t3000_mixtral_tests, timeout: 60,
runs-on: ["config-t3000", "in-service", "pipeline-functional"], owner_id: U03PUAKE719}, #Miguel Tairum Cruz
{ name: "t3k resnet tests", arch: wormhole_b0, cmd: run_t3000_resnet_tests, timeout: 30,
runs-on: ["config-t3000", "in-service", "pipeline-functional"], owner_id: U013121KDH9}, #Austin Ho
]
name: ${{ matrix.test-group.name }}
env:
Expand Down
2 changes: 2 additions & 0 deletions .github/workflows/t3000-model-perf-tests.yaml
Original file line number Diff line number Diff line change
Expand Up @@ -25,6 +25,8 @@ jobs:
runs-on: ["arch-wormhole_b0", "config-t3000", "in-service", "pipeline-perf"], owner_id: U03FJB5TM5Y}, # Colman Glagovich
{ name: "t3k LLM falcon40b model perf tests", model: "falcon40b", model-type: "LLM", arch: wormhole_b0, cmd: run_t3000_falcon40b_tests, timeout: 75,
runs-on: ["arch-wormhole_b0", "config-t3000", "in-service", "pipeline-perf"], owner_id: U053W15B6JF}, # Djordje Ivanovic
{ name: "t3k LLM resnet50 model perf tests", model: "resnet50", model-type: "CNN", arch: wormhole_b0, cmd: run_t3000_resnet50_tests, timeout: 75,
runs-on: ["arch-wormhole_b0", "config-t3000", "in-service", "pipeline-perf"], owner_id: U013121KDH9}, # Austin Ho
#{ name: "t3k CNN model perf tests ", model-type: "CNN", arch: wormhole_b0, cmd: run_t3000_cnn_tests, timeout: 120, owner_id: }, #No tests are being run?
]
name: ${{ matrix.test-group.name }}
Expand Down
16 changes: 8 additions & 8 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -44,12 +44,12 @@
>
> Furthermore, all performance numbers here are run or based off an N300 Wormhole card.
| Model | Last Verified Release | Gen. Token [3] | Batch | End-to-end throughput [1] | Device throughput [2] | Target |
| Model | Last Verified Release | Gen. Token [3] | Batch | End-to-end throughput [1] | Device throughput [2] | Target |
|----------------------------------------------------------------------------------------|---------------------------------------------------------------------------|--------------------|----------------------|--------------------------------|------------------------------|----------------|
| [Falcon7B](./models/demos/wormhole/falcon7b) | [v0.51.0-rc13](https://github.com/tenstorrent/tt-metal/tree/v0.51.0-rc13) | 129th | 32 | 13.7 t/s/u - 438 t/s | 19.5 t/s/u - 624 t/s | 26 |
| [Mistral-7B](./models/demos/wormhole/mistral7b) | [v0.51.0-rc13](https://github.com/tenstorrent/tt-metal/tree/v0.51.0-rc13) | 129th | 32 | 9.9 t/s/u - 317 t/s | 11.0 t/s/u - 352 t/s | 25 |
| [Falcon7B](./models/demos/wormhole/falcon7b) | [v0.51.0-rc13](https://github.com/tenstorrent/tt-metal/tree/v0.51.0-rc13) | 129th | 32 | 13.7 t/s/u - 438 t/s | 19.5 t/s/u - 624 t/s | 26 |
| [Mistral-7B](./models/demos/wormhole/mistral7b) | [v0.51.0-rc13](https://github.com/tenstorrent/tt-metal/tree/v0.51.0-rc13) | 129th | 32 | 9.9 t/s/u - 317 t/s | 11.0 t/s/u - 352 t/s | 25 |
| [Mamba-2.8B](./models/demos/wormhole/mamba) | [v0.51.0-rc13](https://github.com/tenstorrent/tt-metal/tree/v0.51.0-rc13) | any | 32 | 11.6 t/s/u - 371 t/s | 16.5 t/s/u - 528 t/s | 41 |
| [LLaMA-3.1-8B](./models/demos/wormhole/llama31_8b) | [v0.51.0-rc13](https://github.com/tenstorrent/tt-metal/tree/v0.51.0-rc13) | 129th | 8 | 8.3 t/s/u - 66.0 t/s | 9.7 t/s/u - 77.9 t/s | 23 |
| [LLaMA-3.1-8B](./models/demos/wormhole/llama31_8b) | [v0.51.0-rc13](https://github.com/tenstorrent/tt-metal/tree/v0.51.0-rc13) | 129th | 8 | 8.3 t/s/u - 66.0 t/s | 9.7 t/s/u - 77.9 t/s | 23 |
| [BERT-Large](./models/demos/metal_BERT_large_11/) (sen/s) [4] | | | 8 | 270 | 340 | 400 |
| [Stable Diffusion 1.4](./models/demos/wormhole/stable_diffusion) 512x512 (sec/img) [5] | | | 1 | 6 | 5 | 3 |
| [ResNet-50](./models/demos/ttnn_resnet) (fps) | | | 16 | 4,300 | 5,550 | 7,000 |
Expand All @@ -66,14 +66,14 @@

## TT-QuietBox & TT-LoudBox (2x4 mesh of WHs) Models

| Model | Last Verified Release | Technique | Gen. Token [3] | Batch | End-to-end throughput [1] | Device throughput [2] | Target |
| Model | Last Verified Release | Technique | Gen. Token [3] | Batch | End-to-end throughput [1] | Device throughput [2] | Target |
|----------------------------------------------------|---------------------------------------------------------------------------|--------------------|---------------------|-----------------------|------------------------------|------------------------------|-----------------|
| [Falcon7B](./models/demos/t3000/falcon7b) | [v0.51.0-rc13](https://github.com/tenstorrent/tt-metal/tree/v0.51.0-rc13) | Data Parallel | 129th | 256 | 7.6 t/s/u - 1950 t/s | 19.5 t/s/u - 4990 t/s | 26 t/s/u |
| [Falcon7B](./models/demos/t3000/falcon7b) | [v0.51.0-rc13](https://github.com/tenstorrent/tt-metal/tree/v0.51.0-rc13) | Data Parallel | 129th | 256 | 7.6 t/s/u - 1950 t/s | 19.5 t/s/u - 4990 t/s | 26 t/s/u |
| [LLaMA-2-70B](./models/demos/t3000/llama2_70b) | [v0.51.0-rc13](https://github.com/tenstorrent/tt-metal/tree/v0.51.0-rc13) | Tensor Parallel | 129th | 32 | 10.4 t/s/u - 333 t/s | 16.6 t/s/u - 531 t/s | 20 t/s/u |
| [LLaMA-3.1-70B](./models/demos/t3000/llama3_70b) | [v0.51.0-rc13](https://github.com/tenstorrent/tt-metal/tree/v0.51.0-rc13) | Tensor Parallel | 129th | 32 | 10.4 t/s/u - 333 t/s | 15.8 t/s/u - 506 t/s | 20 t/s/u |
| [Falcon40B](./models/demos/t3000/falcon40b) | [v0.51.0-rc13](https://github.com/tenstorrent/tt-metal/tree/v0.51.0-rc13) | Tensor Parallel | 129th | 32 | 5.3 t/s/u - 168 t/s | 12.2 t/s/u - 390 t/s | 36 t/s/u |
| [Falcon40B](./models/demos/t3000/falcon40b) | [v0.51.0-rc13](https://github.com/tenstorrent/tt-metal/tree/v0.51.0-rc13) | Tensor Parallel | 129th | 32 | 5.3 t/s/u - 168 t/s | 12.2 t/s/u - 390 t/s | 36 t/s/u |
| [Mixtral7Bx8](./models/demos/t3000/mixtral8x7b) | [v0.51.0-rc13](https://github.com/tenstorrent/tt-metal/tree/v0.51.0-rc13) | Tensor Parallel | 129th | 32 | 13.3 t/s/u - 426 t/s | 21.4 t/s/u - 685 t/s | 33 t/s/u |
| ResNet50 | | Data Parallel | coming soon | | | | |
| [ResNet-50](./models/demos/ttnn_resnet) | | Data Parallel | | 128 | 31,700 | 44,400 | 56,000 |

## Model Updates
For the latest model updates and features, please see [MODEL_UPDATES.md](models/MODEL_UPDATES.md)
Expand Down
49 changes: 30 additions & 19 deletions conftest.py
Original file line number Diff line number Diff line change
Expand Up @@ -271,36 +271,47 @@ def reset_default_device():
ttl.device.SetDefaultDevice(device)


@pytest.fixture(scope="function")
def use_program_cache(request):
import tt_lib as ttl

def get_devices(request):
if "device" in request.fixturenames:
dev = request.getfixturevalue("device")
dev.enable_program_cache()
devices = [request.getfixturevalue("device")]
elif "all_devices" in request.fixturenames:
devices = request.getfixturevalue("all_devices")
for dev in devices:
dev.enable_program_cache()
elif "pcie_devices" in request.fixturenames:
devices = request.getfixturevalue("pcie_devices")
for dev in devices:
dev.enable_program_cache()
elif "device_mesh" in request.fixturenames:
mesh = request.getfixturevalue("device_mesh")
for device_id in mesh.get_device_ids():
mesh.get_device(device_id).enable_program_cache()
devices = request.getfixturevalue("device_mesh").get_devices()
elif "t3k_device_mesh" in request.fixturenames:
mesh = request.getfixturevalue("t3k_device_mesh")
for device_id in mesh.get_device_ids():
mesh.get_device(device_id).enable_program_cache()
devices = request.getfixturevalue("t3k_device_mesh").get_devices()
elif "pcie_device_mesh" in request.fixturenames:
mesh = request.getfixturevalue("pcie_device_mesh")
for device_id in mesh.get_device_ids():
mesh.get_device(device_id).enable_program_cache()
devices = request.getfixturevalue("pcie_device_mesh").get_devices()
else:
devices = []
return devices


@pytest.fixture(scope="function")
def use_program_cache(request):
devices = get_devices(request)
if not devices:
logger.warning("No device fixture found to apply program cache to: PROGRAM CACHE DISABLED")
for dev in devices:
dev.enable_program_cache()
yield
for dev in devices:
dev.disable_and_clear_program_cache()


@pytest.fixture(scope="function")
def enable_async_mode(request):
devices = get_devices(request)
if not devices:
logger.warning("No device fixture found to apply async mode to: ASYNC MODE DISABLED")

for dev in devices:
dev.enable_async(request.param)
yield request.param
for dev in devices:
dev.enable_async(False)


@pytest.fixture(scope="function")
Expand Down
5 changes: 5 additions & 0 deletions models/demos/ttnn_resnet/README.md
Original file line number Diff line number Diff line change
Expand Up @@ -8,8 +8,13 @@ Our ImageProcessor on the other hand is based on `microsoft/resnet-50` from hugg

## Performance

### Single Device
+ To obtain device performance, run `WH_ARCH_YAML=wormhole_b0_80_arch_eth_dispatch.yaml ./tt_metal/tools/profiler/profile_this.py -c "pytest models/demos/ttnn_resnet/tests/test_ttnn_resnet50_performant.py::test_run_resnet50_inference[16-act_dtype0-weight_dtype0-math_fidelity0-device_params0]"`
This will generate a CSV report under `<this repo dir>/generated/profiler/reports/ops/<report name>`. The report file name is logged in the run output.

+ For end-to-end performance, run `WH_ARCH_YAML=wormhole_b0_80_arch_eth_dispatch.yaml pytest models/demos/ttnn_resnet/tests/test_perf_ttnn_resnet.py::test_perf_trace_2cqs_bare_metal[16-0.004-25-device_params0]`. This will generate a CSV with the timings and throughputs.
Expected end-to-end perf: For batch = 16, it is about `4300 fps` currently. This may vary machine to machine.

### T3000
+ For end-to-end performance, run `WH_ARCH_YAML=wormhole_b0_80_arch_eth_dispatch.yaml pytest test_perf_trace_2cqs_t3000[wormhole_b0-True-16-True-0.0043-60-device_params0]`. This will generate a CSV with the timings and throughputs.
Expected end-to-end perf: For batch = 16 per device, or batch 128 in total, it is about `31,700 fps` currently. This may vary machine to machine.
Loading

0 comments on commit ec0cc14

Please sign in to comment.