Enable T3K Resnet Tests (#11030)

* #10244: Fix optional output tensor handling for reshard * #0: Add enable_async_mode device fixture and refactor use_program_cache * #0: Cleanup ttnn_resnet single device test files * #10244: Fix multi-device api issues for ttnn resnet tests, and add them to ci * #10244: Add E2E performance tests for ttnn_resnet on t3000 * #10244: Add t3000 perf results for ttnn_resnet to README * #0: Remove initial space when parsing perf csv * #0: Increase timeout for Nightly N300 WH-only models job due to some ci machines being slower than others
tenstorrent · Aug 5, 2024 · ec0cc14 · ec0cc14
1 parent 8d4008d
commit ec0cc14
Show file tree

Hide file tree

Showing 16 changed files with 798 additions and 129 deletions.
diff --git a/.github/workflows/fast-dispatch-full-regressions-and-models.yaml b/.github/workflows/fast-dispatch-full-regressions-and-models.yaml
@@ -23,7 +23,7 @@ jobs:
             { name: "Common models N300 WH B0", arch: wormhole_b0, cmd: tests/scripts/single_card/nightly/run_common_models.sh, timeout: 40 },
             { name: "GS-only ttnn nightly", arch: grayskull, cmd: tests/scripts/single_card/nightly/run_ttnn.sh, timeout: 40 },
             { name: "GS-only models", arch: grayskull, cmd: tests/scripts/single_card/nightly/run_gs_only.sh, timeout: 40 },
-            { name: "N300 WH-only models", arch: wormhole_b0, cmd: tests/scripts/single_card/nightly/run_wh_b0_only.sh, timeout: 40 },
+            { name: "N300 WH-only models", arch: wormhole_b0, cmd: tests/scripts/single_card/nightly/run_wh_b0_only.sh, timeout: 50 },
             { name: "API tests GS", arch: grayskull, cmd: ./tests/scripts/run_tests.sh --tt-arch grayskull --pipeline-type frequent_api --dispatch-mode fast, timeout: 40 },
             { name: "API tests N300 WH B0", arch: wormhole_b0, cmd: ./tests/scripts/run_tests.sh --tt-arch wormhole_b0 --pipeline-type frequent_api --dispatch-mode fast, timeout: 40 },
             # #9945: Skip SD for now

diff --git a/.github/workflows/t3000-frequent-tests.yaml b/.github/workflows/t3000-frequent-tests.yaml
@@ -17,18 +17,20 @@ jobs:
       fail-fast: false
       matrix:
         test-group: [
-          { name: "t3k tteager tests", arch: wormhole_b0, cmd: run_t3000_tteager_tests, timeout: 60, 
+          { name: "t3k tteager tests", arch: wormhole_b0, cmd: run_t3000_tteager_tests, timeout: 60,
           runs-on: ["config-t3000", "in-service", "pipeline-functional"], owner_id: ULMEPM2MA}, #Sean Nijjar
-          { name: "t3k ethernet tests", arch: wormhole_b0, cmd: run_t3000_ethernet_tests, timeout: 60, 
+          { name: "t3k ethernet tests", arch: wormhole_b0, cmd: run_t3000_ethernet_tests, timeout: 60,
           runs-on: ["config-t3000", "in-service", "pipeline-functional"], owner_id: ULMEPM2MA}, #Sean Nijjar
-          { name: "t3k trace stress tests", arch: wormhole_b0, cmd: run_t3000_trace_stress_tests, timeout: 120, 
+          { name: "t3k trace stress tests", arch: wormhole_b0, cmd: run_t3000_trace_stress_tests, timeout: 120,
           runs-on: ["config-t3000", "in-service", "pipeline-functional"], owner_id: U03NG0A5ND7}, #Aditya Saigal
-          { name: "t3k falcon40b tests", arch: wormhole_b0, cmd: run_t3000_falcon40b_tests, timeout: 120, 
-          runs-on: ["config-t3000", "in-service", "pipeline-functional"], owner_id: U04S2UV6L8N}, #Sofija Jovic 
-          { name: "t3k llama2_70b tests", arch: wormhole_b0, cmd: run_t3000_llama2_70b_tests, timeout: 60, 
+          { name: "t3k falcon40b tests", arch: wormhole_b0, cmd: run_t3000_falcon40b_tests, timeout: 120,
+          runs-on: ["config-t3000", "in-service", "pipeline-functional"], owner_id: U04S2UV6L8N}, #Sofija Jovic
+          { name: "t3k llama2_70b tests", arch: wormhole_b0, cmd: run_t3000_llama2_70b_tests, timeout: 60,
           runs-on: ["config-t3000", "in-service", "pipeline-functional"], owner_id: U03FJB5TM5Y}, #Colman Glagovich
-          { name: "t3k mixtral tests", arch: wormhole_b0, cmd: run_t3000_mixtral_tests, timeout: 60, 
+          { name: "t3k mixtral tests", arch: wormhole_b0, cmd: run_t3000_mixtral_tests, timeout: 60,
           runs-on: ["config-t3000", "in-service", "pipeline-functional"], owner_id: U03PUAKE719}, #Miguel Tairum Cruz
+          { name: "t3k resnet tests", arch: wormhole_b0, cmd: run_t3000_resnet_tests, timeout: 30,
+          runs-on: ["config-t3000", "in-service", "pipeline-functional"], owner_id: U013121KDH9}, #Austin Ho
         ]
     name: ${{ matrix.test-group.name }}
     env:

diff --git a/.github/workflows/t3000-model-perf-tests.yaml b/.github/workflows/t3000-model-perf-tests.yaml
@@ -25,6 +25,8 @@ jobs:
             runs-on: ["arch-wormhole_b0", "config-t3000", "in-service", "pipeline-perf"], owner_id: U03FJB5TM5Y}, # Colman Glagovich
           { name: "t3k LLM falcon40b model perf tests", model: "falcon40b", model-type: "LLM", arch: wormhole_b0, cmd: run_t3000_falcon40b_tests, timeout: 75,
             runs-on: ["arch-wormhole_b0", "config-t3000", "in-service", "pipeline-perf"], owner_id: U053W15B6JF}, # Djordje Ivanovic
+          { name: "t3k LLM resnet50 model perf tests", model: "resnet50", model-type: "CNN", arch: wormhole_b0, cmd: run_t3000_resnet50_tests, timeout: 75,
+            runs-on: ["arch-wormhole_b0", "config-t3000", "in-service", "pipeline-perf"], owner_id: U013121KDH9}, # Austin Ho
           #{ name: "t3k CNN model perf tests ", model-type: "CNN", arch: wormhole_b0, cmd: run_t3000_cnn_tests, timeout: 120, owner_id: }, #No tests are being run?
         ]
     name: ${{ matrix.test-group.name }}

diff --git a/README.md b/README.md
@@ -44,12 +44,12 @@
 >
 > Furthermore, all performance numbers here are run or based off an N300 Wormhole card.
 
-| Model                                                                                  | Last Verified Release                                                            | Gen. Token [3]     |  Batch               | End-to-end throughput [1]      | Device throughput [2]        | Target         |
+| Model                                                                                  | Last Verified Release                                                     | Gen. Token [3]     |  Batch               | End-to-end throughput [1]      | Device throughput [2]        | Target         |
 |----------------------------------------------------------------------------------------|---------------------------------------------------------------------------|--------------------|----------------------|--------------------------------|------------------------------|----------------|
-| [Falcon7B](./models/demos/wormhole/falcon7b)                                           | [v0.51.0-rc13](https://github.com/tenstorrent/tt-metal/tree/v0.51.0-rc13) | 129th              | 32                   | 13.7 t/s/u - 438 t/s          | 19.5 t/s/u - 624 t/s        | 26             |
-| [Mistral-7B](./models/demos/wormhole/mistral7b)                                        | [v0.51.0-rc13](https://github.com/tenstorrent/tt-metal/tree/v0.51.0-rc13) | 129th              | 32                   | 9.9 t/s/u - 317 t/s           | 11.0 t/s/u - 352 t/s         | 25             |
+| [Falcon7B](./models/demos/wormhole/falcon7b)                                           | [v0.51.0-rc13](https://github.com/tenstorrent/tt-metal/tree/v0.51.0-rc13) | 129th              | 32                   | 13.7 t/s/u - 438 t/s           | 19.5 t/s/u - 624 t/s         | 26             |
+| [Mistral-7B](./models/demos/wormhole/mistral7b)                                        | [v0.51.0-rc13](https://github.com/tenstorrent/tt-metal/tree/v0.51.0-rc13) | 129th              | 32                   | 9.9 t/s/u - 317 t/s            | 11.0 t/s/u - 352 t/s         | 25             |
 | [Mamba-2.8B](./models/demos/wormhole/mamba)                                            | [v0.51.0-rc13](https://github.com/tenstorrent/tt-metal/tree/v0.51.0-rc13) | any                | 32                   | 11.6 t/s/u - 371 t/s           | 16.5 t/s/u - 528 t/s         | 41             |
-| [LLaMA-3.1-8B](./models/demos/wormhole/llama31_8b)                                     | [v0.51.0-rc13](https://github.com/tenstorrent/tt-metal/tree/v0.51.0-rc13) | 129th              | 8                    | 8.3 t/s/u - 66.0 t/s     | 9.7 t/s/u - 77.9 t/s   | 23             |
+| [LLaMA-3.1-8B](./models/demos/wormhole/llama31_8b)                                     | [v0.51.0-rc13](https://github.com/tenstorrent/tt-metal/tree/v0.51.0-rc13) | 129th              | 8                    | 8.3 t/s/u - 66.0 t/s           | 9.7 t/s/u - 77.9 t/s         | 23             |
 | [BERT-Large](./models/demos/metal_BERT_large_11/) (sen/s) [4]                          |                                                                           |                    | 8                    | 270                            | 340                          | 400            |
 | [Stable Diffusion 1.4](./models/demos/wormhole/stable_diffusion) 512x512 (sec/img) [5] |                                                                           |                    | 1                    | 6                              | 5                            | 3              |
 | [ResNet-50](./models/demos/ttnn_resnet) (fps)                                          |                                                                           |                    | 16                   | 4,300                          | 5,550                        | 7,000          |
@@ -66,14 +66,14 @@
 
 ##  TT-QuietBox & TT-LoudBox (2x4 mesh of WHs) Models
 
-| Model                                              | Last Verified Release                                                            |   Technique        | Gen. Token [3]      |  Batch                | End-to-end throughput [1]    | Device throughput [2]        | Target          |
+| Model                                              | Last Verified Release                                                     |   Technique        | Gen. Token [3]      |  Batch                | End-to-end throughput [1]    | Device throughput [2]        | Target          |
 |----------------------------------------------------|---------------------------------------------------------------------------|--------------------|---------------------|-----------------------|------------------------------|------------------------------|-----------------|
-| [Falcon7B](./models/demos/t3000/falcon7b)          | [v0.51.0-rc13](https://github.com/tenstorrent/tt-metal/tree/v0.51.0-rc13) | Data Parallel      | 129th               |  256                  | 7.6 t/s/u - 1950 t/s        |  19.5 t/s/u - 4990 t/s       |   26 t/s/u      |
+| [Falcon7B](./models/demos/t3000/falcon7b)          | [v0.51.0-rc13](https://github.com/tenstorrent/tt-metal/tree/v0.51.0-rc13) | Data Parallel      | 129th               |  256                  | 7.6 t/s/u - 1950 t/s         |  19.5 t/s/u - 4990 t/s       |   26 t/s/u      |
 | [LLaMA-2-70B](./models/demos/t3000/llama2_70b)     | [v0.51.0-rc13](https://github.com/tenstorrent/tt-metal/tree/v0.51.0-rc13) | Tensor Parallel    | 129th               |  32                   | 10.4 t/s/u - 333 t/s         |  16.6 t/s/u - 531 t/s        |   20 t/s/u      |
 | [LLaMA-3.1-70B](./models/demos/t3000/llama3_70b)   | [v0.51.0-rc13](https://github.com/tenstorrent/tt-metal/tree/v0.51.0-rc13) | Tensor Parallel    | 129th               |  32                   | 10.4 t/s/u - 333 t/s         |  15.8 t/s/u - 506 t/s        |   20 t/s/u      |
-| [Falcon40B](./models/demos/t3000/falcon40b)        | [v0.51.0-rc13](https://github.com/tenstorrent/tt-metal/tree/v0.51.0-rc13) | Tensor Parallel    | 129th               |  32                   | 5.3 t/s/u - 168 t/s         |  12.2 t/s/u - 390 t/s       |   36 t/s/u      |
+| [Falcon40B](./models/demos/t3000/falcon40b)        | [v0.51.0-rc13](https://github.com/tenstorrent/tt-metal/tree/v0.51.0-rc13) | Tensor Parallel    | 129th               |  32                   | 5.3 t/s/u - 168 t/s          |  12.2 t/s/u - 390 t/s        |   36 t/s/u      |
 | [Mixtral7Bx8](./models/demos/t3000/mixtral8x7b)    | [v0.51.0-rc13](https://github.com/tenstorrent/tt-metal/tree/v0.51.0-rc13) | Tensor Parallel    | 129th               |  32                   | 13.3 t/s/u - 426 t/s         |  21.4 t/s/u - 685 t/s        |   33 t/s/u      |
-| ResNet50                                           |                                                                           | Data Parallel      | coming soon         |                       |                              |                              |                 |
+| [ResNet-50](./models/demos/ttnn_resnet)            |                                                                           | Data Parallel      |                     |  128                  | 31,700                       |  44,400                      |   56,000        |
 
 ## Model Updates
 For the latest model updates and features, please see [MODEL_UPDATES.md](models/MODEL_UPDATES.md)

diff --git a/conftest.py b/conftest.py
@@ -271,36 +271,47 @@ def reset_default_device():
     ttl.device.SetDefaultDevice(device)
 
 
-@pytest.fixture(scope="function")
-def use_program_cache(request):
-    import tt_lib as ttl
-
+def get_devices(request):
     if "device" in request.fixturenames:
-        dev = request.getfixturevalue("device")
-        dev.enable_program_cache()
+        devices = [request.getfixturevalue("device")]
     elif "all_devices" in request.fixturenames:
         devices = request.getfixturevalue("all_devices")
-        for dev in devices:
-            dev.enable_program_cache()
     elif "pcie_devices" in request.fixturenames:
         devices = request.getfixturevalue("pcie_devices")
-        for dev in devices:
-            dev.enable_program_cache()
     elif "device_mesh" in request.fixturenames:
-        mesh = request.getfixturevalue("device_mesh")
-        for device_id in mesh.get_device_ids():
-            mesh.get_device(device_id).enable_program_cache()
+        devices = request.getfixturevalue("device_mesh").get_devices()
     elif "t3k_device_mesh" in request.fixturenames:
-        mesh = request.getfixturevalue("t3k_device_mesh")
-        for device_id in mesh.get_device_ids():
-            mesh.get_device(device_id).enable_program_cache()
+        devices = request.getfixturevalue("t3k_device_mesh").get_devices()
     elif "pcie_device_mesh" in request.fixturenames:
-        mesh = request.getfixturevalue("pcie_device_mesh")
-        for device_id in mesh.get_device_ids():
-            mesh.get_device(device_id).enable_program_cache()
+        devices = request.getfixturevalue("pcie_device_mesh").get_devices()
     else:
+        devices = []
+    return devices
+
+
+@pytest.fixture(scope="function")
+def use_program_cache(request):
+    devices = get_devices(request)
+    if not devices:
         logger.warning("No device fixture found to apply program cache to: PROGRAM CACHE DISABLED")
+    for dev in devices:
+        dev.enable_program_cache()
     yield
+    for dev in devices:
+        dev.disable_and_clear_program_cache()
+
+
+@pytest.fixture(scope="function")
+def enable_async_mode(request):
+    devices = get_devices(request)
+    if not devices:
+        logger.warning("No device fixture found to apply async mode to: ASYNC MODE DISABLED")
+
+    for dev in devices:
+        dev.enable_async(request.param)
+    yield request.param
+    for dev in devices:
+        dev.enable_async(False)
 
 
 @pytest.fixture(scope="function")

diff --git a/models/demos/ttnn_resnet/README.md b/models/demos/ttnn_resnet/README.md
@@ -8,8 +8,13 @@ Our ImageProcessor on the other hand is based on `microsoft/resnet-50` from hugg
 
 ## Performance
 
+### Single Device
 + To obtain device performance, run `WH_ARCH_YAML=wormhole_b0_80_arch_eth_dispatch.yaml ./tt_metal/tools/profiler/profile_this.py -c "pytest models/demos/ttnn_resnet/tests/test_ttnn_resnet50_performant.py::test_run_resnet50_inference[16-act_dtype0-weight_dtype0-math_fidelity0-device_params0]"`
 This will generate a CSV report under `<this repo dir>/generated/profiler/reports/ops/<report name>`. The report file name is logged in the run output.
 
 + For end-to-end performance, run `WH_ARCH_YAML=wormhole_b0_80_arch_eth_dispatch.yaml pytest models/demos/ttnn_resnet/tests/test_perf_ttnn_resnet.py::test_perf_trace_2cqs_bare_metal[16-0.004-25-device_params0]`. This will generate a CSV with the timings and throughputs.
 Expected end-to-end perf: For batch = 16, it is about `4300 fps` currently. This may vary machine to machine.
+
+### T3000
++ For end-to-end performance, run `WH_ARCH_YAML=wormhole_b0_80_arch_eth_dispatch.yaml pytest test_perf_trace_2cqs_t3000[wormhole_b0-True-16-True-0.0043-60-device_params0]`. This will generate a CSV with the timings and throughputs.
+Expected end-to-end perf: For batch = 16 per device, or batch 128 in total, it is about `31,700 fps` currently. This may vary machine to machine.