#104: Add ttnn flatbuffer tests into CI #274

tapspatel · 2024-08-01T17:48:44Z

No description provided.

.github/workflows/ttrt-flatbuffer-ttnn.yml

test/system_desc/n150.ttsys

.github/workflows/ttrt-flatbuffer-ttnn.yml

nsmithtt · 2024-08-08T14:30:47Z

.github/workflows/ttrt-flatbuffer-ttnn.yml

+      run: |
+        source env/activate
+        ttrt query --save-artifacts
+        export SYSTEM_DESC_PATH=${{ github.workspace }}/ttrt-artifacts/system_desc.ttsys


Any ideas on how to scrape the system descs from the current runner pool?

do we want all the possible system descs? If so, it can be a simple ansible script that goes in to every machine provided in an inventory file and queries it. there's a GitHub api to get list of ci runners registered in mlir

Yeah this is a good question, ideally we want all of them, but then use some policy for divvying them up amongst the tests, i.e. we want to distribute the work evenly between the pool.

I think it might make sense to save the system desc's with their product name or tag them in some way, n300.ttsys/n150.ttsys/tgg.ttsys, but we probably need to further specialize the name because of row harvesting. It gets even more complicated if we have llm box, this is where system desc slicing could be used so we could partition the single host into 8 devices and go parallel.

Let's think through how to capture this information thoroughly because it's going to need to be rock solid to avoid CI device management hell.

Another challenge, is that the system desc must be collected with the current build. Maybe this works like:

Build on GitHub runner, produces build artifact

Run a super fast job on silicon runners that collect all the system descs into an artifact

Go back to GitHub runner which runs compiler tests with collected system descs

Go back to silicon runner which runs the flatbuffers

Not sure if that's too crazy going back and forth. I'm just not sure how else it could work.

fyi @vmilosevic

yeah, Im not sure either how we can get the system desc without building first. Although, I was thinking we could do this

run on GitHub runner, build ttrt

transfer all test files onto all types CI runners

use system desc to generate valid test cases for that machine

run all valid tests on each machine

we could probably use githubs load balancing mechanism instead of us trying to distribute the tests across CI runners, which I think might cause more overhead. I think this will be a larger discussion offline - Lemme know what you think.

So 2 things that I think we should workout:

Ideally llvm-lit tests are run on the compute only GitHub runners, but maybe it's not a big deal to just run this on the silicon runners. But these need to run in order to generate flatbuffers.

Let's say we have a pool of 10 n300s, and we have 100 silicon tests. We don't want to run all 100 tests on all n300s, we want to group the tests into 10 chunks and have each runner run only 10 tests.

I think we should establish some terminology (feel free to suggest other names if these seem confusing or don't make sense, but I think we need some terms to refer to because our test matrix is quite complicated):

Silicon Runner: 1to1 mapping to a unique SystemDesc. Because of row harvesting even different n300s might have unique system descs.

Target Family: A category of system type, we should organize these by product name, like n150, n300, TG, TGG, LLM Box etc. We will have a pool of n300's for example (all the same target system), but might each have unique system descs.

Target Capabilities: high level attributes that describe testable traits of a Target Family.

Test Capabilities: A set of Target Capabilities defined in the test. Effectively describes which system it should be compiled and run against.

To run through an example of how this could work, tracing through the perspective from a single test:

Please reference llvm-lit directives REQUIRES and UNSUPPORTED

GitHub CI builds tt-mlir and ttrt

All Silicon Runners picks up the build artifact and generates all system descs for all runners in our pool. We need to organize all of these somehow and bucketize them by Target Family. Each Target Family bucket forms a pool of available test targets for Silicon tests to be compiled against. Each Target Family bucket has an associated set of Target Capabilities which can be driven by the data in system desc itself (more on this later)

The system desc artifact (a zip file containing all system descs in the entire pool) comes back to the GitHub compute runner. Here we do some kind of partitioning of the Silicon Runners within each Target Family category. So for example, n300 is a Target Family, there might be 10 Silicon Runners of this type, ideally all of the tests that opt into being compiled against this Target Family get evenly distributed amongst the 10 available runners.

Zooming in on a particular test, it might have llvm-lit directive REQUIRES: Galaxy HugePage. Here Galaxy and HugePage forms this test's Test Capabilities (just a set of 1). The Galaxy tag itself is a Target Capability that is associated with TG and TGG because both Target Family's export this Target Capability. This would opt the test into being compiled against a single system desc that is part of this Target Family pool.

A note on Target Capabilities, I think some of these tags should be opt in and some opt out. For example, most silicon tests, without any Test Capabilities listed should implicitly opt into running on most silicon runners, like by default everything should run on n300's. Whereas multi-device tests should explicitly opt in (via REQUIRES directive).

nsmithtt · 2024-08-21T02:29:38Z

@vmilosevic, has taken this work over. Closing, we can reopen if it makes sense.

tapspatel requested a review from vmilosevic as a code owner August 1, 2024 17:48

tapspatel linked an issue Aug 1, 2024 that may be closed by this pull request

ttrt testing #104

Closed

tapspatel self-assigned this Aug 1, 2024

tapspatel force-pushed the tpatel/issue-104 branch 6 times, most recently from f83e5ab to 9e9235d Compare August 7, 2024 17:29

tapspatel requested a review from nsmithtt as a code owner August 7, 2024 17:29

tapspatel force-pushed the tpatel/issue-104 branch 2 times, most recently from c37568c to e0ae5fd Compare August 7, 2024 18:03

nsmithtt requested changes Aug 7, 2024

View reviewed changes

tapspatel force-pushed the tpatel/issue-104 branch 2 times, most recently from adfe5bf to 082575c Compare August 8, 2024 14:27

nsmithtt reviewed Aug 8, 2024

View reviewed changes

tapspatel force-pushed the tpatel/issue-104 branch 2 times, most recently from 7f161b6 to 05afa64 Compare August 8, 2024 14:52

#104: Add ttrt functional and performance tests to CI

f828963

tapspatel force-pushed the tpatel/issue-104 branch from 05afa64 to f828963 Compare August 8, 2024 15:15

tapspatel requested a review from nsmithtt August 8, 2024 15:34

nsmithtt closed this Aug 21, 2024

tapspatel deleted the tpatel/issue-104 branch August 30, 2024 18:30

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

#104: Add ttnn flatbuffer tests into CI #274

#104: Add ttnn flatbuffer tests into CI #274

tapspatel commented Aug 1, 2024

nsmithtt Aug 8, 2024 •

edited

Loading

tapspatel Aug 8, 2024

nsmithtt Aug 8, 2024

nsmithtt Aug 8, 2024

nsmithtt Aug 8, 2024

tapspatel Aug 8, 2024

nsmithtt Aug 8, 2024 •

edited

Loading

nsmithtt commented Aug 21, 2024

#104: Add ttnn flatbuffer tests into CI #274

#104: Add ttnn flatbuffer tests into CI #274

Conversation

tapspatel commented Aug 1, 2024

nsmithtt Aug 8, 2024 • edited Loading

Choose a reason for hiding this comment

tapspatel Aug 8, 2024

Choose a reason for hiding this comment

nsmithtt Aug 8, 2024

Choose a reason for hiding this comment

nsmithtt Aug 8, 2024

Choose a reason for hiding this comment

nsmithtt Aug 8, 2024

Choose a reason for hiding this comment

tapspatel Aug 8, 2024

Choose a reason for hiding this comment

nsmithtt Aug 8, 2024 • edited Loading

Choose a reason for hiding this comment

nsmithtt commented Aug 21, 2024

nsmithtt Aug 8, 2024 •

edited

Loading

nsmithtt Aug 8, 2024 •

edited

Loading