Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

#104: Add ttnn flatbuffer tests into CI #274

Closed
wants to merge 1 commit into from
Closed

Conversation

tapspatel
Copy link
Contributor

No description provided.

@tapspatel tapspatel requested a review from vmilosevic as a code owner August 1, 2024 17:48
@tapspatel tapspatel linked an issue Aug 1, 2024 that may be closed by this pull request
@tapspatel tapspatel self-assigned this Aug 1, 2024
@tapspatel tapspatel force-pushed the tpatel/issue-104 branch 6 times, most recently from f83e5ab to 9e9235d Compare August 7, 2024 17:29
@tapspatel tapspatel requested a review from nsmithtt as a code owner August 7, 2024 17:29
@tapspatel tapspatel force-pushed the tpatel/issue-104 branch 2 times, most recently from c37568c to e0ae5fd Compare August 7, 2024 18:03
.github/workflows/ttrt-flatbuffer-ttnn.yml Outdated Show resolved Hide resolved
.github/workflows/ttrt-flatbuffer-ttnn.yml Outdated Show resolved Hide resolved
test/system_desc/n150.ttsys Outdated Show resolved Hide resolved
.github/workflows/ttrt-flatbuffer-ttnn.yml Outdated Show resolved Hide resolved
@tapspatel tapspatel force-pushed the tpatel/issue-104 branch 2 times, most recently from adfe5bf to 082575c Compare August 8, 2024 14:27
run: |
source env/activate
ttrt query --save-artifacts
export SYSTEM_DESC_PATH=${{ github.workspace }}/ttrt-artifacts/system_desc.ttsys
Copy link
Contributor

@nsmithtt nsmithtt Aug 8, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Any ideas on how to scrape the system descs from the current runner pool?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

do we want all the possible system descs? If so, it can be a simple ansible script that goes in to every machine provided in an inventory file and queries it. there's a GitHub api to get list of ci runners registered in mlir

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah this is a good question, ideally we want all of them, but then use some policy for divvying them up amongst the tests, i.e. we want to distribute the work evenly between the pool.

I think it might make sense to save the system desc's with their product name or tag them in some way, n300.ttsys/n150.ttsys/tgg.ttsys, but we probably need to further specialize the name because of row harvesting. It gets even more complicated if we have llm box, this is where system desc slicing could be used so we could partition the single host into 8 devices and go parallel.

Let's think through how to capture this information thoroughly because it's going to need to be rock solid to avoid CI device management hell.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Another challenge, is that the system desc must be collected with the current build. Maybe this works like:

  • Build on GitHub runner, produces build artifact
  • Run a super fast job on silicon runners that collect all the system descs into an artifact
  • Go back to GitHub runner which runs compiler tests with collected system descs
  • Go back to silicon runner which runs the flatbuffers

Not sure if that's too crazy going back and forth. I'm just not sure how else it could work.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yeah, Im not sure either how we can get the system desc without building first. Although, I was thinking we could do this

  1. run on GitHub runner, build ttrt
  2. transfer all test files onto all types CI runners
  3. use system desc to generate valid test cases for that machine
  4. run all valid tests on each machine

we could probably use githubs load balancing mechanism instead of us trying to distribute the tests across CI runners, which I think might cause more overhead. I think this will be a larger discussion offline - Lemme know what you think.

Copy link
Contributor

@nsmithtt nsmithtt Aug 8, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So 2 things that I think we should workout:

  1. Ideally llvm-lit tests are run on the compute only GitHub runners, but maybe it's not a big deal to just run this on the silicon runners. But these need to run in order to generate flatbuffers.
  2. Let's say we have a pool of 10 n300s, and we have 100 silicon tests. We don't want to run all 100 tests on all n300s, we want to group the tests into 10 chunks and have each runner run only 10 tests.

I think we should establish some terminology (feel free to suggest other names if these seem confusing or don't make sense, but I think we need some terms to refer to because our test matrix is quite complicated):

  • Silicon Runner: 1to1 mapping to a unique SystemDesc. Because of row harvesting even different n300s might have unique system descs.
  • Target Family: A category of system type, we should organize these by product name, like n150, n300, TG, TGG, LLM Box etc. We will have a pool of n300's for example (all the same target system), but might each have unique system descs.
  • Target Capabilities: high level attributes that describe testable traits of a Target Family.
  • Test Capabilities: A set of Target Capabilities defined in the test. Effectively describes which system it should be compiled and run against.

To run through an example of how this could work, tracing through the perspective from a single test:

Please reference llvm-lit directives REQUIRES and UNSUPPORTED

  1. GitHub CI builds tt-mlir and ttrt
  2. All Silicon Runners picks up the build artifact and generates all system descs for all runners in our pool. We need to organize all of these somehow and bucketize them by Target Family. Each Target Family bucket forms a pool of available test targets for Silicon tests to be compiled against. Each Target Family bucket has an associated set of Target Capabilities which can be driven by the data in system desc itself (more on this later)
  3. The system desc artifact (a zip file containing all system descs in the entire pool) comes back to the GitHub compute runner. Here we do some kind of partitioning of the Silicon Runners within each Target Family category. So for example, n300 is a Target Family, there might be 10 Silicon Runners of this type, ideally all of the tests that opt into being compiled against this Target Family get evenly distributed amongst the 10 available runners.
  4. Zooming in on a particular test, it might have llvm-lit directive REQUIRES: Galaxy HugePage. Here Galaxy and HugePage forms this test's Test Capabilities (just a set of 1). The Galaxy tag itself is a Target Capability that is associated with TG and TGG because both Target Family's export this Target Capability. This would opt the test into being compiled against a single system desc that is part of this Target Family pool.

A note on Target Capabilities, I think some of these tags should be opt in and some opt out. For example, most silicon tests, without any Test Capabilities listed should implicitly opt into running on most silicon runners, like by default everything should run on n300's. Whereas multi-device tests should explicitly opt in (via REQUIRES directive).

@tapspatel tapspatel force-pushed the tpatel/issue-104 branch 2 times, most recently from 7f161b6 to 05afa64 Compare August 8, 2024 14:52
@tapspatel tapspatel requested a review from nsmithtt August 8, 2024 15:34
@nsmithtt
Copy link
Contributor

@vmilosevic, has taken this work over. Closing, we can reopen if it makes sense.

@nsmithtt nsmithtt closed this Aug 21, 2024
@tapspatel tapspatel deleted the tpatel/issue-104 branch August 30, 2024 18:30
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

ttrt testing
2 participants