Metal developer updates for morning of Sept 18, 2023 #2699

tt-rkim · 2023-09-17T23:30:30Z

tt-rkim
Sep 17, 2023
Maintainer

Running WH tests before and after UMD changes

The old UMD is not compatible with FW versions after 2023-03-29 (which I believe is 7.8).

The new UMD is not compatible with FW versions before 2023-08-08 which is 7.D.

Using an imcompatible pairing will likely crash / brick your WH machine.

Be mindful of this when trying to run WH tests and you seem to have machine crashes or hanging boards. Always check SMI for FW version and note whether you have the old or new UMD.

You must rebase if you're working on WH by end of this week

If you are working on WH, please note we will be upgrading all our CI machines to new FW, rendering the old UMD unusable. All branches with old UMD will not pass post-commit after these upgrades.

We will be doing these changes sometime after Wednesday, Sep 20th 5pm Toronto / New York time.

If you manage external contractors, please plan and prepare accordingly with your contractors.

Availability of WH machines on the cloud

We have a total of 5 bare metal WH B0 machines available in the cloud. Please check the bare metal spreadsheet for assignments / information for bare metal machines. I have released one bare metal with WH back into the developer pool (e09cs02).

Spreadsheet is here, as always, and just as we pinned months ago in the developers channel: https://tenstorrent.sharepoint.com/:x:/r/sites/Jasmina/_layouts/15/Doc.aspx?sourcedoc=%7B86724D16-A1D3-42FC-A000-1693B4773E63%7D&file=Cloud%20Performance%20Optimization.xlsx&action=default&mobileredirect=true&cid=cd9dc7e8-4e53-4d47-95fb-45c1a0aa4c62

All machines have been updated to 7.D FW and 1.23 driver, except for a CI machine which is there to enable people still on branches with old UMD, which none of us may access for development use unless you're debugging a hang on CI. Note that this machine will be upgraded sometime after the 20th.

There are maybe more machines available in the lab in Santa Clara. I cannot help you with these machines, and contractors without corporate VPN cannot access them. Please message the sjc-lab-machines channel if you're curious.

CI runner instability

We are all aware of how unideal CI has been the last couple of weeks. No one is more acutely aware of how painful this is for developers than the people who take care of the machines every day - I promise you.

We are trying to dedicate more resources to CI improvement and have various things we're going to plan and try. We will be updating everyone with relevant updates when they come.

I'm asking for your patience while we do this. Everyone's plate is full and we all need to remember we all have a common goal with this project.

Reminder about CI usage

Do NOT cancel silicon runs (GS or WH) unless you know you won't hang the board. Many of the runner issues can be attributed to developers cancelling runs and not taking the responsibility to clean up the machine after. Let the run time out, and put the runner out of service.
If you see a broken runner, take it out of service. If you're a contractor, have your Tenstorrent contact do it for you: https://github.com/tenstorrent-metal/metal-internal-workflows/wiki/CI-machine-care#taking-a-machine-out-of-service - This doc has been around for months and we've let you all know about them months before. Please use it. Too many people are silently letting runner failures pass by, do nothing about it, and either that same person or someone else complains that they keep getting an out-of-service runner.
You must pass post-commit pipelines before merging. We do not want to see more pull requests with skipped checks.

I will be reaching out to people with a more urgent tone who did not properly read this email or are ignoring what we are asking.

When you Cloud machine won't come back up after rebooting (BM + VM)

Normally, you would file an issue to cloud with the metal tag if this happens.

However, one common cause for this is that you or someone else installed VNC dependencies on the bare metal machine. Long story short - this will cause hangups during bootup.

We've documented this and will put future possible causes of boot up hang here: https://github.com/tenstorrent-metal/metal-internal-workflows/wiki/Recovering-a-%22bricked%22-device#machine-wont-boot-up

Please do not install VNC or its dependencies. If your machine freezes because of something like this, I can't really help you. Discuss with the cloud team if you need such services or perform the fix as specified in the docs above.

Clean up unused branches

For upcoming open source sanitization changes, I'm asking everyone to delete old branches that you don't need. You may lose branches that we deem are too old during sanitization.

Reminder about docs

You're a developer, and you ship features. However, features do not stop with "working" code. If you do not complete tests or documentation, your feature is not shippable. We are trying to put out a polished products, especially with all our new ops and APIs.

If you're managing a team of contractors, please let them know of this, and be more stringent and scrutinizing with changes, especially changes which contain suspiciously no docs changes or incorrect docs changes.

cc: @tenstorrent-metal/developers @tenstorrent-metal/external-developers

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Metal developer updates for morning of Sept 18, 2023 #2699

{{title}}

Replies: 0 comments

Select a reply

Metal developer updates for morning of Sept 18, 2023 #2699

tt-rkim Sep 17, 2023 Maintainer

Running WH tests before and after UMD changes

You must rebase if you're working on WH by end of this week

Availability of WH machines on the cloud

CI runner instability

Reminder about CI usage

When you Cloud machine won't come back up after rebooting (BM + VM)

Clean up unused branches

Reminder about docs

Replies: 0 comments

tt-rkim
Sep 17, 2023
Maintainer