Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Blackhole IOMMU support #370

Open
joelsmithTT opened this issue Dec 4, 2024 · 0 comments
Open

Blackhole IOMMU support #370

joelsmithTT opened this issue Dec 4, 2024 · 0 comments
Assignees
Labels
Blackhole Blackhole specific issue

Comments

@joelsmithTT
Copy link
Contributor

joelsmithTT commented Dec 4, 2024

Initial IOMMU support PR: #338

TODO:

  • Documentation (how to enable system IOMMU under AMD/Intel/ARM)
  • iATU vs no iATU - implications for the base sysmem address from device NOC perspective
  • Does UMD need to allocate its own buffer? The only place where this seems like it would be needed is for WH ERISC, but initial IOMMU enablement will focus on Blackhole
  • Clarify some of the comments in pci_device.cpp, which are Wormhole-specific
@joelsmithTT joelsmithTT self-assigned this Dec 4, 2024
@broskoTT broskoTT added this to the Hugepage -> IOMMU milestone Dec 9, 2024
@joelsmithTT joelsmithTT added the Blackhole Blackhole specific issue label Dec 10, 2024
joelsmithTT added a commit that referenced this issue Dec 11, 2024
### Issue
#370  

### Description
Adds IOMMU support for Blackhole in a way that should be transparent to
the application.

### List of the changes
* Allow Blackhole to have multiple hugepages / host memory channels
* Add an API on TTDevice for iATU programming
* Rehome Blackhole iATU programming code to blackhole_tt_device.cpp
* Remove unnecessary logic to determine hugepage quantity (just use what
the application passes to Cluster constructor)
* Add sysmem tests for Blackhole.

### Testing
Manual testing was performed for both IOMMU on and IOMMU off cases using
the newly-added sysmem tests for Blackhole.

With IOMMU on:
```
[==========] Running 2 tests from 1 test suite.
[----------] Global test environment set-up.
[----------] 2 tests from SiliconDriverBH
[ RUN      ] SiliconDriverBH.SysmemTestWithPcie
  Detecting chips (found 1)
2024-12-10 20:40:07.019 | WARNING  | SiliconDriver   - Unknown board type for chip 0. This might happen because chip is running old firmware. Defaulting to UNKNOWN
2024-12-10 20:40:07.020 | WARNING  | SiliconDriver   - Unknown board type for chip 0. This might happen because chip is running old firmware. Defaulting to UNKNOWN
2024-12-10 20:40:07.083 | INFO     | SiliconDriver   - Detected PCI devices: [0]
2024-12-10 20:40:07.083 | INFO     | SiliconDriver   - Using local chip ids: {0} and remote chip ids {}
2024-12-10 20:40:07.083 | INFO     | SiliconDriver   - Opened PCI device 0; KMD version: 1.30.0, IOMMU: enabled
2024-12-10 20:40:07.170 | INFO     | SiliconDriver   - Allocating sysmem without hugepages (size: 0x40000000).
2024-12-10 20:40:07.417 | INFO     | SiliconDriver   - Mapped sysmem without hugepages to IOVA 0x3ffffff80000000.
2024-12-10 20:40:07.418 | INFO     | SiliconDriver   - Device: 0 Mapping iATU region 0 from 0x0 to 0x3fffffff to 0x3ffffff80000000
[       OK ] SiliconDriverBH.SysmemTestWithPcie (658 ms)
[ RUN      ] SiliconDriverBH.RandomSysmemTestWithPcie
2024-12-10 20:40:07.672 | WARNING  | SiliconDriver   - Unknown board type for chip 0. This might happen because chip is running old firmware. Defaulting to UNKNOWN
2024-12-10 20:40:07.672 | WARNING  | SiliconDriver   - Unknown board type for chip 0. This might happen because chip is running old firmware. Defaulting to UNKNOWN
2024-12-10 20:40:07.731 | INFO     | SiliconDriver   - Detected PCI devices: [0]
2024-12-10 20:40:07.731 | INFO     | SiliconDriver   - Using local chip ids: {0} and remote chip ids {}
2024-12-10 20:40:07.731 | INFO     | SiliconDriver   - Opened PCI device 0; KMD version: 1.30.0, IOMMU: enabled
2024-12-10 20:40:07.818 | INFO     | SiliconDriver   - Allocating sysmem without hugepages (size: 0x40000000).
2024-12-10 20:40:08.081 | INFO     | SiliconDriver   - Mapped sysmem without hugepages to IOVA 0x3ffffff80000000.
2024-12-10 20:40:08.327 | WARNING  | SiliconDriver   - Unknown board type for chip 0. This might happen because chip is running old firmware. Defaulting to UNKNOWN
2024-12-10 20:40:08.327 | WARNING  | SiliconDriver   - Unknown board type for chip 0. This might happen because chip is running old firmware. Defaulting to UNKNOWN
2024-12-10 20:40:08.387 | INFO     | SiliconDriver   - Detected PCI devices: [0]
2024-12-10 20:40:08.387 | INFO     | SiliconDriver   - Using local chip ids: {0} and remote chip ids {}
2024-12-10 20:40:08.387 | INFO     | SiliconDriver   - Opened PCI device 0; KMD version: 1.30.0, IOMMU: enabled
2024-12-10 20:40:08.474 | INFO     | SiliconDriver   - Allocating sysmem without hugepages (size: 0x100000000).
2024-12-10 20:40:09.453 | INFO     | SiliconDriver   - Mapped sysmem without hugepages to IOVA 0x3fffffe00000000.
2024-12-10 20:40:09.453 | INFO     | SiliconDriver   - Device: 0 Mapping iATU region 0 from 0x0 to 0x3fffffff to 0x3fffffe00000000
2024-12-10 20:40:09.454 | INFO     | SiliconDriver   - Device: 0 Mapping iATU region 1 from 0x40000000 to 0x7fffffff to 0x3fffffe40000000
2024-12-10 20:40:09.454 | INFO     | SiliconDriver   - Device: 0 Mapping iATU region 2 from 0x80000000 to 0xbfffffff to 0x3fffffe80000000
2024-12-10 20:40:09.454 | INFO     | SiliconDriver   - Device: 0 Mapping iATU region 3 from 0xc0000000 to 0xffffffff to 0x3fffffec0000000
[       OK ] SiliconDriverBH.RandomSysmemTestWithPcie (7754 ms)
[----------] 2 tests from SiliconDriverBH (8413 ms total)

[----------] Global test environment tear-down
[==========] 2 tests from 1 test suite ran. (8413 ms total)
[  PASSED  ] 2 tests.
```
With IOMMU in passthrough:
```
[==========] Running 2 tests from 1 test suite.
[----------] Global test environment set-up.
[----------] 2 tests from SiliconDriverBH
[ RUN      ] SiliconDriverBH.SysmemTestWithPcie
  Detecting chips (found 1)
2024-12-10 20:59:03.744 | WARNING  | SiliconDriver   - Unknown board type for chip 0. This might happen because chip is running old firmware. Defaulting to UNKNOWN
2024-12-10 20:59:03.745 | WARNING  | SiliconDriver   - Unknown board type for chip 0. This might happen because chip is running old firmware. Defaulting to UNKNOWN
2024-12-10 20:59:03.812 | INFO     | SiliconDriver   - Detected PCI devices: [0]
2024-12-10 20:59:03.812 | INFO     | SiliconDriver   - Using local chip ids: {0} and remote chip ids {}
2024-12-10 20:59:03.813 | INFO     | SiliconDriver   - Opened PCI device 0; KMD version: 1.30.0, IOMMU: disabled
2024-12-10 20:59:03.928 | INFO     | SiliconDriver   - Device: 0 Mapping iATU region 0 from 0x0 to 0x3fffffff to 0xe00000000
[       OK ] SiliconDriverBH.SysmemTestWithPcie (383 ms)
[ RUN      ] SiliconDriverBH.RandomSysmemTestWithPcie
2024-12-10 20:59:04.121 | WARNING  | SiliconDriver   - Unknown board type for chip 0. This might happen because chip is running old firmware. Defaulting to UNKNOWN
2024-12-10 20:59:04.121 | WARNING  | SiliconDriver   - Unknown board type for chip 0. This might happen because chip is running old firmware. Defaulting to UNKNOWN
2024-12-10 20:59:04.177 | INFO     | SiliconDriver   - Detected PCI devices: [0]
2024-12-10 20:59:04.177 | INFO     | SiliconDriver   - Using local chip ids: {0} and remote chip ids {}
2024-12-10 20:59:04.177 | INFO     | SiliconDriver   - Opened PCI device 0; KMD version: 1.30.0, IOMMU: disabled
2024-12-10 20:59:04.380 | WARNING  | SiliconDriver   - Unknown board type for chip 0. This might happen because chip is running old firmware. Defaulting to UNKNOWN
2024-12-10 20:59:04.380 | WARNING  | SiliconDriver   - Unknown board type for chip 0. This might happen because chip is running old firmware. Defaulting to UNKNOWN
2024-12-10 20:59:04.435 | INFO     | SiliconDriver   - Detected PCI devices: [0]
2024-12-10 20:59:04.435 | INFO     | SiliconDriver   - Using local chip ids: {0} and remote chip ids {}
2024-12-10 20:59:04.436 | INFO     | SiliconDriver   - Opened PCI device 0; KMD version: 1.30.0, IOMMU: disabled
2024-12-10 20:59:04.513 | INFO     | SiliconDriver   - Device: 0 Mapping iATU region 0 from 0x0 to 0x3fffffff to 0xe00000000
2024-12-10 20:59:04.513 | INFO     | SiliconDriver   - Device: 0 Mapping iATU region 1 from 0x40000000 to 0x7fffffff to 0xe40000000
2024-12-10 20:59:04.513 | INFO     | SiliconDriver   - Device: 0 Mapping iATU region 2 from 0x80000000 to 0xbfffffff to 0xe80000000
2024-12-10 20:59:04.513 | INFO     | SiliconDriver   - Device: 0 Mapping iATU region 3 from 0xc0000000 to 0xffffffff to 0xec0000000
[       OK ] SiliconDriverBH.RandomSysmemTestWithPcie (11055 ms)
[----------] 2 tests from SiliconDriverBH (11438 ms total)

[----------] Global test environment tear-down
[==========] 2 tests from 1 test suite ran. (11438 ms total)
[  PASSED  ] 2 tests.
```

### API Changes
There are no API changes in this PR.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Blackhole Blackhole specific issue
Projects
None yet
Development

No branches or pull requests

2 participants