Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Suspicious IOMMU failures with Blackhole #380

Closed
joelsmithTT opened this issue Dec 9, 2024 · 2 comments
Closed

Suspicious IOMMU failures with Blackhole #380

joelsmithTT opened this issue Dec 9, 2024 · 2 comments
Assignees
Labels
Blackhole Blackhole specific issue

Comments

@joelsmithTT
Copy link
Contributor

UMD test code says (edited for brevity) when it passes:

2024-12-09 01:20:09.333 | INFO     | SiliconDriver   - Opened PCI device 0; KMD version: 1.30.0, IOMMU: enabled
2024-12-09 01:20:09.444 | INFO     | SiliconDriver   - Allocating sysmem without hugepages (size: 0x40000000).
2024-12-09 01:20:09.655 | INFO     | SiliconDriver   - Mapped sysmem without hugepages to IOVA 0x40000000.
[       OK ] SiliconDriverBH.SysmemTestWithPcie (882 ms)
[----------] 1 test from SiliconDriverBH (882 ms total)

[----------] Global test environment tear-down
[==========] 1 test from 1 test suite ran. (882 ms total)
[  PASSED  ] 1 test.

And this when it fails:

2024-12-09 01:20:16.880 | INFO     | SiliconDriver   - Opened PCI device 0; KMD version: 1.30.0, IOMMU: enabled
2024-12-09 01:20:16.986 | WARNING  | SiliconDriver   - Insufficient NumHugepages: 0 should be at least NumMMIODevices: 1 for device_id: 0xb140 revision: 0. NumHostMemChannels would be 0, bumping to 1.
2024-12-09 01:20:16.986 | INFO     | SiliconDriver   - Allocating sysmem without hugepages (size: 0x40000000).
2024-12-09 01:20:17.208 | INFO     | SiliconDriver   - Mapped sysmem without hugepages to IOVA 0x3ffffffc0000000.
Base address: 3ffffffc0000000
/home/joel/git/tt-umd/tests/blackhole/test_silicon_driver_bh.cpp:920: Failure
Expected equality of these values:
  buffer
    Which is: { '\xFF' (255), '\xFF' (255), '\xFF' (255), '\xFF' (255), '\xFF' (255), '\xFF' (255), '\xFF' (255), '\xFF' (255), '\xFF' (255), '\xFF' (255), '\xFF' (255), '\xFF' (255), '\xFF' (255), '\xFF' (255), '\xFF' (255), '\xFF' (255), '\xFF' (255), '\xFF' (255), '\xFF' (255), '\xFF' (255), '\xFF' (255), '\xFF' (255), '\xFF' (255), '\xFF' (255), '\xFF' (255), '\xFF' (255), '\xFF' (255), '\xFF' (255), '\xFF' (255), '\xFF' (255), '\xFF' (255), '\xFF' (255), ... }
  std::vector<uint8_t>(sysmem, sysmem + test_size_bytes)
    Which is: { '\xEA' (234), '\x1' (1), '\t' (9), '\xBD' (189), '\x1E' (30), '\x2' (2), '{' (123, 0x7B), '\\' (92, 0x5C), '"' (34, 0x22), '\xC9' (201), '\x3' (3), '\xC8' (200), '\x1' (1), '\xFF' (255), '\xB5' (181), '\xDE' (222), '\x7F' (127), '\xE1' (225), '\xA8' (168), '\xFB' (251), 'y' (121, 0x79), '\x17' (23), 'N' (78, 0x4E), '\xCA' (202), '\x11' (17), '\xA7' (167), 'x' (120, 0x78), '\xBD' (189), '#' (35, 0x23), '\x8D' (141), 'a' (97, 0x61), '\xD3' (211), ... }
[  FAILED  ] SiliconDriverBH.SysmemTestWithPcie (880 ms)
[----------] 1 test from SiliconDriverBH (880 ms total)

[----------] Global test environment tear-down
[==========] 1 test from 1 test suite ran. (880 ms total)
[  PASSED  ] 0 tests.
[  FAILED  ] 1 test, listed below:
[  FAILED  ] SiliconDriverBH.SysmemTestWithPcie

 1 FAILED TEST

Kernel log gets IO_PAGE_FAULT for the unsuccessful test.

Dec  9 01:20:17 blackhole kernel: [  127.443143] tenstorrent 0000:01:00.0: Using 58-bit DMA addresses
Dec  9 01:20:17 blackhole kernel: [  127.631613] tenstorrent 0000:01:00.0: AMD-Vi: Event logged [IO_PAGE_FAULT domain=0x000f address=0x1fffffc0000000 flags=0x0000]
Dec  9 01:20:17 blackhole kernel: [  127.631656] tenstorrent 0000:01:00.0: AMD-Vi: Event logged [IO_PAGE_FAULT domain=0x000f address=0x1fffffc0000004 flags=0x0000]
Dec  9 01:20:17 blackhole kernel: [  127.631692] tenstorrent 0000:01:00.0: AMD-Vi: Event logged [IO_PAGE_FAULT domain=0x000f address=0x1fffffc0000008 flags=0x0000]
Dec  9 01:20:17 blackhole kernel: [  127.631727] tenstorrent 0000:01:00.0: AMD-Vi: Event logged [IO_PAGE_FAULT domain=0x000f address=0x1fffffc000000c flags=0x0000]
Dec  9 01:20:17 blackhole kernel: [  127.631760] tenstorrent 0000:01:00.0: AMD-Vi: Event logged [IO_PAGE_FAULT domain=0x000f address=0x1fffffc0000010 flags=0x0000]
Dec  9 01:20:17 blackhole kernel: [  127.631795] tenstorrent 0000:01:00.0: AMD-Vi: Event logged [IO_PAGE_FAULT domain=0x000f address=0x1fffffc0000014 flags=0x0000]
Dec  9 01:20:17 blackhole kernel: [  127.631829] tenstorrent 0000:01:00.0: AMD-Vi: Event logged [IO_PAGE_FAULT domain=0x000f address=0x1fffffc0000018 flags=0x0000]
Dec  9 01:20:17 blackhole kernel: [  127.631863] tenstorrent 0000:01:00.0: AMD-Vi: Event logged [IO_PAGE_FAULT domain=0x000f address=0x1fffffc000001c flags=0x0000]
Dec  9 01:20:17 blackhole kernel: [  127.631897] tenstorrent 0000:01:00.0: AMD-Vi: Event logged [IO_PAGE_FAULT domain=0x000f address=0x1fffffc0000020 flags=0x0000]
Dec  9 01:20:17 blackhole kernel: [  127.631931] tenstorrent 0000:01:00.0: AMD-Vi: Event logged [IO_PAGE_FAULT domain=0x000f address=0x1fffffc0000024 flags=0x0000]

The only difference between the test runs is where the kernel put the IOVA (0x40000000 vs 0x3ffffffc0000000).
The error from IOMMU driver suggests that somewhere the top five bits of the larger address are getting lost.

IOVA: 0x3ffffffc0000000
Address seen by IOMMU: 0x1fffffc0000000

Investigate whether this is a UMD bug or a hardware constraint.

@joelsmithTT joelsmithTT added the Blackhole Blackhole specific issue label Dec 9, 2024
@joelsmithTT joelsmithTT self-assigned this Dec 9, 2024
@joelsmithTT
Copy link
Contributor Author

This is looking like a UMD bug somewhere, because my Blackhole micro driver works fine.

int test2()
{
    BlackholePciDevice device("/dev/tenstorrent/0");
    size_t ONE_GIG = 1 << 30;
    void *buffer = std::aligned_alloc(0x1000, ONE_GIG);
    uint64_t iova = device.map_for_dma(buffer, ONE_GIG);
    std::cout << "0x" << std::hex << iova << std::dec << std::endl;
    auto window = device.map_tlb_2M_UC(11, 0, iova);

    *reinterpret_cast<uint32_t*>(buffer) = 0xdeadbeef;

    std::cout << "0x" << std::hex << window->read32(0) << std::dec << std::endl;
    return 0;
}

prints,

0x3ffffffc0000000
0xdeadbeef

@joelsmithTT
Copy link
Contributor Author

Here is the bug

Related: #303 #281

@broskoTT broskoTT added this to the Hugepage -> IOMMU milestone Dec 9, 2024
joelsmithTT added a commit that referenced this issue Dec 9, 2024
### Issue
#380

### Description
uint32_t isn't large enough to hold a 58-bit IOVA divided by Blackhole
TLB window size (2 megabytes). Use of uint32_t here incorrectly
truncates the address, leaving the TLB register mis-programmed.

### List of the changes
* change a uint32_t to uint64_t

### Testing
Manual

### API Changes
N/A
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Blackhole Blackhole specific issue
Projects
None yet
Development

No branches or pull requests

2 participants