Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Simplify hugepage code, Chapter I #332

Open
wants to merge 10 commits into
base: main
Choose a base branch
from

Conversation

joelsmithTT
Copy link
Contributor

@joelsmithTT joelsmithTT commented Nov 26, 2024

Issue

#254

Description

Part 1 of an effort to simplify the hugepage scheme for shared memory between device and host. Initial work on this also removed hugetlbfs aspects of the setup, but I will save that for part 2.

List of the changes

  • Remove cpuset_lib code; stop doing fancy things to determine host memory channel quantity - just use what the application passes in. If there are not enough hugepages available, UMD will fail.
  • Use libnuma to prefer hugepage allocations on the NUMA node associated with the accelerator device.
  • Stop passing desired number of huge pages to Cluster as const uint32 &.
  • Remove env var controlling what directory the hugetlbfs is mounted in, as a step towards not requiring this at all.

Testing

CI

API Changes

This PR has API changes, but they should not break anything. Metal already requires libnuma.
If moving to libnuma from hwloc is a problem, there's a commit that uses hwloc instead that we can use.

@joelsmithTT joelsmithTT force-pushed the joelsmith/joelsmith/simplify-hugepage-code-v4 branch 3 times, most recently from 2d83b7e to 18a73fe Compare December 2, 2024 18:02
}

// Prefer allocations on the NUMA node associated with the device.
numa_set_preferred(numa_node);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you please create a tt_metal branch which consumes this UMD commit, and then run tt_metal CI? We don't have machines on our CI that are affected by the change in this PR.

You could run these:
"all-post-commit-workflows.yaml"
"t3000-unit-tests.yaml"
"tg-unit-tests.yaml"
"tgg-unit-tests.yaml"

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This exposed deficiencies in Metal CI, which is running tests that request more hugepages than are actually provisioned.

Copy link
Contributor

@pjanevskiTT pjanevskiTT left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good, left few nit comments

device/hugepage.cpp Show resolved Hide resolved
device/api/umd/device/cluster.h Show resolved Hide resolved
@joelsmithTT joelsmithTT force-pushed the joelsmith/joelsmith/simplify-hugepage-code-v4 branch 2 times, most recently from 513c37d to 1003a47 Compare December 5, 2024 20:46
README.md Show resolved Hide resolved
@joelsmithTT joelsmithTT force-pushed the joelsmith/joelsmith/simplify-hugepage-code-v4 branch from 3c2eb8f to 1ea6f57 Compare January 6, 2025 18:02
Rather than make it use a host memory channel, allow Cluster
construction to succeed with zero.  Insert a check in the remote IO code
paths to emit an actionable error message if there is no available
sysmem.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants