-
Notifications
You must be signed in to change notification settings - Fork 7
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Simplify hugepage code, Chapter I #332
base: main
Are you sure you want to change the base?
Conversation
2d83b7e
to
18a73fe
Compare
} | ||
|
||
// Prefer allocations on the NUMA node associated with the device. | ||
numa_set_preferred(numa_node); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can you please create a tt_metal branch which consumes this UMD commit, and then run tt_metal CI? We don't have machines on our CI that are affected by the change in this PR.
You could run these:
"all-post-commit-workflows.yaml"
"t3000-unit-tests.yaml"
"tg-unit-tests.yaml"
"tgg-unit-tests.yaml"
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This exposed deficiencies in Metal CI, which is running tests that request more hugepages than are actually provisioned.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks good, left few nit comments
513c37d
to
1003a47
Compare
Also remove the cruft that sprouted up to support its continued existence in UMD Remove const int& antipattern Use libnuma
3c2eb8f
to
1ea6f57
Compare
Rather than make it use a host memory channel, allow Cluster construction to succeed with zero. Insert a check in the remote IO code paths to emit an actionable error message if there is no available sysmem.
Issue
#254
Description
Part 1 of an effort to simplify the hugepage scheme for shared memory between device and host. Initial work on this also removed hugetlbfs aspects of the setup, but I will save that for part 2.
List of the changes
Cluster
asconst uint32 &
.Testing
CI
API Changes
This PR has API changes, but they should not break anything. Metal already requires libnuma.
If moving to libnuma from hwloc is a problem, there's a commit that uses hwloc instead that we can use.