Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Refactor test_random to minimize collective calls #1677

Merged
merged 26 commits into from
Oct 17, 2024

Conversation

ClaudiaComito
Copy link
Contributor

@ClaudiaComito ClaudiaComito commented Oct 15, 2024

Due Diligence

  • General:
  • Implementation:
    • unit tests: all split configurations tested
    • unit tests: multiple dtypes tested
    • benchmarks: created for new functionality
    • benchmarks: performance improved or maintained
    • documentation updated where needed

Description

test_random has been giving us problems in connection to .numpy() calls (aka Allgather/Allgatherv and copying to CPU) before.

As far as I can tell, it isn't any particular instance of "allgathering" that doesn't work. On the AMD runner (2-process GPU tests), since this Monday, test_random has been failing consistently around the 10th numpy() call in the module.

I have refactored test_random to gather and copy only when absolutely necessary. It now gathers/copies to CPU only 8 times, as opposed to 47 in the legacy implementation.

Issue/s resolved: #1682

Changes proposed:

  • remove unnecessary numpy() calls

Type of change

Bug fix (non-breaking change which fixes an issue)

Memory requirements

NA

Performance

NA

Does this change modify the behaviour of other functions? If so, which?

no

Copy link
Contributor

Thank you for the PR!

Copy link
Contributor

Thank you for the PR!

Copy link
Contributor

Thank you for the PR!

Copy link
Contributor

Thank you for the PR!

Copy link
Contributor

Thank you for the PR!

1 similar comment
Copy link
Contributor

Thank you for the PR!

Copy link
Contributor

Thank you for the PR!

1 similar comment
Copy link
Contributor

Thank you for the PR!

Copy link
Contributor

Thank you for the PR!

Copy link
Contributor

Thank you for the PR!

Copy link
Contributor

Thank you for the PR!

Copy link
Contributor

Thank you for the PR!

Copy link
Contributor

Thank you for the PR!

Copy link
Contributor

Thank you for the PR!

@ClaudiaComito ClaudiaComito changed the title Debugging test_random on AMD runner Refactor test_random to minimize collective calls Oct 16, 2024
@ClaudiaComito ClaudiaComito added this to the 1.5.0 milestone Oct 16, 2024
@ClaudiaComito ClaudiaComito added bug Something isn't working MPI Anything related to MPI communication testing Implementation of tests, or test-related issues HW:ROCm backport release/1.5.x labels Oct 16, 2024
@ClaudiaComito ClaudiaComito requested a review from mtar October 16, 2024 12:42
Copy link

codecov bot commented Oct 16, 2024

Codecov Report

All modified and coverable lines are covered by tests ✅

Project coverage is 92.13%. Comparing base (b40646f) to head (b3e2b31).
Report is 1 commits behind head on main.

Additional details and impacted files
@@           Coverage Diff           @@
##             main    #1677   +/-   ##
=======================================
  Coverage   92.13%   92.13%           
=======================================
  Files          83       83           
  Lines       12165    12173    +8     
=======================================
+ Hits        11208    11216    +8     
  Misses        957      957           
Flag Coverage Δ
unit 92.13% <100.00%> (+<0.01%) ⬆️

Flags with carried forward coverage won't be shown. Click here to find out more.

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

Copy link
Collaborator

@mtar mtar left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you. You skipped the median tests. Is this intentional?

Copy link
Contributor

Thank you for the PR!

Copy link
Contributor

Thank you for the PR!

Copy link
Contributor

Thank you for the PR!

@ClaudiaComito
Copy link
Contributor Author

Thank you. You skipped the median tests. Is this intentional?

Yes, I skipped the ht.median tests because they are very communication-intensive. The np.median tests are all still in.

Copy link
Contributor

Thank you for the PR!

@ClaudiaComito ClaudiaComito merged commit 4b3e570 into main Oct 17, 2024
43 checks passed
@ClaudiaComito ClaudiaComito deleted the bug/amd-runner-test-random branch October 17, 2024 15:17
github-actions bot pushed a commit that referenced this pull request Oct 17, 2024
* debugging

* fix misinterpretation of dtype

* debugging

* debugging

* debugging

* debugging

* debugging

* debugging

* debugging

* debugging

* replace numpy() calls with alternative checks

* debugging

* debugging

* debugging randint

* debugging

* cast ints to float in statistical ops

* bypass numpy call l. 197

* bypass more numpy calls, skip median checks

* bypass more numpy calls, skip median checks

* bypass numpy calls wherever possible

* reinstate median checks

* skip ht.median if split>0

* skip all ht.median

* Revert "skip all ht.median"

This reverts commit 1241454.

* Revert "skip ht.median if split>0"

This reverts commit 4da8c93.

* Revert "reinstate median checks"

This reverts commit bf50914.

(cherry picked from commit 4b3e570)
Copy link
Contributor

Successfully created backport PR for release/1.5.x:

ClaudiaComito added a commit that referenced this pull request Oct 18, 2024
* debugging

* fix misinterpretation of dtype

* debugging

* debugging

* debugging

* debugging

* debugging

* debugging

* debugging

* debugging

* replace numpy() calls with alternative checks

* debugging

* debugging

* debugging randint

* debugging

* cast ints to float in statistical ops

* bypass numpy call l. 197

* bypass more numpy calls, skip median checks

* bypass more numpy calls, skip median checks

* bypass numpy calls wherever possible

* reinstate median checks

* skip ht.median if split>0

* skip all ht.median

* Revert "skip all ht.median"

This reverts commit 1241454.

* Revert "skip ht.median if split>0"

This reverts commit 4da8c93.

* Revert "reinstate median checks"

This reverts commit bf50914.

(cherry picked from commit 4b3e570)

Co-authored-by: Claudia Comito <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
backport release/1.5.x bug Something isn't working HW:ROCm MPI Anything related to MPI communication testing Implementation of tests, or test-related issues
Projects
None yet
Development

Successfully merging this pull request may close these issues.

[Bug]: test_random fails on AMD GPU
2 participants