Refactor `test_random` to minimize collective calls #1677

ClaudiaComito · 2024-10-15T09:35:03Z

Due Diligence

General:
- title of the PR is suitable to appear in the Release Notes
Implementation:
- unit tests: all split configurations tested
- unit tests: multiple dtypes tested
- benchmarks: created for new functionality
- benchmarks: performance improved or maintained
- documentation updated where needed

Description

test_random has been giving us problems in connection to .numpy() calls (aka Allgather/Allgatherv and copying to CPU) before.

As far as I can tell, it isn't any particular instance of "allgathering" that doesn't work. On the AMD runner (2-process GPU tests), since this Monday, test_random has been failing consistently around the 10th numpy() call in the module.

I have refactored test_random to gather and copy only when absolutely necessary. It now gathers/copies to CPU only 8 times, as opposed to 47 in the legacy implementation.

Issue/s resolved: #1682

Changes proposed:

remove unnecessary numpy() calls

Type of change

Bug fix (non-breaking change which fixes an issue)

Memory requirements

NA

Performance

NA

Does this change modify the behaviour of other functions? If so, which?

no

github-actions · 2024-10-15T09:59:35Z

Thank you for the PR!

github-actions · 2024-10-15T10:24:14Z

Thank you for the PR!

github-actions · 2024-10-15T11:32:23Z

Thank you for the PR!

github-actions · 2024-10-15T12:36:53Z

Thank you for the PR!

github-actions · 2024-10-15T12:58:17Z

Thank you for the PR!

github-actions · 2024-10-15T13:00:12Z

Thank you for the PR!

github-actions · 2024-10-15T13:26:39Z

Thank you for the PR!

github-actions · 2024-10-15T13:32:14Z

Thank you for the PR!

github-actions · 2024-10-15T13:55:27Z

Thank you for the PR!

github-actions · 2024-10-16T07:28:54Z

Thank you for the PR!

github-actions · 2024-10-16T08:02:07Z

Thank you for the PR!

github-actions · 2024-10-16T09:12:56Z

Thank you for the PR!

github-actions · 2024-10-16T10:11:36Z

Thank you for the PR!

github-actions · 2024-10-16T12:03:35Z

Thank you for the PR!

codecov · 2024-10-16T13:06:09Z

Codecov Report

All modified and coverable lines are covered by tests ✅

Project coverage is 92.13%. Comparing base (b40646f) to head (b3e2b31).
Report is 1 commits behind head on main.

Additional details and impacted files

@@           Coverage Diff           @@
##             main    #1677   +/-   ##
=======================================
  Coverage   92.13%   92.13%           
=======================================
  Files          83       83           
  Lines       12165    12173    +8     
=======================================
+ Hits        11208    11216    +8     
  Misses        957      957

Flag	Coverage Δ
unit	`92.13% <100.00%> (+<0.01%)`	⬆️

Flags with carried forward coverage won't be shown. Click here to find out more.

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

mtar

Thank you. You skipped the median tests. Is this intentional?

github-actions · 2024-10-16T17:16:16Z

Thank you for the PR!

github-actions · 2024-10-17T07:55:40Z

Thank you for the PR!

github-actions · 2024-10-17T08:49:45Z

Thank you for the PR!

ClaudiaComito · 2024-10-17T09:00:14Z

Thank you. You skipped the median tests. Is this intentional?

Yes, I skipped the ht.median tests because they are very communication-intensive. The np.median tests are all still in.

This reverts commit 1241454.

This reverts commit 4da8c93.

This reverts commit bf50914.

github-actions · 2024-10-17T10:18:56Z

Thank you for the PR!

* debugging * fix misinterpretation of dtype * debugging * debugging * debugging * debugging * debugging * debugging * debugging * debugging * replace numpy() calls with alternative checks * debugging * debugging * debugging randint * debugging * cast ints to float in statistical ops * bypass numpy call l. 197 * bypass more numpy calls, skip median checks * bypass more numpy calls, skip median checks * bypass numpy calls wherever possible * reinstate median checks * skip ht.median if split>0 * skip all ht.median * Revert "skip all ht.median" This reverts commit 1241454. * Revert "skip ht.median if split>0" This reverts commit 4da8c93. * Revert "reinstate median checks" This reverts commit bf50914. (cherry picked from commit 4b3e570)

github-actions · 2024-10-17T15:17:43Z

Successfully created backport PR for release/1.5.x:

[Backport release/1.5.x] Refactor test_random to minimize collective calls #1683

* debugging * fix misinterpretation of dtype * debugging * debugging * debugging * debugging * debugging * debugging * debugging * debugging * replace numpy() calls with alternative checks * debugging * debugging * debugging randint * debugging * cast ints to float in statistical ops * bypass numpy call l. 197 * bypass more numpy calls, skip median checks * bypass more numpy calls, skip median checks * bypass numpy calls wherever possible * reinstate median checks * skip ht.median if split>0 * skip all ht.median * Revert "skip all ht.median" This reverts commit 1241454. * Revert "skip ht.median if split>0" This reverts commit 4da8c93. * Revert "reinstate median checks" This reverts commit bf50914. (cherry picked from commit 4b3e570) Co-authored-by: Claudia Comito <[email protected]>

ClaudiaComito added 3 commits October 15, 2024 11:34

debugging

62942d9

fix misinterpretation of dtype

024b9e9

debugging

6640c7a

debugging

62dca2b

debugging

0e5ec77

debugging

8114f8d

ClaudiaComito added 2 commits October 15, 2024 14:52

debugging

d4c433c

debugging

e58a3ec

ClaudiaComito added 3 commits October 15, 2024 15:13

debugging

4230d08

debugging

725dc02

replace numpy() calls with alternative checks

621eb48

debugging

6c01e17

debugging

315b3c4

debugging randint

d2b3240

debugging

a4d439b

ClaudiaComito added 2 commits October 16, 2024 12:04

cast ints to float in statistical ops

45dcbe1

bypass numpy call l. 197

3cf651d

bypass numpy calls wherever possible

da80129

ClaudiaComito changed the title ~~Debugging test_random on AMD runner~~ Refactor test_random to minimize collective calls Oct 16, 2024

ClaudiaComito added this to the 1.5.0 milestone Oct 16, 2024

ClaudiaComito added bug Something isn't working MPI Anything related to MPI communication testing Implementation of tests, or test-related issues HW:ROCm backport release/1.5.x labels Oct 16, 2024

ClaudiaComito requested a review from mtar October 16, 2024 12:42

mtar reviewed Oct 16, 2024

View reviewed changes

reinstate median checks

bf50914

skip ht.median if split>0

4da8c93

skip all ht.median

1241454

ClaudiaComito added 3 commits October 17, 2024 12:12

Revert "skip all ht.median"

835a555

This reverts commit 1241454.

Revert "skip ht.median if split>0"

726d784

This reverts commit 4da8c93.

Revert "reinstate median checks"

b3e2b31

This reverts commit bf50914.

mtar approved these changes Oct 17, 2024

View reviewed changes

ClaudiaComito merged commit 4b3e570 into main Oct 17, 2024
43 checks passed

ClaudiaComito deleted the bug/amd-runner-test-random branch October 17, 2024 15:17

github-actions bot mentioned this pull request Oct 17, 2024

[Backport release/1.5.x] Refactor test_random to minimize collective calls #1683

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Refactor `test_random` to minimize collective calls #1677

Refactor `test_random` to minimize collective calls #1677

ClaudiaComito commented Oct 15, 2024 •

edited

Loading

github-actions bot commented Oct 15, 2024

github-actions bot commented Oct 15, 2024

github-actions bot commented Oct 15, 2024

github-actions bot commented Oct 15, 2024

github-actions bot commented Oct 15, 2024

github-actions bot commented Oct 15, 2024

github-actions bot commented Oct 15, 2024

github-actions bot commented Oct 15, 2024

github-actions bot commented Oct 15, 2024

github-actions bot commented Oct 16, 2024

github-actions bot commented Oct 16, 2024

github-actions bot commented Oct 16, 2024

github-actions bot commented Oct 16, 2024

github-actions bot commented Oct 16, 2024

codecov bot commented Oct 16, 2024 •

edited

Loading

mtar left a comment

github-actions bot commented Oct 16, 2024

github-actions bot commented Oct 17, 2024

github-actions bot commented Oct 17, 2024

ClaudiaComito commented Oct 17, 2024

github-actions bot commented Oct 17, 2024

github-actions bot commented Oct 17, 2024

Refactor test_random to minimize collective calls #1677

Refactor test_random to minimize collective calls #1677

Conversation

ClaudiaComito commented Oct 15, 2024 • edited Loading

Due Diligence

Description

Changes proposed:

Type of change

Memory requirements

Performance

Does this change modify the behaviour of other functions? If so, which?

github-actions bot commented Oct 15, 2024

github-actions bot commented Oct 15, 2024

github-actions bot commented Oct 15, 2024

github-actions bot commented Oct 15, 2024

github-actions bot commented Oct 15, 2024

github-actions bot commented Oct 15, 2024

github-actions bot commented Oct 15, 2024

github-actions bot commented Oct 15, 2024

github-actions bot commented Oct 15, 2024

github-actions bot commented Oct 16, 2024

github-actions bot commented Oct 16, 2024

github-actions bot commented Oct 16, 2024

github-actions bot commented Oct 16, 2024

github-actions bot commented Oct 16, 2024

codecov bot commented Oct 16, 2024 • edited Loading

Codecov Report

mtar left a comment

Choose a reason for hiding this comment

github-actions bot commented Oct 16, 2024

github-actions bot commented Oct 17, 2024

github-actions bot commented Oct 17, 2024

ClaudiaComito commented Oct 17, 2024

github-actions bot commented Oct 17, 2024

github-actions bot commented Oct 17, 2024

Refactor `test_random` to minimize collective calls #1677

Refactor `test_random` to minimize collective calls #1677

ClaudiaComito commented Oct 15, 2024 •

edited

Loading

codecov bot commented Oct 16, 2024 •

edited

Loading