Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Whither test_network_sync? #57

Closed
rouson opened this issue Apr 20, 2022 · 11 comments
Closed

Whither test_network_sync? #57

rouson opened this issue Apr 20, 2022 · 11 comments

Comments

@rouson
Copy link
Collaborator

rouson commented Apr 20, 2022

After attempting to develop test_network_sync.f90 into a test that reports passing or failure and seeking some advice Tom Clune, who develops the pFUnit testing framework, I'm not confident there's a way to write a test that is both meaningful straightforward. Without doing extensive research myself, I have the sense is that the formula employed in the randn functions has been thoroughly studied and there's not much to test in its behavior beyond what is already known. @milancurcic if you have thoughts on a useful test, please let me know. Otherwise, I recommend removing the test, given that it currently just prints the generated samples without any form of check on the their statistical properties.

@rouson
Copy link
Collaborator Author

rouson commented Apr 20, 2022

Also, the sync type-bound procedure suffers from an unfortunate ambiguity in its name. I understand the reason for the name, but I'd just call it broadcast or something similar because there's no explicit synchronization (sync all or sync images) and the standard imposes no precise synchronization semantics on collective subroutines other than noting that "the calculations performed by a collective subroutine have some internal synchronizations."

@milancurcic
Copy link
Member

Yes, I wouldn't worry about this one as it will likely go away. I suggest focusing on io, datasets, and activations.

@milancurcic
Copy link
Member

I looked at what sync does to remind myself. It updates the weights and biases on all images > 1 to match those on image 1. A passing test for this would ensure they are indeed exactly the same after the call to sync.

@rouson
Copy link
Collaborator Author

rouson commented Apr 20, 2022

@milancurcic perfect. Thanks. I'll submit a PR with the suggested test.

@rouson
Copy link
Collaborator Author

rouson commented Apr 20, 2022

@milancurcic It still might be that test_network_sync is unnecessary. Fortran 2018 provides an intrinsic subroutine: random_init(repeatable, image_distinct), where both arguments are logical, intent(in) and where passing image_distinct=.true. ensures that each image uses the same sequence of random numbers. At scale, some amount of redundant computation generally outperforms communication so the code will likely scale better by replacing co_broadcast with a call to random_init with image_distinct=.false..

The GNU, Intel, and Cray compilers support random_init. Some important updates to the behavior of random_init will appear in GCC 12, however, so I'll need to check whether GCC 11 works for our purposes. I just submitted a feature request to NAG, which usually response quickly so there's a reasonable chance that the NAG compiler will support random_int soon too.

@rouson
Copy link
Collaborator Author

rouson commented Apr 20, 2022

Actually yet another way to view this is that the test is useful either way and in fact, the test should give the same result either way. How that result is produced is an implementation detail that can be considered separately.

@milancurcic
Copy link
Member

Yes, in other words, ensuring that the initial state of the network is valid across images.

@rouson
Copy link
Collaborator Author

rouson commented Apr 21, 2022

I have determined that with gfortran 11.2.0 and the head of the main branch of OpenCoarrays, calling random_init with image_distinct=.false. produces the same random numbers on each image without need for co_broadcast. See below. I haven't tested earlier versions of gfortran or OpenCoarrays.

I recommend this approach for scalability reasons. I'll submit a PR with separate commits that add the test and replace the co_broadcast with random_init. I'm thinking of setting repeatable=.true. so that test results is deterministic. The current test is non-deterministic so if you prefer that approach, let me know. I suppose we could also think about having a deterministic test and a non-deterministic test. I'm not sure it matters much either way given that our only goal is to ensure the same weights on all images.

% cat random.f90 
program main
  implicit none
  real harvest(5)
  call random_init(.true.,.false.)
  call random_number(harvest)
  print *, this_image(), ":", harvest
end program

% caf random.f90 

% cafrun -n 8 ./a.out
           5 :  0.571258783      0.226831257      0.175722122      0.691333890      0.885308683    
           6 :  0.571258783      0.226831257      0.175722122      0.691333890      0.885308683    
           7 :  0.571258783      0.226831257      0.175722122      0.691333890      0.885308683    
           8 :  0.571258783      0.226831257      0.175722122      0.691333890      0.885308683    
           1 :  0.571258783      0.226831257      0.175722122      0.691333890      0.885308683    
           2 :  0.571258783      0.226831257      0.175722122      0.691333890      0.885308683    
           3 :  0.571258783      0.226831257      0.175722122      0.691333890      0.885308683  

@milancurcic
Copy link
Member

Thanks @rouson. Any chance for this to work with GFortran 10?

@rouson
Copy link
Collaborator Author

rouson commented Apr 25, 2022

Most likely, yes. Certainly the current approach should work with GFortran 10. If random_int works with 10, I'll submit that approach as a separate PR.

@milancurcic
Copy link
Member

Closing in favor of #78.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants