[WIP] Fix GUI MPI available cores #871

gtdang · 2024-08-23T21:11:05Z

The GUI MPI was not working on HPC systems where the number of cores partitioned to an instance is less than the total cores of a node.

This fix uses the number of available cores to the process instead of the total number of cores of the system.

Work done

Use psutil to get number of available cores instead of total system cores for Linux and Windows. Mac will still use total system cores.
Exposed the number of cores selection for the MPI backend. Default is set to 1 and the maximum is set to the number of available cores.
Fixed the issue where the backend sub-options (n cores) was not displaying for the default (joblib) on startup

closes #870

jasmainak · 2024-08-25T19:48:17Z

hnn_core/gui/gui.py

@@ -320,6 +321,9 @@ def __init__(self, theme_color="#802989",
        # load default parameters
        self.params = self.load_parameters(network_configuration)

+        # Number of available cores
+        self.n_cores = self._available_cores()


Usually, I've been told that one shouldn't use all the cores available when doing multiprocessing ... the user should be told how many cores are available and how many of those they want to use.

Yes agreed. I documented the GUI's behavior in issue #870

Using the joblib backend the number of cores is exposed and adjustable. With the MPI backend it is not. MPI uses the total number of cores-1. I was curious if there was a reason it was designed this way.

I thought you could adjust the number of cores in MPIBackend too. Is that not the case? I think n_cores - 1 may not be the best default ... but I'm not an MPI expert. I know it's a bad idea with joblibs

I would just set it to 1 rather than attempt to guess the total number of cores. During the tutorial, the user can be instructed to use more cores. The less magic there is in the code, the less likely it will fail.

gtdang · 2024-08-26T14:31:52Z

Ok. I've exposed the number of cores selection for MPI. The default is set to 1 and the maximum should be the total number of cores available to the process rather than the total number of cores on the system.

Note that core availability applies to Linux and Windows systems, Mac will just show the total system cores. This shouldn't be an issue given that this issue mostly applies to running on HPC nodes. Also the user can now adjust the number of cores for MPI manually.

rythorpe · 2024-08-26T20:38:22Z

Sorry, I didn't notice this until now. I just made a comment on the issue page. I'm not opposed to this change, but also don't think it's absolutely necessary to expose the parallel backend API in the GUI with the sole reason that it adds yet another variable for new users to trip over.

dylansdaniels · 2024-08-26T21:44:36Z

@rythorpe are you saying remove the exposed Backend option from the GUI entirely? We do switch the backend from joblib to MPI via the GUI in the workshops to speed things up for users and make them aware of that functionality, so I think we do need it. Though we can simplify it. I also weighed in on the issue page :)

jasmainak · 2024-08-27T05:16:27Z

I guess it's fair to say we just use MPIBackend and default to joblib if not available ... so no need to expose n_cores for joblibbackend (expose only for mpibackend) and keep it simple

gtdang · 2024-08-27T15:05:00Z

I think users should be able to specify number of system resources the GUI is utilizing for a number of reasons.

They can multitask with other processes running long simulation.
I've noticed that MPI can be slower if number of trials <<< number of processors specified. This is likely due to overhead costs.
- I tested this on Oscar with a 3 trial simulation specified with 48 cores vs 3 cores. MPI with 48 cores was slower than both joblib and MPI specified with 3 cores.
- I can't test this again with 48 cores though. My account got suspended for resource overuse for an hour. 😅 But I could test it with a lower amount.

That said, I could only test this because I've exposed the number of cores attribute to the user!

gtdang · 2024-08-27T20:59:34Z

I've been testing the workshop instances on Oscar (8 cores). I'm finding that MPI will fail on runs with 6-8 cores specified with the message below:

--------------------------------------------------------------------------
A request was made to bind to that would result in binding more
processes than cpus on a resource:

   Bind to:     CORE
   Node:        node1717
   #processes:  2
   #cpus:       1

You can override this protection by adding the "overload-allowed"
option to your binding directive.
--------------------------------------------------------------------------

My instance is partitioned 8 cores on a 32 core node.

(hnn_env) [gdang2@node1717 ~]$ nproc
8
(hnn_env) [gdang2@node1717 ~]$ python
Python 3.11.0 (main, Jan 10 2024, 14:25:36) [GCC 11.3.1 20221121 (Red Hat 11.3.1-4)] on linux
Type "help", "copyright", "credits" or "license" for more information.
>>> import multiprocessing
>>> multiprocessing.cpu_count()
32
>>> import psutil
>>> psutil.cpu_count(logical=False)
32

There must be some sort of hardware restriction that prevents process assignment to certain cores here. In this case we would want to use oversubscribe in the mpi command, however the MPIBackend won't apply that keyword because of the logic below. In my case my n.procs < 32 and so oversubscribe and hyper-threading will never be executed.

hnn-core/hnn_core/parallel_backends.py

Lines 609 to 623 in ccef332

    
           # did user try to force running on more cores than available? 
        
           oversubscribe = False 
        
           if self.n_procs > n_logical_cores: 
        
               oversubscribe = True 
        
           hyperthreading = False 
        
           if _has_mpi4py() and _has_psutil(): 
        
               import psutil 
        
               n_physical_cores = psutil.cpu_count(logical=False) 
        
               # detect if we need to use hwthread-cpus with mpiexec 
        
               if self.n_procs > n_physical_cores: 
        
                   hyperthreading = True

rythorpe · 2024-08-27T22:43:55Z

A few things:

I agree with @jasmainak that we should always default to MPIBackend for the GUI and use JoblibBackend as backup.
@dylansdaniels in the old GUI I always found setting the simulation hyperparameters distracting to the flow of our workshops, so yeah, I guess I'm advocating for limiting backend exposure as much as possible. Users who need more advanced control of backend parameters should be encouraged to use the HNN-core API directly.
@gtdang, I totally see your point that parallelizing with the maximal number of cores is not ideal. Given communication overhead, I've generally found that setting n_procs to >20 isn't beneficial for the default network. The other thing to consider that is in favor of exposing the backend API, is if we ever host a workshop on shared resources (e.g., as we have in the past with Metacell), it may at times be more efficient to have individual users run a smaller number of parallel processes so that the class, as a whole, doesn't get bogged down.

Aside from changing the default to MPIBackend for the GUI as mentioned above, what if we fix the logic within MPIBackend to always allow oversubscription and then provide a resonable default for n_procs in the GUI? If you still think it's necessary to expose the backend API in the GUI, I can get on-board, but I do think we should be mindful about limiting the number non-critical parameters in the GUI so that they're (at least visually) not overwhelming.

dylansdaniels · 2024-08-27T23:37:57Z

In addition to Metacell, we also have the GUI available on other shared resources / HPCs like NSG and Oscar. In future workshops, I believe the plan is to encourage more people to start using the GUI on NSG, and I wonder if we'll run into the issue @gtdang found where MPI fails

gtdang · 2024-08-28T14:46:38Z

I don't think the default for the GUI should be a backend that is an optional dependency that you can't install with the package and which requires variable set-up depending on user hardware and operating system.

rythorpe · 2024-08-28T15:50:09Z

As @jasmainak suggested, the idea would be to default to MPIBackend only if it's available, and then switch to JoblibBackend if it's not.

Is there perhaps another way we can streamline use of the MPIBackend for the GUI? I don't think we've ever run a workshop without explicitly using this backend because simulations take too long without it. With the previous GUI, there were a handful of steps that added to the frustration (at least for me) of just getting a single simulation to run in real-time during workshops, one of which was setting the number of cores over which to parallelize the simulation. In service of the vast majority of GUI use-cases, it'd be nice to streamline this better. As I mentioned previously, adding the number of cores as a simulation parameter in the GUI isn't, in itself, a deal breaker, but I think there's something to be said for a GUI simulation that "just runs" quickly without having to sift through a myriad of parameters that don't directly relate to the scientific discourse of our workshops.

gtdang · 2024-08-28T16:33:47Z

OK. I think that check for mpi and changing the backend on the start-up should be easy to implement.

I agree that a change to the MPIBackend class is needed, though I don't think we yet understand all the variability and system behaviors of different hardware set-ups to define default safe parameters and arguments with the mpiexec call. I only have been able to test my local macOS and Oscar's virtual machine instance GUI behavior. As the MPIBackend class is the core API, we ought to also look at how it behaves with different system architectures, operating systems. For example, just in the world of HPC, I'm not sure how it will behave in a virtual machine instance vs a Slurm job.

Given that unknown I feel like it's the easiest for now to expose the GUI argument to allow users to try different values that work for their system environment.

gtdang · 2024-08-28T19:49:30Z

Added logic to check of packages for mpi use and setting the default backend. b5a07da

jasmainak · 2024-08-28T20:38:18Z

hnn_core/gui/gui.py

+        # Displays the default backend options
+        handle_backend_change(self.widget_backend_selection.value,
+                              self._backend_config_out, self.widget_mpi_cmd,
+                              self.widget_n_jobs)


I was actually suggesting something simpler ... hnn-core already defaults to JoblibBackend if MPI is not available, so all you had to do was expose MPIBackend and get rid of the JoblibBackend. Anyhow @dylansdaniels do you want this merged for the GUI tutorial today?

That handler in the gui changes the displayed options based on the backend drop-down. It doesn't switch between backends. In hnn-core MPIBackend only switches to joblib if it's a single process, I don't think it checks for MPI and reverts to joblib if it can't find it.

Nevermind, you're right. It does revert to joblib. Though it runs on joblib with only 1 processor, is there a reason why it only uses 1?

I think it's because specifying multiple processes with the joblib backend, unlike the MPI backend, deosn't make sense for all types of simulations. Since joblib parallelizes over multiple trials, it only works for simulations that include multiple trials. Plus parallel joblib processes are memory intensive, so silently switching from MPI to joblib can produce errors if the user wasn't anticipating needing the extra resources needed for a joblib task with many trials.

I agree with @rythorpe . I would say the GUI should just use MPIBackend, and if not available, just run everything in a single core. It's probably a misnomer to call it "JoblibBackend" if it's running on a single core. Hence, I wouldn't bother exposing it to user.

Do we know if anyone who is actually using JoblibBackend in the GUI for the their research? I personally don't know of anyone and am generally under the impression that folks who prefer the GUI would want a general-use parallel backend that works without too much fuss.

Your work @ntolley is unique in that your questions require lots of parameter sweeps, which are best served by using the API directly. Note that limited RAM is only an issue for Joblib, which I'm arguing should essentially default to running a single job within the context of the GUI.

Pretty sure a lot of Darcy's initial work compared over multiple trials, although I'll concede she's using the API now. If you're doing 1-1 comparisons of ERP's for instance it's definitely more robust to compute an average over multiple trials. In tutorials we use 1 trial because it's efficient, but that isn't necessarily how you should fit these signals to data.

Again I totally agree with making MPI with oversubscription the default, but I genuinely think trial-based parallelism is a useful feature even in the GUI setting!

These are my cards on the table 😄, but if I get outvoted I'll concede

Fair enough. If GUI users use it, we should support it.

I personally feel that once a feature is added, users find ways to justify keeping it. That's why one should start with minimal features first and then add based on necessity and feedback from user base. Anyway, call is yours @ntolley !

rythorpe · 2024-08-28T21:10:07Z

OK. I think that check for mpi and changing the backend on the start-up should be easy to implement.

I agree that a change to the MPIBackend class is needed, though I don't think we yet understand all the variability and system behaviors of different hardware set-ups to define default safe parameters and arguments with the mpiexec call. I only have been able to test my local macOS and Oscar's virtual machine instance GUI behavior. As the MPIBackend class is the core API, we ought to also look at how it behaves with different system architectures, operating systems. For example, just in the world of HPC, I'm not sure how it will behave in a virtual machine instance vs a Slurm job.

Given that unknown I feel like it's the easiest for now to expose the GUI argument to allow users to try different values that work for their system environment.

That's fair. I still think it's non-ideal that MPI oversubscription isn't being invoked when it's supposed too, but we can fix this later if you think more investigation is needed.

jasmainak · 2024-09-03T14:25:23Z

hnn_core/gui/gui.py

-                                                 value='Joblib',
-                                                 description='Backend:')
+        self.widget_backend_selection = (
+            Dropdown(options=[('Joblib', 'Joblib'),


don't bother exposing this to user

asoplata · 2024-12-13T21:03:34Z

I've made a PR to @gtdang 's branch here: brown-ccv#2 which appears to work in and compensate for determining cores etc. in all of the expected use-cases . The only additions to George's changes that are not due to an master-update-rebase are in this commit: brown-ccv@25c78c0 Pasting the list of use-cases here from the commit message:

In _determine_cores_hwthreading(), if you read the lengthy comments, I
have thought through each common scenario, and I believe resolved what
to do for each, with respect to the number of cores to use and whether
or not to use hardware-threading. These scenarios include: the user
choosing to use hardware-threading (default) or not, across Macos
variations with and without hardware-threading, Linux local computer
variations with and without hardware-threading, and Linux
HPC (e.g. OSCAR) variations which appear to never support
hardware-threading. In the Windows case, due to both #589 and the
currently-untested MPI integration on Windows, I always report the
machine as not having hardware-threading.

Note that this does NOT change the default number of jobs to use for the
GUI if MPI is detected. Such a change breaks the current test_gui.py
testing: see #960 .

asoplata · 2024-12-16T18:28:36Z

There is some git cleanup going to happen before this is merged: @gtdang is going to rebase his work onto a side branch on upstream, then I will apply my changes to that. This should make the result significantly easier to see the code changes.

…based on the operation system

… attribute where relevant.

…d on start-up Previously the sub-options were hidden by default and only displayed when the backend dropdown was changed. This hid the number of cores option for the default joblib backend on start-up.

…nstalled

gtdang force-pushed the gui-mpi-available-cores branch from 11e8b3c to 0c6f4df Compare August 23, 2024 21:14

gtdang requested review from kmilo9999, dylansdaniels and ntolley August 23, 2024 21:14

jasmainak reviewed Aug 25, 2024

View reviewed changes

gtdang changed the title ~~[WIP] Fix GUI MPI available cores~~ [MRG] Fix GUI MPI available cores Aug 26, 2024

gtdang marked this pull request as ready for review August 26, 2024 17:53

gtdang force-pushed the gui-mpi-available-cores branch from c303f73 to 0954871 Compare August 28, 2024 19:22

jasmainak reviewed Aug 28, 2024

View reviewed changes

jasmainak reviewed Sep 3, 2024

View reviewed changes

gtdang mentioned this pull request Sep 3, 2024

[MRG] Fix MPIBackend platform logic #876

Merged

asoplata added the hnn-gui HNN GUI label Nov 15, 2024

asoplata mentioned this pull request Dec 13, 2024

BUG: Network jones2009_3x3_drives.json is incompatible with MPI #960

Open

2 tasks

asoplata changed the title ~~[MRG] Fix GUI MPI available cores~~ [WIP] Fix GUI MPI available cores Dec 16, 2024

feat: updated run_button_clicked to take a number of cores argument

0f2d38b

gtdang added 8 commits December 16, 2024 15:37

feat: created a static method to count the number of available cores …

2cb0aca

…based on the operation system

feat: added number of available cores as an attribute and applied the…

01859d7

… attribute where relevant.

feat: added number of cores selection for the mpi backend

c0eb000

fix: updated the gui compose so that backend sub-options are displaye…

4aa89fa

…d on start-up Previously the sub-options were hidden by default and only displayed when the backend dropdown was changed. This hid the number of cores option for the default joblib backend on start-up.

chore: removed unused argument n_cores from run_button_clicked function

12e5e7b

docs: doc string for _available_cores method

e1b2035

feat: changes the default backend based on if mpi4py and psutil are i…

c67139c

…nstalled

fix: force joblib for test_gui_run_simulations

8f929ca

gtdang force-pushed the gui-mpi-available-cores branch from 6159e1f to 8f929ca Compare December 16, 2024 20:38

chore: resolve merge conflict

64b2b02

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[WIP] Fix GUI MPI available cores #871

[WIP] Fix GUI MPI available cores #871

gtdang commented Aug 23, 2024 •

edited

Loading

jasmainak Aug 25, 2024

gtdang Aug 26, 2024

jasmainak Sep 3, 2024

jasmainak Sep 3, 2024

gtdang commented Aug 26, 2024

rythorpe commented Aug 26, 2024

dylansdaniels commented Aug 26, 2024 •

edited

Loading

jasmainak commented Aug 27, 2024 •

edited

Loading

gtdang commented Aug 27, 2024 •

edited

Loading

gtdang commented Aug 27, 2024 •

edited

Loading

rythorpe commented Aug 27, 2024

dylansdaniels commented Aug 27, 2024 •

edited

Loading

gtdang commented Aug 28, 2024

rythorpe commented Aug 28, 2024

gtdang commented Aug 28, 2024

gtdang commented Aug 28, 2024

jasmainak Aug 28, 2024

gtdang Aug 28, 2024 •

edited

Loading

gtdang Aug 29, 2024

rythorpe Aug 29, 2024

jasmainak Sep 3, 2024

rythorpe Sep 10, 2024

ntolley Sep 12, 2024

ntolley Sep 12, 2024

rythorpe Sep 12, 2024

jasmainak Sep 12, 2024

rythorpe commented Aug 28, 2024

jasmainak Sep 3, 2024

asoplata commented Dec 13, 2024 •

edited

Loading

asoplata commented Dec 16, 2024

[WIP] Fix GUI MPI available cores #871

Are you sure you want to change the base?

[WIP] Fix GUI MPI available cores #871

Conversation

gtdang commented Aug 23, 2024 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

gtdang commented Aug 26, 2024

rythorpe commented Aug 26, 2024

dylansdaniels commented Aug 26, 2024 • edited Loading

jasmainak commented Aug 27, 2024 • edited Loading

gtdang commented Aug 27, 2024 • edited Loading

gtdang commented Aug 27, 2024 • edited Loading

rythorpe commented Aug 27, 2024

dylansdaniels commented Aug 27, 2024 • edited Loading

gtdang commented Aug 28, 2024

rythorpe commented Aug 28, 2024

gtdang commented Aug 28, 2024

gtdang commented Aug 28, 2024

Choose a reason for hiding this comment

gtdang Aug 28, 2024 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

rythorpe commented Aug 28, 2024

Choose a reason for hiding this comment

asoplata commented Dec 13, 2024 • edited Loading

asoplata commented Dec 16, 2024

gtdang commented Aug 23, 2024 •

edited

Loading

dylansdaniels commented Aug 26, 2024 •

edited

Loading

jasmainak commented Aug 27, 2024 •

edited

Loading

gtdang commented Aug 27, 2024 •

edited

Loading

gtdang commented Aug 27, 2024 •

edited

Loading

dylansdaniels commented Aug 27, 2024 •

edited

Loading

gtdang Aug 28, 2024 •

edited

Loading

asoplata commented Dec 13, 2024 •

edited

Loading