-
Notifications
You must be signed in to change notification settings - Fork 230
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Using SCOOP for distributed parallel processing #111
Conversation
…nModel This is so we can do parallelization etc within this method. For now it's very simple and just the same as before.
Run as python -m scoop -vv rmg.py path/to/input.py This is VERY slow, as we're pickling, passing, and unpickling the entire database every time we try to evaluate a species. …but it works :)
This is the start of a framework for sharing the database across multiple workers with scoop. Saving is a method of rmgpy.data.rmg. Loading is done when needed. The filename used to store the database pickle is set via the environment variable RMG_DB_FILE. Call it with something like: RMG_DB_FILE=$PWD/database.pkl python -m scoop -vv rmg.py path/to/input.py
The example is a copy of the methylformate example. The lsf.sh script should be submitted to the LSF queuing system. This is the system used on "Venture" at Northeastern University. Submit with "bsub < lsf.sh". Apparently Lava is related, so it may work for that also. http://en.wikipedia.org/wiki/Platform_LSF Equivalent ones should probably be written for PBS, etc. NB. the RMG_DB_FILE environment variable is required.
This function returns the checksum (hash) of a list of files/folders, eg. hash = rmgpy.utilities.path_checksum(['path/to/database']) This will be useful for checking whether things have changed, for cache validation.
We hash a bunch of metadata to try to be sure that the cached database is the same as what would be loaded if you loaded it from scratch. Hopefully I have included everything that matters.
We pass around the location and hash of the database cache file so that each worker can load it and check it has the right version. This removes the need for the "RMG_DB_FILE" environment variable.
…mporary solution to avoid attribute errors from cclib during parsing.
Adding standoalone thermoestimator to scoop2 branch.
Getting the updates
This reverts commit 7c68b07.
Added a generator to divide species list into chunks (100 species) so that output.txt is written once a chunk is calculated.
…ics with different thresholds
Updated scoop from 0.62 to development version of 0.7RC1. With this new version of scoop you can pass environment variables to workers through bash scripts (prolog.sh). Fixed the wrong usage of futures.map. Interestingly, older version was working correctly even with this bug. There are many debug loggings that should be deleted.
Doesn't work. This reverts commit 384db4e.
All arguments are now optional. I was having problems with importing gprof2dot: ``` from external.gprof2dot import gprof2dot ``` works on Pharos with python 2.6 ``` from external import gprof2dot ``` works on my Mac with python 2.7. There should be a better way of handling it.
To reduce the database pickle size, thermoEstimator now only loads thermo libraries.
Originally, RMG stops if any two species are identical in the initial species list. Since this can happen frequently when using thermoEstimator with a large list of spcies, I changed it, so that RMG ignores the duplicate, and continue execution.
Optional positional argument is added to change chunk size.
I am still not sure the best way to do it.
QM calculations fail for linear molecules with an error '''only length-1 arrays can be converted to Python scalars'''. The problem is that we only need a single rotational constant for a linear rotor, while three of them were being passed.
There are still some issues with positioning of the lone pairs. For instance 1 N 1 1 {2,S} {3,S} 2 H 0 0 {1,S} 3 H 0 0 {1,S} Will draw the long pair on the hydrogen rather than the N. But for now, we can avoid drawing lone pairs on oxygens and such where it is not needed.
The format syntax where you omit the field numbers was introduced in Python 2.7.
There is a bug in pickling of the salvation groups, and there is no reason to load them by default.
Scoop now with QMthermo and many updates from Master, from @keceli.
A recent run spent quite a lot of time generating thermo data, and this seems reasonably parallelizable (each species is independent). The overheads may still be large, but this is a start to try experimenting with.
MOPAC is programmed to use all available processors. Multithreading of RMG-py routines in which a call is made to MOPAC will suffer from CPU overloading when this is not dealt with explicitly. |
In the example file on: the I want to run parallelize this across multiple computers, rather than multiple processors. Has anyone tried to run this across multiple computers? Scoop provides a way to specify a list of host computers that you want to work on through the From what I can read on the Pharos wiki, the SGE on Pharos provides MPI functionality (e.g. |
To run rmg-scoop across nodes I have used:
And prolog.sh has only |
I posted this some time ago in the scoop-user mailing list, but nobody seems to listen. Maybe you have some inspiration that could help me out. Here's my problem: I want to create a scheme in which parallelism is done on 2 levels:
This is how I implemented it on Pharos. The example requests 48 CPUs that will be split in 6 workers, thus providing 8 CPUs to each worker. I started by requesting 48 CPUs to SGE, and specified that each node provides 8 CPUs, which leads to SGE assigning 6 nodes.
Next, I create a hostfile with the names of the hosts and append "1" to only allow 1 worker call per node. The hostfile looks like this:
Next, I tell SCOOP to only deploy 6 worker calls, through the -n option:
What happens next is that SCOOP uses only 1 node, and sends the 6 worker functions to this node, instead of distributing the 6 workers to the 6 available nodes:
I read that the As a result, my observation kind of contrast this statement... Any inspiration for experiments that would help me understand what's going on? |
Maybe, the easiest way to debug is doing it interactively. You can reserve By the way, how much speed up you get when you run Mopac on 8 cores? On Wed, Aug 13, 2014 at 3:43 PM, nickvandewiele [email protected]
|
OK, I figured out why the workers were not equally distributed among the reserved nodes. a hostfile in which the hostnames of the nodes are on one line is not interpreted by SCOOP the way I think it was. So this is wrong:
Instead, each hostname should be on a separate line:
|
The second thing I did wrong is not pointing to the hostfile that contains a list of hostnames.
In addition, the flag |
As parallellizing MOPAC jobs seem to scale pretty well over processors and over nodes, I am focusing on Gaussian jobs now. After some days of testing, I figured out that
If the flag |
Also: when I increase the value of
|
SCOOP (Scalable Concurrent Operations in Python) is a distributed task module allowing concurrent parallel programming on various environments, from heterogeneous grids to supercomputers.
http://code.google.com/p/scoop/
This branch implements it for bits of RMG-Py.
Do a
pip install scoop
, checkout this branch, then try it withpython -m scoop -vv rmg.py path/to/input.py
or see the queue submission file inexamples/rmg/scoop/
for running on a multi-computer cluster.I don't think this is ready to be merged yet, but am making a pull request as somewhere to store discussions.
So far:
Until we implement QMThermo, parallelizing thermo estimation alone doesn't save an awful lot of time.
Implementing the reactor simulations and PDep calculations may be more difficult, or at least involve more pickling and passing. E.g. can we unpickle an updated network after a remotely executed PDep calc, or will all the reactions and species point to the wrong things?