-
Notifications
You must be signed in to change notification settings - Fork 14
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Multiprocess/multithread pandora map --discover
issue with GATB graph creation
#195
Comments
If I understand that snippet of code in GATB correctly, provided we name the GATB graph something like a UUID then we shouldn't have any race conditions? |
We can name the GATB graph using a GATB option (not sure if I remember which one exactly, sth that gives GATB the prefix to the graph). No need to fork/change GATB, but just works if we apply multiprocessing. I think multithreading is better and more efficient. Then all threads when executing has the same process ID, and thus we have the race condition on the temporary files, which are based on the process ID. In this case, we need to fork GATB and give these temporary files a new name. It can be a UUID, or a name based on process + thread ID. It just needs a name that is unique to each thread of the process (or unique everytime you create a new graph). PS: actually, I just checked my local GATB git repo, and there were some unsaved changes. I tried to fix this issue, but IDK why I stopped. It might have been that it needed more work, but we decided to not continue on it. Or it might have been that I left to finish another day, but then we decided we don't need this at that time, so I just switched to sth else. Anyway, I just committed these local changes and pushed to a branch (leoisl/gatb-core@6ed808e) . Very basic changes, just replaces the name of the temp files from using the process ID to a random 64-char alphanumeric string. It seems I was changing also some tests to see if this worked. I absolutely do not remember if it worked, or not, if it needs further development, etc... this can be seen only as a head start for implementing this |
Does this example not name the graph already? http://gatb-core.gforge.inria.fr/doc/api/snippets_graph.html#snippets_kmer_dbg_4 |
I think yes... that is what we have in the examples testing multithreading in leoisl/gatb-core@6ed808e#diff-f91f479f4338c9539549dbf7ee6206dfR37-R47 |
Ok, so I'm confused about why we need to fork GATB then? Can't we just generate a UUID ourselves and specify this |
the only thing that
If there are multiple threads trying to create each one a graph simultaneously, all of them will try to write and read to/from the same temporary files, causing race condition in the filesystem, as all threads have the same process ID |
Ahhh I see. Makes sense now! |
I asked GATB team about how hard would be to do some changes in GATB for this multithreading to work: GATB/gatb-core#37 |
I did some digging today and it looks like only one thread is allowed to use the HDF5 library at any one time. The library can be built to be thread-safe though https://support.hdfgroup.org/HDF5/faq/threadsafe.html I am trying to see if GATB is indeed built with this feature on or not... |
What do you think about this link I posted on the GATB issue https://stackoverflow.com/questions/36308176/using-hdf5-thread-safe-library/47529439#47529439 ? i.e. thread-safe HDF5 lib can be accessed from C, but not from C++ Edit: I looked a bit at GATB code, and they do access HDF5 from C++ but they use the C API it seems... I wonder if this means that we can use thread-safe HDF5 lib... we still have some other issues, like singleton objects, but I guess this a big one that might be not very hard to solve... just testing I guess.. |
Well the problem I am running into now is that it seems the thread-safe feature can't be used for the static library - which is what GATB are using I think. |
Yeah, I think they indeed use the static library, but look at the pandora code slack channel, hdf5 is not a problem anymore, we can tell GATB to use flat files instead of HDF5... but we have other issues |
Closing, we use threads everywhere, except for |
We first discuss multiprocessing issues, and then multithreading.
We have issues running several
pandora map
jobs concurrently with the--discover
flag set (a sort of multiprocessing). It is not straightforward to reproduce, but what might happen is when finding paths through a candidate region, a GATB graph is created:https://github.com/rmcolq/pandora/blob/c16c1fac48ccdf98e0719ea003492c3fd2ec9034/src/denovo_discovery/denovo_discovery.cpp#L35-L38
which triggers the creation of a
dummy.h5
file in the working directory. Later on, this file is deleted. This works fine if only onepandora map
job is running on the working directory, because this part is not yet multithreaded, so only one GATB graph is created, processed and deleted at a time. However, when running severalpandora map
jobs with--discover
, all these jobs will try to create and remove thedummy.h5
file possibly simultaneously, and this could create race condition on the file system (it is hard to reproduce because it depends on the order of execution of these different jobs). This is the log that we would get in failingpandora map
jobs:I know a little bit about this because I have run into this issue in another tool already. This can be easily solved by telling GATB where to store the h5 file (and this will be the solution), but this thread is also to log how to trace the files created by GATB when creating a graph (so we know what we have to do when we decide do multithread this part).
These are the files accessed by GATB when creating a graph using the same parameters we use in pandora (obtained with
strace -y -t -e trace=open,close
):For multiprocessing, we just need to give GATB a prefix to write the
.h5
file, instead of it using the defaultdummy.h5
.For multithreading, it is a bit more complicated, because we need to ensure each thread create the temporary files (the
trashme_*
files) with a different prefix. However, GATB's way of getting a temporary filename is based on the process ID:https://github.com/GATB/gatb-core/blob/2781b0302c9feb60b48d7a142a4ee1873db2fccc/gatb-core/src/gatb/system/impl/FileSystemCommon.cpp#L182-L188
which is the same for each thread. We would need to actually fork GATB, and change this to something that is more robust to multithreading (I already have this implemented ~4 months ago, but I guess it is not finished, so we can continue this development whenever you want).
The text was updated successfully, but these errors were encountered: