-
Notifications
You must be signed in to change notification settings - Fork 85
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Issue with locking when reading from multiple threads with Dask #214
Comments
I had issues passing around the file-like objects in #210 and it was resolved by passing around the path. I think passing around the path for reading will likely be the the most flexible and stable solution, even if it is not the most performant. It will likely also enable multi-core reading of the data. So, making that the default would be fine with me, If you want to look into it, I say 🚀 |
Apparently this was already possible. With the tip from @djhoese of passing in the different lock types, it works: https://corteva.github.io/rioxarray/latest/examples/dask_read_write.html. However, once this issue is resolved, I suspect we will be able to remove locks entirely when reading the data. |
Hmm, at least in my experiments (very rough notebook at https://nbviewer.jupyter.org/gist/TomAugspurger/73224c1aed6671085dfafdf778790e56) I didn't find this to be the case. You could read from multiple processes in parallel just fine, but not with multiple threads from within a single process (My use case had multiple threads reading from blob storage to ensure that a GPU was getting data fast enough). |
That is interesting. Is there a lock that works with multi-thread + multi-core workflows such as |
Mmm the default lock used in rioxarray, the Consider a setup with
So your cluster has 8 cores. With That's different from a I haven't looked yet, but it might be necessary to have a global lock when writing. I don't know if it's safe to write chunks of a COG from multiple processes at once. |
This is definitely the case. |
This is really good information. I think having all of this documented in the dask section would be helpful for users to be able to make that choice. |
Version 0.2 is now released with the change: https://corteva.github.io/rioxarray/stable/examples/read-locks.html |
I'm trying to understand the use of this, particularly when
In the corresponding rioxarray code it looks like |
I think you're supposed to pass an actual lock. Like |
Ok, yeah that's what it looks like. The |
Good catch 👍. Probably leftovers from another idea. Mind opening a separate issue for that? |
Currently, rioxarray.open_rasterio builds a
DataArray
with a_file_obj
that's an xarrayCacheFileManager
that eventually callsrasterio.open
on the user-provided input (typically a filename). When using rioxarray withchunks=True
to do parallel reads from multiple threads this can cause workers to wait for a lock, since GDAL file objects are not safe to be read from multiple threads.This snippet demonstrates the issue by creating and then reading a
(4, 2560, 2560)
TIF. It uses a SerializableLock subclass that adds a bit of logging when locks are requested and acquired.Here's the output to `lock.log:
So the section after
--- read metadata ---
is what matters. We see all the threads try to get the same lock. The first succeeds quickly. But the rest all need to wait. Interestingly, it seems that someone (probably GDAL) has cached the array , so the actual reads of the second through 4th bands are fast (shown by the waits all being around 0.7s). I don't know enough about GDAL / rasterio to say when pieces of data are read and cached, vs. when they're just not read period, but I suspect it depends on the internal chunking of the TIF file and the chunking pattern of the request.Another note: this only affects multi-threaded reads. If we change
dask.config.set(scheduler="processes")
, we don't see any waiting for locks. That's because the lock objects themselves are different after they've been serialized.As for potential solutions: a candidate is to pass around URIs like
example.tif
rather than cached file objects. Then each thread can open its own File object. I think this would have to be an option controlled by a parameter, since the exact tradeoff between the cost of opening a new file vs. reusing an open file object can't be known ahead of time.I haven't looked closely at whether an implementation like this is easily doable, but if there's interest I can take a look.
cc @scottyhq as well, in case this interests you :) xref to pangeo-data/cog-best-practices#2 and https://github.com/pangeo-data/cog-best-practices/blob/main/rasterio-concurrency.ipynb.
The text was updated successfully, but these errors were encountered: