Parallel data loading doesn't work on macOS/Windows #184

adamjstewart · 2021-10-02T22:56:25Z

When trying to to benchmark the data loader, if I run with num_workers > 0, it crashes with the following error message:

Traceback (most recent call last):
  File "benchmark.py", line 299, in <module>
    main(args)
  File "benchmark.py", line 181, in main
    for i, batch in enumerate(dataloader):
  File "/Users/Adam/.spack/.spack-env/view/lib/python3.8/site-packages/torch/utils/data/dataloader.py", line 521, in __next__
    data = self._next_data()
  File "/Users/Adam/.spack/.spack-env/view/lib/python3.8/site-packages/torch/utils/data/dataloader.py", line 1203, in _next_data
    return self._process_data(data)
  File "/Users/Adam/.spack/.spack-env/view/lib/python3.8/site-packages/torch/utils/data/dataloader.py", line 1229, in _process_data
    data.reraise()
  File "/Users/Adam/.spack/.spack-env/view/lib/python3.8/site-packages/torch/_utils.py", line 425, in reraise
    raise self.exc_type(msg)
ValueError: Caught ValueError in DataLoader worker process 0.
Original Traceback (most recent call last):
  File "/Users/Adam/.spack/.spack-env/view/lib/python3.8/site-packages/torch/utils/data/_utils/worker.py", line 287, in _worker_loop
    data = fetcher.fetch(index)
  File "/Users/Adam/.spack/.spack-env/view/lib/python3.8/site-packages/torch/utils/data/_utils/fetch.py", line 44, in fetch
    data = [self.dataset[idx] for idx in possibly_batched_index]
  File "/Users/Adam/.spack/.spack-env/view/lib/python3.8/site-packages/torch/utils/data/_utils/fetch.py", line 44, in <listcomp>
    data = [self.dataset[idx] for idx in possibly_batched_index]
  File "/Users/Adam/torchgeo/torchgeo/datasets/geo.py", line 719, in __getitem__
    if not query.intersects(self.bounds):
  File "/Users/Adam/torchgeo/torchgeo/datasets/geo.py", line 760, in bounds
    minx = max([ds.bounds[0] for ds in self.datasets])
  File "/Users/Adam/torchgeo/torchgeo/datasets/geo.py", line 760, in <listcomp>
    minx = max([ds.bounds[0] for ds in self.datasets])
  File "/Users/Adam/torchgeo/torchgeo/datasets/geo.py", line 135, in bounds
    return BoundingBox(*self.index.bounds)
  File "/Users/Adam/torchgeo/torchgeo/datasets/utils.py", line 213, in __new__
    raise ValueError(f"Bounding box is invalid: 'minx={minx}' > 'maxx={maxx}'")
ValueError: Bounding box is invalid: 'minx=1.7976931348623157e+308' > 'maxx=-1.7976931348623157e+308'

Since this doesn't occur in serial or on Linux, I'm guessing this has something to do with the fact that Python's multiprocessing module switched from fork to spawn as the default start method on macOS for Python 3.8+.

The text was updated successfully, but these errors were encountered:

adamjstewart · 2021-10-02T22:58:09Z

Also, the fact that this wasn't caught by our unit tests means we need better integration tests. We do test our samplers in parallel, but not with a real GeoDataset.

adamjstewart · 2021-12-18T19:11:13Z

This issue is caused by Toblerity/rtree#87, rtree indices do not pickle correctly:

$ python
Python 3.8.12 (default, Oct 25 2021, 13:41:41) 
[Clang 12.0.0 (clang-1200.0.32.29)] on darwin
Type "help", "copyright", "credits" or "license" for more information.
>>> import pickle
>>> from rtree import index
>>> idx = index.Index()
>>> idx.insert(0, (0, 1, 2, 3))
>>> print(idx)
rtree.index.Index(bounds=[0.0, 1.0, 2.0, 3.0], size=1)
>>> x = pickle.dumps(idx)
>>> y = pickle.loads(x)
>>> print(y)
rtree.index.Index(bounds=[1.7976931348623157e+308, 1.7976931348623157e+308, -1.7976931348623157e+308, -1.7976931348623157e+308], size=0)

isaaccorley · 2021-12-18T19:20:23Z

This also happens on Windows and is due to the whole fork vs spawn issue.

adamjstewart added datasets Geospatial or benchmark datasets samplers Samplers for indexing datasets labels Oct 2, 2021

adamjstewart mentioned this issue Oct 10, 2021

Skip slow sampler tests #187

Merged

adamjstewart added this to the 0.1.1 milestone Nov 20, 2021

adamjstewart changed the title ~~Parallel data loading doesn't work on macOS with Python 3.8+~~ Parallel data loading doesn't work on macOS/Windows Dec 18, 2021

adamjstewart mentioned this issue Dec 18, 2021

Fix GeoDataset pickling #304

Merged

adamjstewart closed this as completed in #304 Dec 19, 2021

adamjstewart mentioned this issue Jan 2, 2022

Integration tests fail on macOS #349

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Parallel data loading doesn't work on macOS/Windows #184

Parallel data loading doesn't work on macOS/Windows #184

adamjstewart commented Oct 2, 2021

adamjstewart commented Oct 2, 2021

adamjstewart commented Dec 18, 2021

isaaccorley commented Dec 18, 2021

Parallel data loading doesn't work on macOS/Windows #184

Parallel data loading doesn't work on macOS/Windows #184

Comments

adamjstewart commented Oct 2, 2021

adamjstewart commented Oct 2, 2021

adamjstewart commented Dec 18, 2021

isaaccorley commented Dec 18, 2021