Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Parallel data loading doesn't work on macOS/Windows #184

Closed
adamjstewart opened this issue Oct 2, 2021 · 3 comments · Fixed by #304
Closed

Parallel data loading doesn't work on macOS/Windows #184

adamjstewart opened this issue Oct 2, 2021 · 3 comments · Fixed by #304
Labels
datasets Geospatial or benchmark datasets samplers Samplers for indexing datasets
Milestone

Comments

@adamjstewart
Copy link
Collaborator

When trying to to benchmark the data loader, if I run with num_workers > 0, it crashes with the following error message:

Traceback (most recent call last):
  File "benchmark.py", line 299, in <module>
    main(args)
  File "benchmark.py", line 181, in main
    for i, batch in enumerate(dataloader):
  File "/Users/Adam/.spack/.spack-env/view/lib/python3.8/site-packages/torch/utils/data/dataloader.py", line 521, in __next__
    data = self._next_data()
  File "/Users/Adam/.spack/.spack-env/view/lib/python3.8/site-packages/torch/utils/data/dataloader.py", line 1203, in _next_data
    return self._process_data(data)
  File "/Users/Adam/.spack/.spack-env/view/lib/python3.8/site-packages/torch/utils/data/dataloader.py", line 1229, in _process_data
    data.reraise()
  File "/Users/Adam/.spack/.spack-env/view/lib/python3.8/site-packages/torch/_utils.py", line 425, in reraise
    raise self.exc_type(msg)
ValueError: Caught ValueError in DataLoader worker process 0.
Original Traceback (most recent call last):
  File "/Users/Adam/.spack/.spack-env/view/lib/python3.8/site-packages/torch/utils/data/_utils/worker.py", line 287, in _worker_loop
    data = fetcher.fetch(index)
  File "/Users/Adam/.spack/.spack-env/view/lib/python3.8/site-packages/torch/utils/data/_utils/fetch.py", line 44, in fetch
    data = [self.dataset[idx] for idx in possibly_batched_index]
  File "/Users/Adam/.spack/.spack-env/view/lib/python3.8/site-packages/torch/utils/data/_utils/fetch.py", line 44, in <listcomp>
    data = [self.dataset[idx] for idx in possibly_batched_index]
  File "/Users/Adam/torchgeo/torchgeo/datasets/geo.py", line 719, in __getitem__
    if not query.intersects(self.bounds):
  File "/Users/Adam/torchgeo/torchgeo/datasets/geo.py", line 760, in bounds
    minx = max([ds.bounds[0] for ds in self.datasets])
  File "/Users/Adam/torchgeo/torchgeo/datasets/geo.py", line 760, in <listcomp>
    minx = max([ds.bounds[0] for ds in self.datasets])
  File "/Users/Adam/torchgeo/torchgeo/datasets/geo.py", line 135, in bounds
    return BoundingBox(*self.index.bounds)
  File "/Users/Adam/torchgeo/torchgeo/datasets/utils.py", line 213, in __new__
    raise ValueError(f"Bounding box is invalid: 'minx={minx}' > 'maxx={maxx}'")
ValueError: Bounding box is invalid: 'minx=1.7976931348623157e+308' > 'maxx=-1.7976931348623157e+308'

Since this doesn't occur in serial or on Linux, I'm guessing this has something to do with the fact that Python's multiprocessing module switched from fork to spawn as the default start method on macOS for Python 3.8+.

@adamjstewart adamjstewart added datasets Geospatial or benchmark datasets samplers Samplers for indexing datasets labels Oct 2, 2021
@adamjstewart
Copy link
Collaborator Author

Also, the fact that this wasn't caught by our unit tests means we need better integration tests. We do test our samplers in parallel, but not with a real GeoDataset.

@adamjstewart adamjstewart added this to the 0.1.1 milestone Nov 20, 2021
@adamjstewart
Copy link
Collaborator Author

This issue is caused by Toblerity/rtree#87, rtree indices do not pickle correctly:

$ python
Python 3.8.12 (default, Oct 25 2021, 13:41:41) 
[Clang 12.0.0 (clang-1200.0.32.29)] on darwin
Type "help", "copyright", "credits" or "license" for more information.
>>> import pickle
>>> from rtree import index
>>> idx = index.Index()
>>> idx.insert(0, (0, 1, 2, 3))
>>> print(idx)
rtree.index.Index(bounds=[0.0, 1.0, 2.0, 3.0], size=1)
>>> x = pickle.dumps(idx)
>>> y = pickle.loads(x)
>>> print(y)
rtree.index.Index(bounds=[1.7976931348623157e+308, 1.7976931348623157e+308, -1.7976931348623157e+308, -1.7976931348623157e+308], size=0)

@isaaccorley
Copy link
Collaborator

This also happens on Windows and is due to the whole fork vs spawn issue.

@adamjstewart adamjstewart changed the title Parallel data loading doesn't work on macOS with Python 3.8+ Parallel data loading doesn't work on macOS/Windows Dec 18, 2021
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
datasets Geospatial or benchmark datasets samplers Samplers for indexing datasets
Projects
None yet
Development

Successfully merging a pull request may close this issue.

2 participants