Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

train::ERROR] Runtime Error Pin memory thread exited unexpectedly #29

Open
caiyingchun opened this issue Jun 28, 2023 · 1 comment
Open

Comments

@caiyingchun
Copy link

I tried to train a new model by running train.py, but I got this:

[2023-06-28 10:32:08,821::train::INFO] Namespace(config='./configs/train.yml', device='cuda', logdir='./logs')
[2023-06-28 10:32:08,821::train::INFO] {'model': {'vn': 'vn', 'hidden_channels': 256, 'hidden_channels_vec': 64, 'encoder': {'name': 'cftfm', 'hidden_channels': 256, 'hidden_channels_vec': 64, 'edge_channels': 64, 'key_channels': 128, 'num_heads': 4, 'num_interactions': 6, 'cutoff': 10.0, 'knn': 48}, 'field': {'name': 'classifier', 'num_filters': 128, 'num_filters_vec': 32, 'edge_channels': 64, 'num_heads': 4, 'cutoff': 10.0, 'knn': 32}, 'position': {'num_filters': 128, 'n_component': 3}}, 'train': {'seed': 2023, 'use_apex': False, 'batch_size': 8, 'num_workers': 8, 'pin_memory': True, 'max_iters': 500000, 'val_freq': 5000, 'pos_noise_std': 0.1, 'max_grad_norm': 100.0, 'optimizer': {'type': 'adam', 'lr': 0.0002, 'weight_decay': 0, 'beta1': 0.99, 'beta2': 0.999}, 'scheduler': {'type': 'plateau', 'factor': 0.6, 'patience': 8, 'min_lr': 1e-05}, 'transform': {'mask': {'type': 'mixed', 'min_ratio': 0.0, 'max_ratio': 1.1, 'min_num_masked': 1, 'min_num_unmasked': 0, 'p_random': 0.15, 'p_bfs': 0.6, 'p_invbfs': 0.25}, 'contrastive': {'num_real': 20, 'num_fake': 20, 'pos_real_std': 0.05, 'pos_fake_std': 2.0}, 'edgesampler': {'k': 8}}}, 'dataset': {'name': 'pl', 'path': './data/crossdocked_pocket10', 'split': './data/split_by_name.pt'}}
[2023-06-28 10:32:08,823::train::INFO] Loading dataset...
[2023-06-28 10:32:09,280::train::INFO] Building model...
Num of parameters is 3711167
/data/sdb/opt/miniconda3/envs/aidd/lib/python3.7/site-packages/torch/functional.py:504: UserWarning: torch.meshgrid: in an upcoming release, it will be required to pass the indexing argument. (Triggered internally at ../aten/src/ATen/native/TensorShape.cpp:3190.)
  return _VF.meshgrid(tensors, **kwargs)  # type: ignore[attr-defined]
/data/sdb/opt/miniconda3/envs/aidd/lib/python3.7/site-packages/torch/functional.py:504: UserWarning: torch.meshgrid: in an upcoming release, it will be required to pass the indexing argument. (Triggered internally at ../aten/src/ATen/native/TensorShape.cpp:3190.)
  return _VF.meshgrid(tensors, **kwargs)  # type: ignore[attr-defined]
/data/sdb/opt/miniconda3/envs/aidd/lib/python3.7/site-packages/torch/functional.py:504: UserWarning: torch.meshgrid: in an upcoming release, it will be required to pass the indexing argument. (Triggered internally at ../aten/src/ATen/native/TensorShape.cpp:3190.)
  return _VF.meshgrid(tensors, **kwargs)  # type: ignore[attr-defined]
/data/sdb/opt/miniconda3/envs/aidd/lib/python3.7/site-packages/torch/functional.py:504: UserWarning: torch.meshgrid: in an upcoming release, it will be required to pass the indexing argument. (Triggered internally at ../aten/src/ATen/native/TensorShape.cpp:3190.)
  return _VF.meshgrid(tensors, **kwargs)  # type: ignore[attr-defined]
/data/sdb/opt/miniconda3/envs/aidd/lib/python3.7/site-packages/torch/functional.py:504: UserWarning: torch.meshgrid: in an upcoming release, it will be required to pass the indexing argument. (Triggered internally at ../aten/src/ATen/native/TensorShape.cpp:3190.)
  return _VF.meshgrid(tensors, **kwargs)  # type: ignore[attr-defined]
/data/sdb/opt/miniconda3/envs/aidd/lib/python3.7/site-packages/torch/functional.py:504: UserWarning: torch.meshgrid: in an upcoming release, it will be required to pass the indexing argument. (Triggered internally at ../aten/src/ATen/native/TensorShape.cpp:3190.)
  return _VF.meshgrid(tensors, **kwargs)  # type: ignore[attr-defined]
/data/sdb/opt/miniconda3/envs/aidd/lib/python3.7/site-packages/torch/functional.py:504: UserWarning: torch.meshgrid: in an upcoming release, it will be required to pass the indexing argument. (Triggered internally at ../aten/src/ATen/native/TensorShape.cpp:3190.)
  return _VF.meshgrid(tensors, **kwargs)  # type: ignore[attr-defined]
/data/sdb/opt/miniconda3/envs/aidd/lib/python3.7/site-packages/torch/functional.py:504: UserWarning: torch.meshgrid: in an upcoming release, it will be required to pass the indexing argument. (Triggered internally at ../aten/src/ATen/native/TensorShape.cpp:3190.)
  return _VF.meshgrid(tensors, **kwargs)  # type: ignore[attr-defined]
Exception in thread Thread-2:
Traceback (most recent call last):
  File "/data/sdb/opt/miniconda3/envs/aidd/lib/python3.7/threading.py", line 926, in _bootstrap_inner
    self.run()
  File "/data/sdb/opt/miniconda3/envs/aidd/lib/python3.7/threading.py", line 870, in run
    self._target(*self._args, **self._kwargs)
  File "/data/sdb/opt/miniconda3/envs/aidd/lib/python3.7/site-packages/torch/utils/data/_utils/pin_memory.py", line 49, in _pin_memory_loop
    do_one_step()
  File "/data/sdb/opt/miniconda3/envs/aidd/lib/python3.7/site-packages/torch/utils/data/_utils/pin_memory.py", line 26, in do_one_step
    r = in_queue.get(timeout=MP_STATUS_CHECK_INTERVAL)
  File "/data/sdb/opt/miniconda3/envs/aidd/lib/python3.7/multiprocessing/queues.py", line 113, in get
    return _ForkingPickler.loads(res)
  File "/data/sdb/opt/miniconda3/envs/aidd/lib/python3.7/site-packages/torch/multiprocessing/reductions.py", line 305, in rebuild_storage_fd
    fd = df.detach()
  File "/data/sdb/opt/miniconda3/envs/aidd/lib/python3.7/multiprocessing/resource_sharer.py", line 58, in detach
    return reduction.recv_handle(conn)
  File "/data/sdb/opt/miniconda3/envs/aidd/lib/python3.7/multiprocessing/reduction.py", line 185, in recv_handle
    return recvfds(s, 1)[0]
  File "/data/sdb/opt/miniconda3/envs/aidd/lib/python3.7/multiprocessing/reduction.py", line 161, in recvfds
    len(ancdata))
RuntimeError: received 0 items of ancdata

[2023-06-28 10:32:12,068::train::INFO] [Train] Iter 1 | Loss 10.276641 | Loss(Fron) 0.631725 | Loss(Pos) 3.812413 | Loss(Cls) 1.901050 | Loss(Edge) 1.675482 | Loss(Real) 0.126777 | Loss(Fake) 2.129193 | Loss(Surf) 0.000000
[2023-06-28 10:32:12,073::train::ERROR] Runtime Error Pin memory thread exited unexpectedly
Traceback (most recent call last):
  File "train.py", line 227, in <module>
    train(it)
  File "train.py", line 108, in train
    batch = next(train_iterator).to(args.device)
StopIteration

@pengxingang
Copy link
Owner

Try adding torch.multiprocessing.set_sharing_strategy('file_system') at the top of the file.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants