You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Im running some tests to train different architectures on a specific dataset. The training is going alright, but once getting to the validation step, at the last iter the validation the process is being killed with 137 (no error is being raised).
I watched the ram usage, seems like that ram is running out, thats why I get that 137 exit. I cant find the reason why ram usage is being increased overtime. It only happens in validation step, while in the training step everything go smoothly.
This is happening on different architecture, not on a specific one. If I disable the validation step, the training works perfect, but I cant watch the performance of my model over time.
Thinks that I try to solve it:
Changing the batch size.
Change the number of workers.
disable pin_memory.
Changed to a different machine with much more RAM.
Nothing seems to solve the issue.
One think that I should say is that the dataset is huge - around 8mill crops, and around 800gb. I used symlink to split it to train - test. so test folder and train folder have files which are actually a symlink to different location.
Branch
main branch (mmpretrain version)
Describe the bug
Hi,
Im running some tests to train different architectures on a specific dataset. The training is going alright, but once getting to the validation step, at the last iter the validation the process is being killed with 137 (no error is being raised).
I watched the ram usage, seems like that ram is running out, thats why I get that 137 exit. I cant find the reason why ram usage is being increased overtime. It only happens in validation step, while in the training step everything go smoothly.
This is happening on different architecture, not on a specific one. If I disable the validation step, the training works perfect, but I cant watch the performance of my model over time.
Thinks that I try to solve it:
Nothing seems to solve the issue.
One think that I should say is that the dataset is huge - around 8mill crops, and around 800gb. I used symlink to split it to train - test. so test folder and train folder have files which are actually a symlink to different location.
Any ideas?
Thank you.
Environment
{'sys.platform': 'linux',
'Python': '3.8.19 | packaged by conda-forge | (default, Mar 20 2024, '
'12:47:35) [GCC 12.3.0]',
'CUDA available': True,
'MUSA available': False,
'numpy_random_seed': 2147483648,
'GPU 0,1,2,3': 'NVIDIA L4',
'CUDA_HOME': '/usr',
'NVCC': 'Cuda compilation tools, release 10.1, V10.1.24',
'GCC': 'gcc (Ubuntu 9.4.0-1ubuntu1~20.04.2) 9.4.0',
'PyTorch': '1.9.0+cu111',
'TorchVision': '0.10.0+cu111',
'OpenCV': '4.10.0',
'MMEngine': '0.10.4',
'MMCV': '2.1.0',
'MMPreTrain': '1.2.0+17a886c'}
Other information
Config -
The text was updated successfully, but these errors were encountered: