-
Notifications
You must be signed in to change notification settings - Fork 48
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Issues executing examples. CUDA_ERROR_ILLEGAL_ADDRESS and torch.bmm received an invalid combination of arguments #1
Comments
Hi, first of all I'm sorry about the delay, somehow I haven't received any notification from github.
|
Hi, For sure, during this week I will try to reproduce the error. I will execute several experiments with the CUDA_LAUNCH_BLOCKING=1 and I will come back to you. EDIT: This is the log with the CUDA_LAUNCH_BLOCKING=1
|
Thanks a lot! It's weird it happens in the forward pass, there should be still enough memory available no matter what, especially that you've said you have full 12GB available. Just to check:
|
Hi,
|
Thanks! I've upgraded to your latest version of cupy (btw, it seems there is now a cleaner way how to define custom kernels with https://docs-cupy.chainer.org/en/latest/reference/generated/cupy.RawKernel.html), so my setup should be the same as yours. But I'm sorry but I can't reproduce it, I haven't received the error during training. I have no idea, sorry:(. Perhaps using |
Thanks! The error is completely random, it is not appearing always. Are you using also cuda 8, cudnn 6? Can you tell me wich driver of nvidia are you using? And my last question it is just for curiosity. Why do you choose use cupy instead of using the methodology of pytorch to do custom extension with cuda? Is there a technical reason? I suspect that the error is related to the management of the gpu memory done by pytorch. I mean, as you know, pytorch is using caching memory allocator to speed up memory allocations. This allows fast memory deallocation without device synchronizations. However, the unused memory managed by the allocator will still show as if used in other applications. That maybe can be a conflict with the code executed by cupy. What do you think? But, I do not understand why it is only happening in my setup and not in yours... |
My driver is 384.130, cuda 8.0.61, cudnn 6021. But I'm unable to shuffle around with drivers and cuda, as I'm using a shared computer. Thanks for the tip, the support for extensions in pytorch seems to have improved a lot, there is even a JIT solution in particular! The reason I went the cupy way about 1.5y ago was because the pytorch way was more rudimentary and explicit compilation only. I think my current code could surely be rewritten to use just pytorch/JIT. I might give it a quick shot in the next few days... Regarding the interactions between pytorch and cupy, who knows... but I would assume they both use some standard cuda allocation calls in the end so it should not happen that memory gets assigned twice. But removing the dependency on cupy would sort it out anyway. |
I tested with different gpu's in order to discard the hardware error and I had the same issue. Regarding of the adaptation to the pytorch extension, if I can help in somehow tell me. Btw, another interesting thing of this adaptation is that allows to use the multigpu training in order to train with a bigger batch_size :) Thanks for your time! |
Today I tried to adapt your code to use pytorch extension. here you can find the modified files used in my first try: https://github.com/dhorka/ecc_cuda_extension. I am not using JIT. You need to compile the kernel using the provided setup.py. At this moment the code is failing in run time with a segmentation fault. I was not able to figure out what is going on. But maybe the skeleton can help you. I will check it again later. Thanks. |
Thanks for your hard work on this, that was definitely a great starting point! I've fixed your code (the main problem was that not all types were supposed to be floats and that the grid-block parameters were not right), written it as JIT, added backward aggregation and ported the code to pytorch 0.4. It's in branch https://github.com/mys007/ecc/tree/pytorch4_cuda_extensions . Could you perhaps try to run it on your machine with pytorch 0.4.1 and see if works now? If |
Hi mys, thanks also for your work on this issue!! I launched right now 2 training processes in order to be sure that the issue has disappeared :) On Monday I will tell you the results. Regarding the gpu consumption, if you are looking the consumption on nvidia-smi, we can not be sure that it is the real consumption, because pytorch uses a caching memory allocator to speed up memory allocations but the unused memory managed by the allocator will still show as if used in nvidia-smi. In order to check wich is the real memory used, we need to use some of the methods provided by pytorch like |
Thanks. But it may still crash in the other kernels (pooling), perhaps I should have ported all of them when I was at it... Can you please run the processes with
Indeed, but I though this has been a feature of pytorch from the beginning, but maybe they have changed the laziness of deallocations... |
Sure, I re-executed the experiment with CUDA_LAUNCH_BLOCKING=1, also in parallel I will try to port the other kernels, using as example your ported kernels.
I am not sure what is happening with the gpu memory, because far as I saw when I launch the experiments (with cupy) at the beggining of the training the gpu memory consumption it is more or less at 8GB, but in advanced epochs, I can see that sometimes the memory consumption is 4GB and other times 12GB ... |
All kernels ported, here you can find it https://github.com/dhorka/ecc_cuda_extension. I was not able to test if in runtime all the kernels works properly (at this moment I do not have any gpu avaibale) but atleast the compilation is working. |
Wow, what a great effort! Let's wait for the result of your jobs and if it's good, I can merge & clean up everything. |
Hi Mys,
It seems like the error is cupy related, right? At this moment I have two more experiments with all the kernels ported running. I will let you know about the results obtained when the experiments finishes. |
Hi, It seems is not related to cupy... Below you can see the output error of one of the experiments with all the kernels ported to pytorch041:
|
Damn, that's really frustrating. I guess it must be some bug in the kernels demonstrating itself only under some rare condition of the input data. Could you perhaps run the training with |
Sure! I launched one experiment with 0 workers. Tomorrow I will come back with the results. |
Well I got some results... to be honest this starts to be weird.... I got the error in the epoch 9 as you can see in the following trace:
This is the result of this command:
After that I tried to resume the experiment in order to check if I can reproduce the error but... after resume the training continue without problems... This is the command that I use to resume the experiment:
Now I am thinking to run an experiment forcing the seed of the data loader and also setting CUDNN in deterministic mode. EDIT: |
Thanks for your report. Is the crash reproducible at your side, meaning that if you rerun the training from scratch (the first command line above), will it break during episode 7 again? In my case, I received no crash:(. Resuming will not produce the same results as training straight without resuming because the states of random generators are not saved/restored (too complicated). But data loading should basically start again and crash in episode 14 then, weird that it didn't.
If 'nworkers=0
I think it's because |
Hi Mys,
I launched two experiments and always crash in the iteration 164 of the epoch 7. Far as I saw I think it is reproducible on my side.
Yep, it is weird... I do not understand what is different after the resume...
I understand. Thanks for the explanation!
Far as I know (also I tested), You do not need to restart them, because DataLoaders are able to manage the different epochs. To sum up, I think I can reproduce the error. Maybe I can debug on my side, following your instructions. |
Great news! Could you please pickle Line 136 in 8fbc901 |
Hi Mys, Done it! Here you can find the file. The code used it to pickle(inputs,targets,GIs,PIS) is this one: I runned the experiment with cudnn in descriminative mode. (I forgot to disable) But doesn't matter the error is the same without descriminative. EDIT: |
Hi, thanks a lot... but when I load the batch on my computer (from either of your files) so that each training iteration runs on it, I get no crash (w/pytorch 0.4.1). I'm sorry but I just think that resolving this issue is beyond my powers :(. |
Hi, I was wondering: if you're in a very experimental mood, could you try to run https://github.com/mys007/ecc/tree/crazy_fix with pytorch 0.3? There is just one extra line which touches |
Hi Mys, Sorry for didn't answer your last comment. I have a cold and I was not able to check the e-mail. Yes, of course I will try it. Also I would like to try if with the files that I send to you I can reproduce the error on my setup, because I send it to you but I did not try to reproduce the error, I was thiking that maybe it something related with the state of the rng ... But anyway, this weekend I will test your fix and also, I will try to reproduce again the error. Thanks for your dedication! |
Hi Mys, |
Damn, but thanks a lot. Well, actually, there has been one other user who has contacted me per email with the same issue in the meantime (though on Sydney; CUDA 9.1, TITAN X Pascal). It's so difficult to debug. Perhaps I could rewrite the whole aggregation with sparse matrix operations but I need to have a look what's the current support in pytorch. |
If I can do anything else do not hesitate to ask :) By the way, on my side I ran several experiments using the sydney dataset without errors... |
hi all, I have the same issue, quite randomly. I'm now using SPG to benchmark on ScanNet, I think they just adopt your codes of master branch. I'm now trying to use your pytorch4_cuda_extension branch. I use CUDA 10.2, Driver Version: 430.26, 1080 Ti, cupy-cuda100 6.3.0, pytorch1.2.0. I will come to you later. update: error occured again. Orz |
@HenrryBryant Thanks for your report and thanks for the effort of trying out the experimental branch. I'm sorry that the problem has not been solved. Although the new error message is "CUDA out of memory" rather than "CUDA_ERROR_ILLEGAL_ADDRESS"? |
hi Martin, thank you for your quick reply, actually these two errors just randomly take turns to occur, I will try with CUDA_LAUNCH_BLOCKING, otherwise i just have to turn to Pytorch_Geometric. btw, I really like several works from you and Loic, they are really beautiful. |
Well, what I meant is that "CUDA out of memory" might not be a bug but rather indeed running out of memory. Is the GPU completely free before running the code? Otherwise, you can try decreasing the batch size just for the sake. And yeah, I wish I can really make a rewrite in Pytorch_Geometric one day! |
hi, If you have the same problem as dhorka mentioned and you are using DataLoader, please consider replacing your DataLoader object with for loop(though make it difficult to run on multiple GPUs), this works magically for me so far. Btw, you may also replace tensor with Tensor in lines 225 & 227 in GraphConvModule.py, I'm not sure if this also contributes to finally fixing this random bug, but my supervisor told me this could also cause wired memory leaks problem. I use CUDA10.2, Driver Version is 430.26, PyTorch 1.2.0 installed by pip. |
@HenrryBryant Thanks for the investigations. That's an interesting note with DataLoader but I believe this is just a random workaround causing the timing of kernel runs being changed and more spread (since there is no parallel loading, meaning also the training must be very slow). I guess one could achieve the same effect with setting the number of workers to 0 or 1? The Nevertheless, I think the out of memory issue and the illegal address crash are two different things. |
Hi,
I have some issues executing your code. First, I tried to execute your example with modelnet 10 using the command provided. It seemed to work but an advanced epoch the code crash with this error:
I executed the code several times and the error appears randomly, it is not always in the same epoch, also it is not appearing in the same part of the code, here you can see an other example of the error:
I tried different versions of pytorch: 0.2 0.3 0.4. The three versions was installed using pip, and also I tried to execute the code with a compiled from source version (0.2) the same error appears. I am using a machine with: 60gb of ram, Intel Xeon and a titan X with 12gb of ram. Moreover I tried to use different versions of open3d: 0.2.0 and 0.3.0. Finally I modified your sample command and I add edge_mem_limit in order to limit the memory used on the gpu without success.
Also I tested the code using the Sydney Urban Objects example, but in this case, this error is appearing at the begging of the execution:
Please can you give me some hint in order to solve the issues?
Thanks,
The text was updated successfully, but these errors were encountered: