Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Cuda acceleration #112

Open
andre-nguyen opened this issue Mar 10, 2016 · 13 comments · May be fixed by #257
Open

Cuda acceleration #112

andre-nguyen opened this issue Mar 10, 2016 · 13 comments · May be fixed by #257

Comments

@andre-nguyen
Copy link

Hi @ahornung

I saw issue #29 and wasn't interested in the GPU-voxel approach. It is clear that many ros applications use octomap as a standard and we would gain to work on parallelizing octomap. The advent of embedded GPU's such as the nvidia TK1 and TX1 are making this much more interesting for mobile robotics.

I would like to slowly incrementally develop this by speeding up small parts of the code.

How feasible do you think this is and do you have any pointers on where to start?

@ahornung
Copy link
Member

Great to hear that you're interested in improving the performance! That sounds definitely feasible, and incrementally taking care of parts is probably the best way forward.

The critical functions would be computeUpdate(...) in OccupancyOcTreeBase and computeRayKeys(...) in OcTreeBaseImpl. You'll find that there are already conditional OpenMP parallelizations in place, these could give you some hints for a start.

@andre-nguyen
Copy link
Author

Thanks, time for me to learn Cuda then :D

@ahornung
Copy link
Member

Just in case you're generally looking for speedups and are not yet commited to Cuda: It's probably worth having a look at SIMD intrinsics (SSE) as well. These changes could be less intrusive than switching certain parts to Cuda.

@andre-nguyen
Copy link
Author

Thanks for the tip and sorry for the late response. I unfortunately only recently received my hardware but SSE would certainly be interesting that way I could work from home without the need for the TK1.

Please don't count too much on this though, if it is ever ready, it will be for the end of the summer.

@gsp-27
Copy link

gsp-27 commented Aug 19, 2016

Hi, Can you point me to some resources which can point me to understand octrees more intuitively. I understand segment trees and also familiar with lazy update in 1D segment trees. Octrees are 3 dimensional version of segment trees but it is difficult for me to imagine lazy update in it. I wanted to make contribution for it. I am writing this comment because I also plan on parallelising, if it is even possible.
Your help will be of immense help.

@ahornung
Copy link
Member

The best documentation will be Wikipedia, the OctoMap AuRo journal paper, and the code; with increasing depth into the topic.

@dblanm
Copy link

dblanm commented Jun 19, 2017

Hi @andre-nguyen ,

How is it going the implementation of CUDA with Octomap? I am also planning on implementing CUDA in Octomap. Maybe I could try to help you.

@gsp-27
Copy link

gsp-27 commented Jun 19, 2017 via email

@andre-nguyen
Copy link
Author

@dblanm @gsp-27 Like many projects, other tasks got out of hand and I didn't have time to get to this 😭 😭 😭

@sbaktha
Copy link

sbaktha commented Jul 22, 2019

Hi, Is there any update on the status of CUDA implementation?

@saifullah3396
Copy link

saifullah3396 commented Oct 15, 2019

Hi @ahornung, I have developed a CUDA based replacement of the computeUpdate() and computeRayKeys(). Can you please look at my fork https://github.com/saifullah3396/octomap and tell me if its good for pull request. For now it does not have conflicts with the basic implementation? I'd really like further development on this to be done in this repository. The implementation can be tested by building the cuda-devel branch (add cmake parameter -D__CUDA_SUPPORT__=ON) and running graph2tree as follows:
../bin/graph2tree -i ../octomap/share/data/spherical_scan.graph -o out.bt
I am still facing a few issues regarding speeding up the process. Right now a lot of data has to be copied to GPU before updating the scan. For that maybe its better to copy the tree once on GPU and then keep using it? or create the tree on GPU directly. In any case, copying the tree on GPU takes a lot of time.

@ahornung
Copy link
Member

Thanks for your contribution @saifullah3396, that sounds really useful!

Do you have a first indication about processing times, ideally on the same benchmark data as used in the paper?

Unfortunately, I won't have time for an in-depth review, so best would be a cleaned up pull request that can be iteratively discussed and improved by the community.

@saifullah3396
Copy link

@ahornung Well in basic usage the current implementation is definitely faster but before I produce some results on the benchmark data, I will be working on the implementation a bit more for making it even faster. It might take me some time to add a CUDA - based hashmap in there but it will definitely increase performance. I will share the benchmark results once I'm finished with it and send a PR ! :)

@saifullah3396 saifullah3396 linked a pull request Oct 25, 2019 that will close this issue
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging a pull request may close this issue.

6 participants