Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

_hash_encoder Error while trying to train #96

Open
HamzaOuajhain opened this issue Apr 8, 2024 · 3 comments
Open

_hash_encoder Error while trying to train #96

HamzaOuajhain opened this issue Apr 8, 2024 · 3 comments

Comments

@HamzaOuajhain
Copy link

First of all thank you for your work. I am trying to run nicer-slam and having difficulty doing so they use the implementation of monosdf :

I get this error :

image

and this is the call stack :

image

also for the configuration of gcc, cuda, cudnn :

image

I have tried the suggestion from Issue #19 but with no success sadly.

would you be able to help ?

@niujinshuchong
Copy link
Member

Hi, it seems like 1070 is not on this list here https://github.com/cvg/nicer-slam/blob/main/code/hashencoder/backend.py#L10-L26. Maybe you need to change it.

@HamzaOuajhain
Copy link
Author

Thank you for your reply, I managed to fix that problem, but when I try to train network I get an out of memory error. It is mentioned in the readme file that we should lower the batch size since I only have a 8 Go GPU

batch_size=1
in eval_rendering but I doubt it means that,

I also tried changing 'batch_size = ground_truth["rgb"].shape[0]' in Line 285 of file volsdf_train.py but without success, this is the full error.

python training/exp_runner.py --conf confs/runconf_demo_1.conf shell command : training/exp_runner.py --conf confs/runconf_demo_1.conf Loading data ... Finish loading data. build_directory ./tmp_build_1070/ Detected CUDA files, patching ldflags Emitting ninja build file ./tmp_build_1070/build.ninja... Building extension module _hash_encoder_1070... Allowing ninja to set a default number of workers... (overridable by setting the environment variable MAX_JOBS=N) ninja: no work to do. Loading extension module _hash_encoder_1070... running... 0%| | 0/200 [00:00<?, ?it/s]/home/aspegique/anaconda3/envs/nicer-slam/lib/python3.8/site-packages/torch/autograd/__init__.py:173: UserWarning: Error detected in _hash_encodeBackward. Traceback of forward call that caused the error: File "training/exp_runner.py", line 54, in <module> trainrunner.run() File "/home/aspegique/Desktop/repos/nicer-slam/code/../code/training/volsdf_train.py", line 558, in run model_outputs = self.model( File "/home/aspegique/anaconda3/envs/nicer-slam/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1110, in _call_impl return forward_call(*input, **kwargs) File "/home/aspegique/Desktop/repos/nicer-slam/code/../code/model/network.py", line 129, in forward rgb_flat = self.rendering_network( File "/home/aspegique/anaconda3/envs/nicer-slam/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1110, in _call_impl return forward_call(*input, **kwargs) File "/home/aspegique/Desktop/repos/nicer-slam/code/../code/model/base_networks.py", line 336, in forward grid_feature = self.encoding(points / self.divide_factor) File "/home/aspegique/anaconda3/envs/nicer-slam/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1110, in _call_impl return forward_call(*input, **kwargs) File "/home/aspegique/Desktop/repos/nicer-slam/code/../code/hashencoder/hashgrid.py", line 210, in forward outputs = hash_encode(inputs, self.embeddings, self.offsets, self.per_level_scale, self.base_resolution, inputs.requires_grad) (Triggered internally at /opt/conda/conda-bld/pytorch_1646755903507/work/torch/csrc/autograd/python_anomaly_mode.cpp:104.) Variable._execution_engine.run_backward( # Calls into the C++ engine to run the backward pass 0%| | 0/200 [00:01<?, ?it/s] Traceback (most recent call last): File "training/exp_runner.py", line 54, in <module> trainrunner.run() File "/home/aspegique/Desktop/repos/nicer-slam/code/../code/training/volsdf_train.py", line 577, in run loss.backward() File "/home/aspegique/anaconda3/envs/nicer-slam/lib/python3.8/site-packages/torch/_tensor.py", line 363, in backward torch.autograd.backward(self, gradient, retain_graph, create_graph, inputs=inputs) File "/home/aspegique/anaconda3/envs/nicer-slam/lib/python3.8/site-packages/torch/autograd/__init__.py", line 173, in backward Variable._execution_engine.run_backward( # Calls into the C++ engine to run the backward pass File "/home/aspegique/anaconda3/envs/nicer-slam/lib/python3.8/site-packages/torch/autograd/function.py", line 253, in apply return user_fn(self, *args) File "/home/aspegique/anaconda3/envs/nicer-slam/lib/python3.8/site-packages/torch/cuda/amp/autocast_mode.py", line 135, in decorate_bwd return bwd(*args, **kwargs) File "/home/aspegique/Desktop/repos/nicer-slam/code/../code/hashencoder/hashgrid.py", line 64, in backward grad_inputs, grad_embeddings = _hash_encode_second_backward.apply(grad, inputs, embeddings, offsets, B, D, C, L, S, H, calc_grad_inputs, dy_dx) File "/home/aspegique/Desktop/repos/nicer-slam/code/../code/hashencoder/hashgrid.py", line 85, in forward grad_embeddings = torch.zeros_like(embeddings) RuntimeError: CUDA out of memory. Tried to allocate 1016.00 MiB (GPU 0; 7.91 GiB total capacity; 4.51 GiB already allocated; 1000.12 MiB free; 5.52 GiB reserved in total by PyTorch) If reserved memory is >> allocated memory try setting max_split_size_mb to avoid fragmentation. See documentation for Memory Management and PYTORCH_CUDA_ALLOC_CONF

@doddodod
Copy link

doddodod commented Jul 7, 2024

similar problem: /root/miniconda3/envs/monosdf/lib/python3.8/site-packages/torch/include/ATen/ATen.h:4:2: error: #error C++17 or later compatible compiler is required to use ATen.
4 | #error C++17 or later compatible compiler is required to use ATen.
| ^~~~~
ninja: build stopped: subcommand failed.
[rank0]: Traceback (most recent call last):
[rank0]: File "/root/miniconda3/envs/monosdf/lib/python3.8/site-packages/torch/utils/cpp_extension.py", line 2107, in _run_ninja_build
[rank0]: subprocess.run(
[rank0]: File "/root/miniconda3/envs/monosdf/lib/python3.8/subprocess.py", line 516, in run
[rank0]: raise CalledProcessError(retcode, process.args,
[rank0]: subprocess.CalledProcessError: Command '['ninja', '-v', '-j', '1']' returned non-zero exit status 1.

[rank0]: The above exception was the direct cause of the following exception:

[rank0]: Traceback (most recent call last):
[rank0]: File "training/exp_runner.py", line 58, in
[rank0]: trainrunner = MonoSDFTrainRunner(conf=opt.conf,
[rank0]: File "/root/autodl-tmp/monosdf/code/../code/training/monosdf_train.py", line 107, in init
[rank0]: self.model = utils.get_class(self.conf.get_string('train.model_class'))(conf=conf_model)
[rank0]: File "/root/autodl-tmp/monosdf/code/../code/utils/general.py", line 18, in get_class
[rank0]: m = import(module)
[rank0]: File "/root/autodl-tmp/monosdf/code/../code/model/network.py", line 140, in
[rank0]: from hashencoder.hashgrid import _hash_encode, HashEncoder
[rank0]: File "/root/autodl-tmp/monosdf/code/../code/hashencoder/init.py", line 1, in
[rank0]: from .hashgrid import HashEncoder
[rank0]: File "/root/autodl-tmp/monosdf/code/../code/hashencoder/hashgrid.py", line 12, in
[rank0]: from .backend import _backend
[rank0]: File "/root/autodl-tmp/monosdf/code/../code/hashencoder/backend.py", line 10, in
[rank0]: _backend = load(name='_hash_encoder',
[rank0]: File "/root/miniconda3/envs/monosdf/lib/python3.8/site-packages/torch/utils/cpp_extension.py", line 1309, in load
[rank0]: return _jit_compile(
[rank0]: File "/root/miniconda3/envs/monosdf/lib/python3.8/site-packages/torch/utils/cpp_extension.py", line 1719, in _jit_compile
[rank0]: _write_ninja_file_and_build_library(
[rank0]: File "/root/miniconda3/envs/monosdf/lib/python3.8/site-packages/torch/utils/cpp_extension.py", line 1832, in _write_ninja_file_and_build_library
[rank0]: _run_ninja_build(
[rank0]: File "/root/miniconda3/envs/monosdf/lib/python3.8/site-packages/torch/utils/cpp_extension.py", line 2123, in _run_ninja_build
[rank0]: raise RuntimeError(message) from e
[rank0]: RuntimeError: Error building extension '_hash_encoder'

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants