_hash_encoder Error while trying to train #96

HamzaOuajhain · 2024-04-08T18:04:42Z

First of all thank you for your work. I am trying to run nicer-slam and having difficulty doing so they use the implementation of monosdf :

I get this error :

and this is the call stack :

also for the configuration of gcc, cuda, cudnn :

I have tried the suggestion from Issue #19 but with no success sadly.

would you be able to help ?

niujinshuchong · 2024-04-08T19:39:45Z

Hi, it seems like 1070 is not on this list here https://github.com/cvg/nicer-slam/blob/main/code/hashencoder/backend.py#L10-L26. Maybe you need to change it.

HamzaOuajhain · 2024-04-09T00:05:41Z

Thank you for your reply, I managed to fix that problem, but when I try to train network I get an out of memory error. It is mentioned in the readme file that we should lower the batch size since I only have a 8 Go GPU

batch_size=1
in eval_rendering but I doubt it means that,

I also tried changing 'batch_size = ground_truth["rgb"].shape[0]' in Line 285 of file volsdf_train.py but without success, this is the full error.

python training/exp_runner.py --conf confs/runconf_demo_1.conf shell command : training/exp_runner.py --conf confs/runconf_demo_1.conf Loading data ... Finish loading data. build_directory ./tmp_build_1070/ Detected CUDA files, patching ldflags Emitting ninja build file ./tmp_build_1070/build.ninja... Building extension module _hash_encoder_1070... Allowing ninja to set a default number of workers... (overridable by setting the environment variable MAX_JOBS=N) ninja: no work to do. Loading extension module _hash_encoder_1070... running... 0%| | 0/200 [00:00<?, ?it/s]/home/aspegique/anaconda3/envs/nicer-slam/lib/python3.8/site-packages/torch/autograd/__init__.py:173: UserWarning: Error detected in _hash_encodeBackward. Traceback of forward call that caused the error: File "training/exp_runner.py", line 54, in <module> trainrunner.run() File "/home/aspegique/Desktop/repos/nicer-slam/code/../code/training/volsdf_train.py", line 558, in run model_outputs = self.model( File "/home/aspegique/anaconda3/envs/nicer-slam/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1110, in _call_impl return forward_call(*input, **kwargs) File "/home/aspegique/Desktop/repos/nicer-slam/code/../code/model/network.py", line 129, in forward rgb_flat = self.rendering_network( File "/home/aspegique/anaconda3/envs/nicer-slam/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1110, in _call_impl return forward_call(*input, **kwargs) File "/home/aspegique/Desktop/repos/nicer-slam/code/../code/model/base_networks.py", line 336, in forward grid_feature = self.encoding(points / self.divide_factor) File "/home/aspegique/anaconda3/envs/nicer-slam/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1110, in _call_impl return forward_call(*input, **kwargs) File "/home/aspegique/Desktop/repos/nicer-slam/code/../code/hashencoder/hashgrid.py", line 210, in forward outputs = hash_encode(inputs, self.embeddings, self.offsets, self.per_level_scale, self.base_resolution, inputs.requires_grad) (Triggered internally at /opt/conda/conda-bld/pytorch_1646755903507/work/torch/csrc/autograd/python_anomaly_mode.cpp:104.) Variable._execution_engine.run_backward( # Calls into the C++ engine to run the backward pass 0%| | 0/200 [00:01<?, ?it/s] Traceback (most recent call last): File "training/exp_runner.py", line 54, in <module> trainrunner.run() File "/home/aspegique/Desktop/repos/nicer-slam/code/../code/training/volsdf_train.py", line 577, in run loss.backward() File "/home/aspegique/anaconda3/envs/nicer-slam/lib/python3.8/site-packages/torch/_tensor.py", line 363, in backward torch.autograd.backward(self, gradient, retain_graph, create_graph, inputs=inputs) File "/home/aspegique/anaconda3/envs/nicer-slam/lib/python3.8/site-packages/torch/autograd/__init__.py", line 173, in backward Variable._execution_engine.run_backward( # Calls into the C++ engine to run the backward pass File "/home/aspegique/anaconda3/envs/nicer-slam/lib/python3.8/site-packages/torch/autograd/function.py", line 253, in apply return user_fn(self, *args) File "/home/aspegique/anaconda3/envs/nicer-slam/lib/python3.8/site-packages/torch/cuda/amp/autocast_mode.py", line 135, in decorate_bwd return bwd(*args, **kwargs) File "/home/aspegique/Desktop/repos/nicer-slam/code/../code/hashencoder/hashgrid.py", line 64, in backward grad_inputs, grad_embeddings = _hash_encode_second_backward.apply(grad, inputs, embeddings, offsets, B, D, C, L, S, H, calc_grad_inputs, dy_dx) File "/home/aspegique/Desktop/repos/nicer-slam/code/../code/hashencoder/hashgrid.py", line 85, in forward grad_embeddings = torch.zeros_like(embeddings) RuntimeError: CUDA out of memory. Tried to allocate 1016.00 MiB (GPU 0; 7.91 GiB total capacity; 4.51 GiB already allocated; 1000.12 MiB free; 5.52 GiB reserved in total by PyTorch) If reserved memory is >> allocated memory try setting max_split_size_mb to avoid fragmentation. See documentation for Memory Management and PYTORCH_CUDA_ALLOC_CONF

doddodod · 2024-07-07T13:43:07Z

similar problem: /root/miniconda3/envs/monosdf/lib/python3.8/site-packages/torch/include/ATen/ATen.h:4:2: error: #error C++17 or later compatible compiler is required to use ATen.
4 | #error C++17 or later compatible compiler is required to use ATen.
| ^~~~~
ninja: build stopped: subcommand failed.
[rank0]: Traceback (most recent call last):
[rank0]: File "/root/miniconda3/envs/monosdf/lib/python3.8/site-packages/torch/utils/cpp_extension.py", line 2107, in _run_ninja_build
[rank0]: subprocess.run(
[rank0]: File "/root/miniconda3/envs/monosdf/lib/python3.8/subprocess.py", line 516, in run
[rank0]: raise CalledProcessError(retcode, process.args,
[rank0]: subprocess.CalledProcessError: Command '['ninja', '-v', '-j', '1']' returned non-zero exit status 1.

[rank0]: The above exception was the direct cause of the following exception:

[rank0]: Traceback (most recent call last):
[rank0]: File "training/exp_runner.py", line 58, in
[rank0]: trainrunner = MonoSDFTrainRunner(conf=opt.conf,
[rank0]: File "/root/autodl-tmp/monosdf/code/../code/training/monosdf_train.py", line 107, in init
[rank0]: self.model = utils.get_class(self.conf.get_string('train.model_class'))(conf=conf_model)
[rank0]: File "/root/autodl-tmp/monosdf/code/../code/utils/general.py", line 18, in get_class
[rank0]: m = import(module)
[rank0]: File "/root/autodl-tmp/monosdf/code/../code/model/network.py", line 140, in
[rank0]: from hashencoder.hashgrid import _hash_encode, HashEncoder
[rank0]: File "/root/autodl-tmp/monosdf/code/../code/hashencoder/init.py", line 1, in
[rank0]: from .hashgrid import HashEncoder
[rank0]: File "/root/autodl-tmp/monosdf/code/../code/hashencoder/hashgrid.py", line 12, in
[rank0]: from .backend import _backend
[rank0]: File "/root/autodl-tmp/monosdf/code/../code/hashencoder/backend.py", line 10, in
[rank0]: _backend = load(name='_hash_encoder',
[rank0]: File "/root/miniconda3/envs/monosdf/lib/python3.8/site-packages/torch/utils/cpp_extension.py", line 1309, in load
[rank0]: return _jit_compile(
[rank0]: File "/root/miniconda3/envs/monosdf/lib/python3.8/site-packages/torch/utils/cpp_extension.py", line 1719, in _jit_compile
[rank0]: _write_ninja_file_and_build_library(
[rank0]: File "/root/miniconda3/envs/monosdf/lib/python3.8/site-packages/torch/utils/cpp_extension.py", line 1832, in _write_ninja_file_and_build_library
[rank0]: _run_ninja_build(
[rank0]: File "/root/miniconda3/envs/monosdf/lib/python3.8/site-packages/torch/utils/cpp_extension.py", line 2123, in _run_ninja_build
[rank0]: raise RuntimeError(message) from e
[rank0]: RuntimeError: Error building extension '_hash_encoder'

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

_hash_encoder Error while trying to train #96

_hash_encoder Error while trying to train #96

HamzaOuajhain commented Apr 8, 2024

niujinshuchong commented Apr 8, 2024

HamzaOuajhain commented Apr 9, 2024

doddodod commented Jul 7, 2024

_hash_encoder Error while trying to train #96

_hash_encoder Error while trying to train #96

Comments

HamzaOuajhain commented Apr 8, 2024

niujinshuchong commented Apr 8, 2024

HamzaOuajhain commented Apr 9, 2024

doddodod commented Jul 7, 2024