CUDA out of memory #18

kukumallou · 2023-11-02T02:40:42Z

First of all, thanks for the contribution. Very nice project.
I came across CUDA out of memory when running dense reconstruction (run_neuralangelo-colmap_dense.sh)

torch.cuda.OutOfMemoryError: CUDA out of memory. Tried to allocate 1.05 GiB (GPU 0; 23.64 GiB total capacity; 19.51 GiB already allocated; 911.19 MiB free; 20.04 GiB reserved in total by PyTorch) If reserved memory is >> allocated memory try setting max_split_size_mb to avoid fragmentation. See documentation for Memory Management and PYTORCH_CUDA_ALLOC_CONF

I have tried to smaller the number of samples per ray from 1024. to 512 and 256 as suggested in the FAQ. But the error message is the same. BTW I have succeeded in running sparse reconstruction script and got correct results. Any idea to fix this problem? Thanks a lot

hugoycj · 2023-11-02T03:40:10Z

Would you mind testing the latest version, and replace python launch.py --config configs/neuralangelo-colmap_dense-SH.yaml --gpu 0 --train dataset.root_dir=$INPUT_DIR with python launch.py --config configs/neuralangelo-colmap_dense.yaml --gpu 0 --train dataset.root_dir=$INPUT_DIR in run_neuralangelo-colmap_dense.sh script.

kukumallou · 2023-11-02T07:57:25Z

I tried the latest version (Tue Nov 2), but the error still exists. The card I got is a 4090 with 24G memory.

hugoycj · 2023-11-02T08:09:40Z

Sorry to bother you. Would you mind to share in which step the out of memory happens and what's the resolution of your images

kukumallou · 2023-11-03T00:59:57Z

There're 140 images with resolution 1920x1440. And below is the output log of the script.

---sfm---
Sparse map datasets/cake exist. Aborting
---model_converter---
---colmap2mvsnet---
Image pair datasets/cake/dense/pair.txt exist. Aborting
Number of model parameters: 1162696
load third_party/Vis-MVSNet/pretrained_model/vis/-1
(1, 1, 528, 960): 100%|█████| 140/140 [02:39<00:00, 1.14s/it]
---mvsnet_fusion---
load data: 100%|███| 140/140 [00:01<00:00, 137.01it/s]
prob filter: 100%|███ 140/140 [00:00<00:00, 203.46it/s]
vis filter and med fusion: 100%|████| 140/140 [00:05<00:00, 27.54it/s]
vis filter and ave fusion: 100%|████| 140/140 [00:04<00:00, 31.20it/s]
vis filter: 100%|███| 140/140 [00:04<00:00, 30.62it/s]
back proj: 100%|████| 140/140 [00:00<00:00, 293.64it/s]
Construct combined PCD
Estimate normal
---angelo_recon---
Global seed set to 42
Using 16bit native Automatic Mixed Precision (AMP)
GPU available: True (cuda), used: True
TPU available: False, using: 0 TPU cores
IPU available: False, using: 0 IPUs
HPU available: False, using: 0 HPUs
Trainer(limit_train_batches=1.0) was configured so 100% of the batches per epoch will be used..
Global seed set to 42
Initializing distributed: GLOBAL_RANK: 0, MEMBER: 1/1

distributed_backend=nccl
All distributed processes registered. Starting with 1 processes

Loading dense prior from datasets/cake/dense/fused.ply
LOCAL_RANK: 0 - CUDA_VISIBLE_DEVICES: [0]
| Name | Type | Params

0 | model | NeuSModel | 28.0 M

28.0 M Trainable params
0 Non-trainable params
28.0 M Total params
55.914 Total estimated model params size (MB)
Epoch 0: : 0it [00:00, ?it/s]Update finite_difference_eps to 0.06801176275750971
Traceback (most recent call last):
File "launch.py", line 125, in
main()
File "launch.py", line 114, in main
trainer.fit(system, datamodule=dm)
File "/home/****/anaconda3/envs/objmodel/lib/python3.8/site-packages/pytorch_lightning/trainer/trainer.py", line 696, in fit
self._call_and_handle_interrupt(
File "/home//anaconda3/envs/objmodel/lib/python3.8/site-packages/pytorch_lightning/trainer/trainer.py", line 648, in _call_and_handle_interrupt
return self.strategy.launcher.launch(trainer_fn, *args, trainer=self, kwargs)
File "/home//anaconda3/envs/objmodel/lib/python3.8/site-packages/pytorch_lightning/strategies/launchers/subprocess_script.py", line 93, in launch
return function(*args, kwargs)
File "/home//anaconda3/envs/objmodel/lib/python3.8/site-packages/pytorch_lightning/trainer/trainer.py", line 735, in _fit_impl
results = self._run(model, ckpt_path=self.ckpt_path)
File "/home//anaconda3/envs/objmodel/lib/python3.8/site-packages/pytorch_lightning/trainer/trainer.py", line 1166, in _run
results = self._run_stage()
File "/home//anaconda3/envs/objmodel/lib/python3.8/site-packages/pytorch_lightning/trainer/trainer.py", line 1252, in _run_stage
return self._run_train()
File "/home//anaconda3/envs/objmodel/lib/python3.8/site-packages/pytorch_lightning/trainer/trainer.py", line 1283, in _run_train
self.fit_loop.run()
File "/home//anaconda3/envs/objmodel/lib/python3.8/site-packages/pytorch_lightning/loops/loop.py", line 200, in run
self.advance(*args, kwargs)
File "/home//anaconda3/envs/objmodel/lib/python3.8/site-packages/pytorch_lightning/loops/fit_loop.py", line 271, in advance
self._outputs = self.epoch_loop.run(self._data_fetcher)
File "/home//anaconda3/envs/objmodel/lib/python3.8/site-packages/pytorch_lightning/loops/loop.py", line 200, in run
self.advance(*args, kwargs)
File "/home//anaconda3/envs/objmodel/lib/python3.8/site-packages/pytorch_lightning/loops/epoch/training_epoch_loop.py", line 194, in advance
response = self.trainer._call_lightning_module_hook("on_train_batch_start", batch, batch_idx)
File "/home//anaconda3/envs/objmodel/lib/python3.8/site-packages/pytorch_lightning/trainer/trainer.py", line 1550, in _call_lightning_module_hook
output = fn(*args, kwargs)
File "/home//Dev/instant-angelo/systems/base.py", line 57, in on_train_batch_start
update_module_step(self.model, self.current_epoch, self.global_step)
File "/home//Dev/instant-angelo/systems/utils.py", line 351, in update_module_step
m.update_step(epoch, global_step)
File "/home//Dev/instant-angelo/models/neus.py", line 111, in update_step
self.occupancy_grid_bg.every_n_step(step=global_step, occ_eval_fn=occ_eval_fn_bg, occ_thre=self.config.get('grid_prune_occ_thre_bg', 0.01))
File "/home//anaconda3/envs/objmodel/lib/python3.8/site-packages/torch/utils/_contextlib.py", line 115, in decorate_context
return func(*args, kwargs)
File "/home//anaconda3/envs/objmodel/lib/python3.8/site-packages/nerfacc/grid.py", line 271, in every_n_step
self._update(
File "/home//anaconda3/envs/objmodel/lib/python3.8/site-packages/torch/utils/_contextlib.py", line 115, in decorate_context
return func(*args, kwargs)
File "/home//anaconda3/envs/objmodel/lib/python3.8/site-packages/nerfacc/grid.py", line 229, in _update
occ = occ_eval_fn(x).squeeze(-1)
File "/home//Dev/instant-angelo/models/neus.py", line 104, in occ_eval_fn_bg
density, _ = self.geometry_bg(x)
File "/home//anaconda3/envs/objmodel/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1501, in _call_impl
return forward_call(*args, kwargs)
File "/home//Dev/instant-angelo/models/geometry.py", line 125, in forward
out = self.encoding_with_network(points.view(-1, self.n_input_dims)).view(*points.shape[:-1], self.n_output_dims).float()
File "/home//anaconda3/envs/objmodel/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1501, in _call_impl
return forward_call(*args, kwargs)
File "/home//Dev/instant-angelo/models/network_utils.py", line 193, in forward
return self.network(self.encoding(x))
File "/home/*/anaconda3/envs/objmodel/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1501, in _call_impl
return forward_call(*args, kwargs)
File "/home//Dev/instant-angelo/models/network_utils.py", line 76, in forward
return self.encoding(x, *args) if not self.include_xyz else torch.cat([x * self.xyz_scale + self.xyz_offset, self.encoding(x, *args)], dim=-1)
torch.cuda.OutOfMemoryError: CUDA out of memory. Tried to allocate 1.15 GiB (GPU 0; 23.64 GiB total capacity; 19.97 GiB already allocated; 970.44 MiB free; 20.47 GiB reserved in total by PyTorch) If reserved memory is >> allocated memory try setting max_split_size_mb to avoid fragmentation. See documentation for Memory Management and PYTORCH_CUDA_ALLOC_CONF
Epoch 0: : 0it [00:07, ?it/s]
start time: 2023-11-03 08:46:23
sfm time: 2023-11-03 08:46:23
model_converter finished: 2023-11-03 08:46:24
colmap2mvsnet finished: 2023-11-03 08:46:25
mvsnet_inference finished: 2023-11-03 08:49:06
mvsnet_fusion finished: 2023-11-03 08:49:33
angelo_recon finished: 2023-11-03 08:50:11

lyupei · 2023-11-08T01:16:47Z

Hi, I have decreased the model.num_samples_per_ray from 1024 to 128, but still encountered vram OOM issues. I'm using a 2070 with 8G vram, can I run this project by adjusting other parameters?

jianghr-shanghaitech · 2024-10-22T09:39:32Z

@kukumallou @lyupei Wondering if you guys have solved the problem, it seems that OOM will happen if the VRAM is below 11G. Like 3080 with 10 VRAM.

hugoycj added the help wanted Extra attention is needed label Nov 8, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

CUDA out of memory #18

CUDA out of memory #18

kukumallou commented Nov 2, 2023

hugoycj commented Nov 2, 2023

kukumallou commented Nov 2, 2023

hugoycj commented Nov 2, 2023

kukumallou commented Nov 3, 2023

distributed_backend=nccl
All distributed processes registered. Starting with 1 processes

Loading dense prior from datasets/cake/dense/fused.ply
LOCAL_RANK: 0 - CUDA_VISIBLE_DEVICES: [0]
| Name | Type | Params

0 | model | NeuSModel | 28.0 M

lyupei commented Nov 8, 2023

jianghr-shanghaitech commented Oct 22, 2024

CUDA out of memory #18

CUDA out of memory #18

Comments

kukumallou commented Nov 2, 2023

hugoycj commented Nov 2, 2023

kukumallou commented Nov 2, 2023

hugoycj commented Nov 2, 2023

kukumallou commented Nov 3, 2023

distributed_backend=nccl All distributed processes registered. Starting with 1 processes

Loading dense prior from datasets/cake/dense/fused.ply LOCAL_RANK: 0 - CUDA_VISIBLE_DEVICES: [0] | Name | Type | Params

0 | model | NeuSModel | 28.0 M

lyupei commented Nov 8, 2023

jianghr-shanghaitech commented Oct 22, 2024

distributed_backend=nccl
All distributed processes registered. Starting with 1 processes

Loading dense prior from datasets/cake/dense/fused.ply
LOCAL_RANK: 0 - CUDA_VISIBLE_DEVICES: [0]
| Name | Type | Params