Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

CUDA out of memory #18

Open
kukumallou opened this issue Nov 2, 2023 · 6 comments
Open

CUDA out of memory #18

kukumallou opened this issue Nov 2, 2023 · 6 comments
Labels
help wanted Extra attention is needed

Comments

@kukumallou
Copy link

First of all, thanks for the contribution. Very nice project.
I came across CUDA out of memory when running dense reconstruction (run_neuralangelo-colmap_dense.sh)

torch.cuda.OutOfMemoryError: CUDA out of memory. Tried to allocate 1.05 GiB (GPU 0; 23.64 GiB total capacity; 19.51 GiB already allocated; 911.19 MiB free; 20.04 GiB reserved in total by PyTorch) If reserved memory is >> allocated memory try setting max_split_size_mb to avoid fragmentation. See documentation for Memory Management and PYTORCH_CUDA_ALLOC_CONF

I have tried to smaller the number of samples per ray from 1024. to 512 and 256 as suggested in the FAQ. But the error message is the same. BTW I have succeeded in running sparse reconstruction script and got correct results. Any idea to fix this problem? Thanks a lot

@hugoycj
Copy link
Owner

hugoycj commented Nov 2, 2023

Would you mind testing the latest version, and replace python launch.py --config configs/neuralangelo-colmap_dense-SH.yaml --gpu 0 --train dataset.root_dir=$INPUT_DIR with python launch.py --config configs/neuralangelo-colmap_dense.yaml --gpu 0 --train dataset.root_dir=$INPUT_DIR in run_neuralangelo-colmap_dense.sh script.

@kukumallou
Copy link
Author

I tried the latest version (Tue Nov 2), but the error still exists. The card I got is a 4090 with 24G memory.

@hugoycj
Copy link
Owner

hugoycj commented Nov 2, 2023

Sorry to bother you. Would you mind to share in which step the out of memory happens and what's the resolution of your images

@kukumallou
Copy link
Author

There're 140 images with resolution 1920x1440. And below is the output log of the script.

---sfm---
Sparse map datasets/cake exist. Aborting
---model_converter---
---colmap2mvsnet---
Image pair datasets/cake/dense/pair.txt exist. Aborting
Number of model parameters: 1162696
load third_party/Vis-MVSNet/pretrained_model/vis/-1
(1, 1, 528, 960): 100%|█████| 140/140 [02:39<00:00, 1.14s/it]
---mvsnet_fusion---
load data: 100%|███| 140/140 [00:01<00:00, 137.01it/s]
prob filter: 100%|███ 140/140 [00:00<00:00, 203.46it/s]
vis filter and med fusion: 100%|████| 140/140 [00:05<00:00, 27.54it/s]
vis filter and ave fusion: 100%|████| 140/140 [00:04<00:00, 31.20it/s]
vis filter: 100%|███| 140/140 [00:04<00:00, 30.62it/s]
back proj: 100%|████| 140/140 [00:00<00:00, 293.64it/s]
Construct combined PCD
Estimate normal
---angelo_recon---
Global seed set to 42
Using 16bit native Automatic Mixed Precision (AMP)
GPU available: True (cuda), used: True
TPU available: False, using: 0 TPU cores
IPU available: False, using: 0 IPUs
HPU available: False, using: 0 HPUs
Trainer(limit_train_batches=1.0) was configured so 100% of the batches per epoch will be used..
Global seed set to 42
Initializing distributed: GLOBAL_RANK: 0, MEMBER: 1/1

distributed_backend=nccl
All distributed processes registered. Starting with 1 processes

Loading dense prior from datasets/cake/dense/fused.ply
LOCAL_RANK: 0 - CUDA_VISIBLE_DEVICES: [0]
| Name | Type | Params

0 | model | NeuSModel | 28.0 M

28.0 M Trainable params
0 Non-trainable params
28.0 M Total params
55.914 Total estimated model params size (MB)
Epoch 0: : 0it [00:00, ?it/s]Update finite_difference_eps to 0.06801176275750971
Traceback (most recent call last):
File "launch.py", line 125, in
main()
File "launch.py", line 114, in main
trainer.fit(system, datamodule=dm)
File "/home/****/anaconda3/envs/objmodel/lib/python3.8/site-packages/pytorch_lightning/trainer/trainer.py", line 696, in fit
self._call_and_handle_interrupt(
File "/home//anaconda3/envs/objmodel/lib/python3.8/site-packages/pytorch_lightning/trainer/trainer.py", line 648, in _call_and_handle_interrupt
return self.strategy.launcher.launch(trainer_fn, *args, trainer=self, kwargs)
File "/home/
/anaconda3/envs/objmodel/lib/python3.8/site-packages/pytorch_lightning/strategies/launchers/subprocess_script.py", line 93, in launch
return function(*args, kwargs)
File "/home/
/anaconda3/envs/objmodel/lib/python3.8/site-packages/pytorch_lightning/trainer/trainer.py", line 735, in _fit_impl
results = self._run(model, ckpt_path=self.ckpt_path)
File "/home/
/anaconda3/envs/objmodel/lib/python3.8/site-packages/pytorch_lightning/trainer/trainer.py", line 1166, in _run
results = self._run_stage()
File "/home/
/anaconda3/envs/objmodel/lib/python3.8/site-packages/pytorch_lightning/trainer/trainer.py", line 1252, in _run_stage
return self._run_train()
File "/home/
/anaconda3/envs/objmodel/lib/python3.8/site-packages/pytorch_lightning/trainer/trainer.py", line 1283, in _run_train
self.fit_loop.run()
File "/home/
/anaconda3/envs/objmodel/lib/python3.8/site-packages/pytorch_lightning/loops/loop.py", line 200, in run
self.advance(*args, kwargs)
File "/home/
/anaconda3/envs/objmodel/lib/python3.8/site-packages/pytorch_lightning/loops/fit_loop.py", line 271, in advance
self._outputs = self.epoch_loop.run(self._data_fetcher)
File "/home/
/anaconda3/envs/objmodel/lib/python3.8/site-packages/pytorch_lightning/loops/loop.py", line 200, in run
self.advance(*args, kwargs)
File "/home/
/anaconda3/envs/objmodel/lib/python3.8/site-packages/pytorch_lightning/loops/epoch/training_epoch_loop.py", line 194, in advance
response = self.trainer._call_lightning_module_hook("on_train_batch_start", batch, batch_idx)
File "/home//anaconda3/envs/objmodel/lib/python3.8/site-packages/pytorch_lightning/trainer/trainer.py", line 1550, in _call_lightning_module_hook
output = fn(*args, kwargs)
File "/home/
/Dev/instant-angelo/systems/base.py", line 57, in on_train_batch_start
update_module_step(self.model, self.current_epoch, self.global_step)
File "/home/
/Dev/instant-angelo/systems/utils.py", line 351, in update_module_step
m.update_step(epoch, global_step)
File "/home/
/Dev/instant-angelo/models/neus.py", line 111, in update_step
self.occupancy_grid_bg.every_n_step(step=global_step, occ_eval_fn=occ_eval_fn_bg, occ_thre=self.config.get('grid_prune_occ_thre_bg', 0.01))
File "/home/
/anaconda3/envs/objmodel/lib/python3.8/site-packages/torch/utils/_contextlib.py", line 115, in decorate_context
return func(*args, kwargs)
File "/home/
/anaconda3/envs/objmodel/lib/python3.8/site-packages/nerfacc/grid.py", line 271, in every_n_step
self._update(
File "/home//anaconda3/envs/objmodel/lib/python3.8/site-packages/torch/utils/_contextlib.py", line 115, in decorate_context
return func(*args, kwargs)
File "/home/
/anaconda3/envs/objmodel/lib/python3.8/site-packages/nerfacc/grid.py", line 229, in _update
occ = occ_eval_fn(x).squeeze(-1)
File "/home/
/Dev/instant-angelo/models/neus.py", line 104, in occ_eval_fn_bg
density, _ = self.geometry_bg(x)
File "/home/
/anaconda3/envs/objmodel/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1501, in _call_impl
return forward_call(*args, kwargs)
File "/home/
/Dev/instant-angelo/models/geometry.py", line 125, in forward
out = self.encoding_with_network(points.view(-1, self.n_input_dims)).view(*points.shape[:-1], self.n_output_dims).float()
File "/home/
/anaconda3/envs/objmodel/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1501, in _call_impl
return forward_call(*args, kwargs)
File "/home/
/Dev/instant-angelo/models/network_utils.py", line 193, in forward
return self.network(self.encoding(x))
File "/home/
*/anaconda3/envs/objmodel/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1501, in _call_impl
return forward_call(*args, kwargs)
File "/home/
/Dev/instant-angelo/models/network_utils.py", line 76, in forward
return self.encoding(x, *args) if not self.include_xyz else torch.cat([x * self.xyz_scale + self.xyz_offset, self.encoding(x, *args)], dim=-1)
torch.cuda.OutOfMemoryError: CUDA out of memory. Tried to allocate 1.15 GiB (GPU 0; 23.64 GiB total capacity; 19.97 GiB already allocated; 970.44 MiB free; 20.47 GiB reserved in total by PyTorch) If reserved memory is >> allocated memory try setting max_split_size_mb to avoid fragmentation. See documentation for Memory Management and PYTORCH_CUDA_ALLOC_CONF
Epoch 0: : 0it [00:07, ?it/s]
start time: 2023-11-03 08:46:23
sfm time: 2023-11-03 08:46:23
model_converter finished: 2023-11-03 08:46:24
colmap2mvsnet finished: 2023-11-03 08:46:25
mvsnet_inference finished: 2023-11-03 08:49:06
mvsnet_fusion finished: 2023-11-03 08:49:33
angelo_recon finished: 2023-11-03 08:50:11

@lyupei
Copy link

lyupei commented Nov 8, 2023

Hi, I have decreased the model.num_samples_per_ray from 1024 to 128, but still encountered vram OOM issues. I'm using a 2070 with 8G vram, can I run this project by adjusting other parameters?

@hugoycj hugoycj added the help wanted Extra attention is needed label Nov 8, 2023
@jianghr-shanghaitech
Copy link

@kukumallou @lyupei Wondering if you guys have solved the problem, it seems that OOM will happen if the VRAM is below 11G. Like 3080 with 10 VRAM.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
help wanted Extra attention is needed
Projects
None yet
Development

No branches or pull requests

4 participants