Loading snapshot RuntimeError: Can't set params because CPU buffer has the wrong size. #502

iordachelivia · 2022-04-20T07:58:30Z

iordachelivia
Apr 20, 2022

I am trying to load on a local PC a snapshot trained on a remote PC (think completely different specs between the two, different CPUs, GPUs, everything).

I train the network on the remote PC with
python scripts/run.py --mode nerf --scene data/nerf/fox --save_snapshot saved/fox_10k.msgpack --train --n_steps 10000

And load the snapshot on the local PC
python3 scripts/run.py --mode nerf --load_snapshot saved/fox.msgpack --gui

I receive the following error

Traceback (most recent call last):
  File "scripts/run.py", line 118, in <module>
    testbed.load_snapshot(args.load_snapshot)
RuntimeError: Can't set params because CPU buffer has the wrong size.

The error comes from dependencies/tiny-cuda-nn/include/tiny-cuda-nn/trainer.h the set_params method (called by deserialize).

I receive the same error whether I train on PC A and load on PC B or whether I train on PC B and load on PC A.

I found dump_parameters_as_images method in src/testbed.cu . I printed out the layer size values (first and second) from the method, between the two PCs with snapshots trained on each PC using the train command above python scripts/run.py --mode nerf --scene data/nerf/fox --save_snapshot saved/fox_10k.msgpack --train --n_steps 10000 and received the following:

On local PC (lower specs)
...
8 64  Saved exr file: layer-4.exr

Vs on remote PC (higher specs) 
...
16 64   Saved exr file: layer-4.exr

I found a difference at layer 4, but I don't understand where exactly the difference comes from. Is there some dynamic parameter in the training that would make the same training command have different layer sizes on different PCs?

I found a similar issue here NVlabs/tiny-cuda-nn#6 and I do receive a warning on the local PC that Warning: FullyFusedMLP is not supported for the selected architecture 61. Falling back to CutlassMLP. For maximum performance, raise the target GPU architecture to 75+. .But I'm not sure how I can fix this.

Is there some modification I can make to load the snapshot locally if it was trained on a different PC with different specs?

Tom94 · 2022-04-20T08:14:08Z

Tom94
Apr 20, 2022
Maintainer

Hi there, if you use CutlassMLP instead of FullyFusedMLP on the higher-specced machine, this'll make the snapshots compatible. The size mismatch arises from differences in the underlying matmul implementations: CUTLASS only requires sizes that are a multiple of 8, whereas the fully fused implementation requires multiples of 16, hence different amounts of padding in the last layer of the network.

It would of course be great if the snapshot functionality would automate away this conversion (permitting FullyFusedMLP to be used with snapshots from CutlassMLP), but alas that's not supported yet.

2 replies

iordachelivia Apr 26, 2022
Author

Hello! Thank you for your answer. I modified in the configs/nerf/base.json file and set from FullyFusedMLP to CutlassMLP. The console output confirms this

11:46:11 INFO     Density model: 3--[HashGrid]-->32--[CutlassMLP(neurons=64,layers=3)]-->1
11:46:11 INFO     Color model:   3--[Composite]-->16+16--[CutlassMLP(neurons=64,layers=4)]-->3

However I get the same error RuntimeError: Can't set params because CPU buffer has the wrong size. If I build the remote repo with the target arch of the local PC TCNN_CUDA_ARCHITECTURES=61 then I can successfully load the remote-built model onto the local PC. However, the training time is much slower which defeats the purpose of having the fast remote PC with the newer arch. Are there any other files that I should modify and replace FullyFusedMLP with CutlassMLP?

Thank you for your time!

Tom94 Apr 26, 2022
Maintainer

TCNN_CUDA_ARCHITECTURES=61 forces the use of fp32 rather than fp16, because that particular architecture has no fast fp16 hardware.

You could insert a cast from fp16 to fp32 into the code loading the snapshot -- that'd be the first step towards an automated conversion (the other being the conversion between a padding of 8 and 16).

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Loading snapshot RuntimeError: Can't set params because CPU buffer has the wrong size. #502

{{title}}

{{editor}}'s edit

{{editor}}'s edit

Replies: 1 comment 2 replies

{{title}}

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

Select a reply

Loading snapshot RuntimeError: Can't set params because CPU buffer has the wrong size. #502

iordachelivia Apr 20, 2022

Replies: 1 comment · 2 replies

Tom94 Apr 20, 2022 Maintainer

iordachelivia Apr 26, 2022 Author

Tom94 Apr 26, 2022 Maintainer

iordachelivia
Apr 20, 2022

Replies: 1 comment 2 replies

Tom94
Apr 20, 2022
Maintainer

iordachelivia Apr 26, 2022
Author

Tom94 Apr 26, 2022
Maintainer