-
Notifications
You must be signed in to change notification settings - Fork 31
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Error When Shifting Twice #16
Comments
There is a paper in ICLR named APoT (Additive Powers of Two) that did something similar (sum of 2 shifts). Do you want to try their code: Hopefully, their code should work because they did some other things (like adding a clip and weight norm) that may avoid this NaN that you got. Also, note that APoT quantizes both weights and activations, while we only quantize weights. |
Also, if you are interested, I think adding weight normalization (before calling
|
I'll try. Thank you very much. |
I have some doubts about your code. I look forward to your answer:
1.Why clampabs the weight to a certain range? The value range of weight should be a real number. Why is the range of weights is (-1 * (2**(weight_bits - 1) - 1), 0)? Thank you very much! |
Let me supplement our detailed experiment process. 1.We tried to add the weight normalization code in the function Conv2dShiftQ. class Conv2dShiftQ(_ConvNdShiftQ):
..... A error occured: Traceback (most recent call last):
2.if we comment out the code : Can you give me some suggestions? Thank you very much |
I think we should not modify `self.weight.data`. Can you change the code to:
```
***@***.***_script_method
def forward(self, input):
mean = self.weight.data.mean()
std = self.weight.data.std()
weight_norm = self.weight.data.add(-mean).div(std)
weight_norm = ste.clampabs(weight_norm, 2**self.shift_range[0],
2**self.shift_range[1])
weight_q = ste.round_power_of_2(weight_norm, self.rounding)
```
I will try to reply to your other questions later today
…On Sun., Apr. 24, 2022, 3:53 a.m. mengjingyouling, ***@***.***> wrote:
Let me supplement our detailed experiment process.
1.We tried to add the *weight normalization* code in the function
Conv2dShiftQ.
class Conv2dShiftQ(_ConvNdShiftQ):
.... ....
... ....
***@***.***_script_method
def forward(self, input):
**mean = self.weight.data.mean()
std = self.weight.data.std()
self.weight.data = self.weight.data.add(-mean).div(std)**
self.weight.data = ste.clampabs(self.weight.data, 2**self.shift_range[0], 2**self.shift_range[1])
weight_q = ste.round_power_of_2(self.weight, self.rounding)
.....
*A error occured:*
Traceback (most recent call last):
File "train.py", line 667, in
main(opt)
File "train.py", line 564, in main
train(opt.hyp, opt, device, callbacks)
File "train.py", line 385, in train
callbacks.run('on_train_batch_end', ni, model, imgs, targets, paths,
plots, opt.sync_bn)
File "/home/ubuntu/zj/yolov3/utils/callbacks.py", line 76, in run
logger['callback'](*args, **kwargs)
File "/home/ubuntu/zj/yolov3/utils/loggers/*init*.py", line 89, in
on_train_batch_end
self.tb.add_graph(torch.jit.trace(de_parallel(model), imgs[0:1],
strict=False), [])
File
"/home/ubuntu/anaconda3/envs/yolov3/lib/python3.6/site-packages/torch/jit/_trace.py",
line 750, in trace
_module_class,
File
"/home/ubuntu/anaconda3/envs/yolov3/lib/python3.6/site-packages/torch/jit/_trace.py",
line 991, in trace_module
_module_class,
File
"/home/ubuntu/anaconda3/envs/yolov3/lib/python3.6/site-packages/torch/autograd/grad_mode.py",
line 28, in decorate_context
return func(*args, **kwargs)
File
"/home/ubuntu/anaconda3/envs/yolov3/lib/python3.6/site-packages/torch/jit/_trace.py",
line 526, in _check_trace
raise TracingCheckError(*diag_info)
torch.jit._trace.TracingCheckError: Tracing failed sanity checks!
ERROR: Tensor-valued Constant nodes differed in value across invocations.
This often indicates that the tracer has encountered untraceable code.
Node:
%input.1 : Tensor = prim::Constantvalue=, scope:
__module.model.0/__module.model.0.conv #
/home/ubuntu/zj/yolov3/deepshift/ste.py:86:0
Source Location:
/home/ubuntu/zj/yolov3/deepshift/ste.py(86): clampabs
/home/ubuntu/zj/yolov3/deepshift/modules_q.py(294): forward
/home/ubuntu/anaconda3/envs/yolov3/lib/python3.6/site-packages/torch/nn/modules/module.py(1090):
_slow_forward
/home/ubuntu/anaconda3/envs/yolov3/lib/python3.6/site-packages/torch/nn/modules/module.py(1102):
_call_impl
/home/ubuntu/zj/yolov3/models/common.py(47): forward
/home/ubuntu/anaconda3/envs/yolov3/lib/python3.6/site-packages/torch/nn/modules/module.py(1090):
_slow_forward
/home/ubuntu/anaconda3/envs/yolov3/lib/python3.6/site-packages/torch/nn/modules/module.py(1102):
_call_impl
/home/ubuntu/zj/yolov3/models/yolo.py(150): _forward_once
/home/ubuntu/zj/yolov3/models/yolo.py(127): forward
/home/ubuntu/anaconda3/envs/yolov3/lib/python3.6/site-packages/torch/nn/modules/module.py(1090):
_slow_forward
/home/ubuntu/anaconda3/envs/yolov3/lib/python3.6/site-packages/torch/nn/modules/module.py(1102):
_call_impl
/home/ubuntu/anaconda3/envs/yolov3/lib/python3.6/site-packages/torch/jit/_trace.py(965):
trace_module
/home/ubuntu/anaconda3/envs/yolov3/lib/python3.6/site-packages/torch/jit/_trace.py(750):
trace
/home/ubuntu/zj/yolov3/utils/loggers/*init*.py(89): on_train_batch_end
/home/ubuntu/zj/yolov3/utils/callbacks.py(76): run
train.py(385): train
train.py(564): main
train.py(667):
Comparison exception: Tensor-likes are not close!
Mismatched elements: 190 / 432 (44.0%)
Greatest absolute difference: 0.1587936282157898 at index (3, 1, 1, 0) (up to 1e-05 allowed)
Greatest relative difference: 0.26716628984998353 at index (0, 0, 0, 0) (up to 1.3e-06 allowed)
2.if we comment out the code :
#self.weight.data = ste.clampabs(self.weight.data, 2*self.shift_range[0],
2*self.shift_range[1])
The NaN is occured.
Can you give me some suggestions? Thank you very much
—
Reply to this email directly, view it on GitHub
<#16 (comment)>,
or unsubscribe
<https://github.com/notifications/unsubscribe-auth/AALCKHJ2IUXC4X74F4QZWLTVGT4WPANCNFSM5T7ANQFA>
.
You are receiving this because you authored the thread.Message ID:
***@***.***>
|
We used this code to train 1000 epoch and found that the mAP of the model grew slowly. The final accuracy is only half that of fp32 model. It seems to be there are some errors in this code, which makes the Roundpowerof2 function unable to back propagate.Then We added the print function in the back propagation part of the function Roundpowerof2 to verify our point of view. Sure enough, the above code cannot execute the back propagation code. mean = self.weight.mean() However, the problem of NaN still cannot be solved. Do you have any suggestions? Thank you very much. |
Copying the question by @mengjingyouling from this issue to create a new issue:
The text was updated successfully, but these errors were encountered: