Error When Shifting Twice #16

mostafaelhoushi · 2022-04-21T12:00:46Z

Copying the question by @mengjingyouling from this issue to create a new issue:

We also want to discuss a problem with you. In your paper, the shift network is applied in classification network, not target detection. What do you think? Is there a decline in the accuracy?

Because the shift 1 bit will lead to some accuracy loss. We want to shift twice to solve it. For example: 10 = 8 + 2( shift 3 bits + shift 1 bit). Therefore, we modify the code as follows:
def get_shift_and_sign(x, rounding='deterministic'):
  sign = torch.sign(x)
  x_abs = torch.abs(x)
  shift1 = round(torch.log(x_abs) / np.log(2), rounding)
  wr1 = 2 ** shift1
  w1 = x_abs-wr1
  shift2 = round(torch.log(w1) / np.log(2), rounding)
  return shift1,shift2, sign

def round_power_of_2(x, rounding='deterministic'):

  shift1,shift2,sign = get_shift_and_sign(x, rounding)
  x_rounded = (2.0 ** shift1+2.0 ** shift2) * sign
  return x_rounded
However, the input in class Conv2dShiftQ(_ConvNdShiftQ): function will become Nan, which should be caused by data overflow：
class Conv2dShiftQ(_ConvNdShiftQ):
... ....
... ...

  #@weak_script_method
  def forward(self, input):
    print("--------------------------------------forward---------------------------------------------------")
    print("input======",input)
Can you give some suggestions to solve it? Thank you very much.

The text was updated successfully, but these errors were encountered:

mostafaelhoushi · 2022-04-21T12:05:31Z

There is a paper in ICLR named APoT (Additive Powers of Two) that did something similar (sum of 2 shifts). Do you want to try their code:
https://github.com/yhhhli/APoT_Quantization

Hopefully, their code should work because they did some other things (like adding a clip and weight norm) that may avoid this NaN that you got.

Also, note that APoT quantizes both weights and activations, while we only quantize weights.

mostafaelhoushi · 2022-04-21T12:10:01Z

Also, if you are interested, I think adding weight normalization (before calling round_power_of_2(...)) to my DeepShift code might solve the problem of NaN. You can simply do weight normalization by:

        # weight normalization
        mean = self.weight.mean()
        std = self.weight.std()
        weight = self.weight.add(-mean).div(std)

        # call round_power_of_2(...) on weight

mengjingyouling · 2022-04-21T12:26:37Z

I'll try. Thank you very much.

mengjingyouling · 2022-04-24T03:38:13Z

I have some doubts about your code. I look forward to your answer:

class Conv2dShiftQ(_ConvNdShiftQ):
.... .....
.... ....

def forward(self, input):
    self.weight.data = ste.clampabs(self.weight.data, 2**self.shift_range[0], 2**self.shift_range[1])     
    weight_q = ste.round_power_of_2(self.weight, self.rounding)
    input_fixed_point = ste.round_fixed_point(input, self.act_integer_bits, self.act_fraction_bits)

1.Why clampabs the weight to a certain range? The value range of weight should be a real number. Why is the range of weights is (-1 * (2**(weight_bits - 1) - 1), 0)?
2.Why need to process activation,and what does self.act_integer_bits,self.act_fraction_bits meanes?

Thank you very much!

mengjingyouling · 2022-04-24T07:53:01Z

Let me supplement our detailed experiment process.

1.We tried to add the weight normalization code in the function Conv2dShiftQ.

class Conv2dShiftQ(_ConvNdShiftQ):
.... ....
... ....

#@weak_script_method
def forward(self, input):


    **mean = self.weight.data.mean()
    std = self.weight.data.std()
    self.weight.data = self.weight.data.add(-mean).div(std)**
    self.weight.data = ste.clampabs(self.weight.data, 2**self.shift_range[0], 2**self.shift_range[1])

    weight_q = ste.round_power_of_2(self.weight, self.rounding)

.....

A error occured:

Traceback (most recent call last):
File "train.py", line 667, in
main(opt)
File "train.py", line 564, in main
train(opt.hyp, opt, device, callbacks)
File "train.py", line 385, in train
callbacks.run('on_train_batch_end', ni, model, imgs, targets, paths, plots, opt.sync_bn)
File "/home/ubuntu/zj/yolov3/utils/callbacks.py", line 76, in run
logger['callback'](*args, **kwargs)
File "/home/ubuntu/zj/yolov3/utils/loggers/init.py", line 89, in on_train_batch_end
self.tb.add_graph(torch.jit.trace(de_parallel(model), imgs[0:1], strict=False), [])
File "/home/ubuntu/anaconda3/envs/yolov3/lib/python3.6/site-packages/torch/jit/_trace.py", line 750, in trace
_module_class,
File "/home/ubuntu/anaconda3/envs/yolov3/lib/python3.6/site-packages/torch/jit/_trace.py", line 991, in trace_module
_module_class,
File "/home/ubuntu/anaconda3/envs/yolov3/lib/python3.6/site-packages/torch/autograd/grad_mode.py", line 28, in decorate_context
return func(*args, **kwargs)
File "/home/ubuntu/anaconda3/envs/yolov3/lib/python3.6/site-packages/torch/jit/_trace.py", line 526, in _check_trace
raise TracingCheckError(*diag_info)
torch.jit._trace.TracingCheckError: Tracing failed sanity checks!
ERROR: Tensor-valued Constant nodes differed in value across invocations. This often indicates that the tracer has encountered untraceable code.
Node:
%input.1 : Tensor = prim::Constantvalue=, scope: __module.model.0/__module.model.0.conv # /home/ubuntu/zj/yolov3/deepshift/ste.py:86:0
Source Location:
/home/ubuntu/zj/yolov3/deepshift/ste.py(86): clampabs
/home/ubuntu/zj/yolov3/deepshift/modules_q.py(294): forward
/home/ubuntu/anaconda3/envs/yolov3/lib/python3.6/site-packages/torch/nn/modules/module.py(1090): _slow_forward
/home/ubuntu/anaconda3/envs/yolov3/lib/python3.6/site-packages/torch/nn/modules/module.py(1102): _call_impl
/home/ubuntu/zj/yolov3/models/common.py(47): forward
/home/ubuntu/anaconda3/envs/yolov3/lib/python3.6/site-packages/torch/nn/modules/module.py(1090): _slow_forward
/home/ubuntu/anaconda3/envs/yolov3/lib/python3.6/site-packages/torch/nn/modules/module.py(1102): _call_impl
/home/ubuntu/zj/yolov3/models/yolo.py(150): _forward_once
/home/ubuntu/zj/yolov3/models/yolo.py(127): forward
/home/ubuntu/anaconda3/envs/yolov3/lib/python3.6/site-packages/torch/nn/modules/module.py(1090): _slow_forward
/home/ubuntu/anaconda3/envs/yolov3/lib/python3.6/site-packages/torch/nn/modules/module.py(1102): _call_impl
/home/ubuntu/anaconda3/envs/yolov3/lib/python3.6/site-packages/torch/jit/_trace.py(965): trace_module
/home/ubuntu/anaconda3/envs/yolov3/lib/python3.6/site-packages/torch/jit/_trace.py(750): trace
/home/ubuntu/zj/yolov3/utils/loggers/init.py(89): on_train_batch_end
/home/ubuntu/zj/yolov3/utils/callbacks.py(76): run
train.py(385): train
train.py(564): main
train.py(667):
Comparison exception: Tensor-likes are not close!

	Mismatched elements: 190 / 432 (44.0%)
	Greatest absolute difference: 0.1587936282157898 at index (3, 1, 1, 0) (up to 1e-05 allowed)
	Greatest relative difference: 0.26716628984998353 at index (0, 0, 0, 0) (up to 1.3e-06 allowed)

2.if we comment out the code :
#self.weight.data = ste.clampabs(self.weight.data, 2self.shift_range[0], 2self.shift_range[1])
The NaN is occured.

Can you give me some suggestions? Thank you very much

mostafaelhoushi · 2022-04-24T15:26:36Z

I think we should not modify `self.weight.data`. Can you change the code to: ``` ***@***.***_script_method def forward(self, input): mean = self.weight.data.mean() std = self.weight.data.std() weight_norm = self.weight.data.add(-mean).div(std) weight_norm = ste.clampabs(weight_norm, 2**self.shift_range[0], 2**self.shift_range[1]) weight_q = ste.round_power_of_2(weight_norm, self.rounding) ``` I will try to reply to your other questions later today

…

On Sun., Apr. 24, 2022, 3:53 a.m. mengjingyouling, ***@***.***> wrote: Let me supplement our detailed experiment process. 1.We tried to add the *weight normalization* code in the function Conv2dShiftQ. class Conv2dShiftQ(_ConvNdShiftQ): .... .... ... .... ***@***.***_script_method def forward(self, input): **mean = self.weight.data.mean() std = self.weight.data.std() self.weight.data = self.weight.data.add(-mean).div(std)** self.weight.data = ste.clampabs(self.weight.data, 2**self.shift_range[0], 2**self.shift_range[1]) weight_q = ste.round_power_of_2(self.weight, self.rounding) ..... *A error occured:* Traceback (most recent call last): File "train.py", line 667, in main(opt) File "train.py", line 564, in main train(opt.hyp, opt, device, callbacks) File "train.py", line 385, in train callbacks.run('on_train_batch_end', ni, model, imgs, targets, paths, plots, opt.sync_bn) File "/home/ubuntu/zj/yolov3/utils/callbacks.py", line 76, in run logger['callback'](*args, **kwargs) File "/home/ubuntu/zj/yolov3/utils/loggers/*init*.py", line 89, in on_train_batch_end self.tb.add_graph(torch.jit.trace(de_parallel(model), imgs[0:1], strict=False), []) File "/home/ubuntu/anaconda3/envs/yolov3/lib/python3.6/site-packages/torch/jit/_trace.py", line 750, in trace _module_class, File "/home/ubuntu/anaconda3/envs/yolov3/lib/python3.6/site-packages/torch/jit/_trace.py", line 991, in trace_module _module_class, File "/home/ubuntu/anaconda3/envs/yolov3/lib/python3.6/site-packages/torch/autograd/grad_mode.py", line 28, in decorate_context return func(*args, **kwargs) File "/home/ubuntu/anaconda3/envs/yolov3/lib/python3.6/site-packages/torch/jit/_trace.py", line 526, in _check_trace raise TracingCheckError(*diag_info) torch.jit._trace.TracingCheckError: Tracing failed sanity checks! ERROR: Tensor-valued Constant nodes differed in value across invocations. This often indicates that the tracer has encountered untraceable code. Node: %input.1 : Tensor = prim::Constantvalue=, scope: __module.model.0/__module.model.0.conv # /home/ubuntu/zj/yolov3/deepshift/ste.py:86:0 Source Location: /home/ubuntu/zj/yolov3/deepshift/ste.py(86): clampabs /home/ubuntu/zj/yolov3/deepshift/modules_q.py(294): forward /home/ubuntu/anaconda3/envs/yolov3/lib/python3.6/site-packages/torch/nn/modules/module.py(1090): _slow_forward /home/ubuntu/anaconda3/envs/yolov3/lib/python3.6/site-packages/torch/nn/modules/module.py(1102): _call_impl /home/ubuntu/zj/yolov3/models/common.py(47): forward /home/ubuntu/anaconda3/envs/yolov3/lib/python3.6/site-packages/torch/nn/modules/module.py(1090): _slow_forward /home/ubuntu/anaconda3/envs/yolov3/lib/python3.6/site-packages/torch/nn/modules/module.py(1102): _call_impl /home/ubuntu/zj/yolov3/models/yolo.py(150): _forward_once /home/ubuntu/zj/yolov3/models/yolo.py(127): forward /home/ubuntu/anaconda3/envs/yolov3/lib/python3.6/site-packages/torch/nn/modules/module.py(1090): _slow_forward /home/ubuntu/anaconda3/envs/yolov3/lib/python3.6/site-packages/torch/nn/modules/module.py(1102): _call_impl /home/ubuntu/anaconda3/envs/yolov3/lib/python3.6/site-packages/torch/jit/_trace.py(965): trace_module /home/ubuntu/anaconda3/envs/yolov3/lib/python3.6/site-packages/torch/jit/_trace.py(750): trace /home/ubuntu/zj/yolov3/utils/loggers/*init*.py(89): on_train_batch_end /home/ubuntu/zj/yolov3/utils/callbacks.py(76): run train.py(385): train train.py(564): main train.py(667): Comparison exception: Tensor-likes are not close! Mismatched elements: 190 / 432 (44.0%) Greatest absolute difference: 0.1587936282157898 at index (3, 1, 1, 0) (up to 1e-05 allowed) Greatest relative difference: 0.26716628984998353 at index (0, 0, 0, 0) (up to 1.3e-06 allowed) 2.if we comment out the code : #self.weight.data = ste.clampabs(self.weight.data, 2*self.shift_range[0], 2*self.shift_range[1]) The NaN is occured. Can you give me some suggestions? Thank you very much — Reply to this email directly, view it on GitHub <#16 (comment)>, or unsubscribe <https://github.com/notifications/unsubscribe-auth/AALCKHJ2IUXC4X74F4QZWLTVGT4WPANCNFSM5T7ANQFA> . You are receiving this because you authored the thread.Message ID: ***@***.***>

mengjingyouling · 2022-04-28T12:07:18Z

I think we should not modify self.weight.data. Can you change the code to: ***@***.***_script_method def forward(self, input): mean = self.weight.data.mean() std = self.weight.data.std() weight_norm = self.weight.data.add(-mean).div(std) weight_norm = ste.clampabs(weight_norm, 2**self.shift_range[0], 2**self.shift_range[1]) weight_q = ste.round_power_of_2(weight_norm, self.rounding) I will try to reply to your other questions later today
…
On Sun., Apr. 24, 2022, 3:53 a.m. mengjingyouling, @.> wrote: Let me supplement our detailed experiment process. 1.We tried to add the weight normalization code in the function Conv2dShiftQ. class Conv2dShiftQ(_ConvNdShiftQ): .... .... ... .... @._script_method def forward(self, input): mean = self.weight.data.mean() std = self.weight.data.std() self.weight.data = self.weight.data.add(-mean).div(std) self.weight.data = ste.clampabs(self.weight.data, 2self.shift_range[0], 2self.shift_range[1]) weight_q = ste.round_power_of_2(self.weight, self.rounding) ..... A error occured: Traceback (most recent call last): File "train.py", line 667, in main(opt) File "train.py", line 564, in main train(opt.hyp, opt, device, callbacks) File "train.py", line 385, in train callbacks.run('on_train_batch_end', ni, model, imgs, targets, paths, plots, opt.sync_bn) File "/home/ubuntu/zj/yolov3/utils/callbacks.py", line 76, in run logger['callback'](args, kwargs) File "/home/ubuntu/zj/yolov3/utils/loggers/init.py", line 89, in on_train_batch_end self.tb.add_graph(torch.jit.trace(de_parallel(model), imgs[0:1], strict=False), []) File "/home/ubuntu/anaconda3/envs/yolov3/lib/python3.6/site-packages/torch/jit/_trace.py", line 750, in trace _module_class, File "/home/ubuntu/anaconda3/envs/yolov3/lib/python3.6/site-packages/torch/jit/_trace.py", line 991, in trace_module _module_class, File "/home/ubuntu/anaconda3/envs/yolov3/lib/python3.6/site-packages/torch/autograd/grad_mode.py", line 28, in decorate_context return func(args, **kwargs) File "/home/ubuntu/anaconda3/envs/yolov3/lib/python3.6/site-packages/torch/jit/_trace.py", line 526, in _check_trace raise TracingCheckError(diag_info) torch.jit._trace.TracingCheckError: Tracing failed sanity checks! ERROR: Tensor-valued Constant nodes differed in value across invocations. This often indicates that the tracer has encountered untraceable code. Node: %input.1 : Tensor = prim::Constantvalue=, scope: __module.model.0/__module.model.0.conv # /home/ubuntu/zj/yolov3/deepshift/ste.py:86:0 Source Location: /home/ubuntu/zj/yolov3/deepshift/ste.py(86): clampabs /home/ubuntu/zj/yolov3/deepshift/modules_q.py(294): forward /home/ubuntu/anaconda3/envs/yolov3/lib/python3.6/site-packages/torch/nn/modules/module.py(1090): _slow_forward /home/ubuntu/anaconda3/envs/yolov3/lib/python3.6/site-packages/torch/nn/modules/module.py(1102): _call_impl /home/ubuntu/zj/yolov3/models/common.py(47): forward /home/ubuntu/anaconda3/envs/yolov3/lib/python3.6/site-packages/torch/nn/modules/module.py(1090): _slow_forward /home/ubuntu/anaconda3/envs/yolov3/lib/python3.6/site-packages/torch/nn/modules/module.py(1102): _call_impl /home/ubuntu/zj/yolov3/models/yolo.py(150): _forward_once /home/ubuntu/zj/yolov3/models/yolo.py(127): forward /home/ubuntu/anaconda3/envs/yolov3/lib/python3.6/site-packages/torch/nn/modules/module.py(1090): _slow_forward /home/ubuntu/anaconda3/envs/yolov3/lib/python3.6/site-packages/torch/nn/modules/module.py(1102): _call_impl /home/ubuntu/anaconda3/envs/yolov3/lib/python3.6/site-packages/torch/jit/_trace.py(965): trace_module /home/ubuntu/anaconda3/envs/yolov3/lib/python3.6/site-packages/torch/jit/_trace.py(750): trace /home/ubuntu/zj/yolov3/utils/loggers/init.py(89): on_train_batch_end /home/ubuntu/zj/yolov3/utils/callbacks.py(76): run train.py(385): train train.py(564): main train.py(667): Comparison exception: Tensor-likes are not close! Mismatched elements: 190 / 432 (44.0%) Greatest absolute difference: 0.1587936282157898 at index (3, 1, 1, 0) (up to 1e-05 allowed) Greatest relative difference: 0.26716628984998353 at index (0, 0, 0, 0) (up to 1.3e-06 allowed) 2.if we comment out the code : #self.weight.data = ste.clampabs(self.weight.data, 2self.shift_range[0], 2self.shift_range[1]) The NaN is occured. Can you give me some suggestions? Thank you very much — Reply to this email directly, view it on GitHub <#16 (comment)>, or unsubscribe https://github.com/notifications/unsubscribe-auth/AALCKHJ2IUXC4X74F4QZWLTVGT4WPANCNFSM5T7ANQFA . You are receiving this because you authored the thread.Message ID: @.>

We used this code to train 1000 epoch and found that the mAP of the model grew slowly. The final accuracy is only half that of fp32 model. It seems to be there are some errors in this code, which makes the Roundpowerof2 function unable to back propagate.Then We added the print function in the back propagation part of the function Roundpowerof2 to verify our point of view. Sure enough, the above code cannot execute the back propagation code.
The correct way to write the above code should be:

mean = self.weight.mean()
std = self.weight.std()
weight_norm = self.weight.add(-mean).div(std)
#weight_norm = ste.clampabs(weight_norm, 2self.shift_range[0],
2self.shift_range[1]) #Firstly, we do not limit the bit width of the weight.

However, the problem of NaN still cannot be solved. Do you have any suggestions? Thank you very much.

mostafaelhoushi mentioned this issue Apr 21, 2022

CPU kernel acceleration #15

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Error When Shifting Twice #16

Error When Shifting Twice #16

mostafaelhoushi commented Apr 21, 2022

mostafaelhoushi commented Apr 21, 2022 •

edited

Loading

mostafaelhoushi commented Apr 21, 2022 •

edited

Loading

mengjingyouling commented Apr 21, 2022

mengjingyouling commented Apr 24, 2022 •

edited

Loading

mengjingyouling commented Apr 24, 2022 •

edited

Loading

mostafaelhoushi commented Apr 24, 2022 via email

mengjingyouling commented Apr 28, 2022 •

edited

Loading

Error When Shifting Twice #16

Error When Shifting Twice #16

Comments

mostafaelhoushi commented Apr 21, 2022

mostafaelhoushi commented Apr 21, 2022 • edited Loading

mostafaelhoushi commented Apr 21, 2022 • edited Loading

mengjingyouling commented Apr 21, 2022

mengjingyouling commented Apr 24, 2022 • edited Loading

mengjingyouling commented Apr 24, 2022 • edited Loading

mostafaelhoushi commented Apr 24, 2022 via email

mengjingyouling commented Apr 28, 2022 • edited Loading

mostafaelhoushi commented Apr 21, 2022 •

edited

Loading

mostafaelhoushi commented Apr 21, 2022 •

edited

Loading

mengjingyouling commented Apr 24, 2022 •

edited

Loading

mengjingyouling commented Apr 24, 2022 •

edited

Loading

mengjingyouling commented Apr 28, 2022 •

edited

Loading