Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Error When Shifting Twice #16

Open
mostafaelhoushi opened this issue Apr 21, 2022 · 7 comments
Open

Error When Shifting Twice #16

mostafaelhoushi opened this issue Apr 21, 2022 · 7 comments

Comments

@mostafaelhoushi
Copy link
Owner

Copying the question by @mengjingyouling from this issue to create a new issue:

We also want to discuss a problem with you. In your paper, the shift network is applied in classification network, not target detection. What do you think? Is there a decline in the accuracy?

Because the shift 1 bit will lead to some accuracy loss. We want to shift twice to solve it. For example: 10 = 8 + 2( shift 3 bits + shift 1 bit). Therefore, we modify the code as follows:

def get_shift_and_sign(x, rounding='deterministic'):
  sign = torch.sign(x)
  x_abs = torch.abs(x)
  shift1 = round(torch.log(x_abs) / np.log(2), rounding)
  wr1 = 2 ** shift1
  w1 = x_abs-wr1
  shift2 = round(torch.log(w1) / np.log(2), rounding)
  return shift1,shift2, sign

def round_power_of_2(x, rounding='deterministic'):

  shift1,shift2,sign = get_shift_and_sign(x, rounding)
  x_rounded = (2.0 ** shift1+2.0 ** shift2) * sign
  return x_rounded

However, the input in class Conv2dShiftQ(_ConvNdShiftQ): function will become Nan, which should be caused by data overflow:

class Conv2dShiftQ(_ConvNdShiftQ):
... ....
... ...

  #@weak_script_method
  def forward(self, input):
    print("--------------------------------------forward---------------------------------------------------")
    print("input======",input)

Can you give some suggestions to solve it? Thank you very much.

@mostafaelhoushi
Copy link
Owner Author

mostafaelhoushi commented Apr 21, 2022

There is a paper in ICLR named APoT (Additive Powers of Two) that did something similar (sum of 2 shifts). Do you want to try their code:
https://github.com/yhhhli/APoT_Quantization

Hopefully, their code should work because they did some other things (like adding a clip and weight norm) that may avoid this NaN that you got.

Also, note that APoT quantizes both weights and activations, while we only quantize weights.

@mostafaelhoushi
Copy link
Owner Author

mostafaelhoushi commented Apr 21, 2022

Also, if you are interested, I think adding weight normalization (before calling round_power_of_2(...)) to my DeepShift code might solve the problem of NaN. You can simply do weight normalization by:

        # weight normalization
        mean = self.weight.mean()
        std = self.weight.std()
        weight = self.weight.add(-mean).div(std)

        # call round_power_of_2(...) on weight

@mengjingyouling
Copy link

I'll try. Thank you very much.

@mengjingyouling
Copy link

mengjingyouling commented Apr 24, 2022

I have some doubts about your code. I look forward to your answer:

class Conv2dShiftQ(_ConvNdShiftQ):
.... .....
.... ....

def forward(self, input):
    self.weight.data = ste.clampabs(self.weight.data, 2**self.shift_range[0], 2**self.shift_range[1])     
    weight_q = ste.round_power_of_2(self.weight, self.rounding)
    input_fixed_point = ste.round_fixed_point(input, self.act_integer_bits, self.act_fraction_bits)

1.Why clampabs the weight to a certain range? The value range of weight should be a real number. Why is the range of weights is (-1 * (2**(weight_bits - 1) - 1), 0)?
2.Why need to process activation,and what does self.act_integer_bits,self.act_fraction_bits meanes?

Thank you very much!

@mengjingyouling
Copy link

mengjingyouling commented Apr 24, 2022

Let me supplement our detailed experiment process.

1.We tried to add the weight normalization code in the function Conv2dShiftQ.

class Conv2dShiftQ(_ConvNdShiftQ):
.... ....
... ....

#@weak_script_method
def forward(self, input):


    **mean = self.weight.data.mean()
    std = self.weight.data.std()
    self.weight.data = self.weight.data.add(-mean).div(std)**
    self.weight.data = ste.clampabs(self.weight.data, 2**self.shift_range[0], 2**self.shift_range[1])

    weight_q = ste.round_power_of_2(self.weight, self.rounding)

.....

A error occured:

Traceback (most recent call last):
File "train.py", line 667, in
main(opt)
File "train.py", line 564, in main
train(opt.hyp, opt, device, callbacks)
File "train.py", line 385, in train
callbacks.run('on_train_batch_end', ni, model, imgs, targets, paths, plots, opt.sync_bn)
File "/home/ubuntu/zj/yolov3/utils/callbacks.py", line 76, in run
logger['callback'](*args, **kwargs)
File "/home/ubuntu/zj/yolov3/utils/loggers/init.py", line 89, in on_train_batch_end
self.tb.add_graph(torch.jit.trace(de_parallel(model), imgs[0:1], strict=False), [])
File "/home/ubuntu/anaconda3/envs/yolov3/lib/python3.6/site-packages/torch/jit/_trace.py", line 750, in trace
_module_class,
File "/home/ubuntu/anaconda3/envs/yolov3/lib/python3.6/site-packages/torch/jit/_trace.py", line 991, in trace_module
_module_class,
File "/home/ubuntu/anaconda3/envs/yolov3/lib/python3.6/site-packages/torch/autograd/grad_mode.py", line 28, in decorate_context
return func(*args, **kwargs)
File "/home/ubuntu/anaconda3/envs/yolov3/lib/python3.6/site-packages/torch/jit/_trace.py", line 526, in _check_trace
raise TracingCheckError(*diag_info)
torch.jit._trace.TracingCheckError: Tracing failed sanity checks!
ERROR: Tensor-valued Constant nodes differed in value across invocations. This often indicates that the tracer has encountered untraceable code.

Node:
%input.1 : Tensor = prim::Constantvalue=, scope: __module.model.0/__module.model.0.conv # /home/ubuntu/zj/yolov3/deepshift/ste.py:86:0
Source Location:
/home/ubuntu/zj/yolov3/deepshift/ste.py(86): clampabs
/home/ubuntu/zj/yolov3/deepshift/modules_q.py(294): forward
/home/ubuntu/anaconda3/envs/yolov3/lib/python3.6/site-packages/torch/nn/modules/module.py(1090): _slow_forward
/home/ubuntu/anaconda3/envs/yolov3/lib/python3.6/site-packages/torch/nn/modules/module.py(1102): _call_impl
/home/ubuntu/zj/yolov3/models/common.py(47): forward
/home/ubuntu/anaconda3/envs/yolov3/lib/python3.6/site-packages/torch/nn/modules/module.py(1090): _slow_forward
/home/ubuntu/anaconda3/envs/yolov3/lib/python3.6/site-packages/torch/nn/modules/module.py(1102): _call_impl
/home/ubuntu/zj/yolov3/models/yolo.py(150): _forward_once
/home/ubuntu/zj/yolov3/models/yolo.py(127): forward
/home/ubuntu/anaconda3/envs/yolov3/lib/python3.6/site-packages/torch/nn/modules/module.py(1090): _slow_forward
/home/ubuntu/anaconda3/envs/yolov3/lib/python3.6/site-packages/torch/nn/modules/module.py(1102): _call_impl
/home/ubuntu/anaconda3/envs/yolov3/lib/python3.6/site-packages/torch/jit/_trace.py(965): trace_module
/home/ubuntu/anaconda3/envs/yolov3/lib/python3.6/site-packages/torch/jit/_trace.py(750): trace
/home/ubuntu/zj/yolov3/utils/loggers/init.py(89): on_train_batch_end
/home/ubuntu/zj/yolov3/utils/callbacks.py(76): run
train.py(385): train
train.py(564): main
train.py(667):
Comparison exception: Tensor-likes are not close!

	Mismatched elements: 190 / 432 (44.0%)
	Greatest absolute difference: 0.1587936282157898 at index (3, 1, 1, 0) (up to 1e-05 allowed)
	Greatest relative difference: 0.26716628984998353 at index (0, 0, 0, 0) (up to 1.3e-06 allowed)

2.if we comment out the code :
#self.weight.data = ste.clampabs(self.weight.data, 2self.shift_range[0], 2
self.shift_range[1])
The NaN is occured.

Can you give me some suggestions? Thank you very much

@mostafaelhoushi
Copy link
Owner Author

mostafaelhoushi commented Apr 24, 2022 via email

@mengjingyouling
Copy link

mengjingyouling commented Apr 28, 2022

I think we should not modify self.weight.data. Can you change the code to: ***@***.***_script_method def forward(self, input): mean = self.weight.data.mean() std = self.weight.data.std() weight_norm = self.weight.data.add(-mean).div(std) weight_norm = ste.clampabs(weight_norm, 2**self.shift_range[0], 2**self.shift_range[1]) weight_q = ste.round_power_of_2(weight_norm, self.rounding) I will try to reply to your other questions later today

On Sun., Apr. 24, 2022, 3:53 a.m. mengjingyouling, @.> wrote: Let me supplement our detailed experiment process. 1.We tried to add the weight normalization code in the function Conv2dShiftQ. class Conv2dShiftQ(_ConvNdShiftQ): .... .... ... .... @._script_method def forward(self, input): mean = self.weight.data.mean() std = self.weight.data.std() self.weight.data = self.weight.data.add(-mean).div(std) self.weight.data = ste.clampabs(self.weight.data, 2self.shift_range[0], 2self.shift_range[1]) weight_q = ste.round_power_of_2(self.weight, self.rounding) ..... A error occured: Traceback (most recent call last): File "train.py", line 667, in main(opt) File "train.py", line 564, in main train(opt.hyp, opt, device, callbacks) File "train.py", line 385, in train callbacks.run('on_train_batch_end', ni, model, imgs, targets, paths, plots, opt.sync_bn) File "/home/ubuntu/zj/yolov3/utils/callbacks.py", line 76, in run logger['callback'](args, kwargs) File "/home/ubuntu/zj/yolov3/utils/loggers/init.py", line 89, in on_train_batch_end self.tb.add_graph(torch.jit.trace(de_parallel(model), imgs[0:1], strict=False), []) File "/home/ubuntu/anaconda3/envs/yolov3/lib/python3.6/site-packages/torch/jit/_trace.py", line 750, in trace _module_class, File "/home/ubuntu/anaconda3/envs/yolov3/lib/python3.6/site-packages/torch/jit/_trace.py", line 991, in trace_module _module_class, File "/home/ubuntu/anaconda3/envs/yolov3/lib/python3.6/site-packages/torch/autograd/grad_mode.py", line 28, in decorate_context return func(args, **kwargs) File "/home/ubuntu/anaconda3/envs/yolov3/lib/python3.6/site-packages/torch/jit/_trace.py", line 526, in _check_trace raise TracingCheckError(diag_info) torch.jit._trace.TracingCheckError: Tracing failed sanity checks! ERROR: Tensor-valued Constant nodes differed in value across invocations. This often indicates that the tracer has encountered untraceable code. Node: %input.1 : Tensor = prim::Constantvalue=, scope: __module.model.0/__module.model.0.conv # /home/ubuntu/zj/yolov3/deepshift/ste.py:86:0 Source Location: /home/ubuntu/zj/yolov3/deepshift/ste.py(86): clampabs /home/ubuntu/zj/yolov3/deepshift/modules_q.py(294): forward /home/ubuntu/anaconda3/envs/yolov3/lib/python3.6/site-packages/torch/nn/modules/module.py(1090): _slow_forward /home/ubuntu/anaconda3/envs/yolov3/lib/python3.6/site-packages/torch/nn/modules/module.py(1102): _call_impl /home/ubuntu/zj/yolov3/models/common.py(47): forward /home/ubuntu/anaconda3/envs/yolov3/lib/python3.6/site-packages/torch/nn/modules/module.py(1090): _slow_forward /home/ubuntu/anaconda3/envs/yolov3/lib/python3.6/site-packages/torch/nn/modules/module.py(1102): _call_impl /home/ubuntu/zj/yolov3/models/yolo.py(150): _forward_once /home/ubuntu/zj/yolov3/models/yolo.py(127): forward /home/ubuntu/anaconda3/envs/yolov3/lib/python3.6/site-packages/torch/nn/modules/module.py(1090): _slow_forward /home/ubuntu/anaconda3/envs/yolov3/lib/python3.6/site-packages/torch/nn/modules/module.py(1102): _call_impl /home/ubuntu/anaconda3/envs/yolov3/lib/python3.6/site-packages/torch/jit/_trace.py(965): trace_module /home/ubuntu/anaconda3/envs/yolov3/lib/python3.6/site-packages/torch/jit/_trace.py(750): trace /home/ubuntu/zj/yolov3/utils/loggers/init.py(89): on_train_batch_end /home/ubuntu/zj/yolov3/utils/callbacks.py(76): run train.py(385): train train.py(564): main train.py(667): Comparison exception: Tensor-likes are not close! Mismatched elements: 190 / 432 (44.0%) Greatest absolute difference: 0.1587936282157898 at index (3, 1, 1, 0) (up to 1e-05 allowed) Greatest relative difference: 0.26716628984998353 at index (0, 0, 0, 0) (up to 1.3e-06 allowed) 2.if we comment out the code : #self.weight.data = ste.clampabs(self.weight.data, 2self.shift_range[0], 2self.shift_range[1]) The NaN is occured. Can you give me some suggestions? Thank you very much — Reply to this email directly, view it on GitHub <#16 (comment)>, or unsubscribe https://github.com/notifications/unsubscribe-auth/AALCKHJ2IUXC4X74F4QZWLTVGT4WPANCNFSM5T7ANQFA . You are receiving this because you authored the thread.Message ID: @.>

We used this code to train 1000 epoch and found that the mAP of the model grew slowly. The final accuracy is only half that of fp32 model. It seems to be there are some errors in this code, which makes the Roundpowerof2 function unable to back propagate.Then We added the print function in the back propagation part of the function Roundpowerof2 to verify our point of view. Sure enough, the above code cannot execute the back propagation code.
The correct way to write the above code should be:

mean = self.weight.mean()
std = self.weight.std()
weight_norm = self.weight.add(-mean).div(std)
#weight_norm = ste.clampabs(weight_norm, 2self.shift_range[0],
2
self.shift_range[1]) #Firstly, we do not limit the bit width of the weight.

However, the problem of NaN still cannot be solved. Do you have any suggestions? Thank you very much.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants