Different from the output of the HF inference #280

xcxhy · 2023-06-27T03:58:34Z

xcxhy
Jun 27, 2023

Looking forward to your reply! I set the temperature=0.1, top-k=10, top-p=0.75，I think i infer the same prompt, I will get the same output, through test the hf and vllm inference, HF will get stable output then vllm sometimes get the different output. Is this normal? The parameters of the two inferences are the same and do not bring the same constraints.

Answered by zhuohan123

Jun 27, 2023

The LLM inference process includes sampling, which is a random process. Because the implementation of HF and vLLM are different, it is normal to get different samples. However, if you perform argmax sampling (e.g., temperature=0), then you should be able to see the same results.

View full answer

marscc · 2023-06-27T09:15:53Z

marscc
Jun 27, 2023

I have encountered the same problem. use llama 13B model, set max_tokens=256, frequency_penalty=0.1, temperature=0.1, top-k=50, top-p=0.75，I tested on a set of 40 questions and found that the outputs for 15 questions were different from the outputs obtained using huggingface inference.

0 replies

zhuohan123 · 2023-06-27T15:25:06Z

zhuohan123
Jun 27, 2023
Maintainer

The LLM inference process includes sampling, which is a random process. Because the implementation of HF and vLLM are different, it is normal to get different samples. However, if you perform argmax sampling (e.g., temperature=0), then you should be able to see the same results.

3 replies

tonystz Jan 30, 2024

even when I set temperature=0, the result is still not the same. Anyone has some suggestions?

kaishxu Jun 14, 2024

same problem. I set temperature=0, and the result is still not the same.

damandeep-hyprbots Jul 19, 2024

Same issue.

GhostofAdam · 2023-08-04T03:40:50Z

GhostofAdam
Aug 4, 2023

I am using greedy search in decoding.
I find when the prompt lenght is less than 7(while 8 as block size), vllm and hf get different output texts, but they are same with longer prompts. I believe a bug is hidden behind this.

0 replies

AmazeQiu · 2023-08-16T03:43:43Z

AmazeQiu
Aug 16, 2023

I've meet the same problem.
I use hf with this config:
temperature=0, top_p=1.0 ,top_k=40 ,num_beams=4, max_new_tokens=64
then use vllm with this config :
max_tokens = 64, use_beam_search = True, best_of = 4, temperature = 0 ,top_p = 1.0, top_k = -1
The speed is fast but the accuracy is pretty lower than that using hf. T _ T
I even change the source code to use "model.half().cuda()" to make it same with the way I use hf in my code.

0 replies

BaiMoHan · 2024-07-26T10:25:25Z

BaiMoHan
Jul 26, 2024

Has this issue been resolved?

0 replies

Jerrrrykun · 2024-09-28T19:03:54Z

Jerrrrykun
Sep 28, 2024

I have encountered the same issue for temperature=0. Interestingly, some of outputs are same with HF inference but some are not, leading to decreases in performance.

0 replies

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Different from the output of the HF inference #280

{{title}}

Replies: 6 comments 3 replies

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{title}}

{{title}}

Select a reply

Different from the output of the HF inference #280

Replies: 6 comments · 3 replies

zhuohan123 Jun 27, 2023 Maintainer

Replies: 6 comments 3 replies

zhuohan123
Jun 27, 2023
Maintainer