Wenyi5608 greedy sampling #293

wenyi5608 · 2024-03-08T07:56:50Z

Greedy Greedy search added repeat_penalty top_k top_p and temperature function

ilya-lavrenov · 2024-03-08T08:30:06Z

Can you maybe create a generic function which performances sampling and adopts top_p, top_k, num_beams and temperature ?
Currently, we already have beam search with repetition penalty, but other parameters don't have effect here.

Ideally, we need a single method similar to HF's generate, which hides generation loop from user. Could you please try it?

wenyi5608 · 2024-03-28T05:42:59Z

@ilya-lavrenov HuggingFace's group_beam_search does not implement sampling( top_p, top_k, and temperature) function. Only beam_search and greedy_search have sampling function. genai implements group_beam_search and greedy_search function.

apaniukov · 2024-04-12T08:49:09Z

text_generation/causal_lm/cpp/greedy_sampling.hpp

+            sampling_softmax_inplace(token_scores.data(), token_scores.data() + token_scores.size());
+            for (size_t i = 0; i < token_scores.size(); i++) {
+                logits[i] = token_scores[i].score;


You must take the values from logits for the remaining tokens here, not scores, and then do the softmax on them. The softmax already modified the scores in the top_p call, so the second softmax will skew the distribution towards the most probable token.

I refer to chatglm cpp(https://github.com/li-plus/chatglm.cpp/blob/main/chatglm.cpp#L825) to implement this function.
Is this implementation different from HuggingFace.

I checked the softmax effects on the logits and the results for the llama model are that the second softmax makes the distribution nearly uniform and the third one makes the distribution exactly uniform. Here is how the probabilities of the most probable token are evolving:

Softmax 0 times: tensor([[ 0.0000, 8.6805, 9.4036, 8.4245, 10.8393]]) Softmax 1 times: tensor([[3.1250e-05, 2.7906e-01, 2.0102e-01, 7.2127e-02, 2.6164e-01]]) Softmax 2 times: tensor([[3.1250e-05, 4.1308e-05, 3.8207e-05, 3.3586e-05, 4.0595e-05]]) Softmax 3 times: tensor([[3.1250e-05, 3.1250e-05, 3.1250e-05, 3.1250e-05, 3.1250e-05]]) For uniform distribution: 3.125e-05

The second softmax makes model predictions useless.

Huggingface doesn't transform the logits during top_p, just calculates softmax and filters the original logits based on the result. So sampling_softmax_inplace is not needed inside the sampling_top_p function.

Code for the softmax probs:

import torch from transformers import AutoModelForCausalLM, AutoTokenizer model_checkpoint = "JackFram/llama-68m" hf_tokenizer = AutoTokenizer.from_pretrained(model_checkpoint) hf_model = AutoModelForCausalLM.from_pretrained(model_checkpoint) texts = ["This is a test"] tokenized = hf_tokenizer(texts, return_tensors="pt") with torch.no_grad(): t = hf_model(**tokenized).logits print("Softmax 0 times:", torch.max(t, dim=-1).values) for i in range(1, 4): t = torch.nn.functional.softmax(t, dim=-1,) print(f"Softmax {i} times:", torch.max(t, dim=-1).values) print(f"For uniform distribution: {1 / hf_tokenizer.vocab_size}")

@apaniukov Thanks a lot!

I re-implemented the sampling_top_p function, the implementation is similar to the implementation of huggingface.

don't transform the logits during top_p

as-suvorov · 2024-05-14T08:44:54Z

text_generation/causal_lm/cpp/greedy_sampling.hpp

+            out_token = std::max_element(logits, logits + vocab_size) - logits;
+        }
+
+        prompt.push_back(out_token);


It seems tokens accumulation needed for applying sampling_repetition_penalty. This ids needs to preserved between get_out_token calls.

Do you mean that the logits info remains the same before and after get_out_token calls?

as-suvorov · 2024-05-27T14:31:33Z

@wenyi5608 , random sampling decoding based on your PR was merged to generate api: pavel-esir#6. Generate API is preparing for the merge to main repository: #334. Thank you for your contribution!

wenyi5608 added 3 commits March 8, 2024 15:11

greedy-sampling

a427d0a

greedy sampling

d9320df

greedy sampling

a5aa97f

ilya-lavrenov self-assigned this Mar 8, 2024

ilya-lavrenov requested a review from Wovchena March 8, 2024 08:27

andrei-kochin requested a review from pavel-esir March 11, 2024 11:49

apaniukov reviewed Apr 12, 2024

View reviewed changes

wenyi5608 added 3 commits April 17, 2024 15:15

Update greedy_sampling.hpp

3a598dd

don't transform the logits during top_p

Update greedy_causal_lm.cpp

5017097

Merge branch 'master' into wenyi5608-greedy-sampling

bc746d9

as-suvorov reviewed May 14, 2024

View reviewed changes

ilya-lavrenov closed this May 27, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Wenyi5608 greedy sampling #293

Wenyi5608 greedy sampling #293

wenyi5608 commented Mar 8, 2024

ilya-lavrenov commented Mar 8, 2024

wenyi5608 commented Mar 28, 2024

apaniukov Apr 12, 2024 •

edited

Loading

wenyi5608 Apr 15, 2024

apaniukov Apr 15, 2024

wenyi5608 Apr 16, 2024

wenyi5608 Apr 17, 2024

as-suvorov May 14, 2024

wenyi5608 May 15, 2024

as-suvorov commented May 27, 2024

Wenyi5608 greedy sampling #293

Wenyi5608 greedy sampling #293

Conversation

wenyi5608 commented Mar 8, 2024

ilya-lavrenov commented Mar 8, 2024

wenyi5608 commented Mar 28, 2024

apaniukov Apr 12, 2024 • edited Loading

Choose a reason for hiding this comment

wenyi5608 Apr 15, 2024

Choose a reason for hiding this comment

apaniukov Apr 15, 2024

Choose a reason for hiding this comment

wenyi5608 Apr 16, 2024

Choose a reason for hiding this comment

wenyi5608 Apr 17, 2024

Choose a reason for hiding this comment

as-suvorov May 14, 2024

Choose a reason for hiding this comment

wenyi5608 May 15, 2024

Choose a reason for hiding this comment

as-suvorov commented May 27, 2024

apaniukov Apr 12, 2024 •

edited

Loading