-
Notifications
You must be signed in to change notification settings - Fork 198
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Wenyi5608 greedy sampling #293
Wenyi5608 greedy sampling #293
Conversation
Can you maybe create a generic function which performances sampling and adopts top_p, top_k, num_beams and temperature ? Ideally, we need a single method similar to HF's generate, which hides generation loop from user. Could you please try it? |
@ilya-lavrenov HuggingFace's group_beam_search does not implement sampling( top_p, top_k, and temperature) function. Only beam_search and greedy_search have sampling function. genai implements group_beam_search and greedy_search function. |
sampling_softmax_inplace(token_scores.data(), token_scores.data() + token_scores.size()); | ||
for (size_t i = 0; i < token_scores.size(); i++) { | ||
logits[i] = token_scores[i].score; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You must take the values from logits for the remaining tokens here, not scores, and then do the softmax on them. The softmax already modified the scores in the top_p call, so the second softmax will skew the distribution towards the most probable token.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I refer to chatglm cpp(https://github.com/li-plus/chatglm.cpp/blob/main/chatglm.cpp#L825) to implement this function.
Is this implementation different from HuggingFace.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I checked the softmax effects on the logits and the results for the llama model are that the second softmax makes the distribution nearly uniform and the third one makes the distribution exactly uniform. Here is how the probabilities of the most probable token are evolving:
Softmax 0 times: tensor([[ 0.0000, 8.6805, 9.4036, 8.4245, 10.8393]])
Softmax 1 times: tensor([[3.1250e-05, 2.7906e-01, 2.0102e-01, 7.2127e-02, 2.6164e-01]])
Softmax 2 times: tensor([[3.1250e-05, 4.1308e-05, 3.8207e-05, 3.3586e-05, 4.0595e-05]])
Softmax 3 times: tensor([[3.1250e-05, 3.1250e-05, 3.1250e-05, 3.1250e-05, 3.1250e-05]])
For uniform distribution: 3.125e-05
The second softmax makes model predictions useless.
Huggingface doesn't transform the logits during top_p
, just calculates softmax and filters the original logits based on the result. So sampling_softmax_inplace
is not needed inside the sampling_top_p
function.
Code for the softmax probs:
import torch
from transformers import AutoModelForCausalLM, AutoTokenizer
model_checkpoint = "JackFram/llama-68m"
hf_tokenizer = AutoTokenizer.from_pretrained(model_checkpoint)
hf_model = AutoModelForCausalLM.from_pretrained(model_checkpoint)
texts = ["This is a test"]
tokenized = hf_tokenizer(texts, return_tensors="pt")
with torch.no_grad():
t = hf_model(**tokenized).logits
print("Softmax 0 times:", torch.max(t, dim=-1).values)
for i in range(1, 4):
t = torch.nn.functional.softmax(t, dim=-1,)
print(f"Softmax {i} times:", torch.max(t, dim=-1).values)
print(f"For uniform distribution: {1 / hf_tokenizer.vocab_size}")
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@apaniukov Thanks a lot!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I re-implemented the sampling_top_p function, the implementation is similar to the implementation of huggingface.
don't transform the logits during top_p
out_token = std::max_element(logits, logits + vocab_size) - logits; | ||
} | ||
|
||
prompt.push_back(out_token); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It seems tokens accumulation needed for applying sampling_repetition_penalty
. This ids needs to preserved between get_out_token
calls.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do you mean that the logits info remains the same before and after get_out_token calls?
@wenyi5608 , random sampling decoding based on your PR was merged to generate api: pavel-esir#6. Generate API is preparing for the merge to main repository: #334. Thank you for your contribution! |
Greedy Greedy search added repeat_penalty top_k top_p and temperature function