Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Wenyi5608 greedy sampling #293

Closed

Conversation

wenyi5608
Copy link
Contributor

Greedy Greedy search added repeat_penalty top_k top_p and temperature function

@ilya-lavrenov ilya-lavrenov self-assigned this Mar 8, 2024
@ilya-lavrenov ilya-lavrenov requested a review from Wovchena March 8, 2024 08:27
@ilya-lavrenov
Copy link
Contributor

Can you maybe create a generic function which performances sampling and adopts top_p, top_k, num_beams and temperature ?
Currently, we already have beam search with repetition penalty, but other parameters don't have effect here.

Ideally, we need a single method similar to HF's generate, which hides generation loop from user. Could you please try it?

@wenyi5608
Copy link
Contributor Author

@ilya-lavrenov HuggingFace's group_beam_search does not implement sampling( top_p, top_k, and temperature) function. Only beam_search and greedy_search have sampling function. genai implements group_beam_search and greedy_search function.

Comment on lines +146 to +148
sampling_softmax_inplace(token_scores.data(), token_scores.data() + token_scores.size());
for (size_t i = 0; i < token_scores.size(); i++) {
logits[i] = token_scores[i].score;
Copy link
Contributor

@apaniukov apaniukov Apr 12, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You must take the values from logits for the remaining tokens here, not scores, and then do the softmax on them. The softmax already modified the scores in the top_p call, so the second softmax will skew the distribution towards the most probable token.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I refer to chatglm cpp(https://github.com/li-plus/chatglm.cpp/blob/main/chatglm.cpp#L825) to implement this function.
Is this implementation different from HuggingFace.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I checked the softmax effects on the logits and the results for the llama model are that the second softmax makes the distribution nearly uniform and the third one makes the distribution exactly uniform. Here is how the probabilities of the most probable token are evolving:

Softmax 0 times: tensor([[ 0.0000,  8.6805,  9.4036,  8.4245, 10.8393]])
Softmax 1 times: tensor([[3.1250e-05, 2.7906e-01, 2.0102e-01, 7.2127e-02, 2.6164e-01]])
Softmax 2 times: tensor([[3.1250e-05, 4.1308e-05, 3.8207e-05, 3.3586e-05, 4.0595e-05]])
Softmax 3 times: tensor([[3.1250e-05, 3.1250e-05, 3.1250e-05, 3.1250e-05, 3.1250e-05]])
For uniform distribution: 3.125e-05

The second softmax makes model predictions useless.

Huggingface doesn't transform the logits during top_p, just calculates softmax and filters the original logits based on the result. So sampling_softmax_inplace is not needed inside the sampling_top_p function.

Code for the softmax probs:

import torch
from transformers import AutoModelForCausalLM, AutoTokenizer


model_checkpoint = "JackFram/llama-68m"
hf_tokenizer = AutoTokenizer.from_pretrained(model_checkpoint)
hf_model = AutoModelForCausalLM.from_pretrained(model_checkpoint)
texts = ["This is a test"]
tokenized = hf_tokenizer(texts, return_tensors="pt")

with torch.no_grad():
    t = hf_model(**tokenized).logits
    print("Softmax 0 times:", torch.max(t, dim=-1).values)
    for i in range(1, 4):
        t = torch.nn.functional.softmax(t, dim=-1,)
        print(f"Softmax {i} times:", torch.max(t, dim=-1).values)

print(f"For uniform distribution: {1 / hf_tokenizer.vocab_size}")

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@apaniukov Thanks a lot!

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I re-implemented the sampling_top_p function, the implementation is similar to the implementation of huggingface.

out_token = std::max_element(logits, logits + vocab_size) - logits;
}

prompt.push_back(out_token);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It seems tokens accumulation needed for applying sampling_repetition_penalty. This ids needs to preserved between get_out_token calls.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do you mean that the logits info remains the same before and after get_out_token calls?

@as-suvorov
Copy link
Contributor

@wenyi5608 , random sampling decoding based on your PR was merged to generate api: pavel-esir#6. Generate API is preparing for the merge to main repository: #334. Thank you for your contribution!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants