-
-
Notifications
You must be signed in to change notification settings - Fork 220
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Codelama support #270
Comments
I've added this with the latest commit. I haven't thoroughly tested it, but it's a small change, just reading the value from the config and applying it. Let me know if there are any issues with it. |
What will it do with how it is in textgen now?
Will theta be overwritten or will this param get ignored? |
It will calculate the rotary embedding base based on the
So if you were using an alpha value of 93 to get the correct theta for CodeLlama (i.e. 1,000,000), now you would use an alpha value of 1 instead for the same result, as long as the config file specifies the correct theta for the model. Calling |
So with alpha it will need to be changed to set the base to 10000.0 and then with apply it. And if theta is specified just apply it directly. |
No, only if you want to use an incorrect base. The right base for Llama is 10,000 and for CodeLlama it's 1,000,000. The NTK alpha value is relative to that, and the only reason people have been using alpha = 93 is to get the above calculation to yield a base of 1,000,000, which is now read directly from the config instead. If you want to use an alpha value of, say, 2.4, that now means the same thing for CodeLlama as it does for Llama or Llama2 (i.e. scale the RoPE frequency base by 2.4^(128/126)). |
Right, but I ran perplexity tests on codellama 34b + airoboros 2.1 lora, the numbers are lower by more than a whole point when you use alpha of ~2.7 than when you set the "correct" base. The divide grew further on longer context. Try it yourself if you have them. So for this one case, at least, it's not good to read the config and auto apply the original theta. I'm sure this will be true for other tunes off that model in the future depending on what people do when training. The way this is now, I will be forced to use the 1 million base. I haven't tested what happens with phind, wizardcoder or when the lora is merged. That's still TBD. I read posts people were having trouble replicating phind results and my gut says this is related. |
I'm not sure what's the best approach then. Theta = 10,000 and alpha = 2.7 would give base = 10,000 * 2.7 ^ (128/126) = 27,429 To get a base of 27,429 with theta = 1,000,000, you'd want alpha = (27,429 / 1,000,000) ^ (126/128) = 0.029 |
I'm not either. That's a lot of gymnastics but most people usually want to run at spec. It's also probably why we aren't seeing so many FT for the 34b, people think the model is shit because they set the wrong settings. Tested samantha 1.1 with PTB_NEW. It's GGML but same difference. 1e6 = 44.x PPL Wish I d/l the GPTQ version instead. Bet there is a more ideal rope base. |
@Ph0rk0z, @turboderp, Hey guys, just bumping this issue. I brought up the discussion about RoPE Base and RoPE Frequency in a previous issue. I'm not sure if it's possible to integrate this type of technique into exllama. Also, creating exllama generators is pretty fast. So we could also create a simple regression testing script for PPL with a list of alpha and theta values to test that cover all the ideal alpha and thetas. |
The position embeddings aren't a bottleneck really, so it's possible other positional encoding schemes could be just as fast. But I'm always hesitant to jump on any new proposed format or technique, because it's literally impossible to keep up. Most of them don't hold up, as it turns out, and even when they're sound they don't always catch on, which leads to dead code that can potentially cause problems down the line. So it's not so much a question of performance, but more about how many different features I can realistically implement and maintain, and which ones are actually worth allocating time for. |
@turboderp understood. Here is my proposal. Instead of replacing the current rotary embedding calculation. We have optionality for two. Utilizing Second, let's just pull the calculations done with RoPE from the llama.cpp repo. This will be easier and faster and given the nature of how rotary embeddings function, should not be a problem. Third, while not necessary, an additional testing script for PPL and maybe reviewing sample outputs would be nice. Just to see what are the optimal I'd be happy to formalize this into a spec now. In terms of implementation. I will take a deep dive in a couple weeks assuming no one else is working on it. |
Hi, CodeLlama2 is not really working on exllama.
The answers are sometimes complete gibbrish.
Can you please upgrade the library to upgrade to the new rope_thea parameter of CodeLlama ?
Best regards
The text was updated successfully, but these errors were encountered: