You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Hello
im testing our learning using your code.
Thank you always.
Currently, I have created a dataset with a 1:1 ratio of 8k and 64k datasets.
Afterwards, learning was conducted using code, but
q_embed = (q * cos) + (rotate_half(q) * sin)
RuntimeError: The size of tensor a (1024) must match the size of tensor b (8192) at non-singleton dimension 2
0%| | 0/301 [00:00<?, ?it/s]
An error has occurred.
My prediction is that there will be no problem with the 64k dataset, but a problem appears during the process of learning the 8k dataset.
Should I set the length of the dataset the same when learning?
For datasets shorter than seq-length, I am wondering whether I should pad it.
Thanks for your help.
--seq-length 65535 \
The text was updated successfully, but these errors were encountered:
I feel this is a problem with RoPE cache. If we train on 64K seq, each card with 8K tokens, some RoPE implementation in HF will only spawn 8K RoPE sin-cos cache (such as Qwen2, Mistral, but llama3 does not have this issue). But the position index we use can can have range 0-64K
Hello
im testing our learning using your code.
Thank you always.
Currently, I have created a dataset with a 1:1 ratio of 8k and 64k datasets.
Afterwards, learning was conducted using code, but
An error has occurred.
My prediction is that there will be no problem with the 64k dataset, but a problem appears during the process of learning the 8k dataset.
Should I set the length of the dataset the same when learning?
For datasets shorter than seq-length, I am wondering whether I should pad it.
Thanks for your help.
The text was updated successfully, but these errors were encountered: