Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

I get this error: WARNING: tokenization mismatch: 156 vs. 161. (ignored) when I finetune llama3 #126

Open
shidingz opened this issue Jun 5, 2024 · 1 comment

Comments

@shidingz
Copy link

shidingz commented Jun 5, 2024

When I run this script-scripts/llama3/train/stage_2_full_v8b_672_hr_1536.sh, I encounter this error- WARNING: tokenization mismatch: 156 vs. 161. (ignored)

@shidingz
Copy link
Author

shidingz commented Jun 5, 2024

我发现llama3的模板有些问题,如果设计多轮对话会出现 WARNING: tokenization mismatch

def preprocess_llama_3(
sources,
tokenizer: transformers.PreTrainedTokenizer,
has_image: bool = False
) -> Dict:
这个函数里这部分代码是不是不太对
# include for all rounds
cur_len = 1
target[:cur_len] = IGNORE_INDEX
for i, rou in enumerate(re_rounds):
if rou == "":
break

        parts = rou.split(sep)
        if len(parts) != 2:
            print(f"WARNING: parts!=: {parts}")
            break
        parts[0] += sep

        # include <bos> for all rounds
        if has_image:
            round_len = len(tokenizer_image_token(rou, tokenizer)) - 1
            instruction_len = len(tokenizer_image_token(parts[0], tokenizer)) - 2
        else:
            round_len = len(tokenizer(rou).input_ids) - 1
            instruction_len = len(tokenizer(parts[0]).input_ids) - 2

        # include <|eot_id|> for all rounds
        round_len += 1
        instruction_len += 1

        target[cur_len : cur_len + instruction_len] = IGNORE_INDEX
        cur_len += round_len

    target[cur_len:] = IGNORE_INDEX

模板并没有添加bos token,所以为什么要设置
cur_len = 1
target[:cur_len] = IGNORE_INDEX
然后这里的
round_len = len(tokenizer_image_token(rou, tokenizer)) - 1
instruction_len = len(tokenizer_image_token(parts[0], tokenizer)) - 2
-1 -2分别是为什么呢
按照我对这个模板的理解应该修改成这样吧
# not include for all rounds
cur_len = 0
# target[:cur_len] = IGNORE_INDEX
for i, rou in enumerate(re_rounds):
if rou == "":
break

        parts = rou.split(sep)
        if len(parts) != 2:
            print(f"WARNING: parts!=: {parts}")
            break
        parts[0] += sep

        # not include <bos> for all rounds
        if has_image:
            round_len = len(tokenizer_image_token(rou, tokenizer))
            instruction_len = len(tokenizer_image_token(parts[0], tokenizer))
        else:
            round_len = len(tokenizer(rou).input_ids)
            instruction_len = len(tokenizer(parts[0]).input_ids)

        # include <|eot_id|> for all rounds
        round_len += 1
        instruction_len += 1

        target[cur_len : cur_len + instruction_len] = IGNORE_INDEX
        cur_len += round_len

    target[cur_len:] = IGNORE_INDEX

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant