[`RWKV5`] Add support for RWKV5 model #29095

ArthurZucker · 2024-02-19T02:12:03Z

What does this PR do?

Adds RWKV5, superseeds #26963

BBuf · 2024-03-28T07:27:01Z

Yeah, we can just set to infinite, juste used 500 for a quick fix, I don't know if the original one has a limit as wel

I think it would be a good idea to set it to infinite, because rwkv don't have sequence length limit in theory.

ArthurZucker · 2024-03-28T08:12:36Z

Okay my slow tests are all green for the tokenizer time to focus on the model!

…wkv5

ArthurZucker · 2024-04-02T15:38:33Z

The fast CUDA path works thanks to @kashif , but the cpu does not yet

…wkv5

JL-er · 2024-04-04T13:00:43Z

I set trust_remote_code=True but still get an error.

code：
python3 utils/prepare_dataset.py -i JeanKaddour/minipile -o /users/aigclab/copilot/data/Msample

import click

import datasets
from transformers import AutoTokenizer

import torch


def _rechunk_tokenize(rechunk_size: int, input_column: str, output_column: str, examples):
    tokenizer = AutoTokenizer.from_pretrained(
        "RWKV/rwkv-6-world-1b6", trust_remote_code=True
    )

    special_token = torch.tensor([0], dtype=torch.long)
    seqs = []
    for e in examples[input_column]:
        seq = tokenizer(e, padding=False, truncation=False, return_tensors="pt")
        seqs.append(seq.input_ids[0])
        seqs.append(special_token)
    seqs = torch.cat(seqs)
    rechunked = seqs[: (seqs.size(0) // rechunk_size) * rechunk_size].view(
        -1, rechunk_size
    )
    return {output_column: rechunked}


@click.command()
@click.option("--rechunk_size", default=513, help="Rechunk size for the dataset")
@click.option("--input_column", default="text", help="Column to tokenize")
@click.option(
    "--output_column", default="input_ids", help="Output column for the tokenized dataset"
)
@click.option(
    "-i",
    "--input_name",
    help="HuggingFace dataset name to tokenize, accept format:"
    + '"dataset_name" or "json:file_a,file_b,..."',
)
@click.option("-o", "--output_dir", help="Output directory for the tokenized dataset")
def main(rechunk_size, input_column, output_column, input_name, output_dir):
    print(f"Tokenizing HuggingFace dataset {input_name} to locally saved {output_dir}")
    if ":" in input_name:
        input_name, input_data_file = input_name.split(":")
        if "," in input_data_file:
            input_data_file = input_data_file.split(",")
    else:
        input_data_file = None
    dataset = datasets.load_dataset(
        input_name, data_files=input_data_file, trust_remote_code=True
    )
    dataset.shuffle().flatten_indices(num_proc=8).map(
        lambda x: _rechunk_tokenize(rechunk_size, input_column, output_column, x),
        batched=True,
        remove_columns=dataset["train"].column_names,
        num_proc=8,
    ).save_to_disk(output_dir, num_proc=8)


if __name__ == "__main__":
    main()

JL-er · 2024-04-04T13:02:24Z

One month ago, there was no problem.

SmerkyG · 2024-04-04T20:42:03Z

Important: maybe less problematic for v5 (or maybe not!), but I found that for v6 the following line is absolutely terrible for inference accuracy:

out = self.ln_x(rwkv.to(hidden.dtype)).view(batch, seq_length, -1)

verus the original @BBuf version (which I tweaked and adapted to v6):

out = F.group_norm(out / self.config.head_size_divisor, num_groups=H, weight=self.ln_x.weight.to(out.dtype), bias=self.ln_x.bias.to(out.dtype), eps=self.ln_x.eps).reshape(B, T, H * S)
out = out.to(dtype=hidden.dtype) * gate

The issue is that the potential down-cast to bf16 prior to the groupnorm causes really bad inference quality. If you look closely this is written differently than in the original @BBuf version where the down-cast occurs after the groupnorm during non-cuda inference.

Please see these lines of Bo Peng's original ChatRWKV code for reference about this groupnorm needing float32:
https://github.com/BlinkDL/ChatRWKV/blob/28ed01a8423842c3082f668922a1b45ac182dff0/rwkv_pip_package/src/rwkv/model.py#L377
https://github.com/BlinkDL/ChatRWKV/blob/28ed01a8423842c3082f668922a1b45ac182dff0/rwkv_pip_package/src/rwkv/model.py#L669

ArthurZucker · 2024-04-05T06:21:57Z

Will update the group norm!

BBuf · 2024-04-06T07:15:07Z

I set trust_remote_code=True but still get an error. code： python3 utils/prepare_dataset.py -i JeanKaddour/minipile -o /users/aigclab/copilot/data/Msample

import click

import datasets
from transformers import AutoTokenizer

import torch


def _rechunk_tokenize(rechunk_size: int, input_column: str, output_column: str, examples):
    tokenizer = AutoTokenizer.from_pretrained(
        "RWKV/rwkv-6-world-1b6", trust_remote_code=True
    )

    special_token = torch.tensor([0], dtype=torch.long)
    seqs = []
    for e in examples[input_column]:
        seq = tokenizer(e, padding=False, truncation=False, return_tensors="pt")
        seqs.append(seq.input_ids[0])
        seqs.append(special_token)
    seqs = torch.cat(seqs)
    rechunked = seqs[: (seqs.size(0) // rechunk_size) * rechunk_size].view(
        -1, rechunk_size
    )
    return {output_column: rechunked}


@click.command()
@click.option("--rechunk_size", default=513, help="Rechunk size for the dataset")
@click.option("--input_column", default="text", help="Column to tokenize")
@click.option(
    "--output_column", default="input_ids", help="Output column for the tokenized dataset"
)
@click.option(
    "-i",
    "--input_name",
    help="HuggingFace dataset name to tokenize, accept format:"
    + '"dataset_name" or "json:file_a,file_b,..."',
)
@click.option("-o", "--output_dir", help="Output directory for the tokenized dataset")
def main(rechunk_size, input_column, output_column, input_name, output_dir):
    print(f"Tokenizing HuggingFace dataset {input_name} to locally saved {output_dir}")
    if ":" in input_name:
        input_name, input_data_file = input_name.split(":")
        if "," in input_data_file:
            input_data_file = input_data_file.split(",")
    else:
        input_data_file = None
    dataset = datasets.load_dataset(
        input_name, data_files=input_data_file, trust_remote_code=True
    )
    dataset.shuffle().flatten_indices(num_proc=8).map(
        lambda x: _rechunk_tokenize(rechunk_size, input_column, output_column, x),
        batched=True,
        remove_columns=dataset["train"].column_names,
        num_proc=8,
    ).save_to_disk(output_dir, num_proc=8)


if __name__ == "__main__":
    main()

Bug fixed by BBuf/RWKV-World-HF-Tokenizer@6dd44c8 , it has no relation with this pr, I will update hf repo RWKV/rwkv-6-world-1b6 later .

BBuf · 2024-04-22T02:18:27Z

BTW @BBuf don't you think it would be great to have a seperate github repo with installables for the kernels? (you can track usage, easier to maintain and propagate to here!) WDYT?

It has been solved in https://huggingface.co/RWKV/rwkv-5-world-1b5/blob/main/modeling_rwkv5.py , we need pip install flash-rwkv first, modeling_rwkv5.py in this pr can be replaced directly.

BBuf and others added 30 commits October 18, 2023 10:06

add rwkv world tokenizer

b43841e

update init

8624343

update init

bcc6e32

update init

2b535db

refine

6ee3a38

refine

f566790

fix ci error

abc2699

fix ci error

56369e0

Merge branch 'main' into main

e7a1c0b

fix ci error

ca2e67b

fix ci error

4c156bf

fix ci error

2421feb

fix ci error

023a2f2

try to fix ci error

916311f

fix dummy_pt_objects bug

67e8785

fix ci error

3f4b5a1

fix requires_torch

37932e9

delete useless import

b3aa804

refine

4dc1c8e

add rwkv5 model

d4b8181

add rwkv5 model and run test success

4f73ae9

revert change

1f043b7

revert change

f2b6cbf

fix ci error

a39c57a

format

c461fc3

fix ci error

5a415b4

fix ci bug

f542283

fix ci error

00f480f

Merge branch 'main' into main

f8b528c

revert extra change

6fbf193

ArthurZucker added 4 commits March 28, 2024 16:32

remove word limit

06b4e7a

nit

d6aaad5

update the test

f869c51

update expected values from original tokenizer

fa1ccf4

ArthurZucker added 7 commits March 30, 2024 06:31

small nits and help update test to correct tokenizers path

2683d34

small nit in order of args

e31e11e

oups

44faea1

just fix some shape issues

c868379

small nits

fe8a18c

no renaming for ln_x

0b2efef

Merge branch 'main' of github.com:huggingface/transformers into add-r…

a66a0ca

…wkv5

kashif force-pushed the add-rwkv5 branch from 96bd345 to a66a0ca Compare March 30, 2024 12:02

kashif added 4 commits March 30, 2024 13:36

add fp16 and fp32 kernels

0359130

fix argument order to forward_bf16

a23279a

rename time_first to time_faaaa and fix dtype requirement

995eb28

add head_size_divisor and fix group norm

5d411d9

ArthurZucker added 5 commits April 3, 2024 05:45

Merge branch 'main' of github.com:huggingface/transformers into add-r…

7e76996

…wkv5

fixes?

12de334

nits

b13a2ba

no info, warn

62f052e

info when correctly loaded

0adb55f

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[`RWKV5`] Add support for RWKV5 model #29095

[`RWKV5`] Add support for RWKV5 model #29095

ArthurZucker commented Feb 19, 2024 •

edited

Loading

BBuf commented Mar 28, 2024

ArthurZucker commented Mar 28, 2024

ArthurZucker commented Apr 2, 2024

JL-er commented Apr 4, 2024

JL-er commented Apr 4, 2024

SmerkyG commented Apr 4, 2024 •

edited

Loading

ArthurZucker commented Apr 5, 2024

BBuf commented Apr 6, 2024

BBuf commented Apr 22, 2024

[RWKV5] Add support for RWKV5 model #29095

Are you sure you want to change the base?

[RWKV5] Add support for RWKV5 model #29095

Conversation

ArthurZucker commented Feb 19, 2024 • edited Loading

What does this PR do?

BBuf commented Mar 28, 2024

ArthurZucker commented Mar 28, 2024

ArthurZucker commented Apr 2, 2024

JL-er commented Apr 4, 2024

JL-er commented Apr 4, 2024

SmerkyG commented Apr 4, 2024 • edited Loading

ArthurZucker commented Apr 5, 2024

BBuf commented Apr 6, 2024

BBuf commented Apr 22, 2024

[`RWKV5`] Add support for RWKV5 model #29095

[`RWKV5`] Add support for RWKV5 model #29095

ArthurZucker commented Feb 19, 2024 •

edited

Loading

SmerkyG commented Apr 4, 2024 •

edited

Loading