Skip to content

Commit

Permalink
remove fastchat and sharegpt (#2021)
Browse files Browse the repository at this point in the history
* remove fastchat and sharegpt

* remove imports

* remove more fastchat imports

* chore: remove unused functions

* feat: remove sharegpt and deprecate from docs

* chore: remove unused sharegpt checks

* fix: remove sharegpt type from tests

* feat: add sharegpt deprecation error

* feat: update readme

---------

Co-authored-by: NanoCode012 <[email protected]>
  • Loading branch information
winglian and NanoCode012 authored Nov 8, 2024
1 parent 3265b70 commit fd3b807
Show file tree
Hide file tree
Showing 22 changed files with 28 additions and 1,804 deletions.
7 changes: 3 additions & 4 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -383,11 +383,10 @@ See [examples](examples) for quick start. It is recommended to duplicate and mod
- typescript
type: ... # unimplemented custom format
# fastchat conversation (deprecation soon, use chat_template https://axolotl-ai-cloud.github.io/axolotl/docs/dataset-formats/conversation.html#chat_template)
# See 'conversation' options: https://github.com/lm-sys/FastChat/blob/main/fastchat/conversation.py
# chat_template https://axolotl-ai-cloud.github.io/axolotl/docs/dataset-formats/conversation.html#chat_template
- path: ...
type: sharegpt
conversation: chatml # default: vicuna_v1.1
type: chat_template
chat_template: chatml # defaults to tokenizer's chat_template
# local
- path: data.jsonl # or json
Expand Down
2 changes: 1 addition & 1 deletion devtools/dev_chat_template.yml
Original file line number Diff line number Diff line change
@@ -1,4 +1,4 @@
# Example config for debugging the sharegpt prompt format
# Example config for debugging the chat_template prompt format
base_model: TinyLlama/TinyLlama-1.1B-Chat-v1.0
model_type: LlamaForCausalLM
tokenizer_type: LlamaTokenizer
Expand Down
11 changes: 1 addition & 10 deletions docs/config.qmd
Original file line number Diff line number Diff line change
Expand Up @@ -83,7 +83,7 @@ lora_on_cpu: true
datasets:
# HuggingFace dataset repo | s3://,gs:// path | "json" for local dataset, make sure to fill data_files
- path: vicgalle/alpaca-gpt4
# The type of prompt to use for training. [alpaca, sharegpt, gpteacher, oasst, reflection]
# The type of prompt to use for training. [alpaca, gpteacher, oasst, reflection]
type: alpaca # format | format:<prompt_style> (chat/instruct) | <prompt_strategies>.load_<load_fn>
ds_type: # Optional[str] (json|arrow|parquet|text|csv) defines the datatype when path is a file
data_files: # Optional[str] path to source data files
Expand All @@ -92,15 +92,6 @@ datasets:
train_on_split: train # Optional[str] name of dataset split to load from
revision: # Optional[str] The specific revision of the dataset to use when loading from the Hugging Face Hub. This can be a commit hash, tag, or branch name. If not specified, the latest version will be used. This parameter is ignored for local datasets.

# Optional[str] fastchat conversation type, only used with type: sharegpt
conversation: # Options (see Conversation 'name'): https://github.com/lm-sys/FastChat/blob/main/fastchat/conversation.py
field_human: # Optional[str]. Human key to use for conversation.
field_model: # Optional[str]. Assistant key to use for conversation.
# Add additional keys from your dataset as input or output roles
roles:
input: # Optional[List[str]]. These will be masked based on train_on_input
output: # Optional[List[str]].

# Custom user instruction prompt
- path: repo
type:
Expand Down
59 changes: 1 addition & 58 deletions docs/dataset-formats/conversation.qmd
Original file line number Diff line number Diff line change
Expand Up @@ -6,72 +6,15 @@ order: 3

## sharegpt

UPDATE: ShareGPT is being deprecated in the next release. Please see `chat_template` section below.
IMPORTANT: ShareGPT is deprecated!. Please see `chat_template` section below.

conversations where `from` is `human`/`gpt`. (optional: first row with role `system` to override default system prompt)

```{.json filename="data.jsonl"}
{"conversations": [{"from": "...", "value": "..."}]}
```

Note: `type: sharegpt` opens special configs:
- `conversation`: enables conversions to many Conversation types. Refer to the 'name' [here](https://github.com/lm-sys/FastChat/blob/main/fastchat/conversation.py) for options.
- `roles`: allows you to specify the roles for input and output. This is useful for datasets with custom roles such as `tool` etc to support masking.
- `field_human`: specify the key to use instead of `human` in the conversation.
- `field_model`: specify the key to use instead of `gpt` in the conversation.

```yaml
datasets:
path: ...
type: sharegpt

conversation: # Options (see Conversation 'name'): https://github.com/lm-sys/FastChat/blob/main/fastchat/conversation.py
field_human: # Optional[str]. Human key to use for conversation.
field_model: # Optional[str]. Assistant key to use for conversation.
# Add additional keys from your dataset as input or output roles
roles:
input: # Optional[List[str]]. These will be masked based on train_on_input
output: # Optional[List[str]].
```

## pygmalion

```{.json filename="data.jsonl"}
{"conversations": [{"role": "...", "value": "..."}]}
```

## sharegpt.load_role

conversations where `role` is used instead of `from`

```{.json filename="data.jsonl"}
{"conversations": [{"role": "...", "value": "..."}]}
```

## sharegpt.load_guanaco

conversations where `from` is `prompter` `assistant` instead of default sharegpt

```{.json filename="data.jsonl"}
{"conversations": [{"from": "...", "value": "..."}]}
```

## sharegpt.load_ultrachat

conversations where the turns field is 'messages', human is 'user' and gpt is 'assistant'.

```{.json filename="data.jsonl"}
{"messages": [{"user": "...", "assistant": "..."}]}
```

## sharegpt_jokes

creates a chat where bot is asked to tell a joke, then explain why the joke is funny

```{.json filename="data.jsonl"}
{"conversations": [{"title": "...", "text": "...", "explanation": "..."}]}
```


## chat_template

Expand Down
1 change: 0 additions & 1 deletion requirements.txt
Original file line number Diff line number Diff line change
Expand Up @@ -28,7 +28,6 @@ scipy
scikit-learn==1.4.2
pynvml
art
fschat @ git+https://github.com/lm-sys/FastChat.git@27a05b04a35510afb1d767ae7e5990cbd278f8fe
gradio==3.50.2
tensorboard
python-dotenv==1.0.1
Expand Down
21 changes: 0 additions & 21 deletions src/axolotl/cli/preprocess.py
Original file line number Diff line number Diff line change
Expand Up @@ -23,10 +23,6 @@
)
from axolotl.common.cli import PreprocessCliArgs
from axolotl.common.const import DEFAULT_DATASET_PREPARED_PATH
from axolotl.prompt_strategies.sharegpt import (
register_chatml_template,
register_llama3_template,
)
from axolotl.utils.trainer import disable_datasets_caching

LOG = logging.getLogger("axolotl.cli.preprocess")
Expand All @@ -44,23 +40,6 @@ def do_cli(config: Union[Path, str] = Path("examples/"), **kwargs):
return_remaining_strings=True
)

if parsed_cfg.chat_template == "chatml":
if parsed_cfg.default_system_message:
LOG.info(
f"ChatML set. Adding default system message: {parsed_cfg.default_system_message}"
)
register_chatml_template(parsed_cfg.default_system_message)
else:
register_chatml_template()
elif parsed_cfg.chat_template == "llama3":
if parsed_cfg.default_system_message:
LOG.info(
f"LLaMA-3 set. Adding default system message: {parsed_cfg.default_system_message}"
)
register_llama3_template(parsed_cfg.default_system_message)
else:
register_llama3_template()

if not parsed_cfg.dataset_prepared_path:
msg = (
Fore.RED
Expand Down
19 changes: 0 additions & 19 deletions src/axolotl/cli/train.py
Original file line number Diff line number Diff line change
Expand Up @@ -19,10 +19,6 @@
)
from axolotl.common.cli import TrainerCliArgs
from axolotl.integrations.base import PluginManager
from axolotl.prompt_strategies.sharegpt import (
register_chatml_template,
register_llama3_template,
)
from axolotl.train import train

LOG = logging.getLogger("axolotl.cli.train")
Expand All @@ -42,21 +38,6 @@ def do_train(cfg, cli_args) -> None:
print_axolotl_text_art()
check_accelerate_default_config()
check_user_token()
if cfg.chat_template == "chatml" and cfg.default_system_message:
LOG.info(
f"ChatML set. Adding default system message: {cfg.default_system_message}"
)
register_chatml_template(cfg.default_system_message)
else:
register_chatml_template()

if cfg.chat_template == "llama3" and cfg.default_system_message:
LOG.info(
f"LLaMA-3 set. Adding default system message: {cfg.default_system_message}"
)
register_llama3_template(cfg.default_system_message)
else:
register_llama3_template()

if cfg.rl: # and cfg.rl != "orpo":
dataset_meta = load_rl_datasets(cfg=cfg, cli_args=cli_args)
Expand Down
Loading

0 comments on commit fd3b807

Please sign in to comment.