This repository contains a script to convert chat conversations from a JSON file to a text file, with formatting suitable for training a language model using the FastChat framework. The script relies on the FastChat repository to preprocess and tokenize the conversations before writing them to the output file.
- Python 3.6 or higher
- IJSON library
- FastChat repository (specifically, the
train.py
script located under thefastchat/train/
directory)
conda create -n fastchat-conversation-converter python=3.10.9
conda activate fastchat-conversation-converter
git clone https://github.com/practicaldreamer/fastchat-conversation-converter
cd fastchat-conversation-converter
pip install ijson
mkdir repos
cd repos
git clone https://github.com/lm-sys/FastChat
cd FastChat
pip install -e .
cd ..
cd ..
Note: My script was built around FastChat commit 5ccf842
python process_conversations.py \
--model_path '/home/user/Documents/models/llama-7b' \
--input_json_path '/home/user/Downloads/ShareGPT_V3_unfiltered_cleaned_split_no_imsorry.json' \
--output_txt_path '/home/user/Documents/output.txt'
Replace the arguments with the appropriate values for your use case.
This script is built upon the FastChat project. Please refer to the original repository for more information about the framework and its usage.