tutorial: LLM basics from scratch provide step by step explanation.
cd to data
folder
cd data
Initialize Git LFS for Large Files
git lfs install
Clone the dataset:
git clone https://huggingface.co/datasets/Skylion007/openwebtext
Unzip dataset:
bash unzip.sh
Back to the root folder, run the following command:
python convert_data.py
It converts all the .xz
files in data/openwebtext/subsets
and put the converted .txt
files in folder data/extracted
.
We are using neetbox for monitoring, open localhost:20202 (neetbox's default port) in your browser and you can check the progresses. If you are working on a remote server, you can use ssh -L 20202:localhost:20202 user@remotehost
to forward the port to your local machine, or you can directly access the server's IP address with the port number, and you will see all the processes:
Optionally, the script will ask you if you'd like to delete the original .xz
files to save disk space. If you want to keep them, type n
and press Enter.
python train.py --config config/gptv1_s.toml
Since we are using neetbox for monitoring, open localhost:20202 (neetbox's default port) in your browser and you can check the progresses:
python inference.py --config config/gptv1_s.toml
Open localhost:20202 (neetbox's default port) in your browser and feed text to your model via action button.
more information see also LLM basics from scratch