Skip to content

visualDust/naive-llm-from-scratch

Repository files navigation

Hands on Large(?) Language model from scratch

tutorial: LLM basics from scratch provide step by step explanation.


how to run

Download Dataset

cd to data folder

cd data

Initialize Git LFS for Large Files

git lfs install

Clone the dataset:

git clone https://huggingface.co/datasets/Skylion007/openwebtext

Unzip dataset:

bash unzip.sh

Convert Data

Back to the root folder, run the following command:

python convert_data.py

It converts all the .xz files in data/openwebtext/subsets and put the converted .txt files in folder data/extracted.

We are using neetbox for monitoring, open localhost:20202 (neetbox's default port) in your browser and you can check the progresses. If you are working on a remote server, you can use ssh -L 20202:localhost:20202 user@remotehost to forward the port to your local machine, or you can directly access the server's IP address with the port number, and you will see all the processes:

image-20231226202536338

Optionally, the script will ask you if you'd like to delete the original .xz files to save disk space. If you want to keep them, type n and press Enter.

train

python train.py --config config/gptv1_s.toml

Since we are using neetbox for monitoring, open localhost:20202 (neetbox's default port) in your browser and you can check the progresses:

image-20231226195105751

predict

python inference.py --config config/gptv1_s.toml

Open localhost:20202 (neetbox's default port) in your browser and feed text to your model via action button.

image-20231226202121711


further

more information see also LLM basics from scratch