Modded-NanoGPT : a replication attempt of nGPT

This repo is a fork of the modded-nanogpt which allows easy test and play of different methods to improve LLMs training. This repo is specifically intended as an attempt to reproduce the results of the nGPT paper.

I used a smaller model than what is used in the paper (162M vs 0.5B and 1B) and far fewer tokens (2.5B vs 500B). But I guess we have to start somewhere to try and verify the results by ourselves.

This is the validation loss curves for GPT and nGPT, on 2.5B tokens on FineWeb:

You see how nGPT is lacking behind the whole time, but catches up and even beats the baseline during the cooldown phase of the WSD schedule.

By "GPT", I mean the modded-nanogpt baseline using AdamW (I derived the current implementation and removed all the tricks and enhancements like Muon, QK norm, zero init... and switched back to something very similar to the original nanogpt). This is implemented in the train_gpt2.py file.

By "nGPT", I mean the normalized version of this baseline, as described quite precisely in section 2.6 of the nGPT paper. This is implemented in the train_ngpt2.py file.

Here is another experiment, at a smaller scale:

Same observations : lacking behind the whole time, but catches up at the end.

Note that GPT term here isn't correct here because 1) RoPE is used 2) a Llama MLP block is used but I choosed to followed what was done in the paper. No weight tying employed (that's why no 124M parameters)

Next up, I wanted to check the length extrapolation capabilities of a trained GPT vs nGPT, inspired by the figure 13 of the paper. This is the result for the 162M models on the PG19 dataset :

Quite impressive compared to the GPT baseline. Note that the overall shape of the curves don't match the figure 13 of the nGPT paper (the GPT perplexity is supposed to shoot up but here it kinds of levels off..)

Notes about my experiments / potential things to look upon :

the slowdown incurred by nGPT over GPT is only about 20%, as opposed to 80% reported in the paper. Note that I used the latest 2.5 version of pytorch.
I used the WSD LR scheduler, as opposed to cosine scheduler in the paper.
the behavior of "nGPT catching up and beating GPT during the cooldown" is a bit weird, maybe try to make the slowdown longer ? (proposed by @Grad62304977)
train at ctx_len=256, and see if it can extrapolates to ctx_len=1024 ? it's easier to see than 1024/2048 for example (also proposed by @Grad62304977)
maybe try looking at the EDM2 code ? also by Nvidia, and shares similar ideas of heavy normalization.

I used muP for both GPT and nGPT to fix the LR of 2**(-9) thanks to a sweep at n_embd=64. I used it in a "non-invasive" way as I like to call it : at the desired width n_embd=768 I fixed SP=muP. See this.

This is test sweep LR I have done with the GPT baseline to check for the correctness of my muP implementation :

And the same for nGPT:

Feel free to reach me/open PR if you have any questions/suggestions. I can provide you with the trained models for example.

Name		Name	Last commit message	Last commit date
Latest commit History 216 Commits
data		data
img		img
mup		mup
records		records
.DS_Store		.DS_Store
.gitignore		.gitignore
README.md		README.md
requirements.txt		requirements.txt
run.sh		run.sh
train_gpt2.py		train_gpt2.py
train_ngpt2.py		train_ngpt2.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Modded-NanoGPT : a replication attempt of nGPT

About

Releases

Packages

Languages

alxndrTL/modded-nanogpt

Folders and files

Latest commit

History

Repository files navigation

Modded-NanoGPT : a replication attempt of nGPT

About

Resources

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages