A character-level seq2seq transformer from scratch in a single file seq2seq.py.
Optimized for readability and learnability.
- single file
- as readable as possible
- comments for learnings and common errors
- working code that:
- trains on paired sequences of text
- given input text, generates the corresponding output text
We train a character-level seq2seq transformer to translate from Hinglish (a modern hybrid of Hindi and English) to English.
After training, the same model is used to translate sample Hinglish sentences to English.
The dataset used is cmu-hinglish-dog on Huggingface, which provides samples of movie reviews written in Hinglish that have been translated to English.
python >= 3.10
torch >= 2.0
datasets
pip install torch
pip install datasets
python seq2seq.py
All contributions in the form of confusions, concerns, suggestions, or improvements are welcome!
This repo was motivated by my previous "single file" repo single_file_gpt, which in turn was influenced by Andrej Karpathy's nanogpt.
The demo in this repo uses the cmu-hinglish-dog dataset on Huggingface, orignally produced by Zhou et al., 2018. This dataset can also be found in the datasets-CMU_DoG repo on Github.