Skip to content

Commit

Permalink
fix xl and rmd
Browse files Browse the repository at this point in the history
  • Loading branch information
Anton Emelyanov committed Feb 11, 2021
1 parent ad61a6a commit c0b6b09
Show file tree
Hide file tree
Showing 6 changed files with 228 additions and 303 deletions.
252 changes: 5 additions & 247 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -26,8 +26,8 @@ We support 🤗HuggingFace interface only for ruGPT3Large, ruGPT3Medium, ruGPT3S
Here we can obtain examples of [finetuning](examples/Finetune_RuGPTs_with_HF.ipynb) or [generation](examples/Generate_text_with_RuGPTs_HF.ipynb).

Also this examples is adapted for google colab:
* [![finetuning](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/sberbank-ai/ru-gpts/blob/master/examples/Finetune_RuGPTs_with_HF.ipynb)
* [![generation](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/sberbank-ai/ru-gpts/blob/master/examples/Generate_text_with_RuGPTs_HF.ipynb)
* finetuning: [![finetuning](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/sberbank-ai/ru-gpts/blob/master/examples/Finetune_RuGPTs_with_HF.ipynb)
* generation: [![generation](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/sberbank-ai/ru-gpts/blob/master/examples/Generate_text_with_RuGPTs_HF.ipynb)

Basic usage:

Expand Down Expand Up @@ -129,249 +129,7 @@ Test installation of deepspeed you can with the following command: `ds_report`.

Example of inference of RuGPT3XL [here](examples/ruGPT3XL_generation.ipynb) or [![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/sberbank-ai/ru-gpts/blob/master/examples/ruGPT3XL_generation.ipynb)

# Details of pretraining
All GPUs are Tesla V100-SXM3 32 Gb.
# Advanced
Also we add pretraining scripts for all models (except RuGPT2Large). See [scripts](scripts/) dir.

## Details of pretraining ruGPT3XL
Model was trained on 512 context length with [deepspeed](https://github.com/microsoft/DeepSpeed) and [megatron](https://github.com/NVIDIA/Megatron-LM) code by [SberDevices](https://sberdevices.ru/) team. After that model was finetuned on 2048 context. Note! Model has sparse attention modules.

Total training time took around 10 days on 256 GPUs. Final perplexity on test set is `11.4`.

🤗HuggingFace model card [link](https://huggingface.co/sberbank-ai/rugpt3xl).

See more details [here](gw/)

## Details of pretraining ruGPT3Large
Model was trained on 1024 context length with transformers by [SberDevices](https://sberdevices.ru/) team on 80B tokens around 3 epochs. After that we finetune this on 2048 context. For load transformers checkpoint use `--load-openai`.

The training process took around two weeks on 8 DGX2 (128 GPUs) for 1024 context and few days on 16 GPUs for 2048 context on [Christophari](https://sbercloud.ru/ru/christofari).

Perplexity is 16 on test set.

You can obtain this model here [GDrive](https://drive.google.com/file/d/1t4xw-nvNLQ8kt9FrWW4bPEgCr45M98vu/view?usp=sharing) [Yandex.Disk](https://yadi.sk/d/X7v84O9jrQ8jJg) [GDrive option-2](https://drive.google.com/file/d/1wtc2iBNTcYrqwOzRyEWYWoVBc9xfsbPP/view?usp=sharing) or use in transformers with model name `sberbank-ai/rugpt3large_based_on_gpt2` (see [usage](#Usage-ruGPT3Large) for details).

🤗HuggingFace model card [link](https://huggingface.co/sberbank-ai/rugpt3large_based_on_gpt2)

## Details of pretraining ruGPT3Medium
Model was trained on 1024 context length with transformers by [SberDevices](https://sberdevices.ru/) team on 80B tokens around 3 epoch. After that model was finetuned on 2048 context.

Total training time took around 16 days on 64 GPUs.

You can obtain this model here [GDrive](https://drive.google.com/file/d/1Lb9ILKw0N2ZSEG80QyaCvkp1b2RAw1pC/view?usp=sharing) [Yandex.Disk](https://yadi.sk/d/yE0cw0QIikCPAg) [GDrive option-2](https://drive.google.com/file/d/1gADn4VxDBVrxZ9Wv4bISbDjwCm_3mrDH/view?usp=sharing) or use in transformers with model name `sberbank-ai/rugpt3medium_based_on_gpt2` (see [usage](#Usage-ruGPT3Medium) for details).

🤗HuggingFace model card [link](https://huggingface.co/sberbank-ai/rugpt3medium_based_on_gpt2)

## Details of pretraining ruGPT3Small
Model was trained on 1024 context length with transformers by [SberDevices](https://sberdevices.ru/) team on 80B tokens around 3 epoch. After that model was finetuned on 2048 context.

Total training time took around one week on 32 GPUs.

You can obtain this model here [GDrive](https://drive.google.com/file/d/19dyhhayJSVJpVPwPzqLRIdCtOddvkzJ4/view?usp=sharing) or use in transformers with model name `sberbank-ai/rugpt3small_based_on_gpt2` (see [usage](#Usage-ruGPT3Small) for details).

🤗HuggingFace model card [link](https://huggingface.co/sberbank-ai/rugpt3small_based_on_gpt2)

## Details of pretraining ruGPT2Large
Model was trained on 1024 context length with transformers by [SberDevices](https://sberdevices.ru/) team on 170Gb data on 64 GPUs 3 weeks.

You can obtain this model here [GDrive](https://drive.google.com/file/d/1r65MwU0arie8NggxpSmc_3Ja5ldRNS70/view?usp=sharing) [Yandex.Disk](https://yadi.sk/d/BRbn4fl9wqKy0w) [GDrive option-2](https://drive.google.com/file/d/17YuV-uuhSVvMD1cnTe7cR-qscb3BtTiG/view?usp=sharing) or use in transformers with model name `sberbank-ai/rugpt2large` (see [usage](#Usage-ruGPT2Large) for details).

🤗HuggingFace model card [link](https://huggingface.co/sberbank-ai/rugpt2large)

# Usage
## Usage ruGPT3XL
See all details [here](gw/) or run example on [![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/sberbank-ai/ru-gpts/blob/master/examples/ruGPT3XL_generation.ipynb)

## Usage ruGPT3Large
We've provided 2 scripts that pretrain and generate with ruGPT3Large. Save and load model checkpoints with `--save` and `--load`.

### Finetuning
#### Data preparation
We support three file formats for training, but all require preprocessing. First, place your training data in a loose json format, with one json containing a text sample per line. For example:

```json
{"src": "KISH", "text": "Как же джокер ты хитер", "type": "Ru", "id": "0", "title": "First Part"}
{"src": "The Internet", "text": "Ты удачи приговор", "type": "Ru", "id": "42", "title": "Second Part"}
```

The name of the text field of the json can be changed by using the `--text-key` flag. The other metadata are optional and are not used in training.
#### Running script
`bash ./scripts/pretrain_ruGPT3Large.sh`

This script runs single gpu ruGPT3Large pretraining. This script contains command for running on [Christophari](https://sbercloud.ru/ru/christofari):

```bash
MP_SIZE=1
NUM_GPUS_PER_WORKER=1

mpirun --np ${NUM_GPUS_PER_WORKER} python pretrain_megatron.py \
--train-data /home/jovyan/data/train.jsonl \
--valid-data /home/jovyan/data/valid.jsonl \
--test-data /home/jovyan/data/valid.jsonl \
--save /home/jovyan/ruGPT3Large/checkpoints_${now}_${host} \
--load /home/jovyan/ruGPT3Large \
--tensorboard-dir /home/jovyan/ruGPT3Large/runs_${now}_${host} \
--save-interval 500 \
--eval-interval 500 \
--log-interval 100 \
--model-parallel-size ${MP_SIZE} \
--num-layers 24 \
--hidden-size 1536 \
--num-attention-heads 16 \
--seq-length 2048 \
--max-position-embeddings 2048 \
--vocab-size 50257 \
--batch-size 1 \
--train-iters 200000 \
--distributed-backend nccl \
--lr 0.00015 \
--lr-decay-style cosine \
--weight-decay 1e-2 \
--clip-grad 1.0 \
--warmup .01 \
--fp16 \
--lazy-loader \
--checkpoint-activations \
--loose-json \
--text-key \
--tokenizer-path /home/jovyan/ruGPT3Large \
--tokenizer-type GPT2BPETokenizer \
--finetune \
```

Or you can use use transformers interface:

```python
from transformers import AutoTokenizer, AutoModel

tokenizer = AutoTokenizer.from_pretrained("sberbank-ai/rugpt3large_based_on_gpt2")

model = AutoModel.from_pretrained("sberbank-ai/rugpt3large_based_on_gpt2")
```

### Text Generation
`bash ./scripts/generate_ruGPT3Large.sh`

Starts an interactive terminal session that generates text either conditionally or unconditionally depending on what the user enters into the prompt.

The script is capable of top-k, or top-p sampling as specified by the appropriate variables within the script.

Example of generation:

```text
Context: на словах ты лев толстой
ruGPT3Large: а в сущности, - ты тоже не дурак, просто так же, как и твой человек, то есть твоя "жизнь", а также как и ты думаешь по-настоящему "ты" и есть твои "жизнь" или "выбор" в отношении твоего положения.
Context: как же джокер ты хитер
ruGPT3Large: или автор книги по бизнесу!
```

Example of generation in colab [![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/sberbank-ai/ru-gpts/blob/master/examples/ruGPT3_generation_example.ipynb)

## Usage ruGPT3Medium
You can run megatron script with option `--load-openai` or use transformers interface:

```python
from transformers import AutoTokenizer, AutoModel

tokenizer = AutoTokenizer.from_pretrained("sberbank-ai/rugpt3medium_based_on_gpt2")

model = AutoModel.from_pretrained("sberbank-ai/rugpt3medium_based_on_gpt2")
```

### Text Generation
`bash ./scripts/generate_ruGPT3Medium.sh`

Starts an interactive terminal session that generates text either conditionally or unconditionally depending on what the user enters into the prompt.

The script is capable of top-k, or top-p sampling as specified by the appropriate variables within the script.

Example of generation:

```text
Context >>> На словах ты Лев Толстой, а на деле
ruGPT: На словах ты Лев Толстой, а на деле я — Лев Давидович Троцкий, — сказал я. — Так что мы еще посмотрим
Context: как же джокер ты хитер
ruGPT: как же джокер ты хитер, в этой игре
- Я не злодей, просто хотел узнать, можно ли узнать о чём?
```

## Usage ruGPT3Small
You can run megatron script with option `--load-openai` or use transformers interface:

```python
from transformers import AutoTokenizer, AutoModelWithLMHead

tokenizer = AutoTokenizer.from_pretrained("sberbank-ai/rugpt3small_based_on_gpt2")

model = AutoModelWithLMHead.from_pretrained("sberbank-ai/rugpt3small_based_on_gpt2")
```

### Text Generation
`bash ./scripts/generate_ruGPT3Small.sh`

Starts an interactive terminal session that generates text either conditionally or unconditionally depending on what the user enters into the prompt.

The script is capable of top-k, or top-p sampling as specified by the appropriate variables within the script.

Example of generation:

```text
Context >>> На словах ты Лев Толстой, а на деле
ruGPT: На словах ты Лев Толстой, а на деле – Толстой, – с улыбкой заметил Николай, – я вижу, что ты прав.
– А вот это – другое дело, – сказал Лев Толстой, – это дело другое.
– Да, да, – согласился Николай, – я прав.
– А вот что, Лев Николаевич, – сказал Лев Толстой, – я думаю, что в этом отношении у меня нет оснований сомневаться в твоей правоте.
```

Example of finetune on essays and generation in colab [![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/sberbank-ai/ru-gpts/blob/master/examples/Finetune_ruGPT3Small.ipynb)

## Usage ruGPT2Large
We've provided 2 scripts that pretrain and generate with ruGPT2Large from [transformers](https://github.com/huggingface/transformers/tree/v2.8.0) original code.

### Finetuning
#### Data preparation
We can pass to model raw text files.
#### Running script
`bash ./scripts/pretrain_ruGPT2Large.sh`

This script runs single gpu ruGPT3Large pretraining. This script contains command for running on [Christophari](https://sbercloud.ru/ru/christofari):

```
python pretrain_transformers.py \
--output_dir=/home/jovyan/rugpt2large/checkpoints_"${now}"_"${host}" \
--model_type=gpt2 \
--model_name_or_path=/home/jovyan/gpt2_large_bbpe_v50 \
--do_train \
--train_data_file=/home/jovyan/data/train.txt \
--do_eval \
--eval_data_file=/home/jovyan/data/valid.txt \
--fp16
```

Or use transformers interface:

```
from transformers import AutoTokenizer, AutoModel
tokenizer = AutoTokenizer.from_pretrained("sberbank-ai/rugpt2large")
model = AutoModel.from_pretrained("sberbank-ai/rugpt2large")
```

### Text Generation
`bash ./scripts/generate_ruGPT2Large.sh`

Starts an interactive terminal session that generates text either conditionally or unconditionally depending on what the user enters into the prompt.

The script is capable of top-k, or top-p sampling as specified by the appropriate variables within the script.

Example of generation:

```
Context: На словах ты Лев Толстой, а на деле
ruGPT: На словах ты Лев Толстой, а на деле – козел!» – так я про себя подумал, но решил не отвечать. Я встал, поклонился
```
**Note!** All training params (such as lr, wd, ...) may was different while real training. This is just for example.
Loading

0 comments on commit c0b6b09

Please sign in to comment.