diff --git a/README.md b/README.md
index e3e9589..a42fea8 100644
--- a/README.md
+++ b/README.md
@@ -1,7 +1,7 @@
 # ruGPT3XL, ruGPT3Large, ruGPT3Medium, ruGPT3Small and ruGPT2Large
 Russian GPT trained with 2048 context length (ruGPT3XL) with sparse attention, Russian GPT trained with 2048 context length (ruGPT3Large), Russian GPT Medium trained with context 2048 (ruGPT3Medium), Russian GPT Small trained with context 2048 (ruGPT3Small) and Russian GPT2 large (ruGPT2Large) trained with 1024 context length.
 
-We suggest you use ruGPT2Large or ruGPT3XL because this model is more stable and tested.
+We suggest using ruGPT2Large or ruGPT3XL because this models are well tested and achieve the best perplexity.
 
 Examples [here](examples/)
 
@@ -115,50 +115,19 @@ For more details please see full code of dataset: `src.dataset_rugpt3.RuGpt3Text
 
 **Note!** This way is valid for all RuGPTs models except RuGPT3XL.
 
-
-
-
-
-
-## Setup ruGPT3XL
-See all details [here](gw/)
-
-## Setup ruGPT3Large
-Code reused from microsoft [implementation](https://github.com/microsoft/DeepSpeedExamples/tree/master/Megatron-LM) of Megatron-LM.
-Supports only python3.6.
-
-To use this repo please install the latest supported versions of PyTorch with GPU support. 
-
-Additionally, part of this codebase leverages tensorflow-cpu to (optionally) perform dataloading of TFRecords for GPT training. We recommend creating a virtual environment (to avoid breaking existing tf installations) and install our `requirements.txt`. 
+### Megatron with deepspeed and sparsity
+This section is used mostly for usage of RuGPT3XL model and training models with sparse attention.
 
 ```bash
-python -m pip install virtualenv
-virtualenv gpt_env
-source gpt_env/bin/activate
-pip install -r requirements.txt
+apt-get install llvm-9-dev
+pip install cpufeature
+pip install triton==0.2.3
+DS_BUILD_CPU_ADAM=1 DS_BUILD_SPARSE_ATTN=1 pip install deepspeed==0.3.7
 ```
 
-For using of sparse operations in attention additionally install [torch-blocksparse](https://github.com/ptillet/torch-blocksparse):
+Test installation of deepspeed you can with the following command: `ds_report`.
 
-```bash
-source gpt_env/bin/activate
-pip install torch-blocksparse
-```
-
-Torch-Blocksparse depends on CUDA 10.1 and the [Triton](https://github.com/ptillet/triton) language and compiler, which requires llvm-9.
-
-## Setup ruGPT3Medium
-For this model you can use code from microsoft [implementation](https://github.com/microsoft/DeepSpeedExamples/tree/master/Megatron-LM) of Megatron-LM in our repo or use transformers interface. Therefore, you should follow the instructions for ruGPT2Large or ruGPT3Large for installation.
-
-## Setup ruGPT3Small
-For this model you can use code from microsoft [implementation](https://github.com/microsoft/DeepSpeedExamples/tree/master/Megatron-LM) of Megatron-LM in our repo or use transformers interface. Therefore, you should follow the instructions for ruGPT2Large or ruGPT3Large for installation.
-
-## Setup ruGPT2Large
-This model is smaller and was trained with [transformers==v2.8.0](https://github.com/huggingface/transformers/tree/v2.8.0).
-For installing use command:
-```bash
-pip install transformers
-```
+Example of inference of RuGPT3XL [here](examples/ruGPT3XL_generation.ipynb) or [![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/sberbank-ai/ru-gpts/blob/master/examples/ruGPT3XL_generation.ipynb)
 
 # Details of pretraining
 All GPUs are  Tesla V100-SXM3 32 Gb.
diff --git a/examples/Finetune_and_generate_RuGPTs_deepspeed_megatron.ipynb b/examples/Finetune_and_generate_RuGPTs_deepspeed_megatron.ipynb
new file mode 100644
index 0000000..939b6e8
--- /dev/null
+++ b/examples/Finetune_and_generate_RuGPTs_deepspeed_megatron.ipynb
@@ -0,0 +1,402 @@
+{
+  "nbformat": 4,
+  "nbformat_minor": 0,
+  "metadata": {
+    "colab": {
+      "name": "Finetune_and_generate_RuGPTs_deepspeed_megatron.ipynb",
+      "provenance": [],
+      "collapsed_sections": []
+    },
+    "kernelspec": {
+      "name": "python3",
+      "display_name": "Python 3"
+    },
+    "accelerator": "GPU"
+  },
+  "cells": [
+    {
+      "cell_type": "markdown",
+      "metadata": {
+        "id": "pRzAXrPVzsHX"
+      },
+      "source": [
+        "# Finetune RuGPTs in megatron and deepspeed\n",
+        "How to finetune RuGPTs models with megatron and deepspeed. Example for RuGPT3Small. Note for other models it will take more GPU memory.\n",
+        "\n",
+        "This notebook is valid for all RuGPTs models except RuGPT3XL.\n",
+        "## Install env"
+      ]
+    },
+    {
+      "cell_type": "code",
+      "metadata": {
+        "id": "hu1OzWZ6zqQv"
+      },
+      "source": [
+        "!pip3 install transformers==3.5.0"
+      ],
+      "execution_count": null,
+      "outputs": []
+    },
+    {
+      "cell_type": "code",
+      "metadata": {
+        "colab": {
+          "base_uri": "https://localhost:8080/"
+        },
+        "id": "ozJOYbK-11pk",
+        "outputId": "47a4a5b0-71c8-46dc-b9df-4f1cea205ea1"
+      },
+      "source": [
+        "%%writefile setup.sh\n",
+        "\n",
+        "export CUDA_HOME=/usr/local/cuda-10.1\n",
+        "git clone https://github.com/NVIDIA/apex\n",
+        "pip install -v --no-cache-dir --global-option=\"--cpp_ext\" --global-option=\"--cuda_ext\" ./apex"
+      ],
+      "execution_count": 3,
+      "outputs": [
+        {
+          "output_type": "stream",
+          "text": [
+            "Writing setup.sh\n"
+          ],
+          "name": "stdout"
+        }
+      ]
+    },
+    {
+      "cell_type": "code",
+      "metadata": {
+        "id": "M46Pk6DJ19Jk"
+      },
+      "source": [
+        "!sh setup.sh"
+      ],
+      "execution_count": null,
+      "outputs": []
+    },
+    {
+      "cell_type": "code",
+      "metadata": {
+        "id": "_gE7xBM_z-uW"
+      },
+      "source": [
+        "!git clone --single-branch --branch deepspeed_full https://github.com/sberbank-ai/ru-gpts"
+      ],
+      "execution_count": null,
+      "outputs": []
+    },
+    {
+      "cell_type": "code",
+      "metadata": {
+        "id": "-bVWryahFmtx"
+      },
+      "source": [
+        "!pip install deepspeed==0.3.7"
+      ],
+      "execution_count": null,
+      "outputs": []
+    },
+    {
+      "cell_type": "markdown",
+      "metadata": {
+        "id": "8di4sCoS0Pyw"
+      },
+      "source": [
+        "## Download files"
+      ]
+    },
+    {
+      "cell_type": "code",
+      "metadata": {
+        "id": "96qG_A1n0CiF"
+      },
+      "source": [
+        "!wget -O train.txt https://www.dropbox.com/s/oa3v9c7g9bp40xw/train.txt?dl=0\n",
+        "!wget -O valid.txt https://www.dropbox.com/s/mworl3ld6r3bg62/valid.txt?dl=0"
+      ],
+      "execution_count": null,
+      "outputs": []
+    },
+    {
+      "cell_type": "markdown",
+      "metadata": {
+        "id": "jqW38Hni64xH"
+      },
+      "source": [
+        "## Prepare data for parallel\n",
+        "We use custom implementation of distributed dataset. For training and evaluating we should specify file `file.list` with list of paths to txt files. All files from `file.list` will be splitted between aviable GPUs. The logic of splitting is described by the following code:\n",
+        "\n",
+        "```python\n",
+        "shard_size = len(files) // world_size\n",
+        "shard_start = rank * shard_size\n",
+        "shard_end = (rank + 1) * shard_size\n",
+        "files = files[shard_start:shard_end]\n",
+        "```\n",
+        "\n",
+        "For more details please see full code of dataset: `src.dataset_rugpt3.RuGpt3TextDataset`."
+      ]
+    },
+    {
+      "cell_type": "code",
+      "metadata": {
+        "id": "wtItuLGA38db"
+      },
+      "source": [
+        "!echo train.txt > train.list\n",
+        "!echo valid.txt > valid.list"
+      ],
+      "execution_count": 9,
+      "outputs": []
+    },
+    {
+      "cell_type": "markdown",
+      "metadata": {
+        "id": "-EF0JepF0S41"
+      },
+      "source": [
+        "## Train\n",
+        "Load model from Huggingface and finetune on essays.\n",
+        "\n",
+        "This will take arount ten minutes."
+      ]
+    },
+    {
+      "cell_type": "code",
+      "metadata": {
+        "id": "XHluAlFh0SJo"
+      },
+      "source": [
+        "!export PYTHONPATH=${PYTHONPATH}:${HOME}/ru-gpts\n",
+        "\n",
+        "!USE_DEEPSPEED=1 python -m torch.distributed.launch --nproc_per_node 1 ru-gpts/pretrain_gpt3.py \\\n",
+        "  --train-data-path \"train.list\" \\\n",
+        "  --test-data-path \"valid.list\" \\\n",
+        "  --max-files-per-process 100 \\\n",
+        "  --logging-dir=\"log\" \\\n",
+        "  --save model \\\n",
+        "  --load-huggingface sberbank-ai/rugpt3small_based_on_gpt2 \\\n",
+        "  --save-interval 1000 \\\n",
+        "  --log-interval 100 \\\n",
+        "  --eval-interval 1000 \\\n",
+        "  --eval-iters 100 \\\n",
+        "  --model-parallel-size 1 \\\n",
+        "  --num-layers 12 \\\n",
+        "  --hidden-size 768 \\\n",
+        "  --num-attention-heads 12 \\\n",
+        "  --batch-size 1 \\\n",
+        "  --seq-length 2048 \\\n",
+        "  --max-position-embeddings 2048 \\\n",
+        "  --train-iters 2000 \\\n",
+        "  --resume-dataloader \\\n",
+        "  --distributed-backend \"nccl\" \\\n",
+        "  --lr 0.00015 \\\n",
+        "  --lr-decay-style \"cosine\" \\\n",
+        "  --lr-decay-iters 3200 \\\n",
+        "  --clip-grad 0.5 \\\n",
+        "  --warmup .004 \\\n",
+        "  --fp16 \\\n",
+        "  --checkpoint-activations \\\n",
+        "  --deepspeed-activation-checkpointing \\\n",
+        "  --deepspeed \\\n",
+        "  --deepspeed_config ru-gpts/src/deepspeed_config/gpt3_small_2048.json \\\n"
+      ],
+      "execution_count": null,
+      "outputs": []
+    },
+    {
+      "cell_type": "markdown",
+      "metadata": {
+        "id": "ALvcD5SE8RtP"
+      },
+      "source": [
+        "At the end of training output should be something like this:\n",
+        "\n",
+        "\"-----------------------------------------------------------------------------------------\n",
+        "\n",
+        " validation loss at the end of training for test data | LM loss: 3.0002 | LM PPL: 20.090\n",
+        "\n",
+        "-----------------------------------------------------------------------------------------\""
+      ]
+    },
+    {
+      "cell_type": "code",
+      "metadata": {
+        "id": "jVc1YUGbNSy3"
+      },
+      "source": [
+        "!rm -rf ru-gpts"
+      ],
+      "execution_count": 11,
+      "outputs": []
+    },
+    {
+      "cell_type": "markdown",
+      "metadata": {
+        "id": "0HmKilrb8lQm"
+      },
+      "source": [
+        "## Generate\n",
+        "\n",
+        "Load pretrained model from dir and generate."
+      ]
+    },
+    {
+      "cell_type": "code",
+      "metadata": {
+        "colab": {
+          "base_uri": "https://localhost:8080/"
+        },
+        "id": "kAH-WpCG8lmG",
+        "outputId": "bb67e780-21bf-4e05-c457-7cba4463ce93"
+      },
+      "source": [
+        "!export PYTHONPATH=${PYTHONPATH}:${HOME}/ru-gpts\n",
+        "\n",
+        "!python ru-gpts/generate_samples.py \\\n",
+        "  --load model/ \\\n",
+        "  --model-parallel-size 1 \\\n",
+        "  --num-layers 12 \\\n",
+        "  --hidden-size 768 \\\n",
+        "  --num-attention-heads 12 \\\n",
+        "  --batch-size 1 \\\n",
+        "  --seq-length 500 \\\n",
+        "  --max-position-embeddings 2048 \\\n",
+        "  --distributed-backend \"nccl\" \\\n",
+        "  --tokenizer-path sberbank-ai/rugpt3small_based_on_gpt2 \\\n",
+        "  --no-load-optim\n"
+      ],
+      "execution_count": 17,
+      "outputs": [
+        {
+          "output_type": "stream",
+          "text": [
+            "2021-02-11 22:34:33.200550: I tensorflow/stream_executor/platform/default/dso_loader.cc:49] Successfully opened dynamic library libcudart.so.10.1\n",
+            "Generate Samples\n",
+            "WARNING: No training data specified\n",
+            "using world size: 1 and model-parallel size: 1 \n",
+            " > using dynamic loss scaling\n",
+            "> initializing model parallel with size 1\n",
+            "> initializing model parallel cuda seeds on global rank 0, model parallel rank 0, and data parallel rank 0 with model parallel seed: 3952 and data parallel seed: 1234\n",
+            "prepare tokenizer done, size 50264\n",
+            "building GPT3 model ...\n",
+            " > number of parameters on model parallel rank 0: 125231616\n",
+            "Load checkpoint from model/\n",
+            "global rank 0 is loading checkpoint model/iter_0002000/mp_rank_00/model_optim_rng.pt\n",
+            "  successfully loaded model/iter_0002000/mp_rank_00/model_optim_rng.pt\n",
+            "Loaded\n",
+            "\n",
+            "Context prompt (stop to exit) >>> <s>Тема: «Создает человека природа, но развивает и образует его общество». (В.Т. Белинский)\\nСочинение: \n",
+            "\u001b[H\u001b[2J\n",
+            "Taken time 18.70\n",
+            "\n",
+            "\n",
+            "Context: <s>Тема: «Создает человека природа, но развивает и образует его общество». (В.Т. Белинский)\\nСочинение: \n",
+            "\n",
+            "GPT: <s>Тема: «Создает человека природа, но развивает и образует его общество». (В.Т. Белинский)\\nСочинение:  С. Паркинсон - американский публицист, политический деятель. Он понимает, что политика - это выбор между гибельным и неприятным. Как часто приходится политикам делать этот выбор. Порой, действительно, трудно обманывать ожидания своих избирателей, народа, но это бывает необходимо. Как сказал автор высказывания, чтобы власть не обманула ожидания народа. Я согласен с автором. Люди, получившие в результате неправильный выбор, часто забывают о тех, кто живет на другом конце земного шара. Угасая, они забывают о том, что на планете существуют «статусные» организации, которые могут нарушить правила человеческого бытия. Это организации, которые на терпящем бедствие корабле спасают в первую очередь женщин и детей, а потом спасаются сами. Это страны, в которых не выполняется функция «расширения территории» (вооруженные силы, милиция, пограничная стража). В демократических странах сейчас действует многопартийная система, которая соответствует международным документам о политических правах человека. Система политического права – многопартийная. Она подразумевает, что каждый гражданин несет ответственность за свой выбор, и потому каждый должен ориентироваться в политике, быть активным участником общественной жизни страны. Из философско-правовых понятий « политика» и « права» известно, что политика - это выбор между гибельным и неприятным. Как часто приходится политикам делать этот выбор. Порой, действительно, трудно обманывать ожидания своих избирателей, народа, но это бывает необходимо. Из других философских понятий: «Политика<pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad\n",
+            "\n",
+            "Context prompt (stop to exit) >>> Traceback (most recent call last):\n",
+            "  File \"ru-gpts/generate_samples.py\", line 204, in <module>\n",
+            "    main()\n",
+            "  File \"ru-gpts/generate_samples.py\", line 200, in main\n",
+            "    generate_samples(model, tokenizer, args)\n",
+            "  File \"ru-gpts/generate_samples.py\", line 106, in generate_samples\n",
+            "    raw_text = input(\"\\nContext prompt (stop to exit) >>> \")\n",
+            "KeyboardInterrupt\n",
+            "^C\n"
+          ],
+          "name": "stdout"
+        }
+      ]
+    },
+    {
+      "cell_type": "markdown",
+      "metadata": {
+        "id": "VCapfDfeBq0x"
+      },
+      "source": [
+        "### Convert checkpoint to Huggingface format"
+      ]
+    },
+    {
+      "cell_type": "code",
+      "metadata": {
+        "colab": {
+          "base_uri": "https://localhost:8080/"
+        },
+        "id": "4JnhIyqd-Eeo",
+        "outputId": "89877660-45b2-4a6a-a9de-24259a28ee87"
+      },
+      "source": [
+        "!export PYTHONPATH=${PYTHONPATH}:${HOME}/ru-gpts\n",
+        "\n",
+        "!python ru-gpts/convert2huggingface.py \\\n",
+        "  --load model/ \\\n",
+        "  --model-parallel-size 1 \\\n",
+        "  --num-layers 12 \\\n",
+        "  --hidden-size 768 \\\n",
+        "  --num-attention-heads 12 \\\n",
+        "  --max-position-embeddings 2048 \\\n",
+        "  --tokenizer-path sberbank-ai/rugpt3small_based_on_gpt2 \\\n",
+        "  --no-load-optim \\\n",
+        "  --export-huggingface model_hf\n"
+      ],
+      "execution_count": 20,
+      "outputs": [
+        {
+          "output_type": "stream",
+          "text": [
+            "2021-02-11 22:36:01.838720: I tensorflow/stream_executor/platform/default/dso_loader.cc:49] Successfully opened dynamic library libcudart.so.10.1\n",
+            "WARNING: No training data specified\n",
+            "using world size: 1 and model-parallel size: 1 \n",
+            " > using dynamic loss scaling\n",
+            "> initializing model parallel with size 1\n",
+            "prepare tokenizer done, size 50264\n",
+            "building GPT3 model ...\n",
+            " > number of parameters on model parallel rank 0: 125231616\n",
+            "Load checkpoint from model/\n",
+            "global rank 0 is loading checkpoint model/iter_0002000/mp_rank_00/model_optim_rng.pt\n",
+            "  successfully loaded model/iter_0002000/mp_rank_00/model_optim_rng.pt\n",
+            "Loaded\n",
+            "Export to huggingface model  model_hf with config {'vocab_size': 50264, 'n_positions': 2048, 'n_ctx': 2048, 'n_embd': 768, 'n_layer': 12, 'n_head': 12}\n",
+            "Saved huggingface model <class 'src.model.distributed.DistributedDataParallel'>\n",
+            "Exported in huggingface format to model_hf\n"
+          ],
+          "name": "stdout"
+        }
+      ]
+    },
+    {
+      "cell_type": "markdown",
+      "metadata": {
+        "id": "KRlEwlPdE0L8"
+      },
+      "source": [
+        "#### Test load"
+      ]
+    },
+    {
+      "cell_type": "code",
+      "metadata": {
+        "id": "5U81i24aEEm0"
+      },
+      "source": [
+        "from transformers import GPT2LMHeadModel"
+      ],
+      "execution_count": 21,
+      "outputs": []
+    },
+    {
+      "cell_type": "code",
+      "metadata": {
+        "id": "eBRatZnJEcCX"
+      },
+      "source": [
+        "model = GPT2LMHeadModel.from_pretrained(\"model_hf\")"
+      ],
+      "execution_count": 22,
+      "outputs": []
+    }
+  ]
+}
\ No newline at end of file
diff --git a/examples/ruGPT3XL_generation.ipynb b/examples/ruGPT3XL_generation.ipynb
new file mode 100644
index 0000000..72f46de
--- /dev/null
+++ b/examples/ruGPT3XL_generation.ipynb
@@ -0,0 +1,689 @@
+{
+  "nbformat": 4,
+  "nbformat_minor": 0,
+  "metadata": {
+    "kernelspec": {
+      "display_name": "Python 3",
+      "language": "python",
+      "name": "python3"
+    },
+    "language_info": {
+      "codemirror_mode": {
+        "name": "ipython",
+        "version": 3
+      },
+      "file_extension": ".py",
+      "mimetype": "text/x-python",
+      "name": "python",
+      "nbconvert_exporter": "python",
+      "pygments_lexer": "ipython3",
+      "version": "3.7.8"
+    },
+    "colab": {
+      "name": "ruGPT3XL_generation",
+      "provenance": [],
+      "collapsed_sections": []
+    },
+    "accelerator": "GPU"
+  },
+  "cells": [
+    {
+      "cell_type": "markdown",
+      "metadata": {
+        "id": "q4tnxQoRogVV"
+      },
+      "source": [
+        "# Install env"
+      ]
+    },
+    {
+      "cell_type": "markdown",
+      "metadata": {
+        "id": "JfKxFhRPoWNv"
+      },
+      "source": [
+        "### Install Apex"
+      ]
+    },
+    {
+      "cell_type": "code",
+      "metadata": {
+        "colab": {
+          "base_uri": "https://localhost:8080/"
+        },
+        "id": "QW1w_w0itrLP",
+        "outputId": "d5b83680-0fe3-4122-885d-5caf3385da91"
+      },
+      "source": [
+        "%%writefile setup.sh\n",
+        "\n",
+        "export CUDA_HOME=/usr/local/cuda-10.1\n",
+        "git clone https://github.com/NVIDIA/apex\n",
+        "pip install -v --no-cache-dir --global-option=\"--cpp_ext\" --global-option=\"--cuda_ext\" ./apex"
+      ],
+      "execution_count": 1,
+      "outputs": [
+        {
+          "output_type": "stream",
+          "text": [
+            "Writing setup.sh\n"
+          ],
+          "name": "stdout"
+        }
+      ]
+    },
+    {
+      "cell_type": "code",
+      "metadata": {
+        "id": "4Hwo1XzYts5a"
+      },
+      "source": [
+        "!sh setup.sh"
+      ],
+      "execution_count": null,
+      "outputs": []
+    },
+    {
+      "cell_type": "markdown",
+      "metadata": {
+        "id": "MMXQ-moaoayT"
+      },
+      "source": [
+        "### Install triton"
+      ]
+    },
+    {
+      "cell_type": "code",
+      "metadata": {
+        "id": "F4b2CBehhQX7"
+      },
+      "source": [
+        "!apt-get install llvm-9-dev"
+      ],
+      "execution_count": null,
+      "outputs": []
+    },
+    {
+      "cell_type": "code",
+      "metadata": {
+        "id": "gbWxEvnN02bY"
+      },
+      "source": [
+        "!pip install cpufeature"
+      ],
+      "execution_count": null,
+      "outputs": []
+    },
+    {
+      "cell_type": "code",
+      "metadata": {
+        "id": "euv2A1weyZ5p"
+      },
+      "source": [
+        "!pip install triton==0.2.3"
+      ],
+      "execution_count": null,
+      "outputs": []
+    },
+    {
+      "cell_type": "markdown",
+      "metadata": {
+        "id": "IId2GBmCod9A"
+      },
+      "source": [
+        "### Install DeepSpeed"
+      ]
+    },
+    {
+      "cell_type": "code",
+      "metadata": {
+        "id": "hmx7xwp_Kmz6"
+      },
+      "source": [
+        "!DS_BUILD_CPU_ADAM=1 DS_BUILD_SPARSE_ATTN=1 pip install deepspeed==0.3.7"
+      ],
+      "execution_count": null,
+      "outputs": []
+    },
+    {
+      "cell_type": "markdown",
+      "metadata": {
+        "id": "47RnjoAJolAc"
+      },
+      "source": [
+        "#### Test installation: we should have the following output"
+      ]
+    },
+    {
+      "cell_type": "code",
+      "metadata": {
+        "colab": {
+          "base_uri": "https://localhost:8080/"
+        },
+        "id": "hRR8wi7kK3dO",
+        "outputId": "86074d56-214b-4dd6-dac3-0f4091e9c5c1"
+      },
+      "source": [
+        "!ds_report"
+      ],
+      "execution_count": 7,
+      "outputs": [
+        {
+          "output_type": "stream",
+          "text": [
+            "--------------------------------------------------\n",
+            "DeepSpeed C++/CUDA extension op report\n",
+            "--------------------------------------------------\n",
+            "NOTE: Ops not installed will be just-in-time (JIT) compiled at\n",
+            "      runtime if needed. Op compatibility means that your system\n",
+            "      meet the required dependencies to JIT install the op.\n",
+            "--------------------------------------------------\n",
+            "JIT compiled ops requires ninja\n",
+            "ninja .................. \u001b[92m[OKAY]\u001b[0m\n",
+            "--------------------------------------------------\n",
+            "op name ................ installed .. compatible\n",
+            "--------------------------------------------------\n",
+            "cpu_adam ............... \u001b[92m[YES]\u001b[0m ...... \u001b[92m[OKAY]\u001b[0m\n",
+            "fused_adam ............. \u001b[93m[NO]\u001b[0m ....... \u001b[92m[OKAY]\u001b[0m\n",
+            "fused_lamb ............. \u001b[93m[NO]\u001b[0m ....... \u001b[92m[OKAY]\u001b[0m\n",
+            "sparse_attn ............ \u001b[92m[YES]\u001b[0m ...... \u001b[92m[OKAY]\u001b[0m\n",
+            "transformer ............ \u001b[93m[NO]\u001b[0m ....... \u001b[92m[OKAY]\u001b[0m\n",
+            "stochastic_transformer . \u001b[93m[NO]\u001b[0m ....... \u001b[92m[OKAY]\u001b[0m\n",
+            "utils .................. \u001b[93m[NO]\u001b[0m ....... \u001b[92m[OKAY]\u001b[0m\n",
+            "--------------------------------------------------\n",
+            "DeepSpeed general environment info:\n",
+            "torch install path ............... ['/usr/local/lib/python3.6/dist-packages/torch']\n",
+            "torch version .................... 1.7.0+cu101\n",
+            "torch cuda version ............... 10.1\n",
+            "nvcc version ..................... 10.1\n",
+            "deepspeed install path ........... ['/usr/local/lib/python3.6/dist-packages/deepspeed']\n",
+            "deepspeed info ................... 0.3.7, unknown, unknown\n",
+            "deepspeed wheel compiled w. ...... torch 1.7, cuda 10.1\n"
+          ],
+          "name": "stdout"
+        }
+      ]
+    },
+    {
+      "cell_type": "code",
+      "metadata": {
+        "id": "HDyVwahMCLKR"
+      },
+      "source": [
+        "# And this cell should be run without errors\n",
+        "import deepspeed.ops.sparse_attention.sparse_attn_op"
+      ],
+      "execution_count": 8,
+      "outputs": []
+    },
+    {
+      "cell_type": "markdown",
+      "metadata": {
+        "id": "sbNKoC8mo0_S"
+      },
+      "source": [
+        "### Download repo and install other libs"
+      ]
+    },
+    {
+      "cell_type": "code",
+      "metadata": {
+        "id": "t43yH5k1jtZZ",
+        "colab": {
+          "base_uri": "https://localhost:8080/"
+        },
+        "outputId": "e1cb348a-3ba6-4c11-9df9-ed74f7e1555d"
+      },
+      "source": [
+        "!git clone https://github.com/sberbank-ai/ru-gpts.git"
+      ],
+      "execution_count": 5,
+      "outputs": [
+        {
+          "output_type": "stream",
+          "text": [
+            "Cloning into 'ru-gpts'...\n",
+            "remote: Enumerating objects: 77, done.\u001b[K\n",
+            "remote: Counting objects: 100% (77/77), done.\u001b[K\n",
+            "remote: Compressing objects: 100% (58/58), done.\u001b[K\n",
+            "remote: Total 325 (delta 38), reused 47 (delta 19), pack-reused 248\u001b[K\n",
+            "Receiving objects: 100% (325/325), 263.91 KiB | 8.80 MiB/s, done.\n",
+            "Resolving deltas: 100% (162/162), done.\n"
+          ],
+          "name": "stdout"
+        }
+      ]
+    },
+    {
+      "cell_type": "code",
+      "metadata": {
+        "id": "H2XiJvm_tQgL"
+      },
+      "source": [
+        "!pip install transformers==3.5.1"
+      ],
+      "execution_count": null,
+      "outputs": []
+    },
+    {
+      "cell_type": "code",
+      "metadata": {
+        "id": "gJLYfiOitYNx"
+      },
+      "source": [
+        "!pip install natsort"
+      ],
+      "execution_count": null,
+      "outputs": []
+    },
+    {
+      "cell_type": "markdown",
+      "metadata": {
+        "id": "7TbybJfIpBVa"
+      },
+      "source": [
+        "# Test model"
+      ]
+    },
+    {
+      "cell_type": "markdown",
+      "metadata": {
+        "id": "rU8lvJHAjpPQ"
+      },
+      "source": [
+        "### Load model"
+      ]
+    },
+    {
+      "cell_type": "code",
+      "metadata": {
+        "id": "EnTy1SEajpPV"
+      },
+      "source": [
+        "import warnings\n",
+        "warnings.filterwarnings(\"ignore\")"
+      ],
+      "execution_count": 6,
+      "outputs": []
+    },
+    {
+      "cell_type": "code",
+      "metadata": {
+        "id": "Z-DSEz0ljpPV"
+      },
+      "source": [
+        "import sys\n",
+        "sys.path.append(\"ru-gpts/gw\")"
+      ],
+      "execution_count": 7,
+      "outputs": []
+    },
+    {
+      "cell_type": "code",
+      "metadata": {
+        "id": "_HeCFnJEjpPV"
+      },
+      "source": [
+        "from generation_wrapper import RuGPT3XL"
+      ],
+      "execution_count": 8,
+      "outputs": []
+    },
+    {
+      "cell_type": "markdown",
+      "metadata": {
+        "id": "OdyughHDjpPV"
+      },
+      "source": [
+        "Note! seq_len is max sequence length for generation used in generation process. Max avialable seq_len is 2048 (in tokens).\n",
+        "Also inference takes around 10 Gb GPU memory."
+      ]
+    },
+    {
+      "cell_type": "code",
+      "metadata": {
+        "scrolled": true,
+        "id": "56aNJNPYjpPW",
+        "colab": {
+          "base_uri": "https://localhost:8080/"
+        },
+        "outputId": "30a94066-8356-4dd6-fe59-8e728eba59a8"
+      },
+      "source": [
+        "gpt = RuGPT3XL.from_pretrained(\"sberbank-ai/rugpt3xl\", seq_len=512)"
+      ],
+      "execution_count": 9,
+      "outputs": [
+        {
+          "output_type": "stream",
+          "text": [
+            "> initializing model parallel with size 1\n",
+            "> initializing model parallel cuda seeds on global rank 0, model parallel rank 0, and data parallel rank 0 with model parallel seed: 3952 and data parallel seed: 1234\n",
+            "Use alternating sparse & dense attention layers\n"
+          ],
+          "name": "stdout"
+        }
+      ]
+    },
+    {
+      "cell_type": "markdown",
+      "metadata": {
+        "id": "7qR_3287jpPW"
+      },
+      "source": [
+        "### Get logits"
+      ]
+    },
+    {
+      "cell_type": "code",
+      "metadata": {
+        "id": "9yfRylnnjpPW"
+      },
+      "source": [
+        "logits = gpt(\"Кто был президентом США в 2020? \").logits"
+      ],
+      "execution_count": 12,
+      "outputs": []
+    },
+    {
+      "cell_type": "code",
+      "metadata": {
+        "id": "BHcwcMwHjpPX",
+        "colab": {
+          "base_uri": "https://localhost:8080/"
+        },
+        "outputId": "2f481a21-4995-498f-ee1f-b0c195f6d794"
+      },
+      "source": [
+        "type(logits), logits.shape"
+      ],
+      "execution_count": 13,
+      "outputs": [
+        {
+          "output_type": "execute_result",
+          "data": {
+            "text/plain": [
+              "(torch.Tensor, torch.Size([1, 8, 50264]))"
+            ]
+          },
+          "metadata": {
+            "tags": []
+          },
+          "execution_count": 13
+        }
+      ]
+    },
+    {
+      "cell_type": "markdown",
+      "metadata": {
+        "id": "J7a29eADjpPX"
+      },
+      "source": [
+        "### Get loss"
+      ]
+    },
+    {
+      "cell_type": "code",
+      "metadata": {
+        "id": "q44D1VlojpPX"
+      },
+      "source": [
+        "input_ids = [gpt.tokenizer(\"Кто был президентом США в 2020? \")['input_ids']]\n",
+        "labels = input_ids"
+      ],
+      "execution_count": 14,
+      "outputs": []
+    },
+    {
+      "cell_type": "code",
+      "metadata": {
+        "id": "9zIpNDegjpPX"
+      },
+      "source": [
+        "import torch\n",
+        "\n",
+        "\n",
+        "with torch.no_grad():\n",
+        "    loss = gpt(input_ids=input_ids, labels=labels).loss"
+      ],
+      "execution_count": 15,
+      "outputs": []
+    },
+    {
+      "cell_type": "code",
+      "metadata": {
+        "id": "WcYvXFsBjpPY",
+        "colab": {
+          "base_uri": "https://localhost:8080/"
+        },
+        "outputId": "eaab922b-b74e-4517-c72e-1c05d017fdff"
+      },
+      "source": [
+        "loss"
+      ],
+      "execution_count": 16,
+      "outputs": [
+        {
+          "output_type": "execute_result",
+          "data": {
+            "text/plain": [
+              "[tensor(4.3908, device='cuda:0')]"
+            ]
+          },
+          "metadata": {
+            "tags": []
+          },
+          "execution_count": 16
+        }
+      ]
+    },
+    {
+      "cell_type": "markdown",
+      "metadata": {
+        "id": "Lgi-JUNijpPY"
+      },
+      "source": [
+        "### Simple generation"
+      ]
+    },
+    {
+      "cell_type": "code",
+      "metadata": {
+        "id": "Z0AIfKPPjpPY"
+      },
+      "source": [
+        "def filter_resuls(nr):\n",
+        "    return [x[:x.find(\"<|endoftext|>\")] for x in nr]"
+      ],
+      "execution_count": 17,
+      "outputs": []
+    },
+    {
+      "cell_type": "markdown",
+      "metadata": {
+        "id": "aHWixkU3jpPY"
+      },
+      "source": [
+        "Greedy decoding"
+      ]
+    },
+    {
+      "cell_type": "code",
+      "metadata": {
+        "id": "wla741VxjpPY",
+        "colab": {
+          "base_uri": "https://localhost:8080/"
+        },
+        "outputId": "6b3d93d0-b6c9-4a97-d645-82a1fdd6b21d"
+      },
+      "source": [
+        "filter_resuls(gpt.generate(\n",
+        "    \"Кто был президентом США в 2020? \",\n",
+        "    max_length=50,\n",
+        "    no_repeat_ngram_size=3,\n",
+        "    repetition_penalty=2.,\n",
+        "))"
+      ],
+      "execution_count": 18,
+      "outputs": [
+        {
+          "output_type": "execute_result",
+          "data": {
+            "text/plain": [
+              "['Кто был президентом США в 2020? \\nВ этом году выборы президента Соединенных Штатов Америки пройдут уже через несколько дней. И, как и всегда на протяжении последних лет (а это не первый раз), кандидаты будут бороться за право стать главой государства с помощь']"
+            ]
+          },
+          "metadata": {
+            "tags": []
+          },
+          "execution_count": 18
+        }
+      ]
+    },
+    {
+      "cell_type": "markdown",
+      "metadata": {
+        "id": "I_ltaza-jpPZ"
+      },
+      "source": [
+        "sample"
+      ]
+    },
+    {
+      "cell_type": "code",
+      "metadata": {
+        "id": "3oPE1lP2jpPZ",
+        "colab": {
+          "base_uri": "https://localhost:8080/"
+        },
+        "outputId": "9d4040fd-2beb-4efd-96f6-c44ebd260c31"
+      },
+      "source": [
+        "filter_resuls(gpt.generate(\n",
+        "    \"Кто был президентом США в 2020? \", do_sample=True, num_return_sequences=5,\n",
+        "    max_length=50,\n",
+        "    no_repeat_ngram_size=3,\n",
+        "    repetition_penalty=2.,\n",
+        "))"
+      ],
+      "execution_count": 19,
+      "outputs": [
+        {
+          "output_type": "execute_result",
+          "data": {
+            "text/plain": [
+              "['Кто был президентом США в 2020? \\nНовым, на первый взгляд странным и непонятным образом определилось будущее Америки. За несколько часов до конца таймера ожидания выборов президента уже ясно было - кто займет президентский пост после окончания голосования – Дональд Трам',\n",
+              " 'Кто был президентом США в 2020? \\n\"Я никогда не выйду живым из этого леса\",- говорит главный герой фильма \"Зеленая миля\". Услышав это, начинаешь задумываться: а так ли хорош твой опыт жизни на другой стороне земного шара',\n",
+              " 'Кто был президентом США в 2020? \\nВы хотите знать, как будет выглядеть ваш мир через сорок лет после того срока своего президентства уходящего следующего президента? Посмотрите на эти четыре изображения и попробуйте угадать кто бы это мог быть. \\n Вот первый предполаг',\n",
+              " 'Кто был президентом США в 2020? \\nВ интернете появился новый опрос на тему, которую так трудно назвать \"демократичной\". Народ хочет знать о том. какие президенты стояли у руля их страны и за сколько часов до собственной смерти они успевали подписат',\n",
+              " 'Кто был президентом США в 2020? \\nВот, мы знаем. Кто президент России с 2018 — точно: Владимир Владимирович Путин наш дорогой и любимый (пока не отлучим его по суду). У остальных тоже есть свои любимчики... или там ненав']"
+            ]
+          },
+          "metadata": {
+            "tags": []
+          },
+          "execution_count": 19
+        }
+      ]
+    },
+    {
+      "cell_type": "markdown",
+      "metadata": {
+        "id": "9FbokoJDjpPZ"
+      },
+      "source": [
+        "### Top_k top_p filtering"
+      ]
+    },
+    {
+      "cell_type": "code",
+      "metadata": {
+        "id": "8L-CF6uLjpPZ",
+        "colab": {
+          "base_uri": "https://localhost:8080/"
+        },
+        "outputId": "6bd8a3d2-a751-43e7-9770-6378cbd4050e"
+      },
+      "source": [
+        "filter_resuls(gpt.generate(\n",
+        "    \"Александр Сергеевич Пушкин родился в \",\n",
+        "    top_k=5,\n",
+        "    top_p=0.95,\n",
+        "    temperature=1.2,\n",
+        "    num_return_sequences=5,\n",
+        "    do_sample=True,\n",
+        "    max_length=50,\n",
+        "    no_repeat_ngram_size=3,\n",
+        "    repetition_penalty=2.,\n",
+        "))"
+      ],
+      "execution_count": 20,
+      "outputs": [
+        {
+          "output_type": "execute_result",
+          "data": {
+            "text/plain": [
+              "['Александр Сергеевич Пушкин родился в \\nМоскве. В 1799 г., после того как его отец, отставной поручик лейб-гвардии Преображенского полка Александр Иванович (17451816), женился на вдове капитана Екатерине Петро',\n",
+              " 'Александр Сергеевич Пушкин родился в \\nМоскве, а его отец - Александр Иванович Глинка (1786-1831) служил капельмейстером при дворе императора Павла I. В 1811 г. семья Пушкиных переехала во Владимирскую губерни',\n",
+              " 'Александр Сергеевич Пушкин родился в \\n1817 году. Его отец – Александр Иванович, служил чиновником при министерстве внутренних дел Российской империи и умер рано; его мать Мария Алексеевна Ганнибал (урожденная Пушкина), урождённая Энгельга',\n",
+              " 'Александр Сергеевич Пушкин родился в \\n1817 г. (по другим сведениям 1820). В детстве и ранней юности жил с родителями за границей, учился у лучших педагогов Франции - Лагарпа, Жозефа Мари Ашара. С ранних лет увле',\n",
+              " 'Александр Сергеевич Пушкин родился в \\n1799 г. Его отец, Александр Матвеевич Ганнибал (ум.-1831), был родом из деревни Слепушкино Тверской губернии; дед - отставной майор Иван Петрович Гамильтон-Пушки']"
+            ]
+          },
+          "metadata": {
+            "tags": []
+          },
+          "execution_count": 20
+        }
+      ]
+    },
+    {
+      "cell_type": "markdown",
+      "metadata": {
+        "id": "scs7xKdhjpPZ"
+      },
+      "source": [
+        "### Beamsearch"
+      ]
+    },
+    {
+      "cell_type": "code",
+      "metadata": {
+        "id": "7Qw65CVzjpPZ",
+        "colab": {
+          "base_uri": "https://localhost:8080/"
+        },
+        "outputId": "a07008ae-efd4-4673-dee5-c3350890a56f"
+      },
+      "source": [
+        "filter_resuls(gpt.generate(\n",
+        "    text=\"Александр Сергеевич Пушкин родился в \",\n",
+        "    max_length=50,\n",
+        "    num_beams=10,\n",
+        "    no_repeat_ngram_size=3,\n",
+        "    repetition_penalty=2.,\n",
+        "    num_return_sequences=5,\n",
+        "))"
+      ],
+      "execution_count": 21,
+      "outputs": [
+        {
+          "output_type": "execute_result",
+          "data": {
+            "text/plain": [
+              "['Александр Сергеевич Пушкин родился в \\n1799 году в селе Михайловском Псковской губернии. Его отец, \\nАлександр Львович Пушкин, происходил из старинного \\nдворянского рода. Мать, урожденная Ганнибал, был',\n",
+              " 'Александр Сергеевич Пушкин родился в \\n1799 году в селе Михайловском Псковской губернии. Его отец, \\nАлександр Львович Пушкин, происходил из старинного \\nдворянского рода. Мать поэта, Мария Алексеевна Ганнибал',\n",
+              " 'Александр Сергеевич Пушкин родился в \\n1799 году в селе Михайловском Псковской губернии. Его отец, \\nАлександр Львович Пушкин, происходил из старинного \\nдворянского рода. Мать поэта, Наталья Николаевна \\nПушкин',\n",
+              " 'Александр Сергеевич Пушкин родился в \\n1799 году в селе Михайловском Псковской губернии. Его отец, \\nАлександр Львович Пушкин, происходил из старинного \\nдворянского рода. Мать, урождённая Ганнибал, был',\n",
+              " 'Александр Сергеевич Пушкин родился в \\n1799 году в селе Михайловском Псковской губернии. Его отец, \\nАлександр Львович Пушкин, происходил из старинного \\nдворянского рода. Мать, урожденная Ганнибал,']"
+            ]
+          },
+          "metadata": {
+            "tags": []
+          },
+          "execution_count": 21
+        }
+      ]
+    }
+  ]
+}
\ No newline at end of file
diff --git a/src/download_utils.py b/src/download_utils.py
new file mode 100644
index 0000000..9974cd2
--- /dev/null
+++ b/src/download_utils.py
@@ -0,0 +1,73 @@
+import os
+
+from transformers.file_utils import (
+    cached_path,
+    hf_bucket_url,
+    is_remote_url,
+)
+from transformers.utils import logging
+
+logger = logging.get_logger(__name__)
+WEIGHTS_NAME = "mp_rank_00_model_states.pt"
+DEEPSPEED_CONFIG_NAME = "deepspeed_config.json"
+
+
+def download_model_files(pretrained_model_name_or_path):
+    weights_path = download_file_from_hf(pretrained_model_name_or_path, WEIGHTS_NAME)
+    deepspeed_config_path = download_file_from_hf(pretrained_model_name_or_path, DEEPSPEED_CONFIG_NAME)
+    return weights_path, deepspeed_config_path
+
+
+def download_file_from_hf(pretrained_model_name_or_path: str, file_name: str) -> str:
+    # Load model
+    if pretrained_model_name_or_path is not None:
+        if os.path.isdir(pretrained_model_name_or_path):
+            if os.path.isfile(os.path.join(pretrained_model_name_or_path, file_name)):
+                # Load from a PyTorch checkpoint
+                archive_file = os.path.join(pretrained_model_name_or_path, file_name)
+            else:
+                raise EnvironmentError(
+                    "Error no file named {} found in directory {}".format(
+                        file_name,
+                        pretrained_model_name_or_path,
+                    )
+                )
+        elif os.path.isfile(pretrained_model_name_or_path) or is_remote_url(pretrained_model_name_or_path):
+            archive_file = pretrained_model_name_or_path
+        else:
+            archive_file = hf_bucket_url(
+                pretrained_model_name_or_path,
+                filename=file_name,
+                revision=None,
+                mirror=None,
+            )
+
+        try:
+            # Load from URL or cache if already cached
+            resolved_archive_file = cached_path(
+                archive_file,
+                cache_dir=None,
+                force_download=False,
+                proxies=None,
+                resume_download=False,
+                local_files_only=False,
+            )
+        except EnvironmentError as err:
+            logger.error(err)
+            msg = (
+                f"Can't load weights for '{pretrained_model_name_or_path}'. Make sure that:\n\n"
+                f"- '{pretrained_model_name_or_path}' is a correct model identifier listed on"
+                f"'https://huggingface.co/models'\n\n"
+                f"- or '{pretrained_model_name_or_path}' is the correct path to a directory containing a"
+                f"file named one of {file_name}.\n\n"
+            )
+            raise EnvironmentError(msg)
+
+        if resolved_archive_file == archive_file:
+            logger.info("loading weights file {}".format(archive_file))
+        else:
+            logger.info("loading weights file {} from cache at {}".format(archive_file, resolved_archive_file))
+    else:
+        resolved_archive_file = None
+
+    return resolved_archive_file
diff --git a/src/xl_wrapper.py b/src/xl_wrapper.py
new file mode 100644
index 0000000..0d04f52
--- /dev/null
+++ b/src/xl_wrapper.py
@@ -0,0 +1,294 @@
+#!/usr/bin/env python
+# -*- coding: utf-8 -*-
+import os
+import random
+from typing import Union, Iterable
+
+import numpy as np
+import torch
+from deepspeed import DeepSpeedConfig
+from torch.nn import CrossEntropyLoss
+from transformers import GPT2Tokenizer, PreTrainedModel, PretrainedConfig
+
+from src import mpu
+from .fp16 import FP16_Module
+from .model import GPT3Model
+from .download_utils import download_model_files
+from transformers.utils import logging
+
+
+logger = logging.get_logger(__name__)
+NoneType = type(None)
+
+
+def get_deepspeed_config(path):
+    return DeepSpeedConfig(path)
+
+
+def get_sparse_attention_config(path, num_heads):
+    ds_config = get_deepspeed_config(path)
+    if hasattr(ds_config, 'sparse_attention') and ds_config.sparse_attention:
+        sa_config = ds_config.sparse_attention
+        sa_mode = sa_config.get('mode')
+        if sa_mode == 'dense':
+            from deepspeed.ops.sparse_attention import DenseSparsityConfig as STConfig
+        elif sa_mode == 'fixed':
+            from deepspeed.ops.sparse_attention import FixedSparsityConfig as STConfig
+        elif sa_mode == 'bigbird':
+            from deepspeed.ops.sparse_attention import BigBirdSparsityConfig as STConfig
+        elif sa_mode == 'bslongformer':
+            from deepspeed.ops.sparse_attention import BSLongformerSparsityConfig as STConfig
+        elif sa_mode == 'variable':
+            from deepspeed.ops.sparse_attention import VariableSparsityConfig as STConfig
+        else:
+            raise NotImplementedError(
+                f'Given sparsity mode, {sa_mode}, has not been implemented yet!'
+            )
+        del sa_config['mode']
+        return STConfig(num_heads=num_heads, **sa_config)
+    else:
+        return None
+
+
+def get_model(deepspeed_config_path):
+    num_local_heads = 16
+    sparse_mode = 'alternating'
+    deepspeed_sparsity_config = get_sparse_attention_config(deepspeed_config_path, num_local_heads)
+    if deepspeed_sparsity_config is not None:
+        logger.info(f"Use sparse attention with mode {sparse_mode}")
+    else:
+        logger.info(f"Use dense attention")
+    model = GPT3Model(num_layers=24,
+                      vocab_size=50264,
+                      hidden_size=2048,
+                      num_attention_heads=num_local_heads,
+                      embedding_dropout_prob=0.1, attention_dropout_prob=0.1, output_dropout_prob=0.1,
+                      max_sequence_length=2048,
+                      checkpoint_activations=False,
+                      checkpoint_num_layers=1,
+                      parallel_output=False,
+                      deepspeed_sparsity_config=deepspeed_sparsity_config,
+                      sparse_mode=sparse_mode)
+    # GPU allocation.
+    model.cuda(torch.cuda.current_device())
+
+    # Fp16 conversion.
+    model = FP16_Module(model)
+
+    return model
+
+
+def setup_model(weights_path, deepspeed_config_path):
+    model = get_model(deepspeed_config_path)
+    logger.info("Load checkpoint from " + weights_path)
+    checkpoint = torch.load(weights_path, map_location=lambda storage, loc: storage)['module']
+    model.load_state_dict(checkpoint)
+    model.eval()
+    logger.info("Model Loaded")
+    return model
+
+
+def get_masks_and_position_ids(data,
+                               eod_token,
+                               reset_position_ids,
+                               reset_attention_mask):
+    # Extract batch size and sequence length.
+    batch_size, seq_length = data.size()
+
+    # Attention mask (lower triangular).
+    if reset_attention_mask:
+        att_mask_batch = batch_size
+    else:
+        att_mask_batch = 1
+    attention_mask = torch.tril(torch.ones(
+        (att_mask_batch, seq_length, seq_length), device=data.device)).view(
+        att_mask_batch, 1, seq_length, seq_length)
+
+    # Loss mask.
+    loss_mask = torch.ones(data.size(), dtype=torch.float, device=data.device)
+    loss_mask[data == eod_token] = 0.0
+
+    # Position ids.
+    position_ids = torch.arange(seq_length, dtype=torch.long,
+                                device=data.device)
+    position_ids = position_ids.unsqueeze(0).expand_as(data)
+    # We need to clone as the ids will be modifed based on batch index.
+    if reset_position_ids:
+        position_ids = position_ids.clone()
+
+    if reset_position_ids or reset_attention_mask:
+        # Loop through the batches:
+        for b in range(batch_size):
+
+            # Find indices where EOD token is.
+            eod_index = position_ids[b, data[b] == eod_token]
+            # Detach indecies from positions if going to modify positions.
+            if reset_position_ids:
+                eod_index = eod_index.clone()
+
+            # Loop through EOD indices:
+            prev_index = 0
+            for j in range(eod_index.size()[0]):
+                i = eod_index[j]
+                # Mask attention loss.
+                if reset_attention_mask:
+                    attention_mask[b, 0, (i + 1):, :(i + 1)] = 0
+                # Reset positions.
+                if reset_position_ids:
+                    position_ids[b, (i + 1):] -= (i + 1 - prev_index)
+                    prev_index = i + 1
+
+    return attention_mask, loss_mask, position_ids
+
+
+class ModelOutput(object):
+    def __init__(self, logits, loss=None):
+        self.logits = logits
+        self.loss = loss
+
+    def __getitem__(self, key):
+        if key == "logits":
+            return self.logits
+        raise StopIteration
+
+
+class RuGPT3XL(PreTrainedModel):
+    def __init__(self, model, tokenizer, model_path, seq_len=512):
+        super().__init__(PretrainedConfig())
+        self.model = model
+        self.pad_token_id = tokenizer.encoder['<pad>']
+        self.eos_token_id = tokenizer.encoder['<|endoftext|>']
+        self.seq_len = seq_len
+        self.model_path = model_path
+        self.tokenizer = tokenizer
+
+    @classmethod
+    def from_pretrained(cls, model_name_or_path, seq_len=512):
+        init_method = 'tcp://' + os.getenv('MASTER_ADDR', 'localhost') + ':' + os.getenv('MASTER_PORT', '6000')
+        torch.distributed.init_process_group(backend='nccl', world_size=1, rank=0, init_method=init_method)
+        mpu.initialize_model_parallel(1)
+
+        seed = 1234
+        random.seed(seed)
+        np.random.seed(seed)
+        torch.manual_seed(seed)
+        mpu.model_parallel_cuda_manual_seed(seed)
+        tokenizer = GPT2Tokenizer.from_pretrained(model_name_or_path)
+        logger.info("Check cached model files...")
+        weights_path, deepspeed_config_path = download_model_files(model_name_or_path)
+        model = setup_model(weights_path, deepspeed_config_path)
+        model.cuda()
+        model = model.eval()
+        return cls(model, tokenizer=tokenizer, seq_len=seq_len, model_path=model_name_or_path)
+
+    def prepare_inputs_for_generation(self, input_ids: torch.LongTensor, **kwargs):
+        kwargs.update({"input_ids": input_ids})
+        return kwargs
+
+    def generate(
+            self, text: Union[str, NoneType] = None,
+            input_ids: Union[torch.LongTensor, NoneType] = None,
+            max_length: Union[int, None] = None,
+            min_length: Union[int, NoneType] = None,
+            do_sample: Union[bool, NoneType] = None,
+            early_stopping: Union[bool, NoneType] = None,
+            num_beams: Union[int, NoneType] = None,
+            temperature: Union[float, NoneType] = None,
+            top_k: Union[int, NoneType] = None,
+            top_p: Union[float, NoneType] = None,
+            repetition_penalty: Union[float, NoneType] = None,
+            bad_words_ids: Union[Iterable[int], NoneType] = None,
+            bos_token_id: Union[int, NoneType] = None,
+            pad_token_id: Union[int, NoneType] = None,
+            eos_token_id: Union[int, NoneType] = None,
+            length_penalty: Union[float, NoneType] = None,
+            no_repeat_ngram_size: Union[int, NoneType] = None,
+            num_return_sequences: Union[int, NoneType] = None,
+            decoder_start_token_id: Union[int, NoneType] = None,
+            use_cache: Union[bool, NoneType] = None,
+            **model_kwargs):
+        if text is not None:
+            input_ids = torch.cuda.LongTensor([self.tokenizer(text)['input_ids']])
+        if eos_token_id is None:
+            eos_token_id = self.eos_token_id
+        if pad_token_id is None:
+            pad_token_id = self.pad_token_id
+        res = super().generate(
+            input_ids=input_ids,
+            max_length=max_length,
+            min_length=min_length,
+            do_sample=do_sample,
+            early_stopping=early_stopping,
+            num_beams=num_beams,
+            temperature=temperature,
+            top_k=top_k,
+            top_p=top_p,
+            repetition_penalty=repetition_penalty,
+            bad_words_ids=bad_words_ids,
+            bos_token_id=bos_token_id,
+            pad_token_id=pad_token_id,
+            eos_token_id=eos_token_id,
+            length_penalty=length_penalty,
+            no_repeat_ngram_size=no_repeat_ngram_size,
+            num_return_sequences=num_return_sequences,
+            decoder_start_token_id=decoder_start_token_id,
+            use_cache=use_cache,
+            **model_kwargs
+        )
+        return list(map(self.tokenizer.decode, res.tolist()))
+
+    def __call__(self, text=None, input_ids=None, labels=None, **kwargs):
+        if input_ids is None:
+            if text is None:
+                text = ""
+            input_ids = torch.cuda.LongTensor([self.tokenizer(text)['input_ids']])
+        if isinstance(input_ids, list):
+            input_ids = torch.cuda.LongTensor(input_ids)
+        if isinstance(labels, list):
+            labels = torch.cuda.LongTensor(labels)
+        res = []
+        if labels is not None:
+            lbls = labels
+        else:
+            lbls = [None] * len(input_ids)
+        loss = None
+        original_context_length = 0
+        for tokens, lbl in zip(input_ids, lbls):
+            context_tokens = tokens.tolist()
+            context_length = len(context_tokens)
+            original_context_length = len(context_tokens)
+            if context_length < self.seq_len:
+                context_tokens.extend([self.pad_token_id] * (self.seq_len - context_length))
+                if labels is not None:
+                    lbl = lbl.tolist()
+                    lbl.extend([self.pad_token_id] * (self.seq_len - context_length))
+                    lbl = torch.cuda.LongTensor(lbl)
+            context_tokens_tensor = torch.cuda.LongTensor(context_tokens)
+            context_length_tensor = torch.cuda.LongTensor([context_length])
+
+            torch.distributed.broadcast(context_length_tensor, mpu.get_model_parallel_src_rank(),
+                                        group=mpu.get_model_parallel_group())
+            torch.distributed.broadcast(context_tokens_tensor, mpu.get_model_parallel_src_rank(),
+                                        group=mpu.get_model_parallel_group())
+
+            # context_length = context_length_tensor[0].item()
+
+            tokens = context_tokens_tensor
+            tokens = tokens.view(1, -1).contiguous()
+            tokens = tokens.to(torch.cuda.current_device())
+            attention_mask, loss_mask, position_ids = get_masks_and_position_ids(tokens, self.pad_token_id, False,
+                                                                                 False)
+            lm_logits = self.model(tokens, position_ids, attention_mask)
+            loss = None
+            if labels is not None:
+                # Shift so that tokens < n predict n
+                shift_logits = lm_logits[..., :-1, :].contiguous()
+                shift_labels = lbl[..., 1:].contiguous()
+                # Flatten the tokens
+                loss_fct = CrossEntropyLoss(ignore_index=self.pad_token_id)
+                loss = loss_fct(shift_logits.view(-1, shift_logits.size(-1)), shift_labels.view(-1))
+            res.append((lm_logits, loss))
+        logits = torch.cat([x[0] for x in res], dim=0)[:, : original_context_length, :]
+        if loss is not None:
+            loss = [x[1] for x in res]
+        return ModelOutput(logits, loss)