diff --git a/homeworks/hw3_bert/hw_attention_and_bert.ipynb b/homeworks/hw3_bert/hw_attention_and_bert.ipynb
new file mode 100644
index 0000000..43da786
--- /dev/null
+++ b/homeworks/hw3_bert/hw_attention_and_bert.ipynb
@@ -0,0 +1,596 @@
+{
+ "cells": [
+  {
+   "cell_type": "markdown",
+   "id": "d755d457",
+   "metadata": {},
+   "source": [
+    "\n",
+    "## Grading Criteria\n",
+    "\n",
+    "**Maximum Score: 10 points**\n",
+    "\n",
+    "1. **Step #1: Implementation of Attention Mechanisms (4 points)**  \n",
+    "   - 2 points for implementing the `additive` attention mechanism correctly.\n",
+    "   - 2 points for implementing the `multiplicative` attention mechanism correctly.\n",
+    "\n",
+    "2. **Step #2: BERT-based Text Classification Task (4 points)**  \n",
+    "   - 2 points for setting up the BERT model correctly for classification.\n",
+    "   - 2 points for evaluating the model performance accurately.\n",
+    "\n",
+    "3. **Code Quality and Comments (2 points)**  \n",
+    "   - 1 point for code clarity and logical structuring of functions and classes.\n",
+    "   - 1 point for detailed comments explaining each part of the code.\n",
+    "\n",
+    "**Total: 10 points**  \n",
+    "Each section will be reviewed to ensure that all requirements are met and that the code is efficient and well-documented. Pay attention to using proper variable names and providing comments that describe the purpose of each function and major code sections.\n"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {
+    "id": "jNtLJlW4v5VF"
+   },
+   "source": [
+    "## Attention & BERT\n",
+    "\n",
+    "For this homework assignment, your goal is to delve into the Attention mechanism (implementing several of its variants) and revisit the text classification task, this time solving it with BERT.\n"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "import os\n",
+    "import random\n",
+    "\n",
+    "import numpy as np\n",
+    "import pandas as pd\n",
+    "\n",
+    "from sklearn.metrics import roc_auc_score, roc_curve, accuracy_score\n",
+    "\n",
+    "import torch\n",
+    "import torch.nn as nn\n",
+    "import torch.nn.functional as F\n",
+    "\n",
+    "import matplotlib.pyplot as plt\n",
+    "from IPython.display import clear_output \n",
+    "# Display inline matplotlib plots\n",
+    "%matplotlib inline"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "### Step #1. Implementation of Attention\n",
+    "\n",
+    "In this task, please implement the Attention mechanism, specifically several methods for calculating attention scores. While this mechanism is already implemented in popular frameworks, you'll implement it using `numpy` to gain a better understanding.\n",
+    "\n",
+    "Your task in this part: implement `additive` and `multiplicative` variants of Attention. For your convenience (and as an example), the `dot product` attention (based on scalar product) is already implemented.\n",
+    "\n",
+    "Detailed descriptions of these types of Attention are available in the lecture slides."
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "decoder_hidden_state = np.array([7, 11, 4]).astype(float)[:, None]\n",
+    "\n",
+    "plt.figure(figsize=(2, 5))\n",
+    "plt.pcolormesh(decoder_hidden_state)\n",
+    "plt.colorbar()\n",
+    "plt.title(\"Decoder state\")"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "Dot product attention (example implementation)\n",
+    "\n",
+    "Let's consider a single encoder state – a vector with dimensions `(n_hidden, 1)`, where `n_hidden = 3`:\n"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "single_encoder_hidden_state = np.array([1, 5, 11]).astype(float)[:, None]\n",
+    "\n",
+    "plt.figure(figsize=(2, 5))\n",
+    "plt.pcolormesh(single_encoder_hidden_state)\n",
+    "plt.colorbar()"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "The attention score between these encoder and decoder states is simply calculated as a dot product:"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "np.dot(decoder_hidden_state.T, single_encoder_hidden_state)"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "In the general case, there are, of course, multiple encoder states. Attention scores are computed with each encoder state:"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "encoder_hidden_states = (\n",
+    "    np.array([[1, 5, 11], [7, 4, 1], [8, 12, 2], [-9, 0, 1]]).astype(float).T\n",
+    ")\n",
+    "\n",
+    "encoder_hidden_states"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "Then, to calculate the dot products between a single decoder state and all encoder states, we can use the following function (which is essentially just matrix multiplication and type conversion):\n"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "def dot_product_attention_score(decoder_hidden_state, encoder_hidden_states):\n",
+    "    \"\"\"\n",
+    "    decoder_hidden_state: np.array of shape (n_features, 1)\n",
+    "    encoder_hidden_states: np.array of shape (n_features, n_states)\n",
+    "\n",
+    "    return: np.array of shape (1, n_states)\n",
+    "        Array with dot product attention scores\n",
+    "    \"\"\"\n",
+    "    attention_scores = np.dot(decoder_hidden_state.T, encoder_hidden_states)\n",
+    "    return attention_scores"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "dot_product_attention_score(decoder_hidden_state, encoder_hidden_states)"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "To calculate the \"weights,\" we need Softmax:"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "def softmax(vector):\n",
+    "    \"\"\"\n",
+    "    vector: np.array of shape (n, m)\n",
+    "\n",
+    "    return: np.array of shape (n, m)\n",
+    "        Matrix where softmax is computed for every row independently\n",
+    "    \"\"\"\n",
+    "    nice_vector = vector - vector.max()\n",
+    "    exp_vector = np.exp(nice_vector)\n",
+    "    exp_denominator = np.sum(exp_vector, axis=1)[:, np.newaxis]\n",
+    "    softmax_ = exp_vector / exp_denominator\n",
+    "    return softmax_"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "weights_vector = softmax(\n",
+    "    dot_product_attention_score(decoder_hidden_state, encoder_hidden_states)\n",
+    ")\n",
+    "\n",
+    "weights_vector"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "Finally, we'll use these weights and compute the final vector, as described for dot product attention.\n"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "attention_vector = weights_vector.dot(encoder_hidden_states.T).T\n",
+    "print(attention_vector)\n",
+    "\n",
+    "plt.figure(figsize=(2, 5))\n",
+    "plt.pcolormesh(attention_vector, cmap=\"spring\")\n",
+    "plt.colorbar()"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "This vector accumulates information from all encoder states, weighted based on proximity to the given decoder state. Let's implement all the above transformations in a single function:"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "def dot_product_attention(decoder_hidden_state, encoder_hidden_states):\n",
+    "    \"\"\"\n",
+    "    decoder_hidden_state: np.array of shape (n_features, 1)\n",
+    "    encoder_hidden_states: np.array of shape (n_features, n_states)\n",
+    "\n",
+    "    return: np.array of shape (n_features, 1)\n",
+    "        Final attention vector\n",
+    "    \"\"\"\n",
+    "    softmax_vector = softmax(\n",
+    "        dot_product_attention_score(decoder_hidden_state, encoder_hidden_states)\n",
+    "    )\n",
+    "    attention_vector = softmax_vector.dot(encoder_hidden_states.T).T\n",
+    "    return attention_vector"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "assert (\n",
+    "    attention_vector\n",
+    "    == dot_product_attention(decoder_hidden_state, encoder_hidden_states)\n",
+    ").all()"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "Multiplicative attention. Your current task: implement multiplicative attention.\n",
+    "\n",
+    "$$\n",
+    "e_i = \\mathbf{s}^TW_{mult}\\mathbf{h}_i\n",
+    "$$\n",
+    "\n",
+    "The weight matrix `W_mult` is given below. It should be noted that multiplicative attention allows working with encoder and decoder states of different dimensions, so the encoder states will be updated:\n"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "encoder_hidden_states_complex = (\n",
+    "    np.array([[1, 5, 11, 4, -4], [7, 4, 1, 2, 2], [8, 12, 2, 11, 5], [-9, 0, 1, 8, 12]])\n",
+    "    .astype(float)\n",
+    "    .T\n",
+    ")\n",
+    "\n",
+    "W_mult = np.array(\n",
+    "    [\n",
+    "        [-0.78, -0.97, -1.09, -1.79, 0.24],\n",
+    "        [0.04, -0.27, -0.98, -0.49, 0.52],\n",
+    "        [1.08, 0.91, -0.99, 2.04, -0.15],\n",
+    "    ]\n",
+    ")"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "# your code here"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "Implement the attention calculation according to the formulas and create the final function `multiplicative_attention`:"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "def multiplicative_attention(decoder_hidden_state, encoder_hidden_states, W_mult):\n",
+    "    \"\"\"\n",
+    "    decoder_hidden_state: np.array of shape (n_features_dec, 1)\n",
+    "    encoder_hidden_states: np.array of shape (n_features_enc, n_states)\n",
+    "    W_mult: np.array of shape (n_features_dec, n_features_enc)\n",
+    "\n",
+    "    return: np.array of shape (n_features_enc, 1)\n",
+    "        Final attention vector\n",
+    "    \"\"\"\n",
+    "    # your code here\n",
+    "    return attention_vector"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "Additive attention. Now you need to implement additive attention.\n",
+    "\n",
+    "$$\n",
+    "e_i = \\mathbf{v}^T \\text{tanh} (W_{add-enc} \\mathbf{h}_i + W_{add-dec} \\mathbf{s})\n",
+    "$$\n",
+    "\n",
+    "The weight matrices `W_add_enc` and `W_add_dec` are provided below, as well as the weight vector `v_add`. For activation calculation, you can use `np.tanh`."
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "v_add = np.array([[-0.35, -0.58, 0.07, 1.39, -0.79, -1.78, -0.35]]).T\n",
+    "\n",
+    "W_add_enc = np.array(\n",
+    "    [\n",
+    "        [-1.34, -0.1, -0.38, 0.12, -0.34],\n",
+    "        [-1.0, 1.28, 0.49, -0.41, -0.32],\n",
+    "        [-0.39, -1.38, 1.26, 1.21, 0.15],\n",
+    "        [-0.18, 0.04, 1.36, -1.18, -0.53],\n",
+    "        [-0.23, 0.96, 1.02, 0.39, -1.26],\n",
+    "        [-1.27, 0.89, -0.85, -0.01, -1.19],\n",
+    "        [0.46, -0.12, -0.86, -0.93, -0.4],\n",
+    "    ]\n",
+    ")\n",
+    "\n",
+    "W_add_dec = np.array(\n",
+    "    [\n",
+    "        [-1.62, -0.02, -0.39],\n",
+    "        [0.43, 0.61, -0.23],\n",
+    "        [-1.5, -0.43, -0.91],\n",
+    "        [-0.14, 0.03, 0.05],\n",
+    "        [0.85, 0.51, 0.63],\n",
+    "        [0.39, -0.42, 1.34],\n",
+    "        [-0.47, -0.31, -1.34],\n",
+    "    ]\n",
+    ")"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "# your code here"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "Implement the attention calculation according to the formulas and create the final function `additive_attention`:"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "def additive_attention(\n",
+    "    decoder_hidden_state, encoder_hidden_states, v_add, W_add_enc, W_add_dec\n",
+    "):\n",
+    "    \"\"\"\n",
+    "    decoder_hidden_state: np.array of shape (n_features_dec, 1)\n",
+    "    encoder_hidden_states: np.array of shape (n_features_enc, n_states)\n",
+    "    v_add: np.array of shape (n_features_int, 1)\n",
+    "    W_add_enc: np.array of shape (n_features_int, n_features_enc)\n",
+    "    W_add_dec: np.array of shape (n_features_int, n_features_dec)\n",
+    "\n",
+    "    return: np.array of shape (n_features_enc, 1)\n",
+    "        Final attention vector\n",
+    "    \"\"\"\n",
+    "    # your code here\n",
+    "    return attention_vector"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "Submit the `multiplicative_attention` and `additive_attention` functions in the contest. Don’t forget to import `numpy`!"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "### Step #2. Text classification using a pretrained language model.\n",
+    "\n",
+    "We work with the SST-2 dataset. Split the dataset into train and test sets."
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "# do not change the code in the block below\n",
+    "# __________start of block__________\n",
+    "\n",
+    "!wget https://raw.githubusercontent.com/girafe-ai/ml-course/msu_branch/homeworks/hw08_attention/holdout_texts08.npy\n",
+    "# __________end of block__________"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "# do not change the code in the block below\n",
+    "# __________start of block__________\n",
+    "df = pd.read_csv(\n",
+    "    \"https://github.com/clairett/pytorch-sentiment-classification/raw/master/data/SST2/train.tsv\",\n",
+    "    delimiter=\"\\t\",\n",
+    "    header=None,\n",
+    ")\n",
+    "texts_train = df[0].values[:5000]\n",
+    "y_train = df[1].values[:5000]\n",
+    "texts_test = df[0].values[5000:]\n",
+    "y_test = df[1].values[5000:]\n",
+    "texts_holdout = np.load(\"holdout_texts08.npy\", allow_pickle=True)\n",
+    "# __________end of block__________"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "The rest of the code is up to you to write.\n",
+    "To successfully achieve the maximum score, you need to reach at least 84.5% accuracy on the test part of the dataset."
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "# your beautiful experiments here"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "Submitting the Assignment in the Contest\n",
+    "\n",
+    "Save the probabilities of belonging to class 0 and class 1, respectively, in the dictionary `out_dict`:\n",
+    "\n"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "out_dict = {\n",
+    "    'train': # np.array of size (5000, 2) with probas\n",
+    "    'test': # np.array of size (1920, 2) with probas\n",
+    "    'holdout': # np.array of size (500, 2) with probas\n",
+    "}"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "Several `assert`s to check your solution:"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "assert isinstance(out_dict[\"train\"], np.ndarray), \"Dict values should be numpy arrays\"\n",
+    "assert out_dict[\"train\"].shape == (\n",
+    "    5000,\n",
+    "    2,\n",
+    "), \"The predicted probas shape does not match the train set size\"\n",
+    "assert np.allclose(\n",
+    "    out_dict[\"train\"].sum(axis=1), 1.0\n",
+    "), \"Probas do not sum up to 1 for some of the objects\"\n",
+    "\n",
+    "assert isinstance(out_dict[\"test\"], np.ndarray), \"Dict values should be numpy arrays\"\n",
+    "assert out_dict[\"test\"].shape == (\n",
+    "    1920,\n",
+    "    2,\n",
+    "), \"The predicted probas shape does not match the test set size\"\n",
+    "assert np.allclose(\n",
+    "    out_dict[\"test\"].sum(axis=1), 1.0\n",
+    "), \"Probas do not sum up to 1 for some of the object\"\n",
+    "\n",
+    "assert isinstance(out_dict[\"holdout\"], np.ndarray), \"Dict values should be numpy arrays\"\n",
+    "assert out_dict[\"holdout\"].shape == (\n",
+    "    500,\n",
+    "    2,\n",
+    "), \"The predicted probas shape does not match the holdout set size\"\n",
+    "assert np.allclose(\n",
+    "    out_dict[\"holdout\"].sum(axis=1), 1.0\n",
+    "), \"Probas do not sum up to 1 for some of the object\""
+   ]
+  }
+ ],
+ "metadata": {
+  "colab": {
+   "collapsed_sections": [],
+   "name": "NLP_hw01_texts.ipynb",
+   "provenance": []
+  },
+  "kernelspec": {
+   "display_name": "Py3 Research",
+   "language": "python",
+   "name": "py3_research"
+  },
+  "language_info": {
+   "codemirror_mode": {
+    "name": "ipython",
+    "version": 3
+   },
+   "file_extension": ".py",
+   "mimetype": "text/x-python",
+   "name": "python",
+   "nbconvert_exporter": "python",
+   "pygments_lexer": "ipython3",
+   "version": "3.7.7"
+  }
+ },
+ "nbformat": 4,
+ "nbformat_minor": 1
+}