diff --git a/Automate Documents Categorization- PyTorch & Neural Networks/Text Classification.ipynb b/Automate Documents Categorization- PyTorch & Neural Networks/Text Classification.ipynb new file mode 100644 index 000000000..7253fc8ec --- /dev/null +++ b/Automate Documents Categorization- PyTorch & Neural Networks/Text Classification.ipynb @@ -0,0 +1 @@ +{"cells":[{"cell_type":"markdown","id":"f595719d-84d4-4e8f-8f89-6a35f737ffff","metadata":{},"source":["# **Automate Documents Categorization- PyTorch & Neural Networks**\n","\n","\n","The implementation of an automated machine learning system makes it very efficient. Such a system, equipped with advanced natural language processing and machine learning capabilities, could sift through the vast archives, categorizing articles into their respective topics with remarkable precision. As a result, readers would seamlessly access a wealth of knowledge tailored to their interests, while the editorial team gains newfound agility in content management.\n","\n"]},{"cell_type":"markdown","id":"9c018c6b-f595-49a6-bcea-bcc2ee2bdbf9","metadata":{},"source":["# Setup\n"]},{"cell_type":"markdown","id":"f661ca54-9c3e-442e-84f9-e6053932a503","metadata":{},"source":["### Installing Required Libraries\n","\n"]},{"cell_type":"code","execution_count":null,"id":"c20d0ed2-a2c6-4d9a-bce1-8f3250723934","metadata":{"vscode":{"languageId":"python"}},"outputs":[],"source":["# All Libraries required for this lab are listed below. The libraries pre-installed on Skills Network Labs are commented.\n","# !pip install -qy pandas==1.3.4 numpy==1.21.4 seaborn==0.9.0 matplotlib==3.5.0 scikit-learn==0.20.1\n","# - Update a specific package\n","# !pip install pmdarima -U\n","# - Update a package to specific version\n","# !pip install --upgrade pmdarima==2.0.2\n","# Note: If your environment doesn't support \"!pip install\", use \"!mamba install\""]},{"cell_type":"code","execution_count":null,"id":"5fa7f7c7-b8cd-407c-ad5c-60162958b93e","metadata":{"vscode":{"languageId":"python"}},"outputs":[],"source":["!pip install -Uqq portalocker>=2.0.0\n","!pip install -qq torchtext\n","!pip install -qq torchdata\n","!pip install -Uqq plotly"]},{"cell_type":"markdown","id":"65f3991c-d8d6-464b-812b-73bd0155bbe8","metadata":{},"source":["### Importing Required Libraries\n"]},{"cell_type":"code","execution_count":null,"id":"8662f2ca-9c7c-4154-ac80-c918cc745ebc","metadata":{"vscode":{"languageId":"python"}},"outputs":[],"source":["from tqdm import tqdm\n","import numpy as np\n","import pandas as pd\n","from itertools import accumulate\n","import matplotlib.pyplot as plt\n","\n","import torch\n","import torch.nn as nn\n","\n","from torch.utils.data import DataLoader\n","import numpy as np\n","from torchtext.datasets import AG_NEWS\n","from IPython.display import Markdown as md\n","from tqdm import tqdm\n","\n","from torchtext.vocab import build_vocab_from_iterator\n","from torchtext.datasets import AG_NEWS\n","from torch.utils.data.dataset import random_split\n","from torchtext.data.functional import to_map_style_dataset\n","from sklearn.manifold import TSNE\n","import plotly.graph_objs as go\n","\n","\n","def warn(*args, **kwargs):\n"," pass\n","import warnings\n","warnings.warn = warn\n","warnings.filterwarnings('ignore')"]},{"cell_type":"code","execution_count":null,"id":"ce416f81-d1ed-4bf4-bf4f-838cb6c6c262","metadata":{"vscode":{"languageId":"python"}},"outputs":[],"source":["def plot(COST,ACC):\n"," fig, ax1 = plt.subplots()\n"," color = 'tab:red'\n"," ax1.plot(COST, color=color)\n"," ax1.set_xlabel('epoch', color=color)\n"," ax1.set_ylabel('total loss', color=color)\n"," ax1.tick_params(axis='y', color=color)\n"," \n"," ax2 = ax1.twinx() \n"," color = 'tab:blue'\n"," ax2.set_ylabel('accuracy', color=color) # we already handled the x-label with ax1\n"," ax2.plot(ACC, color=color)\n"," ax2.tick_params(axis='y', color=color)\n"," fig.tight_layout() # otherwise the right y-label is slightly clipped\n"," \n"," plt.show()"]},{"cell_type":"markdown","id":"6efafea4-5344-4cd3-911a-92082bbb22ea","metadata":{},"source":["# Toy Dataset \n","To gain a deeper understanding of the TorchText pipeline, let's engage with a toy dataset through interactive exploration. \n","\n","We have a dataset that consists of a list of tuples, where each tuple contains a random numeric label and a text document.\n"]},{"cell_type":"code","execution_count":null,"id":"7d6ff252-49fa-4d63-9274-ca742a28ea44","metadata":{"vscode":{"languageId":"python"}},"outputs":[],"source":["dataset = [\n"," (1,\"Introduction to NLP\"),\n"," (2,\"Basics of PyTorch\"),\n"," (1,\"NLP Techniques for Text Classification\"),\n"," (3,\"Named Entity Recognition with PyTorch\"),\n"," (3,\"Sentiment Analysis using PyTorch\"),\n"," (3,\"Machine Translation with PyTorch\"),\n"," (1,\" NLP Named Entity,Sentiment Analysis,Machine Translation \"),\n"," (1,\" Machine Translation with NLP \"),\n"," (1,\" Named Entity vs Sentiment Analysis NLP \")]"]},{"cell_type":"markdown","id":"0e8d4c54-1a8e-4745-a977-b7b92edd1e08","metadata":{},"source":["### Tokenizer\n"]},{"cell_type":"markdown","id":"7c50cef5-0a8c-4bea-937f-4fb4d6feb76e","metadata":{},"source":["First import the **```get_tokenizer```** function from **```torchtext.data.utils```**\n"]},{"cell_type":"code","execution_count":null,"id":"47ee702b-7ad2-4d13-b46b-0ba49ab09b4a","metadata":{"vscode":{"languageId":"python"}},"outputs":[],"source":["from torchtext.data.utils import get_tokenizer"]},{"cell_type":"markdown","id":"7b168061-b439-4bf5-b8f4-3b74026b8b92","metadata":{},"source":["Next, we create the tokenizer, we set it to \"basic_english\" tokenizer provided by torchtext. The \"basic_english\" tokenizer is designed to handle basic English text and splits text into individual tokens based on spaces and punctuation marks.\n"]},{"cell_type":"code","execution_count":null,"id":"3f46e08c-aa29-49d0-b256-9867ee67b9da","metadata":{"vscode":{"languageId":"python"}},"outputs":[],"source":["tokenizer = get_tokenizer(\"basic_english\")"]},{"cell_type":"markdown","id":"487b8d9b-fc8d-467d-aac1-4ee9a6647b38","metadata":{},"source":["We iterate over each tuple (y, sentence) in the corpus, tokenizing the sentence. Therefore, each word becomes a token for the computer to understand individual words from a sentence.\n"]},{"cell_type":"code","execution_count":null,"id":"97156f07-1885-4917-8e91-8eff0bf17e26","metadata":{"vscode":{"languageId":"python"}},"outputs":[],"source":["for y, sentence in dataset:\n"," # Print the label (y) associated with the current sentence\n"," print(\"y:\", y)\n","\n"," # Print the sentence text\n"," print(\"sentence:\", sentence)\n","\n"," # Tokenize the sentence using the \"basic_english\" tokenizer\n"," # The tokenizer splits the sentence into individual tokens (words)\n"," tokens = tokenizer(sentence)\n","\n"," # Print the list of tokens obtained from tokenizing the sentence\n"," print(\"tokenizer:\", tokens)"]},{"cell_type":"markdown","id":"bc22415a-9aa9-4f83-9aab-3d5d13add592","metadata":{},"source":["### Token Indices\n"]},{"cell_type":"markdown","id":"d6905efb-a09b-40b6-ac25-81e86ac4930a","metadata":{},"source":["We need to represent words as numbers as NLP algorithms can process and manipulate numbers more efficiently and quickly than raw text. We use the function **```build_vocab_from_iterator```**, the output is typically referred to as 'token indices' or simply 'indices.' These indices represent the numeric representations of the tokens in the vocabulary.\n","\n","The **```build_vocab_from_iterator```** function, when applied to a list of tokens, assigns a unique index to each token based on its position in the vocabulary. These indices serve as a way to represent the tokens in a numerical format that can be easily processed by machine learning models.\n","\n","For example, given a vocabulary with tokens [\"apple\", \"banana\", \"orange\"], the corresponding indices might be [0, 1, 2], where \"apple\" is represented by index 0, \"banana\" by index 1, and \"orange\" by index 2.\n"]},{"cell_type":"markdown","id":"8b709f31-e564-4e83-a10b-672837675ff8","metadata":{},"source":["**```dataset```** is an iterable therefore we use a generator function yield_tokens to apply the **```tokenizer```**. The purpose of the generator function **```yield_tokens```** is to yield tokenized texts one at a time. Instead of processing the entire dataset and returning all the tokenized texts in one go, the generator function processes and yields each tokenized text individually as it is requested. The tokenization process is performed lazily, which means the next tokenized text is generated only when needed, saving memory and computational resources.\n"]},{"cell_type":"code","execution_count":null,"id":"290e848d-1b9a-47a6-b54b-59f3142493f6","metadata":{"vscode":{"languageId":"python"}},"outputs":[],"source":["def yield_tokens(data_iter):\n"," for _,text in data_iter:\n"," yield tokenizer(text)"]},{"cell_type":"code","execution_count":null,"id":"6a3f3d67-d27b-4aea-a005-4b455e04c1aa","metadata":{"vscode":{"languageId":"python"}},"outputs":[],"source":["my_iterator = yield_tokens(dataset) "]},{"cell_type":"markdown","id":"9a725c94-ce89-4aa1-b036-71acedbfb3dc","metadata":{},"source":["This creates an iterator called **```my_iterator```** using the generator. To begin the evaluation of the generator and retrieve the values, you can iterate over **```my_iterator```** using a for loop or retrieve values from it using the **```next()```** function.\n"]},{"cell_type":"code","execution_count":null,"id":"9f0d4af5-2621-48e7-80c5-dc9622d7afbb","metadata":{"vscode":{"languageId":"python"}},"outputs":[],"source":["next(my_iterator)"]},{"cell_type":"markdown","id":"66c0b789-296a-441e-b8d0-16cba6d21358","metadata":{},"source":["We build a vocabulary from the tokenized texts generated by the **```yield_tokens```** generator function, which processes the dataset. The **```build_vocab_from_iterator()```** function constructs the vocabulary, including a special token `unk` to represent out-of-vocabulary words. \n","\n","### Out-of-vocabulary (OOV)\n","When text data is tokenized, there may be words that are not present in the vocabulary because they are rare or unseen during the vocabulary building process. When encountering such OOV words during actual language processing tasks like text generation or language modeling, the model can use the `unk` token to represent them.\n","\n","For example, if the word \"apple\" is present in the vocabulary, but \"pineapple\" is not, \"apple\" will be used normally in the text, but \"pineapple\" (being an OOV word) would be replaced by the token.\n","\n","By including the token in the vocabulary, you provide a consistent way to handle out-of-vocabulary words in your language model or other natural language processing tasks.\n"]},{"cell_type":"code","execution_count":null,"id":"7a76b30e-089c-4b63-9f3d-3d48baf1eadf","metadata":{"vscode":{"languageId":"python"}},"outputs":[],"source":["vocab = build_vocab_from_iterator(yield_tokens(dataset), specials=[\"\"])\n","vocab.set_default_index(vocab[\"\"])"]},{"cell_type":"markdown","id":"cf0e5a0d-3d16-4f96-831e-a384bda08577","metadata":{},"source":["Using **```vocab.get_stoi()```**, you can obtain the entire string-to-index mapping of the vocabulary. This mapping is useful when you want to look up the index of a specific word in the vocabulary, which is essential when converting text data into numerical form for various natural language processing tasks.\n"]},{"cell_type":"code","execution_count":null,"id":"f5df9263-6f64-4199-9963-4fc8fdf442b5","metadata":{"vscode":{"languageId":"python"}},"outputs":[],"source":["vocab.get_stoi()"]},{"cell_type":"markdown","id":"259d5d81-c257-421d-a9c8-9c5ee8c95534","metadata":{},"source":["Prepare the text processing pipeline with the tokenizer and vocabulary. The text and label pipelines will be used to process the raw data strings from the dataset iterators. \n","\n","The function **```text_pipeline```** will tokenize the input text, and **```vocab```** will then be applied to get the token indices. \n","The **```label_pipeline```** will ensure that the labels start at zero.\n"]},{"cell_type":"code","execution_count":null,"id":"b5799561-dd03-4780-9b99-5d5f7b18d65a","metadata":{"vscode":{"languageId":"python"}},"outputs":[],"source":["def text_pipeline(x):\n"," return vocab(tokenizer(x))\n","\n","def label_pipeline(x):\n"," return int(x) - 1"]},{"cell_type":"markdown","id":"9d82e735-d257-435c-afa7-352cb762e96f","metadata":{},"source":["We can apply the functions to each element in the dataset, storing the labels in```label_list```, and the Token Indices in```text_list```. Later on, to retrieve the original sample, we use the list offsets, which indicate the size of each sequence.\n"]},{"cell_type":"code","execution_count":null,"id":"41bc51ca-4156-4d0b-afe0-675eb4df9258","metadata":{"vscode":{"languageId":"python"}},"outputs":[],"source":["label_list, text_list, offsets = [], [], [0]\n","for i, (_label, _text) in enumerate(dataset):\n"," print(\"iteration\",i) \n"," print(\"label:\",_label)\n"," new_label=label_pipeline(_label)\n"," print(\"new label:\",new_label)\n"," label_list.append(new_label)\n"," \n"," print(\"text:\",_text)\n"," processed_text=text_pipeline(_text)\n"," print(\"processed text:\",processed_text)\n"," \n"," processed_text = torch.tensor(processed_text, dtype=torch.int64)\n"," text_list.append(processed_text)\n"," print(\"offsets:\",processed_text.size(0))\n"," \n"," offsets.append(processed_text.size(0))\n"," print(\"\\n\")"]},{"cell_type":"markdown","id":"c7455b38-9867-482a-87df-b7c6ce211599","metadata":{},"source":["We change ```label_list``` to a tensor to work with PyTorch.\n"]},{"cell_type":"code","execution_count":null,"id":"5ba207d5-c30e-43c7-8b13-62c252e35632","metadata":{"vscode":{"languageId":"python"}},"outputs":[],"source":["label_list = torch.tensor(label_list, dtype=torch.int64)\n","label_list"]},{"cell_type":"code","execution_count":null,"id":"6ba79fde-80fd-42c7-ba2a-6730928d942e","metadata":{"vscode":{"languageId":"python"}},"outputs":[],"source":["text_list"]},{"cell_type":"markdown","id":"9dc9730c-f311-48f8-8e83-a7938bc78059","metadata":{},"source":["We see each sequence in the list ```text_list``` is of a different length, we will flatten the list making processing more efficient.\n","\n","**Why do we flatten the list?**\n","\n","By flattening the list and using cumulative offsets, we create a unified, contiguous representation of the entire text. This approach eliminates the need for padding (which essentially involves keeping sentences separate and of equal length, but can be memory-intensive). As a result, the model can process the text more effectively without having to handle individual sequences separately. This approach can lead to faster computations and reduced memory usage.\n"]},{"cell_type":"code","execution_count":null,"id":"04d04ea3-eb6e-4c30-b8b4-e3fe501437f9","metadata":{"vscode":{"languageId":"python"}},"outputs":[],"source":[" text_list = torch.cat(text_list)\n"," text_list"]},{"cell_type":"markdown","id":"2c0daf01-a98d-46a6-85b2-8ed3ae1dc1bb","metadata":{},"source":["Offsets represent the length of each sequence. By applying the`cumsum`function, we can find the position of each sample in```text_list```. Each element in the output will be the cumulative sum of each sample's size, providing the precise index position for each sample in```text_list```.\n"]},{"cell_type":"code","execution_count":null,"id":"4f58b328-fad8-492c-9086-9d9ac6934cd2","metadata":{"vscode":{"languageId":"python"}},"outputs":[],"source":["print(offsets)\n","offsets = torch.tensor(offsets[:-1]).cumsum(dim=0)\n","offsets"]},{"cell_type":"markdown","id":"1d3319c3-b666-4382-8c0d-f16a4cf3fcbf","metadata":{},"source":["---\n"]},{"cell_type":"markdown","id":"ac3bc816-2aaa-46c3-b639-fabb8fa4abde","metadata":{},"source":["## Embedding\n","\n","### Embeddings in PyTorch \n","\n","We can represent each word as a special kind of number, which we call a \"word vector\" or \"embedding.\" These word vectors are like coordinates in a lower-dimensional space. Let's say we want to represent three words: \"cat,\" \"dog,\" and \"bird.\" We can create word vectors for each of them as follows:\n","\n","Word \"cat\": $\\mathbf{e}_{\\text{cat}} = [0.8, 0.5]$\n","\n","Word \"dog\": $\\mathbf{e}_{\\text{dog}} = [0.3, -0.2]$\n","\n","Word \"bird\": $\\mathbf{e}_{\\text{bird}} = [-0.6, 0.9]$\n","\n","The magic of these word vectors is that words with similar meanings or used in similar contexts end up closer to each other in this lower-dimensional space. In our example, if we visualize these points, we might see that \"cat\" and \"dog\" are closer to each other than either is to \"bird.\" This closeness helps the computer understand the relationships between words, which is useful for various machine learning tasks.\n","\n","![2d](https://cf-courses-data.s3.us.cloud-object-storage.appdomain.cloud/IBM-GPXX0Y15EN/Screenshot%202023-07-31%20at%208.26.00%20AM.png)\n"]},{"cell_type":"markdown","id":"9d2ba2d8-d237-42b7-956a-ecbdc1a0d71f","metadata":{},"source":["**```nn.Embedding(vocab_size, embedding_dim)```** is a PyTorch layer used in natural language processing. It converts words (represented as integers) into compact and meaningful vectors of a specified size. The `vocab_size` is the number of unique words, and `embedding_dim` is the size of the resulting word vectors. For example, if `vocab_size=3` and `embedding_dim=2`, each word will be transformed into a 2-dimensional vector.\n"]},{"cell_type":"code","execution_count":null,"id":"27dd702f-c5dd-4488-a462-8d26e312ef00","metadata":{"vscode":{"languageId":"python"}},"outputs":[],"source":["embedding_dim=4\n","\n","vocab_size=len(vocab)\n","print(vocab_size)"]},{"cell_type":"code","execution_count":null,"id":"688b1a71-0db5-4915-aa9c-937e7799cbff","metadata":{"vscode":{"languageId":"python"}},"outputs":[],"source":["embedding = nn.Embedding(vocab_size, embedding_dim)"]},{"cell_type":"markdown","id":"5450afb9-0ec1-461c-98ef-58214b051c26","metadata":{},"source":[" Embedding are randomly initialized, we will set them to our randomly initialized values for later.\n"]},{"cell_type":"code","execution_count":null,"id":"34f2398b-f30a-40d2-820b-fea45ae42ada","metadata":{"vscode":{"languageId":"python"}},"outputs":[],"source":["wights_=torch.randn(22,4)\n","embedding.weight.data=wights_"]},{"cell_type":"markdown","id":"8c5841d2-f9ee-493e-b259-763f9efb9d88","metadata":{},"source":["The random embedding values are stored in tokens.\n"]},{"cell_type":"code","execution_count":null,"id":"cba56b5b-fd2b-442b-b256-f1fd434c852d","metadata":{"vscode":{"languageId":"python"}},"outputs":[],"source":["for word in dataset[1][1].split():\n"," print(\"word:\",word)\n"," print(\"Token Indices\",text_pipeline(word))\n"," print(\"embedding\",embedding(torch.tensor(text_pipeline(word))))"]},{"cell_type":"markdown","id":"15553b69-86e7-4b78-b866-0ec46bae8a37","metadata":{},"source":["### Embedding Bag\n","\n","Now, a sentence can have different lengths (more words or fewer words), and this can be a problem when we want to use this information in a computer program. So, to solve this problem, we use \"embedding bag\". **Embedding bag** that takes all the word embeddings in a sentence and combines them to create a single fixed-length representation that summarizes the overall meaning of the whole sentence. It does this by averaging the word embeddings together.\n","\n","An `embedding_bag` in PyTorch takes dense vectors of a specified size (embedding_dim), similar to the `Embedding` layer. However, the key difference is that `embedding_bag` performs a pooling operation (like averaging) on the input embeddings to generate a single fixed-length vector, whereas the `Embedding` layer maps each input to a unique vector from a lookup table.\n"]},{"cell_type":"code","execution_count":null,"id":"bfb1d2f9-29d0-4c4d-8171-b459e687f287","metadata":{"vscode":{"languageId":"python"}},"outputs":[],"source":["embedding_bag=nn.EmbeddingBag(vocab_size, embedding_dim)"]},{"cell_type":"markdown","id":"f96f43a5-011d-499f-9806-18063dba6782","metadata":{},"source":["The parameters of embedding bag are randomly initialized; we will set them to our randomly initialized values so we can compare them to the Embedding:\n"]},{"cell_type":"code","execution_count":null,"id":"eafda05f-023b-430a-9249-0d9802a8344f","metadata":{"vscode":{"languageId":"python"}},"outputs":[],"source":["embedding_bag.weight.data=wights_"]},{"cell_type":"markdown","id":"64c80c9d-ad1e-4edb-8478-23e77d7a5a5b","metadata":{},"source":["The input consists of token indices, where 'offsets' determine the starting index position. If we only have one sequence, the offset value is 0. The code below showcases this.\n"]},{"cell_type":"code","execution_count":null,"id":"830f5408-7888-472f-8217-b0a6cd772118","metadata":{"vscode":{"languageId":"python"}},"outputs":[],"source":["embedding_bag(torch.tensor(text_pipeline(dataset[1][1])),offsets=torch.tensor([0]))"]},{"cell_type":"markdown","id":"05332684-b18b-4539-a368-e3f491972c36","metadata":{},"source":["We can show that the output of **`embedding_bag`** is just the average of embeddings.\n"]},{"cell_type":"code","execution_count":null,"id":"4ced0e38-8dfe-4297-9d33-f971673203d8","metadata":{"vscode":{"languageId":"python"}},"outputs":[],"source":["for y,sentence in dataset:\n"," print(\"sentence:\",sentence)\n"," my_tokenizes=tokenizer(sentence)\n"," print(\"tokens:\",my_tokenizes)\n"," token_indices=vocab(my_tokenizes)\n"," my_embeddings=embedding(torch.tensor(token_indices).reshape(-1))\n"," print(\"my embeddings \\n\",my_embeddings)\n"," print(\"mean embeddings:\",my_embeddings.mean(0))\n"," my_embeddings_bag=embedding_bag(torch.tensor(token_indices),torch.tensor([0]))\n"," print(\"embeddings bag:\",my_embeddings_bag)\n"," print(\"\\n\")"]},{"cell_type":"markdown","id":"c78f4965-2f1a-4bb8-b2a5-b3d8b7fd264e","metadata":{},"source":["By inputting multiple sequences from `text_list`, the `offsets` will provide precise location information for each sequence. The resulting output will consist of an embedding bag representing each individual sequence.\n"]},{"cell_type":"code","execution_count":null,"id":"a1f5b1d3-f70e-4fae-adb0-5ac56c41a86f","metadata":{"vscode":{"languageId":"python"}},"outputs":[],"source":["embedding_bag( text_list ,offsets=offsets)"]},{"cell_type":"markdown","id":"d40df917-df9f-4dfd-a136-653ee524f412","metadata":{},"source":["If you inspect, the `embedding_bag` collection will be the same for each generated embeddings bag.\n","\n","Now after going through all of this. The gif below gives you an overview.\n","\n","![nlp](https://cf-courses-data.s3.us.cloud-object-storage.appdomain.cloud/IBM-GPXX0Y15EN/NLP.gif)\n"]},{"cell_type":"markdown","id":"5626297e-ca2f-4226-9042-d6bc3c7e15c8","metadata":{},"source":["---\n"]},{"cell_type":"markdown","id":"2a99dacc-26f1-42c4-9e72-52c902091004","metadata":{},"source":["### Import bank dataset\n","\n","Load the AG_NEWS dataset for the train split and split it into input text and corresponding labels:\n"]},{"cell_type":"code","execution_count":null,"id":"360b34a2-3513-4239-b04c-a4508eee5ad2","metadata":{"vscode":{"languageId":"python"}},"outputs":[],"source":["train_iter= AG_NEWS(split=\"train\")"]},{"cell_type":"markdown","id":"ae392669-6e54-4941-80bd-313a3a389583","metadata":{},"source":["The AG_NEWS dataset in torchtext does not support direct indexing like a list or tuple. It is not a random access dataset but rather an iterable dataset that needs to be used with an iterator. This approach is more effective for text data.\n"]},{"cell_type":"code","execution_count":null,"id":"2cc6290a-52d7-43f6-a3a5-42927cd37d33","metadata":{"vscode":{"languageId":"python"}},"outputs":[],"source":["y,text= next(iter(train_iter ))\n","print(y,text)"]},{"cell_type":"markdown","id":"f4bea358-c18e-4357-8ab5-3eeaa95a028c","metadata":{},"source":["We can find the label of the sample.\n"]},{"cell_type":"code","execution_count":null,"id":"1ce136eb-ee13-430a-a273-be7138be59f7","metadata":{"vscode":{"languageId":"python"}},"outputs":[],"source":["ag_news_label = {1: \"World\", 2: \"Sports\", 3: \"Business\", 4: \"Sci/Tec\"}\n","ag_news_label[y]"]},{"cell_type":"markdown","id":"62157cbd-4f51-4953-bcfb-90e52e224fcf","metadata":{},"source":["We can also use the dataset to find all the classes.\n"]},{"cell_type":"code","execution_count":null,"id":"48e7e480-c399-44a2-982a-a7913e80c079","metadata":{"vscode":{"languageId":"python"}},"outputs":[],"source":["num_class = len(set([label for (label, text) in train_iter ]))\n","num_class "]},{"cell_type":"markdown","id":"3b9e6b3c-e7b0-4517-9c25-5105a8d55e80","metadata":{},"source":["We can build the vocabulary as before, just using the AG dataset to obtain token indices\n"]},{"cell_type":"code","execution_count":null,"id":"5bb91ec6-ab55-4f42-913d-e83b6e41f8b1","metadata":{"vscode":{"languageId":"python"}},"outputs":[],"source":["vocab = build_vocab_from_iterator(yield_tokens(train_iter), specials=[\"\"])\n","vocab.set_default_index(vocab[\"\"])"]},{"cell_type":"markdown","id":"4017f44f-1454-45ee-a54c-d77b7ce1f7ce","metadata":{},"source":["Here are some token indices:\n"]},{"cell_type":"code","execution_count":null,"id":"9b29189f-34be-427f-913c-cfbe7af41664","metadata":{"vscode":{"languageId":"python"}},"outputs":[],"source":["vocab([\"age\",\"hello\"])"]},{"cell_type":"markdown","id":"e56a5740-91bb-41b7-86ca-680558ec56a8","metadata":{},"source":["### Dataset \n"]},{"cell_type":"markdown","id":"16fa7837-6e62-4293-85a3-5554c2ada70b","metadata":{},"source":["We can convert the dataset into map-style datasets and then perform a random split to create separate training and validation datasets. The training dataset will contain 95% of the samples, while the validation dataset will contain the remaining 5%. These datasets can be used for training and evaluating a machine learning model for text classification on the AG_NEWS dataset.\n"]},{"cell_type":"code","execution_count":null,"id":"9058cd37-88eb-4623-a982-252c1a2bac3b","metadata":{"vscode":{"languageId":"python"}},"outputs":[],"source":["# Split the dataset into training and testing iterators.\n","train_iter, test_iter = AG_NEWS()\n","\n","# Convert the training and testing iterators to map-style datasets.\n","train_dataset = to_map_style_dataset(train_iter)\n","test_dataset = to_map_style_dataset(test_iter)\n","\n","# Determine the number of samples to be used for training and validation (5% for validation).\n","num_train = int(len(train_dataset) * 0.95)\n","\n","# Randomly split the training dataset into training and validation datasets using `random_split`.\n","# The training dataset will contain 95% of the samples, and the validation dataset will contain the remaining 5%.\n","split_train_, split_valid_ = random_split(train_dataset, [num_train, len(train_dataset) - num_train])"]},{"cell_type":"markdown","id":"6821e9be-094f-4e68-a4cc-344288e5109f","metadata":{},"source":["The code checks if a CUDA-compatible GPU is available in the system using PyTorch, a popular deep learning framework. If a GPU is available, it assigns the device variable to \"cuda\" (which stands for CUDA, the parallel computing platform and application programming interface model developed by NVIDIA). If a GPU is not available, it assigns the device variable to \"cpu\" (which means the code will run on the CPU instead).\n"]},{"cell_type":"code","execution_count":null,"id":"19473200-4065-4138-a3d9-bed603d49902","metadata":{"vscode":{"languageId":"python"}},"outputs":[],"source":["device = torch.device(\"cuda\" if torch.cuda.is_available() else \"cpu\")\n","device"]},{"cell_type":"markdown","id":"850f9550-8fda-4af2-b6e6-b7e0fd52fdce","metadata":{},"source":["### Data Loader\n"]},{"cell_type":"markdown","id":"47914855-c43c-438f-aa91-616f7f31724a","metadata":{},"source":["In PyTorch, the **`collate_fn`** function is used in conjunction with data loaders to customize the way batches are created from individual samples. The provided code defines a `collate_batch` function in PyTorch, which is used with data loaders to customize batch creation from individual samples. It processes a batch of data, including labels and text sequences. It applies the `label_pipeline` and `text_pipeline` functions to preprocess the labels and texts, respectively. The processed data is then converted into PyTorch tensors and returned as a tuple containing the label tensor, text tensor, and offsets tensor representing the starting positions of each text sequence in the combined tensor. The function also ensures that the returned tensors are moved to the specified device (e.g., GPU) for efficient computation.\n"]},{"cell_type":"code","execution_count":null,"id":"6954033b-16e6-4a5b-96fa-bb9d51bf4f78","metadata":{"vscode":{"languageId":"python"}},"outputs":[],"source":["def collate_batch(batch):\n"," label_list, text_list, offsets = [], [], [0]\n"," for _label, _text in batch:\n"," label_list.append(label_pipeline(_label))\n"," processed_text = torch.tensor(text_pipeline(_text), dtype=torch.int64)\n"," text_list.append(processed_text)\n"," offsets.append(processed_text.size(0))\n"," label_list = torch.tensor(label_list, dtype=torch.int64)\n"," offsets = torch.tensor(offsets[:-1]).cumsum(dim=0)\n"," text_list = torch.cat(text_list)\n"," return label_list.to(device), text_list.to(device), offsets.to(device)"]},{"cell_type":"markdown","id":"4883538c-30cc-4d73-b019-1a0639bc8947","metadata":{},"source":["We convert the dataset objects to a data loader by applying the collate function.\n"]},{"cell_type":"code","execution_count":null,"id":"3428afdd-3f12-45d9-bfca-813d8d6344fc","metadata":{"vscode":{"languageId":"python"}},"outputs":[],"source":["BATCH_SIZE = 64\n","\n","train_dataloader = DataLoader(\n"," split_train_, batch_size=BATCH_SIZE, shuffle=True, collate_fn=collate_batch\n",")\n","valid_dataloader = DataLoader(\n"," split_valid_, batch_size=BATCH_SIZE, shuffle=True, collate_fn=collate_batch\n",")\n","test_dataloader = DataLoader(\n"," test_dataset, batch_size=BATCH_SIZE, shuffle=True, collate_fn=collate_batch\n",")"]},{"cell_type":"markdown","id":"b1fb5f55-b6b5-4bef-ab07-118ddf10c453","metadata":{},"source":["We can see the output sequence when we have the label, text, and offsets for each batch.\n"]},{"cell_type":"code","execution_count":null,"id":"8fffeb5b-292e-4369-9fa1-7da3dca8f68e","metadata":{"vscode":{"languageId":"python"}},"outputs":[],"source":["label, text, offsets=next(iter(valid_dataloader ))\n","label, text, offsets"]},{"cell_type":"markdown","id":"df092251-38b6-4aea-8fbc-8b1719b89bc4","metadata":{},"source":["### Neural Network\n"]},{"cell_type":"markdown","id":"b00e02c3-b411-4dd0-aaab-a0fee635bd51","metadata":{},"source":["We have created a neural network for a text classification model using an `EmbeddingBag` layer, followed by a softmax output layer. Additionally, we have initialized the model using a specific method.\n"]},{"cell_type":"code","execution_count":null,"id":"6b395dcf-5323-466c-9fc5-68c106a8ed2e","metadata":{"vscode":{"languageId":"python"}},"outputs":[],"source":["from torch import nn\n","\n","class TextClassificationModel(nn.Module):\n"," def __init__(self, vocab_size, embed_dim, num_class):\n"," super(TextClassificationModel, self).__init__()\n"," self.embedding = nn.EmbeddingBag(vocab_size, embed_dim, sparse=False)\n"," self.fc = nn.Linear(embed_dim, num_class)\n"," self.init_weights()\n","\n"," def init_weights(self):\n"," initrange = 0.5\n"," self.embedding.weight.data.uniform_(-initrange, initrange)\n"," self.fc.weight.data.uniform_(-initrange, initrange)\n"," self.fc.bias.data.zero_()\n","\n"," def forward(self, text, offsets):\n"," embedded = self.embedding(text, offsets)\n"," return self.fc(embedded)"]},{"cell_type":"markdown","id":"02ce16a5-623b-4dd0-bb4c-741a37cf3eb5","metadata":{},"source":["We have created the model, and the embedding dimension size is a free parameter.\n"]},{"cell_type":"code","execution_count":null,"id":"b206820b-6747-4a9b-a741-b735936e6589","metadata":{"vscode":{"languageId":"python"}},"outputs":[],"source":["emsize=64"]},{"cell_type":"markdown","id":"62409b62-7fdf-4142-ae0f-282b391bc7c6","metadata":{},"source":["We need the vocabulary size to determine the number of embeddings.\n"]},{"cell_type":"code","execution_count":null,"id":"1eab91b8-754f-42aa-983a-88ad34f876f6","metadata":{"vscode":{"languageId":"python"}},"outputs":[],"source":["vocab_size=len(vocab)\n","vocab_size"]},{"cell_type":"markdown","id":"478b3f98-87e3-4424-9022-3b7a18f3dd7f","metadata":{},"source":["We have also determined the number of classes for the output layer.\n"]},{"cell_type":"code","execution_count":null,"id":"3141700f-0879-44e6-a72d-bf0b23983c9f","metadata":{"vscode":{"languageId":"python"}},"outputs":[],"source":["num_class "]},{"cell_type":"markdown","id":"e3d213ef-39c2-4b79-9f84-e2957ac5c933","metadata":{},"source":["Creating the model:\n"]},{"cell_type":"code","execution_count":null,"id":"37fe9c29-b156-426b-bbd4-aeb0b733ab7f","metadata":{"vscode":{"languageId":"python"}},"outputs":[],"source":["model = TextClassificationModel(vocab_size, emsize, num_class).to(device)\n","model"]},{"cell_type":"markdown","id":"6175d9c3-f63a-4ea4-9cf9-037c75def79a","metadata":{},"source":["The code line `predicted_label=model(text, offsets)` is used to obtain predicted labels from a machine learning model for a given input text and its corresponding offsets. The `model` is the machine learning model being used for text classification or similar tasks.\n"]},{"cell_type":"code","execution_count":null,"id":"7f949c89-5e7f-4e0d-83ec-d244df893bd8","metadata":{"vscode":{"languageId":"python"}},"outputs":[],"source":["predicted_label=model(text, offsets)"]},{"cell_type":"markdown","id":"56cbc475-9042-4c20-851e-e6869c39240e","metadata":{},"source":["We verify the output shape of our model. In this case, the model is trained with a mini-batch size of 64 samples. The output layer of the model produces 4 logits for each neuron, corresponding to the four classes in the classification task. We can also create a function to find the accuracy given a dataset.\n"]},{"cell_type":"code","execution_count":null,"id":"e7f961bb-f575-486a-9738-41995e9901f6","metadata":{"vscode":{"languageId":"python"}},"outputs":[],"source":["predicted_label.shape"]},{"cell_type":"markdown","id":"492988fb-052e-40dc-98d7-2d986d08b55a","metadata":{},"source":["Function **`predict`** takes in a text and a text pipeline, which preprocesses the text for machine learning. It uses a pre-trained model to predict the label of the text for text classification on the AG_NEWS dataset. The function returns the predicted label as a result.\n"]},{"cell_type":"code","execution_count":null,"id":"efd1e56e-513d-4f85-800b-bd273b01d638","metadata":{"vscode":{"languageId":"python"}},"outputs":[],"source":["def predict(text, text_pipeline):\n"," with torch.no_grad():\n"," text = torch.tensor(text_pipeline(text))\n"," output = model(text, torch.tensor([0]))\n"," return ag_news_label[output.argmax(1).item() + 1]"]},{"cell_type":"code","execution_count":null,"id":"9ed8f537-236f-4f44-9a0f-70ec62cb1241","metadata":{"vscode":{"languageId":"python"}},"outputs":[],"source":["predict(\"I like sports\",text_pipeline )"]},{"cell_type":"markdown","id":"c4637a16-bf89-4a4f-a4e1-0212581d0d82","metadata":{},"source":["We create a function to evaluate the model's accuracy on a dataset.\n"]},{"cell_type":"code","execution_count":null,"id":"d5b79791-805f-41d0-8aed-49d130ef04ec","metadata":{"vscode":{"languageId":"python"}},"outputs":[],"source":["def evaluate(dataloader):\n"," model.eval()\n"," total_acc, total_count= 0, 0\n","\n"," with torch.no_grad():\n"," for idx, (label, text, offsets) in enumerate(dataloader):\n"," predicted_label = model(text, offsets)\n","\n"," total_acc += (predicted_label.argmax(1) == label).sum().item()\n"," total_count += label.size(0)\n"," return total_acc / total_count"]},{"cell_type":"markdown","id":"03f3625e-6c30-46e8-b99e-948483e9fe01","metadata":{},"source":["We proceeded to evaluate the model, and upon observation, we found that its performance is no better than average. This outcome is expected, considering that the model has not undergone any training yet.\n"]},{"cell_type":"code","execution_count":null,"id":"eb8b944d-fc0b-4916-8a46-6d7cd9825095","metadata":{"vscode":{"languageId":"python"}},"outputs":[],"source":["evaluate(test_dataloader)"]},{"cell_type":"markdown","id":"7ad7a062-c570-485a-a1b6-02dd4dcf47f9","metadata":{},"source":["---\n"]},{"cell_type":"markdown","id":"7c175f7a-bfe7-486f-83a6-ce2a65c168c3","metadata":{},"source":["## Train the Model\n","\n","We set the learning rate (LR) to 0.1, which determines the step size at which the optimizer updates the model's parameters during training. The CrossEntropyLoss criterion is used to calculate the loss between the model's predicted outputs and the ground truth labels. This loss function is commonly employed for multi-class classification tasks.\n","\n","The chosen optimizer is Stochastic Gradient Descent (SGD), which optimizes the model's parameters based on the computed gradients with respect to the loss function. The SGD optimizer uses the specified learning rate to control the size of the weight updates.\n","\n","Additionally, a learning rate scheduler is defined using StepLR. This scheduler adjusts the learning rate during training, reducing it by a factor (gamma) of 0.1 after every epoch (step) to improve convergence and fine-tune the model's performance. These components together form the essential setup for training a neural network using the specified learning rate, loss criterion, optimizer, and learning rate scheduler.\n"]},{"cell_type":"code","execution_count":null,"id":"ccec8d60-2ace-4a37-a1eb-179a9657fe3f","metadata":{"vscode":{"languageId":"python"}},"outputs":[],"source":["LR=0.1\n","\n","criterion = torch.nn.CrossEntropyLoss()\n","optimizer = torch.optim.SGD(model.parameters(), lr=LR)\n","scheduler = torch.optim.lr_scheduler.StepLR(optimizer, 1.0, gamma=0.1)"]},{"cell_type":"markdown","id":"394b937c-69cf-4bbf-b251-33f530c24720","metadata":{},"source":["Training the model, which should take about 20 minutes.\n"]},{"cell_type":"code","execution_count":null,"id":"959a53dd-c3cc-48c4-8889-faf88368ade4","metadata":{"vscode":{"languageId":"python"}},"outputs":[],"source":["EPOCHS = 10\n","cum_loss_list=[]\n","acc_epoch=[]\n","acc_old=0\n","\n","for epoch in tqdm(range(1, EPOCHS + 1)):\n"," model.train()\n"," cum_loss=0\n"," for idx, (label, text, offsets) in enumerate(train_dataloader):\n"," optimizer.zero_grad()\n"," predicted_label = model(text, offsets)\n"," loss = criterion(predicted_label, label)\n"," loss.backward()\n"," torch.nn.utils.clip_grad_norm_(model.parameters(), 0.1)\n"," optimizer.step()\n"," cum_loss+=loss.item()\n","\n"," cum_loss_list.append(cum_loss)\n"," accu_val = evaluate(valid_dataloader)\n"," acc_epoch.append(accu_val)\n","\n"," if accu_val > acc_old:\n"," acc_old= accu_val\n"," torch.save(model.state_dict(), 'my_model.pth')"]},{"cell_type":"code","execution_count":null,"id":"f627d529-4afa-4bf6-9021-5fdf608f4a24","metadata":{"vscode":{"languageId":"python"}},"outputs":[],"source":["plot(cum_loss_list,acc_epoch)"]},{"cell_type":"code","execution_count":null,"id":"5b759210-2972-4d91-9087-a3a148b97d81","metadata":{"vscode":{"languageId":"python"}},"outputs":[],"source":["evaluate(test_dataloader)"]},{"cell_type":"markdown","id":"347da002-10bd-4c0f-af3c-37fe1f96ebcc","metadata":{},"source":["This code snippet provides a summary for generating a 3D t-SNE visualization of embeddings using Plotly. It demonstrates how words that are similar to each other are positioned closer together.\n"]},{"cell_type":"code","execution_count":null,"id":"ee4d4415-da07-433a-a4d4-92f79629e703","metadata":{"vscode":{"languageId":"python"}},"outputs":[],"source":["# Get the first batch from the validation data\n","batch = next(iter(valid_dataloader))\n","\n","# Extract the text and offsets from the batch\n","label, text, offsets = batch\n","\n","# Send the data to the device (GPU if available)\n","text = text.to(device)\n","offsets = offsets.to(device)\n","\n","# Get the embeddings bag output for the batch\n","embedded = model.embedding(text, offsets)\n","\n","# Convert the embeddings tensor to a numpy array\n","embeddings_numpy = embedded.detach().cpu().numpy()\n","\n","# Perform t-SNE on the embeddings to reduce their dimensionality to 3D.\n","X_embedded_3d = TSNE(n_components=3).fit_transform(embeddings_numpy)\n","\n","# Create a 3D scatter plot using Plotly\n","trace = go.Scatter3d(\n"," x=X_embedded_3d[:, 0],\n"," y=X_embedded_3d[:, 1],\n"," z=X_embedded_3d[:, 2],\n"," mode='markers',\n"," marker=dict(\n"," size=5,\n"," color=label.numpy(), # Use label information for color\n"," colorscale='Viridis', # Choose a colorscale\n"," opacity=0.8\n"," )\n",")\n","\n","layout = go.Layout(title=\"3D t-SNE Visualization of Embeddings\",\n"," scene=dict(xaxis_title='Dimension 1',\n"," yaxis_title='Dimension 2',\n"," zaxis_title='Dimension 3'))\n","\n","fig = go.Figure(data=[trace], layout=layout)\n","fig.show()"]},{"cell_type":"markdown","id":"836b30d7-67f0-41ea-8743-b14b10d042b8","metadata":{},"source":["We can make a prediction on the following article using the function **`predict`**.\n"]},{"cell_type":"code","execution_count":null,"id":"dd9220f0-c596-43a2-b65f-5762fce8eeca","metadata":{"vscode":{"languageId":"python"}},"outputs":[],"source":["article=\"\"\"Canada navigated a stiff test against the Republic of Ireland on a rain soaked evening in Perth, coming from behind to claim a vital 2-1 victory at the Women’s World Cup.\n","Katie McCabe opened the scoring with an incredible Olimpico goal – scoring straight from a corner kick – as her corner flew straight over the despairing Canada goalkeeper Kailen Sheridan at Perth Rectangular Stadium in Australia.\n","Just when Ireland thought it had safely navigated itself to half time with a lead, Megan Connolly failed to get a clean connection on a clearance with the resulting contact squirming into her own net to level the score.\n","Minutes into the second half, Adriana Leon completed the turnaround for the Olympic champion, slotting home from the edge of the area to seal the three points.\"\"\""]},{"cell_type":"markdown","id":"46abdd3b-f2c5-4e45-b5bc-6131b69d9ebd","metadata":{},"source":["This markdown content generates a styled box with light gray background and padding. It contains an `

` header displaying the content of the `article` variable, and an `

` header indicating the predicted category of the news article which is provided by the `result` variable. The placeholders `{article}` and `{result}` will be dynamically replaced with actual values when this markdown is rendered.\n"]},{"cell_type":"code","execution_count":null,"id":"2b5a1eae-d3e0-470e-978e-786b1be3a765","metadata":{"vscode":{"languageId":"python"}},"outputs":[],"source":["result = predict(article, text_pipeline)\n","\n","markdown_content = f'''\n","
\n","

{article}

\n","

The category of the news article: {result}

\n","
\n","'''\n","\n","md(markdown_content)"]}],"metadata":{"kernelspec":{"display_name":"Python","language":"python","name":"conda-env-python-py"},"language_info":{"name":""}},"nbformat":4,"nbformat_minor":4}