From fb2dd046c89be2fec00a7e95099002be53a13419 Mon Sep 17 00:00:00 2001
From: Sayantika Laskar <127471376+SayantikaLaskar@users.noreply.github.com>
Date: Sat, 8 Jun 2024 10:40:07 +0530
Subject: [PATCH] Add spam mail detection

---
 Spam Mail Classifier/README.md                |  96 ++
 ...rediction_using_Machine_Learning (1).ipynb | 927 ++++++++++++++++++
 2 files changed, 1023 insertions(+)
 create mode 100644 Spam Mail Classifier/README.md
 create mode 100644 Spam Mail Classifier/Spam_Mail_Prediction_using_Machine_Learning (1).ipynb

diff --git a/Spam Mail Classifier/README.md b/Spam Mail Classifier/README.md
new file mode 100644
index 000000000..f0302c3ae
--- /dev/null
+++ b/Spam Mail Classifier/README.md	
@@ -0,0 +1,96 @@
+# Spam Mail Detection Project
+
+## Overview
+
+Welcome to the Spam Mail Detection Project! This project aims to develop a machine learning model to classify emails as spam or not spam. The primary objective is to provide an efficient and accurate tool to filter out unwanted spam emails from the user's inbox. This project is a part of the Girl Script Summer of Code (GSSOC) 2024 initiative.
+
+## Table of Contents
+
+1. [Introduction](#introduction)
+2. [Features](#features)
+3. [Installation](#installation)
+4. [Usage](#usage)
+5. [Model Training](#model-training)
+6. [License](#license)
+
+## Introduction
+
+Spam emails are a significant issue for many users, leading to wasted time and potential security threats. Our project leverages machine learning techniques to identify and filter out spam emails. The model will be trained on a labeled dataset of emails and will use various features extracted from the email content and metadata.
+
+## Features
+
+- **Email Preprocessing**: Clean and preprocess email data for model training.
+- **Feature Extraction**: Extract relevant features from emails, such as word frequency, email headers, and metadata.
+- **Model Training**: Train a machine learning model using the preprocessed and feature-extracted data.
+- **Spam Detection**: Classify new emails as spam or not spam using the trained model.
+- **Evaluation Metrics**: Evaluate the performance of the model using accuracy, precision, recall, and F1-score.
+
+## Installation
+
+To set up the project locally, follow these steps:
+
+1. Clone the repository:
+    ```sh
+    git clone https://github.com/yourusername/spam-mail-detection.git
+    cd spam-mail-detection
+    ```
+
+2. Create a virtual environment and activate it:
+    ```sh
+    python -m venv venv
+    source venv/bin/activate  # On Windows, use `venv\Scripts\activate`
+    ```
+
+3. Install the required dependencies:
+    ```sh
+    pip install -r requirements.txt
+    ```
+
+## Usage
+
+1. **Data Preprocessing**:
+    - Run the preprocessing script to clean and prepare the email data:
+      ```sh
+      python preprocess.py
+      ```
+
+2. **Feature Extraction**:
+    - Extract features from the preprocessed data:
+      ```sh
+      python feature_extraction.py
+      ```
+
+3. **Model Training**:
+    - Train the machine learning model:
+      ```sh
+      python train_model.py
+      ```
+
+4. **Spam Detection**:
+    - Use the trained model to classify new emails:
+      ```sh
+      python predict.py --email "path_to_email_file"
+      ```
+
+
+## Model Training
+
+The model training process involves the following steps:
+
+1. Load and preprocess the dataset.
+2. Extract features from the emails.
+3. Split the data into training and testing sets.
+4. Train a machine learning model (e.g., Naive Bayes, SVM, Random Forest).
+5. Evaluate the model using appropriate metrics.
+
+
+## License
+
+This project is licensed under the MIT License. See the [LICENSE](LICENSE) file for more details.
+
+
+We hope you find this project useful and look forward to your contributions!
+
+---
+
+*This project is developed as part of the Girl Script Summer of Code (GSSOC) 2024 initiative.*
\ No newline at end of file
diff --git a/Spam Mail Classifier/Spam_Mail_Prediction_using_Machine_Learning (1).ipynb b/Spam Mail Classifier/Spam_Mail_Prediction_using_Machine_Learning (1).ipynb
new file mode 100644
index 000000000..612d95889
--- /dev/null
+++ b/Spam Mail Classifier/Spam_Mail_Prediction_using_Machine_Learning (1).ipynb	
@@ -0,0 +1,927 @@
+{
+  "nbformat": 4,
+  "nbformat_minor": 0,
+  "metadata": {
+    "colab": {
+      "provenance": []
+    },
+    "kernelspec": {
+      "name": "python3",
+      "display_name": "Python 3"
+    },
+    "language_info": {
+      "name": "python"
+    }
+  },
+  "cells": [
+    {
+      "cell_type": "markdown",
+      "source": [
+        "Imorting the Dependencies"
+      ],
+      "metadata": {
+        "id": "fjWeEGD-0EtH"
+      }
+    },
+    {
+      "cell_type": "code",
+      "execution_count": null,
+      "metadata": {
+        "id": "n8w458Fyyeft"
+      },
+      "outputs": [],
+      "source": [
+        "import numpy as np\n",
+        "import pandas as pd\n",
+        "from sklearn.model_selection import train_test_split\n",
+        "from sklearn.feature_extraction.text import TfidfVectorizer\n",
+        "from sklearn.linear_model import LogisticRegression\n",
+        "from sklearn.metrics import accuracy_score"
+      ]
+    },
+    {
+      "cell_type": "markdown",
+      "source": [
+        "Data Collection & Pre-processing"
+      ],
+      "metadata": {
+        "id": "nXr7GETI1uv-"
+      }
+    },
+    {
+      "cell_type": "code",
+      "source": [
+        "# loading the data from csv file to a pandas dataframe\n",
+        "raw_mail_data = pd.read_csv('/content/mail_data.csv')"
+      ],
+      "metadata": {
+        "id": "TEy05ApM1g41"
+      },
+      "execution_count": null,
+      "outputs": []
+    },
+    {
+      "cell_type": "code",
+      "source": [
+        "print(raw_mail_data)"
+      ],
+      "metadata": {
+        "colab": {
+          "base_uri": "https://localhost:8080/"
+        },
+        "id": "xePFRWCm2FRc",
+        "outputId": "6f87fc00-1328-476c-8033-bcd14c3a06d8"
+      },
+      "execution_count": null,
+      "outputs": [
+        {
+          "output_type": "stream",
+          "name": "stdout",
+          "text": [
+            "     Category                                            Message\n",
+            "0         ham  Go until jurong point, crazy.. Available only ...\n",
+            "1         ham                      Ok lar... Joking wif u oni...\n",
+            "2        spam  Free entry in 2 a wkly comp to win FA Cup fina...\n",
+            "3         ham  U dun say so early hor... U c already then say...\n",
+            "4         ham  Nah I don't think he goes to usf, he lives aro...\n",
+            "...       ...                                                ...\n",
+            "5567     spam  This is the 2nd time we have tried 2 contact u...\n",
+            "5568      ham               Will ü b going to esplanade fr home?\n",
+            "5569      ham  Pity, * was in mood for that. So...any other s...\n",
+            "5570      ham  The guy did some bitching but I acted like i'd...\n",
+            "5571      ham                         Rofl. Its true to its name\n",
+            "\n",
+            "[5572 rows x 2 columns]\n"
+          ]
+        }
+      ]
+    },
+    {
+      "cell_type": "code",
+      "source": [
+        "# replace null values with a null string\n",
+        "mail_data = raw_mail_data.where((pd.notnull(raw_mail_data)), '')"
+      ],
+      "metadata": {
+        "id": "y2rsT5nT2JBA"
+      },
+      "execution_count": null,
+      "outputs": []
+    },
+    {
+      "cell_type": "code",
+      "source": [
+        "# printing the first 5 rows of the dataframe\n",
+        "mail_data.head()"
+      ],
+      "metadata": {
+        "colab": {
+          "base_uri": "https://localhost:8080/",
+          "height": 206
+        },
+        "id": "fWXv7COg272M",
+        "outputId": "b44f1e80-e0f7-4a6f-f1f5-a23316323f08"
+      },
+      "execution_count": null,
+      "outputs": [
+        {
+          "output_type": "execute_result",
+          "data": {
+            "text/plain": [
+              "  Category                                            Message\n",
+              "0      ham  Go until jurong point, crazy.. Available only ...\n",
+              "1      ham                      Ok lar... Joking wif u oni...\n",
+              "2     spam  Free entry in 2 a wkly comp to win FA Cup fina...\n",
+              "3      ham  U dun say so early hor... U c already then say...\n",
+              "4      ham  Nah I don't think he goes to usf, he lives aro..."
+            ],
+            "text/html": [
+              "\n",
+              "  <div id=\"df-d5779190-ae7c-4e55-b713-748fc6379621\" class=\"colab-df-container\">\n",
+              "    <div>\n",
+              "<style scoped>\n",
+              "    .dataframe tbody tr th:only-of-type {\n",
+              "        vertical-align: middle;\n",
+              "    }\n",
+              "\n",
+              "    .dataframe tbody tr th {\n",
+              "        vertical-align: top;\n",
+              "    }\n",
+              "\n",
+              "    .dataframe thead th {\n",
+              "        text-align: right;\n",
+              "    }\n",
+              "</style>\n",
+              "<table border=\"1\" class=\"dataframe\">\n",
+              "  <thead>\n",
+              "    <tr style=\"text-align: right;\">\n",
+              "      <th></th>\n",
+              "      <th>Category</th>\n",
+              "      <th>Message</th>\n",
+              "    </tr>\n",
+              "  </thead>\n",
+              "  <tbody>\n",
+              "    <tr>\n",
+              "      <th>0</th>\n",
+              "      <td>ham</td>\n",
+              "      <td>Go until jurong point, crazy.. Available only ...</td>\n",
+              "    </tr>\n",
+              "    <tr>\n",
+              "      <th>1</th>\n",
+              "      <td>ham</td>\n",
+              "      <td>Ok lar... Joking wif u oni...</td>\n",
+              "    </tr>\n",
+              "    <tr>\n",
+              "      <th>2</th>\n",
+              "      <td>spam</td>\n",
+              "      <td>Free entry in 2 a wkly comp to win FA Cup fina...</td>\n",
+              "    </tr>\n",
+              "    <tr>\n",
+              "      <th>3</th>\n",
+              "      <td>ham</td>\n",
+              "      <td>U dun say so early hor... U c already then say...</td>\n",
+              "    </tr>\n",
+              "    <tr>\n",
+              "      <th>4</th>\n",
+              "      <td>ham</td>\n",
+              "      <td>Nah I don't think he goes to usf, he lives aro...</td>\n",
+              "    </tr>\n",
+              "  </tbody>\n",
+              "</table>\n",
+              "</div>\n",
+              "    <div class=\"colab-df-buttons\">\n",
+              "\n",
+              "  <div class=\"colab-df-container\">\n",
+              "    <button class=\"colab-df-convert\" onclick=\"convertToInteractive('df-d5779190-ae7c-4e55-b713-748fc6379621')\"\n",
+              "            title=\"Convert this dataframe to an interactive table.\"\n",
+              "            style=\"display:none;\">\n",
+              "\n",
+              "  <svg xmlns=\"http://www.w3.org/2000/svg\" height=\"24px\" viewBox=\"0 -960 960 960\">\n",
+              "    <path d=\"M120-120v-720h720v720H120Zm60-500h600v-160H180v160Zm220 220h160v-160H400v160Zm0 220h160v-160H400v160ZM180-400h160v-160H180v160Zm440 0h160v-160H620v160ZM180-180h160v-160H180v160Zm440 0h160v-160H620v160Z\"/>\n",
+              "  </svg>\n",
+              "    </button>\n",
+              "\n",
+              "  <style>\n",
+              "    .colab-df-container {\n",
+              "      display:flex;\n",
+              "      gap: 12px;\n",
+              "    }\n",
+              "\n",
+              "    .colab-df-convert {\n",
+              "      background-color: #E8F0FE;\n",
+              "      border: none;\n",
+              "      border-radius: 50%;\n",
+              "      cursor: pointer;\n",
+              "      display: none;\n",
+              "      fill: #1967D2;\n",
+              "      height: 32px;\n",
+              "      padding: 0 0 0 0;\n",
+              "      width: 32px;\n",
+              "    }\n",
+              "\n",
+              "    .colab-df-convert:hover {\n",
+              "      background-color: #E2EBFA;\n",
+              "      box-shadow: 0px 1px 2px rgba(60, 64, 67, 0.3), 0px 1px 3px 1px rgba(60, 64, 67, 0.15);\n",
+              "      fill: #174EA6;\n",
+              "    }\n",
+              "\n",
+              "    .colab-df-buttons div {\n",
+              "      margin-bottom: 4px;\n",
+              "    }\n",
+              "\n",
+              "    [theme=dark] .colab-df-convert {\n",
+              "      background-color: #3B4455;\n",
+              "      fill: #D2E3FC;\n",
+              "    }\n",
+              "\n",
+              "    [theme=dark] .colab-df-convert:hover {\n",
+              "      background-color: #434B5C;\n",
+              "      box-shadow: 0px 1px 3px 1px rgba(0, 0, 0, 0.15);\n",
+              "      filter: drop-shadow(0px 1px 2px rgba(0, 0, 0, 0.3));\n",
+              "      fill: #FFFFFF;\n",
+              "    }\n",
+              "  </style>\n",
+              "\n",
+              "    <script>\n",
+              "      const buttonEl =\n",
+              "        document.querySelector('#df-d5779190-ae7c-4e55-b713-748fc6379621 button.colab-df-convert');\n",
+              "      buttonEl.style.display =\n",
+              "        google.colab.kernel.accessAllowed ? 'block' : 'none';\n",
+              "\n",
+              "      async function convertToInteractive(key) {\n",
+              "        const element = document.querySelector('#df-d5779190-ae7c-4e55-b713-748fc6379621');\n",
+              "        const dataTable =\n",
+              "          await google.colab.kernel.invokeFunction('convertToInteractive',\n",
+              "                                                    [key], {});\n",
+              "        if (!dataTable) return;\n",
+              "\n",
+              "        const docLinkHtml = 'Like what you see? Visit the ' +\n",
+              "          '<a target=\"_blank\" href=https://colab.research.google.com/notebooks/data_table.ipynb>data table notebook</a>'\n",
+              "          + ' to learn more about interactive tables.';\n",
+              "        element.innerHTML = '';\n",
+              "        dataTable['output_type'] = 'display_data';\n",
+              "        await google.colab.output.renderOutput(dataTable, element);\n",
+              "        const docLink = document.createElement('div');\n",
+              "        docLink.innerHTML = docLinkHtml;\n",
+              "        element.appendChild(docLink);\n",
+              "      }\n",
+              "    </script>\n",
+              "  </div>\n",
+              "\n",
+              "\n",
+              "<div id=\"df-600fe1c4-34ef-47cd-941c-161259ca58ee\">\n",
+              "  <button class=\"colab-df-quickchart\" onclick=\"quickchart('df-600fe1c4-34ef-47cd-941c-161259ca58ee')\"\n",
+              "            title=\"Suggest charts\"\n",
+              "            style=\"display:none;\">\n",
+              "\n",
+              "<svg xmlns=\"http://www.w3.org/2000/svg\" height=\"24px\"viewBox=\"0 0 24 24\"\n",
+              "     width=\"24px\">\n",
+              "    <g>\n",
+              "        <path d=\"M19 3H5c-1.1 0-2 .9-2 2v14c0 1.1.9 2 2 2h14c1.1 0 2-.9 2-2V5c0-1.1-.9-2-2-2zM9 17H7v-7h2v7zm4 0h-2V7h2v10zm4 0h-2v-4h2v4z\"/>\n",
+              "    </g>\n",
+              "</svg>\n",
+              "  </button>\n",
+              "\n",
+              "<style>\n",
+              "  .colab-df-quickchart {\n",
+              "      --bg-color: #E8F0FE;\n",
+              "      --fill-color: #1967D2;\n",
+              "      --hover-bg-color: #E2EBFA;\n",
+              "      --hover-fill-color: #174EA6;\n",
+              "      --disabled-fill-color: #AAA;\n",
+              "      --disabled-bg-color: #DDD;\n",
+              "  }\n",
+              "\n",
+              "  [theme=dark] .colab-df-quickchart {\n",
+              "      --bg-color: #3B4455;\n",
+              "      --fill-color: #D2E3FC;\n",
+              "      --hover-bg-color: #434B5C;\n",
+              "      --hover-fill-color: #FFFFFF;\n",
+              "      --disabled-bg-color: #3B4455;\n",
+              "      --disabled-fill-color: #666;\n",
+              "  }\n",
+              "\n",
+              "  .colab-df-quickchart {\n",
+              "    background-color: var(--bg-color);\n",
+              "    border: none;\n",
+              "    border-radius: 50%;\n",
+              "    cursor: pointer;\n",
+              "    display: none;\n",
+              "    fill: var(--fill-color);\n",
+              "    height: 32px;\n",
+              "    padding: 0;\n",
+              "    width: 32px;\n",
+              "  }\n",
+              "\n",
+              "  .colab-df-quickchart:hover {\n",
+              "    background-color: var(--hover-bg-color);\n",
+              "    box-shadow: 0 1px 2px rgba(60, 64, 67, 0.3), 0 1px 3px 1px rgba(60, 64, 67, 0.15);\n",
+              "    fill: var(--button-hover-fill-color);\n",
+              "  }\n",
+              "\n",
+              "  .colab-df-quickchart-complete:disabled,\n",
+              "  .colab-df-quickchart-complete:disabled:hover {\n",
+              "    background-color: var(--disabled-bg-color);\n",
+              "    fill: var(--disabled-fill-color);\n",
+              "    box-shadow: none;\n",
+              "  }\n",
+              "\n",
+              "  .colab-df-spinner {\n",
+              "    border: 2px solid var(--fill-color);\n",
+              "    border-color: transparent;\n",
+              "    border-bottom-color: var(--fill-color);\n",
+              "    animation:\n",
+              "      spin 1s steps(1) infinite;\n",
+              "  }\n",
+              "\n",
+              "  @keyframes spin {\n",
+              "    0% {\n",
+              "      border-color: transparent;\n",
+              "      border-bottom-color: var(--fill-color);\n",
+              "      border-left-color: var(--fill-color);\n",
+              "    }\n",
+              "    20% {\n",
+              "      border-color: transparent;\n",
+              "      border-left-color: var(--fill-color);\n",
+              "      border-top-color: var(--fill-color);\n",
+              "    }\n",
+              "    30% {\n",
+              "      border-color: transparent;\n",
+              "      border-left-color: var(--fill-color);\n",
+              "      border-top-color: var(--fill-color);\n",
+              "      border-right-color: var(--fill-color);\n",
+              "    }\n",
+              "    40% {\n",
+              "      border-color: transparent;\n",
+              "      border-right-color: var(--fill-color);\n",
+              "      border-top-color: var(--fill-color);\n",
+              "    }\n",
+              "    60% {\n",
+              "      border-color: transparent;\n",
+              "      border-right-color: var(--fill-color);\n",
+              "    }\n",
+              "    80% {\n",
+              "      border-color: transparent;\n",
+              "      border-right-color: var(--fill-color);\n",
+              "      border-bottom-color: var(--fill-color);\n",
+              "    }\n",
+              "    90% {\n",
+              "      border-color: transparent;\n",
+              "      border-bottom-color: var(--fill-color);\n",
+              "    }\n",
+              "  }\n",
+              "</style>\n",
+              "\n",
+              "  <script>\n",
+              "    async function quickchart(key) {\n",
+              "      const quickchartButtonEl =\n",
+              "        document.querySelector('#' + key + ' button');\n",
+              "      quickchartButtonEl.disabled = true;  // To prevent multiple clicks.\n",
+              "      quickchartButtonEl.classList.add('colab-df-spinner');\n",
+              "      try {\n",
+              "        const charts = await google.colab.kernel.invokeFunction(\n",
+              "            'suggestCharts', [key], {});\n",
+              "      } catch (error) {\n",
+              "        console.error('Error during call to suggestCharts:', error);\n",
+              "      }\n",
+              "      quickchartButtonEl.classList.remove('colab-df-spinner');\n",
+              "      quickchartButtonEl.classList.add('colab-df-quickchart-complete');\n",
+              "    }\n",
+              "    (() => {\n",
+              "      let quickchartButtonEl =\n",
+              "        document.querySelector('#df-600fe1c4-34ef-47cd-941c-161259ca58ee button');\n",
+              "      quickchartButtonEl.style.display =\n",
+              "        google.colab.kernel.accessAllowed ? 'block' : 'none';\n",
+              "    })();\n",
+              "  </script>\n",
+              "</div>\n",
+              "    </div>\n",
+              "  </div>\n"
+            ]
+          },
+          "metadata": {},
+          "execution_count": 6
+        }
+      ]
+    },
+    {
+      "cell_type": "code",
+      "source": [
+        "# checking the number of rows and columns in the dataframe\n",
+        "mail_data.shape"
+      ],
+      "metadata": {
+        "colab": {
+          "base_uri": "https://localhost:8080/"
+        },
+        "id": "a9q3ycSH3Dyd",
+        "outputId": "b4c74161-8e31-4542-fc21-ef3d6ac51e8d"
+      },
+      "execution_count": null,
+      "outputs": [
+        {
+          "output_type": "execute_result",
+          "data": {
+            "text/plain": [
+              "(5572, 2)"
+            ]
+          },
+          "metadata": {},
+          "execution_count": 7
+        }
+      ]
+    },
+    {
+      "cell_type": "markdown",
+      "source": [
+        "Label Encoding"
+      ],
+      "metadata": {
+        "id": "xhR7s_t13pkA"
+      }
+    },
+    {
+      "cell_type": "code",
+      "source": [
+        "# label spam mail as 0; ham mail as 1\n",
+        "\n",
+        "mail_data.loc[mail_data['Category'] == 'spam', 'Category',] = 0\n",
+        "mail_data.loc[mail_data['Category'] == 'ham', 'Category',] = 1"
+      ],
+      "metadata": {
+        "id": "kansbBdx3TAz"
+      },
+      "execution_count": null,
+      "outputs": []
+    },
+    {
+      "cell_type": "markdown",
+      "source": [
+        "spam - 0\n",
+        "ham - 1"
+      ],
+      "metadata": {
+        "id": "2gD99sNg4YM5"
+      }
+    },
+    {
+      "cell_type": "code",
+      "source": [
+        "# separating the data as text and label\n",
+        "\n",
+        "X = mail_data['Message']\n",
+        "\n",
+        "Y = mail_data['Category']"
+      ],
+      "metadata": {
+        "id": "Z9K7sXtz4Vsh"
+      },
+      "execution_count": null,
+      "outputs": []
+    },
+    {
+      "cell_type": "code",
+      "source": [
+        "print(X)"
+      ],
+      "metadata": {
+        "colab": {
+          "base_uri": "https://localhost:8080/"
+        },
+        "id": "oJZ4hpg746Bu",
+        "outputId": "aac864bd-dfb4-441f-f21c-72ae47b6c621"
+      },
+      "execution_count": null,
+      "outputs": [
+        {
+          "output_type": "stream",
+          "name": "stdout",
+          "text": [
+            "0       Go until jurong point, crazy.. Available only ...\n",
+            "1                           Ok lar... Joking wif u oni...\n",
+            "2       Free entry in 2 a wkly comp to win FA Cup fina...\n",
+            "3       U dun say so early hor... U c already then say...\n",
+            "4       Nah I don't think he goes to usf, he lives aro...\n",
+            "                              ...                        \n",
+            "5567    This is the 2nd time we have tried 2 contact u...\n",
+            "5568                 Will ü b going to esplanade fr home?\n",
+            "5569    Pity, * was in mood for that. So...any other s...\n",
+            "5570    The guy did some bitching but I acted like i'd...\n",
+            "5571                           Rofl. Its true to its name\n",
+            "Name: Message, Length: 5572, dtype: object\n"
+          ]
+        }
+      ]
+    },
+    {
+      "cell_type": "code",
+      "source": [
+        "print(Y)"
+      ],
+      "metadata": {
+        "colab": {
+          "base_uri": "https://localhost:8080/"
+        },
+        "id": "OCnwQusp47zP",
+        "outputId": "75cc459e-e541-48b5-93c2-ded8789a0838"
+      },
+      "execution_count": null,
+      "outputs": [
+        {
+          "output_type": "stream",
+          "name": "stdout",
+          "text": [
+            "0       1\n",
+            "1       1\n",
+            "2       0\n",
+            "3       1\n",
+            "4       1\n",
+            "       ..\n",
+            "5567    0\n",
+            "5568    1\n",
+            "5569    1\n",
+            "5570    1\n",
+            "5571    1\n",
+            "Name: Category, Length: 5572, dtype: object\n"
+          ]
+        }
+      ]
+    },
+    {
+      "cell_type": "markdown",
+      "source": [
+        "Splitting the data into training data and test data"
+      ],
+      "metadata": {
+        "id": "bXsx8Aaf5EHC"
+      }
+    },
+    {
+      "cell_type": "code",
+      "source": [
+        "X_train, X_test, Y_train, Y_test = train_test_split(X, Y, test_size=0.2, random_state=3)"
+      ],
+      "metadata": {
+        "id": "aSjY5OTx49tP"
+      },
+      "execution_count": null,
+      "outputs": []
+    },
+    {
+      "cell_type": "code",
+      "source": [
+        "print(X.shape)\n",
+        "print(X_train.shape)\n",
+        "print(X_test.shape)"
+      ],
+      "metadata": {
+        "colab": {
+          "base_uri": "https://localhost:8080/"
+        },
+        "id": "tMquKtlh6Lxu",
+        "outputId": "e0178f5a-21ac-4194-c595-1a81654752de"
+      },
+      "execution_count": null,
+      "outputs": [
+        {
+          "output_type": "stream",
+          "name": "stdout",
+          "text": [
+            "(5572,)\n",
+            "(4457,)\n",
+            "(1115,)\n"
+          ]
+        }
+      ]
+    },
+    {
+      "cell_type": "markdown",
+      "source": [
+        "Feature Extraction"
+      ],
+      "metadata": {
+        "id": "kCX1EKnJ6l4x"
+      }
+    },
+    {
+      "cell_type": "code",
+      "source": [
+        "from sklearn.feature_extraction.text import TfidfVectorizer\n",
+        "\n",
+        "# transform the test data to feature vectors that can be used as input to the Logistic regression\n",
+        "\n",
+        "feature_extraction = TfidfVectorizer(min_df=1, stop_words='english', lowercase=True)\n",
+        "\n",
+        "X_train_features = feature_extraction.fit_transform(X_train)\n",
+        "X_test_features = feature_extraction.transform(X_test)\n",
+        "\n",
+        "# Convert Y_train and Y_test values to integers\n",
+        "Y_train = Y_train.astype('int')\n",
+        "Y_test = Y_test.astype('int')\n"
+      ],
+      "metadata": {
+        "id": "JyAmeXa26Xye"
+      },
+      "execution_count": null,
+      "outputs": []
+    },
+    {
+      "cell_type": "code",
+      "source": [
+        "print(X_train)"
+      ],
+      "metadata": {
+        "colab": {
+          "base_uri": "https://localhost:8080/"
+        },
+        "id": "zy9BjMhg9siT",
+        "outputId": "08ab590c-049b-496d-97a1-4b78048807f3"
+      },
+      "execution_count": null,
+      "outputs": [
+        {
+          "output_type": "stream",
+          "name": "stdout",
+          "text": [
+            "3075                  Don know. I did't msg him recently.\n",
+            "1787    Do you know why god created gap between your f...\n",
+            "1614                         Thnx dude. u guys out 2nite?\n",
+            "4304                                      Yup i'm free...\n",
+            "3266    44 7732584351, Do you want a New Nokia 3510i c...\n",
+            "                              ...                        \n",
+            "789     5 Free Top Polyphonic Tones call 087018728737,...\n",
+            "968     What do u want when i come back?.a beautiful n...\n",
+            "1667    Guess who spent all last night phasing in and ...\n",
+            "3321    Eh sorry leh... I din c ur msg. Not sad alread...\n",
+            "1688    Free Top ringtone -sub to weekly ringtone-get ...\n",
+            "Name: Message, Length: 4457, dtype: object\n"
+          ]
+        }
+      ]
+    },
+    {
+      "cell_type": "code",
+      "source": [
+        "print(X_train_features)"
+      ],
+      "metadata": {
+        "colab": {
+          "base_uri": "https://localhost:8080/"
+        },
+        "id": "kWNtiRsSACQq",
+        "outputId": "f4fcb312-2e62-477d-a18a-da79bb2c36ac"
+      },
+      "execution_count": null,
+      "outputs": [
+        {
+          "output_type": "stream",
+          "name": "stdout",
+          "text": [
+            "  (0, 5413)\t0.6198254967574347\n",
+            "  (0, 4456)\t0.4168658090846482\n",
+            "  (0, 2224)\t0.413103377943378\n",
+            "  (0, 3811)\t0.34780165336891333\n",
+            "  (0, 2329)\t0.38783870336935383\n",
+            "  (1, 4080)\t0.18880584110891163\n",
+            "  (1, 3185)\t0.29694482957694585\n",
+            "  (1, 3325)\t0.31610586766078863\n",
+            "  (1, 2957)\t0.3398297002864083\n",
+            "  (1, 2746)\t0.3398297002864083\n",
+            "  (1, 918)\t0.22871581159877646\n",
+            "  (1, 1839)\t0.2784903590561455\n",
+            "  (1, 2758)\t0.3226407885943799\n",
+            "  (1, 2956)\t0.33036995955537024\n",
+            "  (1, 1991)\t0.33036995955537024\n",
+            "  (1, 3046)\t0.2503712792613518\n",
+            "  (1, 3811)\t0.17419952275504033\n",
+            "  (2, 407)\t0.509272536051008\n",
+            "  (2, 3156)\t0.4107239318312698\n",
+            "  (2, 2404)\t0.45287711070606745\n",
+            "  (2, 6601)\t0.6056811524587518\n",
+            "  (3, 2870)\t0.5864269879324768\n",
+            "  (3, 7414)\t0.8100020912469564\n",
+            "  (4, 50)\t0.23633754072626942\n",
+            "  (4, 5497)\t0.15743785051118356\n",
+            "  :\t:\n",
+            "  (4454, 4602)\t0.2669765732445391\n",
+            "  (4454, 3142)\t0.32014451677763156\n",
+            "  (4455, 2247)\t0.37052851863170466\n",
+            "  (4455, 2469)\t0.35441545511837946\n",
+            "  (4455, 5646)\t0.33545678464631296\n",
+            "  (4455, 6810)\t0.29731757715898277\n",
+            "  (4455, 6091)\t0.23103841516927642\n",
+            "  (4455, 7113)\t0.30536590342067704\n",
+            "  (4455, 3872)\t0.3108911491788658\n",
+            "  (4455, 4715)\t0.30714144758811196\n",
+            "  (4455, 6916)\t0.19636985317119715\n",
+            "  (4455, 3922)\t0.31287563163368587\n",
+            "  (4455, 4456)\t0.24920025316220423\n",
+            "  (4456, 141)\t0.292943737785358\n",
+            "  (4456, 647)\t0.30133182431707617\n",
+            "  (4456, 6311)\t0.30133182431707617\n",
+            "  (4456, 5569)\t0.4619395404299172\n",
+            "  (4456, 6028)\t0.21034888000987115\n",
+            "  (4456, 7154)\t0.24083218452280053\n",
+            "  (4456, 7150)\t0.3677554681447669\n",
+            "  (4456, 6249)\t0.17573831794959716\n",
+            "  (4456, 6307)\t0.2752760476857975\n",
+            "  (4456, 334)\t0.2220077711654938\n",
+            "  (4456, 5778)\t0.16243064490100795\n",
+            "  (4456, 2870)\t0.31523196273113385\n"
+          ]
+        }
+      ]
+    },
+    {
+      "cell_type": "markdown",
+      "source": [
+        "Training the Model"
+      ],
+      "metadata": {
+        "id": "Pva5RHYdAXZ6"
+      }
+    },
+    {
+      "cell_type": "markdown",
+      "source": [
+        "Logistic Regression"
+      ],
+      "metadata": {
+        "id": "GecnPeGQAaQY"
+      }
+    },
+    {
+      "cell_type": "code",
+      "source": [
+        "model = LogisticRegression()"
+      ],
+      "metadata": {
+        "id": "L0vJOJWEAHh7"
+      },
+      "execution_count": null,
+      "outputs": []
+    },
+    {
+      "cell_type": "code",
+      "source": [
+        "# training the Logistic Regression model with the training data\n",
+        "model.fit(X_train_features, Y_train)"
+      ],
+      "metadata": {
+        "colab": {
+          "base_uri": "https://localhost:8080/",
+          "height": 75
+        },
+        "id": "arWwqBNsAlr4",
+        "outputId": "8ed56123-c774-4ef4-beaf-5b88376d8406"
+      },
+      "execution_count": null,
+      "outputs": [
+        {
+          "output_type": "execute_result",
+          "data": {
+            "text/plain": [
+              "LogisticRegression()"
+            ],
+            "text/html": [
+              "<style>#sk-container-id-2 {color: black;background-color: white;}#sk-container-id-2 pre{padding: 0;}#sk-container-id-2 div.sk-toggleable {background-color: white;}#sk-container-id-2 label.sk-toggleable__label {cursor: pointer;display: block;width: 100%;margin-bottom: 0;padding: 0.3em;box-sizing: border-box;text-align: center;}#sk-container-id-2 label.sk-toggleable__label-arrow:before {content: \"▸\";float: left;margin-right: 0.25em;color: #696969;}#sk-container-id-2 label.sk-toggleable__label-arrow:hover:before {color: black;}#sk-container-id-2 div.sk-estimator:hover label.sk-toggleable__label-arrow:before {color: black;}#sk-container-id-2 div.sk-toggleable__content {max-height: 0;max-width: 0;overflow: hidden;text-align: left;background-color: #f0f8ff;}#sk-container-id-2 div.sk-toggleable__content pre {margin: 0.2em;color: black;border-radius: 0.25em;background-color: #f0f8ff;}#sk-container-id-2 input.sk-toggleable__control:checked~div.sk-toggleable__content {max-height: 200px;max-width: 100%;overflow: auto;}#sk-container-id-2 input.sk-toggleable__control:checked~label.sk-toggleable__label-arrow:before {content: \"▾\";}#sk-container-id-2 div.sk-estimator input.sk-toggleable__control:checked~label.sk-toggleable__label {background-color: #d4ebff;}#sk-container-id-2 div.sk-label input.sk-toggleable__control:checked~label.sk-toggleable__label {background-color: #d4ebff;}#sk-container-id-2 input.sk-hidden--visually {border: 0;clip: rect(1px 1px 1px 1px);clip: rect(1px, 1px, 1px, 1px);height: 1px;margin: -1px;overflow: hidden;padding: 0;position: absolute;width: 1px;}#sk-container-id-2 div.sk-estimator {font-family: monospace;background-color: #f0f8ff;border: 1px dotted black;border-radius: 0.25em;box-sizing: border-box;margin-bottom: 0.5em;}#sk-container-id-2 div.sk-estimator:hover {background-color: #d4ebff;}#sk-container-id-2 div.sk-parallel-item::after {content: \"\";width: 100%;border-bottom: 1px solid gray;flex-grow: 1;}#sk-container-id-2 div.sk-label:hover label.sk-toggleable__label {background-color: #d4ebff;}#sk-container-id-2 div.sk-serial::before {content: \"\";position: absolute;border-left: 1px solid gray;box-sizing: border-box;top: 0;bottom: 0;left: 50%;z-index: 0;}#sk-container-id-2 div.sk-serial {display: flex;flex-direction: column;align-items: center;background-color: white;padding-right: 0.2em;padding-left: 0.2em;position: relative;}#sk-container-id-2 div.sk-item {position: relative;z-index: 1;}#sk-container-id-2 div.sk-parallel {display: flex;align-items: stretch;justify-content: center;background-color: white;position: relative;}#sk-container-id-2 div.sk-item::before, #sk-container-id-2 div.sk-parallel-item::before {content: \"\";position: absolute;border-left: 1px solid gray;box-sizing: border-box;top: 0;bottom: 0;left: 50%;z-index: -1;}#sk-container-id-2 div.sk-parallel-item {display: flex;flex-direction: column;z-index: 1;position: relative;background-color: white;}#sk-container-id-2 div.sk-parallel-item:first-child::after {align-self: flex-end;width: 50%;}#sk-container-id-2 div.sk-parallel-item:last-child::after {align-self: flex-start;width: 50%;}#sk-container-id-2 div.sk-parallel-item:only-child::after {width: 0;}#sk-container-id-2 div.sk-dashed-wrapped {border: 1px dashed gray;margin: 0 0.4em 0.5em 0.4em;box-sizing: border-box;padding-bottom: 0.4em;background-color: white;}#sk-container-id-2 div.sk-label label {font-family: monospace;font-weight: bold;display: inline-block;line-height: 1.2em;}#sk-container-id-2 div.sk-label-container {text-align: center;}#sk-container-id-2 div.sk-container {/* jupyter's `normalize.less` sets `[hidden] { display: none; }` but bootstrap.min.css set `[hidden] { display: none !important; }` so we also need the `!important` here to be able to override the default hidden behavior on the sphinx rendered scikit-learn.org. See: https://github.com/scikit-learn/scikit-learn/issues/21755 */display: inline-block !important;position: relative;}#sk-container-id-2 div.sk-text-repr-fallback {display: none;}</style><div id=\"sk-container-id-2\" class=\"sk-top-container\"><div class=\"sk-text-repr-fallback\"><pre>LogisticRegression()</pre><b>In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook. <br />On GitHub, the HTML representation is unable to render, please try loading this page with nbviewer.org.</b></div><div class=\"sk-container\" hidden><div class=\"sk-item\"><div class=\"sk-estimator sk-toggleable\"><input class=\"sk-toggleable__control sk-hidden--visually\" id=\"sk-estimator-id-2\" type=\"checkbox\" checked><label for=\"sk-estimator-id-2\" class=\"sk-toggleable__label sk-toggleable__label-arrow\">LogisticRegression</label><div class=\"sk-toggleable__content\"><pre>LogisticRegression()</pre></div></div></div></div></div>"
+            ]
+          },
+          "metadata": {},
+          "execution_count": 23
+        }
+      ]
+    },
+    {
+      "cell_type": "markdown",
+      "source": [
+        "Evaluating the trained model"
+      ],
+      "metadata": {
+        "id": "r8cYckM1A7XZ"
+      }
+    },
+    {
+      "cell_type": "code",
+      "source": [
+        "# prediction on training data\n",
+        "\n",
+        "prediction_on_training_data = model.predict(X_train_features)\n",
+        "accuracy_on_training_data = accuracy_score(Y_train, prediction_on_training_data)"
+      ],
+      "metadata": {
+        "id": "tQAxlE6SA1FZ"
+      },
+      "execution_count": null,
+      "outputs": []
+    },
+    {
+      "cell_type": "code",
+      "source": [
+        "print('Accuracy on training data : ', accuracy_on_training_data)"
+      ],
+      "metadata": {
+        "id": "jZ6YrljxBfVJ",
+        "colab": {
+          "base_uri": "https://localhost:8080/"
+        },
+        "outputId": "0c816d93-1eb1-40fc-d67d-9276cc78cd62"
+      },
+      "execution_count": null,
+      "outputs": [
+        {
+          "output_type": "stream",
+          "name": "stdout",
+          "text": [
+            "Accuracy on training data :  0.9670181736594121\n"
+          ]
+        }
+      ]
+    },
+    {
+      "cell_type": "code",
+      "source": [
+        "#  prediction on training data\n",
+        "\n",
+        "prediction_on_test_data = model.predict(X_test_features)\n",
+        "accuracy_on_test_data = accuracy_score(Y_test, prediction_on_test_data)"
+      ],
+      "metadata": {
+        "id": "R00nGZjsDG8E"
+      },
+      "execution_count": null,
+      "outputs": []
+    },
+    {
+      "cell_type": "code",
+      "source": [
+        "print('Accuracy on test data : ', accuracy_on_test_data)"
+      ],
+      "metadata": {
+        "colab": {
+          "base_uri": "https://localhost:8080/"
+        },
+        "id": "1RskbscVDfKK",
+        "outputId": "7ffcd95c-3a64-4608-a114-6d57b4cc7e73"
+      },
+      "execution_count": null,
+      "outputs": [
+        {
+          "output_type": "stream",
+          "name": "stdout",
+          "text": [
+            "Accuracy on test data :  0.9659192825112107\n"
+          ]
+        }
+      ]
+    },
+    {
+      "cell_type": "markdown",
+      "source": [
+        "Building a Predictive System"
+      ],
+      "metadata": {
+        "id": "ReFsV1WjELoS"
+      }
+    },
+    {
+      "cell_type": "code",
+      "source": [
+        "input_mail = [\"I've been searching for the right words to thank you for this breather. I promise i wont take your help for granted and will fulfil my promise. You have been wonderful and a blessing at all times.\"]\n",
+        "\n",
+        "# convert text to feature vectors\n",
+        "input_data_features = feature_extraction.transform(input_mail)\n",
+        "\n",
+        "# making predictions\n",
+        "\n",
+        "prediction = model.predict(input_data_features)\n",
+        "print(prediction)\n",
+        "\n",
+        "\n",
+        "if (prediction[0]==1):\n",
+        "  print('Ham mail')\n",
+        "\n",
+        "else:\n",
+        "  print('Spam mail')"
+      ],
+      "metadata": {
+        "colab": {
+          "base_uri": "https://localhost:8080/"
+        },
+        "id": "Btg42OqdDkCH",
+        "outputId": "74932794-0527-451d-b257-fc281e5f2648"
+      },
+      "execution_count": null,
+      "outputs": [
+        {
+          "output_type": "stream",
+          "name": "stdout",
+          "text": [
+            "[1]\n",
+            "Ham mail\n"
+          ]
+        }
+      ]
+    }
+  ]
+}
\ No newline at end of file