diff --git a/lab-dw-data-cleaning-and-formatting.ipynb b/lab-dw-data-cleaning-and-formatting.ipynb
index cdfc3c6..d59f6c1 100644
--- a/lab-dw-data-cleaning-and-formatting.ipynb
+++ b/lab-dw-data-cleaning-and-formatting.ipynb
@@ -1,371 +1,4223 @@
{
- "cells": [
+ "cells": [
+ {
+ "cell_type": "markdown",
+ "id": "25d7736c-ba17-4aff-b6bb-66eba20fbf4e",
+ "metadata": {
+ "id": "25d7736c-ba17-4aff-b6bb-66eba20fbf4e"
+ },
+ "source": [
+ "# Lab | Data Cleaning and Formatting"
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "id": "d1973e9e-8be6-4039-b70e-d73ee0d94c99",
+ "metadata": {
+ "id": "d1973e9e-8be6-4039-b70e-d73ee0d94c99"
+ },
+ "source": [
+ "In this lab, we will be working with the customer data from an insurance company, which can be found in the CSV file located at the following link: https://raw.githubusercontent.com/data-bootcamp-v4/data/main/file1.csv\n"
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "id": "31b8a9e7-7db9-4604-991b-ef6771603e57",
+ "metadata": {
+ "id": "31b8a9e7-7db9-4604-991b-ef6771603e57"
+ },
+ "source": [
+ "# Challenge 1: Data Cleaning and Formatting"
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "id": "81553f19-9f2c-484b-8940-520aff884022",
+ "metadata": {
+ "id": "81553f19-9f2c-484b-8940-520aff884022"
+ },
+ "source": [
+ "## Exercise 1: Cleaning Column Names"
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "id": "34a929f4-1be4-4fa8-adda-42ffd920be90",
+ "metadata": {
+ "id": "34a929f4-1be4-4fa8-adda-42ffd920be90"
+ },
+ "source": [
+ "To ensure consistency and ease of use, standardize the column names of the dataframe. Start by taking a first look at the dataframe and identifying any column names that need to be modified. Use appropriate naming conventions and make sure that column names are descriptive and informative.\n",
+ "\n",
+ "*Hint*:\n",
+ "- *Column names should be in lower case*\n",
+ "- *White spaces in column names should be replaced by `_`*\n",
+ "- *`st` could be replaced for `state`*"
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": 10,
+ "id": "5810735c-8056-4442-bbf2-dda38d3e284a",
+ "metadata": {
+ "id": "5810735c-8056-4442-bbf2-dda38d3e284a"
+ },
+ "outputs": [
{
- "cell_type": "markdown",
- "id": "25d7736c-ba17-4aff-b6bb-66eba20fbf4e",
- "metadata": {
- "id": "25d7736c-ba17-4aff-b6bb-66eba20fbf4e"
- },
- "source": [
- "# Lab | Data Cleaning and Formatting"
+ "data": {
+ "text/html": [
+ "
\n",
+ "\n",
+ "
\n",
+ " \n",
+ " \n",
+ " | \n",
+ " Customer | \n",
+ " ST | \n",
+ " GENDER | \n",
+ " Education | \n",
+ " Customer Lifetime Value | \n",
+ " Income | \n",
+ " Monthly Premium Auto | \n",
+ " Number of Open Complaints | \n",
+ " Policy Type | \n",
+ " Vehicle Class | \n",
+ " Total Claim Amount | \n",
+ "
\n",
+ " \n",
+ " \n",
+ " \n",
+ " 0 | \n",
+ " RB50392 | \n",
+ " Washington | \n",
+ " NaN | \n",
+ " Master | \n",
+ " NaN | \n",
+ " 0.0 | \n",
+ " 1000.0 | \n",
+ " 1/0/00 | \n",
+ " Personal Auto | \n",
+ " Four-Door Car | \n",
+ " 2.704934 | \n",
+ "
\n",
+ " \n",
+ " 1 | \n",
+ " QZ44356 | \n",
+ " Arizona | \n",
+ " F | \n",
+ " Bachelor | \n",
+ " 697953.59% | \n",
+ " 0.0 | \n",
+ " 94.0 | \n",
+ " 1/0/00 | \n",
+ " Personal Auto | \n",
+ " Four-Door Car | \n",
+ " 1131.464935 | \n",
+ "
\n",
+ " \n",
+ " 2 | \n",
+ " AI49188 | \n",
+ " Nevada | \n",
+ " F | \n",
+ " Bachelor | \n",
+ " 1288743.17% | \n",
+ " 48767.0 | \n",
+ " 108.0 | \n",
+ " 1/0/00 | \n",
+ " Personal Auto | \n",
+ " Two-Door Car | \n",
+ " 566.472247 | \n",
+ "
\n",
+ " \n",
+ " 3 | \n",
+ " WW63253 | \n",
+ " California | \n",
+ " M | \n",
+ " Bachelor | \n",
+ " 764586.18% | \n",
+ " 0.0 | \n",
+ " 106.0 | \n",
+ " 1/0/00 | \n",
+ " Corporate Auto | \n",
+ " SUV | \n",
+ " 529.881344 | \n",
+ "
\n",
+ " \n",
+ " 4 | \n",
+ " GA49547 | \n",
+ " Washington | \n",
+ " M | \n",
+ " High School or Below | \n",
+ " 536307.65% | \n",
+ " 36357.0 | \n",
+ " 68.0 | \n",
+ " 1/0/00 | \n",
+ " Personal Auto | \n",
+ " Four-Door Car | \n",
+ " 17.269323 | \n",
+ "
\n",
+ " \n",
+ " ... | \n",
+ " ... | \n",
+ " ... | \n",
+ " ... | \n",
+ " ... | \n",
+ " ... | \n",
+ " ... | \n",
+ " ... | \n",
+ " ... | \n",
+ " ... | \n",
+ " ... | \n",
+ " ... | \n",
+ "
\n",
+ " \n",
+ " 4003 | \n",
+ " NaN | \n",
+ " NaN | \n",
+ " NaN | \n",
+ " NaN | \n",
+ " NaN | \n",
+ " NaN | \n",
+ " NaN | \n",
+ " NaN | \n",
+ " NaN | \n",
+ " NaN | \n",
+ " NaN | \n",
+ "
\n",
+ " \n",
+ " 4004 | \n",
+ " NaN | \n",
+ " NaN | \n",
+ " NaN | \n",
+ " NaN | \n",
+ " NaN | \n",
+ " NaN | \n",
+ " NaN | \n",
+ " NaN | \n",
+ " NaN | \n",
+ " NaN | \n",
+ " NaN | \n",
+ "
\n",
+ " \n",
+ " 4005 | \n",
+ " NaN | \n",
+ " NaN | \n",
+ " NaN | \n",
+ " NaN | \n",
+ " NaN | \n",
+ " NaN | \n",
+ " NaN | \n",
+ " NaN | \n",
+ " NaN | \n",
+ " NaN | \n",
+ " NaN | \n",
+ "
\n",
+ " \n",
+ " 4006 | \n",
+ " NaN | \n",
+ " NaN | \n",
+ " NaN | \n",
+ " NaN | \n",
+ " NaN | \n",
+ " NaN | \n",
+ " NaN | \n",
+ " NaN | \n",
+ " NaN | \n",
+ " NaN | \n",
+ " NaN | \n",
+ "
\n",
+ " \n",
+ " 4007 | \n",
+ " NaN | \n",
+ " NaN | \n",
+ " NaN | \n",
+ " NaN | \n",
+ " NaN | \n",
+ " NaN | \n",
+ " NaN | \n",
+ " NaN | \n",
+ " NaN | \n",
+ " NaN | \n",
+ " NaN | \n",
+ "
\n",
+ " \n",
+ "
\n",
+ "
4008 rows × 11 columns
\n",
+ "
"
+ ],
+ "text/plain": [
+ " Customer ST GENDER Education \\\n",
+ "0 RB50392 Washington NaN Master \n",
+ "1 QZ44356 Arizona F Bachelor \n",
+ "2 AI49188 Nevada F Bachelor \n",
+ "3 WW63253 California M Bachelor \n",
+ "4 GA49547 Washington M High School or Below \n",
+ "... ... ... ... ... \n",
+ "4003 NaN NaN NaN NaN \n",
+ "4004 NaN NaN NaN NaN \n",
+ "4005 NaN NaN NaN NaN \n",
+ "4006 NaN NaN NaN NaN \n",
+ "4007 NaN NaN NaN NaN \n",
+ "\n",
+ " Customer Lifetime Value Income Monthly Premium Auto \\\n",
+ "0 NaN 0.0 1000.0 \n",
+ "1 697953.59% 0.0 94.0 \n",
+ "2 1288743.17% 48767.0 108.0 \n",
+ "3 764586.18% 0.0 106.0 \n",
+ "4 536307.65% 36357.0 68.0 \n",
+ "... ... ... ... \n",
+ "4003 NaN NaN NaN \n",
+ "4004 NaN NaN NaN \n",
+ "4005 NaN NaN NaN \n",
+ "4006 NaN NaN NaN \n",
+ "4007 NaN NaN NaN \n",
+ "\n",
+ " Number of Open Complaints Policy Type Vehicle Class \\\n",
+ "0 1/0/00 Personal Auto Four-Door Car \n",
+ "1 1/0/00 Personal Auto Four-Door Car \n",
+ "2 1/0/00 Personal Auto Two-Door Car \n",
+ "3 1/0/00 Corporate Auto SUV \n",
+ "4 1/0/00 Personal Auto Four-Door Car \n",
+ "... ... ... ... \n",
+ "4003 NaN NaN NaN \n",
+ "4004 NaN NaN NaN \n",
+ "4005 NaN NaN NaN \n",
+ "4006 NaN NaN NaN \n",
+ "4007 NaN NaN NaN \n",
+ "\n",
+ " Total Claim Amount \n",
+ "0 2.704934 \n",
+ "1 1131.464935 \n",
+ "2 566.472247 \n",
+ "3 529.881344 \n",
+ "4 17.269323 \n",
+ "... ... \n",
+ "4003 NaN \n",
+ "4004 NaN \n",
+ "4005 NaN \n",
+ "4006 NaN \n",
+ "4007 NaN \n",
+ "\n",
+ "[4008 rows x 11 columns]"
]
- },
+ },
+ "execution_count": 10,
+ "metadata": {},
+ "output_type": "execute_result"
+ }
+ ],
+ "source": [
+ "import pandas as pd\n",
+ "\n",
+ "url = \"https://raw.githubusercontent.com/data-bootcamp-v4/data/main/file1.csv\"\n",
+ "\n",
+ "df_1 = pd.read_csv(url)\n",
+ "\n",
+ "df_1\n"
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": 12,
+ "id": "7c012bf0-f8a6-4819-96a3-a5053f6b994d",
+ "metadata": {},
+ "outputs": [
{
- "cell_type": "markdown",
- "id": "d1973e9e-8be6-4039-b70e-d73ee0d94c99",
- "metadata": {
- "id": "d1973e9e-8be6-4039-b70e-d73ee0d94c99"
- },
- "source": [
- "In this lab, we will be working with the customer data from an insurance company, which can be found in the CSV file located at the following link: https://raw.githubusercontent.com/data-bootcamp-v4/data/main/file1.csv\n"
+ "data": {
+ "text/html": [
+ "\n",
+ "\n",
+ "
\n",
+ " \n",
+ " \n",
+ " | \n",
+ " Customer | \n",
+ " ST | \n",
+ " GENDER | \n",
+ " Education | \n",
+ " Customer Lifetime Value | \n",
+ " Income | \n",
+ " Monthly Premium Auto | \n",
+ " Number of Open Complaints | \n",
+ " Policy Type | \n",
+ " Vehicle Class | \n",
+ " Total Claim Amount | \n",
+ "
\n",
+ " \n",
+ " \n",
+ " \n",
+ " 0 | \n",
+ " RB50392 | \n",
+ " Washington | \n",
+ " NaN | \n",
+ " Master | \n",
+ " NaN | \n",
+ " 0.0 | \n",
+ " 1000.0 | \n",
+ " 1/0/00 | \n",
+ " Personal Auto | \n",
+ " Four-Door Car | \n",
+ " 2.704934 | \n",
+ "
\n",
+ " \n",
+ " 1 | \n",
+ " QZ44356 | \n",
+ " Arizona | \n",
+ " F | \n",
+ " Bachelor | \n",
+ " 697953.59% | \n",
+ " 0.0 | \n",
+ " 94.0 | \n",
+ " 1/0/00 | \n",
+ " Personal Auto | \n",
+ " Four-Door Car | \n",
+ " 1131.464935 | \n",
+ "
\n",
+ " \n",
+ " 2 | \n",
+ " AI49188 | \n",
+ " Nevada | \n",
+ " F | \n",
+ " Bachelor | \n",
+ " 1288743.17% | \n",
+ " 48767.0 | \n",
+ " 108.0 | \n",
+ " 1/0/00 | \n",
+ " Personal Auto | \n",
+ " Two-Door Car | \n",
+ " 566.472247 | \n",
+ "
\n",
+ " \n",
+ " 3 | \n",
+ " WW63253 | \n",
+ " California | \n",
+ " M | \n",
+ " Bachelor | \n",
+ " 764586.18% | \n",
+ " 0.0 | \n",
+ " 106.0 | \n",
+ " 1/0/00 | \n",
+ " Corporate Auto | \n",
+ " SUV | \n",
+ " 529.881344 | \n",
+ "
\n",
+ " \n",
+ " 4 | \n",
+ " GA49547 | \n",
+ " Washington | \n",
+ " M | \n",
+ " High School or Below | \n",
+ " 536307.65% | \n",
+ " 36357.0 | \n",
+ " 68.0 | \n",
+ " 1/0/00 | \n",
+ " Personal Auto | \n",
+ " Four-Door Car | \n",
+ " 17.269323 | \n",
+ "
\n",
+ " \n",
+ " ... | \n",
+ " ... | \n",
+ " ... | \n",
+ " ... | \n",
+ " ... | \n",
+ " ... | \n",
+ " ... | \n",
+ " ... | \n",
+ " ... | \n",
+ " ... | \n",
+ " ... | \n",
+ " ... | \n",
+ "
\n",
+ " \n",
+ " 4003 | \n",
+ " NaN | \n",
+ " NaN | \n",
+ " NaN | \n",
+ " NaN | \n",
+ " NaN | \n",
+ " NaN | \n",
+ " NaN | \n",
+ " NaN | \n",
+ " NaN | \n",
+ " NaN | \n",
+ " NaN | \n",
+ "
\n",
+ " \n",
+ " 4004 | \n",
+ " NaN | \n",
+ " NaN | \n",
+ " NaN | \n",
+ " NaN | \n",
+ " NaN | \n",
+ " NaN | \n",
+ " NaN | \n",
+ " NaN | \n",
+ " NaN | \n",
+ " NaN | \n",
+ " NaN | \n",
+ "
\n",
+ " \n",
+ " 4005 | \n",
+ " NaN | \n",
+ " NaN | \n",
+ " NaN | \n",
+ " NaN | \n",
+ " NaN | \n",
+ " NaN | \n",
+ " NaN | \n",
+ " NaN | \n",
+ " NaN | \n",
+ " NaN | \n",
+ " NaN | \n",
+ "
\n",
+ " \n",
+ " 4006 | \n",
+ " NaN | \n",
+ " NaN | \n",
+ " NaN | \n",
+ " NaN | \n",
+ " NaN | \n",
+ " NaN | \n",
+ " NaN | \n",
+ " NaN | \n",
+ " NaN | \n",
+ " NaN | \n",
+ " NaN | \n",
+ "
\n",
+ " \n",
+ " 4007 | \n",
+ " NaN | \n",
+ " NaN | \n",
+ " NaN | \n",
+ " NaN | \n",
+ " NaN | \n",
+ " NaN | \n",
+ " NaN | \n",
+ " NaN | \n",
+ " NaN | \n",
+ " NaN | \n",
+ " NaN | \n",
+ "
\n",
+ " \n",
+ "
\n",
+ "
4008 rows × 11 columns
\n",
+ "
"
+ ],
+ "text/plain": [
+ " Customer ST GENDER Education \\\n",
+ "0 RB50392 Washington NaN Master \n",
+ "1 QZ44356 Arizona F Bachelor \n",
+ "2 AI49188 Nevada F Bachelor \n",
+ "3 WW63253 California M Bachelor \n",
+ "4 GA49547 Washington M High School or Below \n",
+ "... ... ... ... ... \n",
+ "4003 NaN NaN NaN NaN \n",
+ "4004 NaN NaN NaN NaN \n",
+ "4005 NaN NaN NaN NaN \n",
+ "4006 NaN NaN NaN NaN \n",
+ "4007 NaN NaN NaN NaN \n",
+ "\n",
+ " Customer Lifetime Value Income Monthly Premium Auto \\\n",
+ "0 NaN 0.0 1000.0 \n",
+ "1 697953.59% 0.0 94.0 \n",
+ "2 1288743.17% 48767.0 108.0 \n",
+ "3 764586.18% 0.0 106.0 \n",
+ "4 536307.65% 36357.0 68.0 \n",
+ "... ... ... ... \n",
+ "4003 NaN NaN NaN \n",
+ "4004 NaN NaN NaN \n",
+ "4005 NaN NaN NaN \n",
+ "4006 NaN NaN NaN \n",
+ "4007 NaN NaN NaN \n",
+ "\n",
+ " Number of Open Complaints Policy Type Vehicle Class \\\n",
+ "0 1/0/00 Personal Auto Four-Door Car \n",
+ "1 1/0/00 Personal Auto Four-Door Car \n",
+ "2 1/0/00 Personal Auto Two-Door Car \n",
+ "3 1/0/00 Corporate Auto SUV \n",
+ "4 1/0/00 Personal Auto Four-Door Car \n",
+ "... ... ... ... \n",
+ "4003 NaN NaN NaN \n",
+ "4004 NaN NaN NaN \n",
+ "4005 NaN NaN NaN \n",
+ "4006 NaN NaN NaN \n",
+ "4007 NaN NaN NaN \n",
+ "\n",
+ " Total Claim Amount \n",
+ "0 2.704934 \n",
+ "1 1131.464935 \n",
+ "2 566.472247 \n",
+ "3 529.881344 \n",
+ "4 17.269323 \n",
+ "... ... \n",
+ "4003 NaN \n",
+ "4004 NaN \n",
+ "4005 NaN \n",
+ "4006 NaN \n",
+ "4007 NaN \n",
+ "\n",
+ "[4008 rows x 11 columns]"
]
- },
+ },
+ "execution_count": 12,
+ "metadata": {},
+ "output_type": "execute_result"
+ }
+ ],
+ "source": [
+ "url = \"https://raw.githubusercontent.com/data-bootcamp-v4/data/main/file1.csv\"\n",
+ "\n",
+ "df_1 = pd.read_csv(url)\n",
+ "\n",
+ "df_1\n"
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": 14,
+ "id": "8f814bb4-5301-41f3-9ad5-000b067d05ee",
+ "metadata": {},
+ "outputs": [
{
- "cell_type": "markdown",
- "id": "31b8a9e7-7db9-4604-991b-ef6771603e57",
- "metadata": {
- "id": "31b8a9e7-7db9-4604-991b-ef6771603e57"
- },
- "source": [
- "# Challenge 1: Data Cleaning and Formatting"
+ "data": {
+ "text/html": [
+ "\n",
+ "\n",
+ "
\n",
+ " \n",
+ " \n",
+ " | \n",
+ " customer | \n",
+ " st | \n",
+ " gender | \n",
+ " education | \n",
+ " customer-lifetime-value | \n",
+ " income | \n",
+ " monthly-premium-auto | \n",
+ " number-of-open-complaints | \n",
+ " policy-type | \n",
+ " vehicle-class | \n",
+ " total-claim-amount | \n",
+ "
\n",
+ " \n",
+ " \n",
+ " \n",
+ " 0 | \n",
+ " RB50392 | \n",
+ " Washington | \n",
+ " NaN | \n",
+ " Master | \n",
+ " NaN | \n",
+ " 0.0 | \n",
+ " 1000.0 | \n",
+ " 1/0/00 | \n",
+ " Personal Auto | \n",
+ " Four-Door Car | \n",
+ " 2.704934 | \n",
+ "
\n",
+ " \n",
+ " 1 | \n",
+ " QZ44356 | \n",
+ " Arizona | \n",
+ " F | \n",
+ " Bachelor | \n",
+ " 697953.59% | \n",
+ " 0.0 | \n",
+ " 94.0 | \n",
+ " 1/0/00 | \n",
+ " Personal Auto | \n",
+ " Four-Door Car | \n",
+ " 1131.464935 | \n",
+ "
\n",
+ " \n",
+ " 2 | \n",
+ " AI49188 | \n",
+ " Nevada | \n",
+ " F | \n",
+ " Bachelor | \n",
+ " 1288743.17% | \n",
+ " 48767.0 | \n",
+ " 108.0 | \n",
+ " 1/0/00 | \n",
+ " Personal Auto | \n",
+ " Two-Door Car | \n",
+ " 566.472247 | \n",
+ "
\n",
+ " \n",
+ " 3 | \n",
+ " WW63253 | \n",
+ " California | \n",
+ " M | \n",
+ " Bachelor | \n",
+ " 764586.18% | \n",
+ " 0.0 | \n",
+ " 106.0 | \n",
+ " 1/0/00 | \n",
+ " Corporate Auto | \n",
+ " SUV | \n",
+ " 529.881344 | \n",
+ "
\n",
+ " \n",
+ " 4 | \n",
+ " GA49547 | \n",
+ " Washington | \n",
+ " M | \n",
+ " High School or Below | \n",
+ " 536307.65% | \n",
+ " 36357.0 | \n",
+ " 68.0 | \n",
+ " 1/0/00 | \n",
+ " Personal Auto | \n",
+ " Four-Door Car | \n",
+ " 17.269323 | \n",
+ "
\n",
+ " \n",
+ " ... | \n",
+ " ... | \n",
+ " ... | \n",
+ " ... | \n",
+ " ... | \n",
+ " ... | \n",
+ " ... | \n",
+ " ... | \n",
+ " ... | \n",
+ " ... | \n",
+ " ... | \n",
+ " ... | \n",
+ "
\n",
+ " \n",
+ " 4003 | \n",
+ " NaN | \n",
+ " NaN | \n",
+ " NaN | \n",
+ " NaN | \n",
+ " NaN | \n",
+ " NaN | \n",
+ " NaN | \n",
+ " NaN | \n",
+ " NaN | \n",
+ " NaN | \n",
+ " NaN | \n",
+ "
\n",
+ " \n",
+ " 4004 | \n",
+ " NaN | \n",
+ " NaN | \n",
+ " NaN | \n",
+ " NaN | \n",
+ " NaN | \n",
+ " NaN | \n",
+ " NaN | \n",
+ " NaN | \n",
+ " NaN | \n",
+ " NaN | \n",
+ " NaN | \n",
+ "
\n",
+ " \n",
+ " 4005 | \n",
+ " NaN | \n",
+ " NaN | \n",
+ " NaN | \n",
+ " NaN | \n",
+ " NaN | \n",
+ " NaN | \n",
+ " NaN | \n",
+ " NaN | \n",
+ " NaN | \n",
+ " NaN | \n",
+ " NaN | \n",
+ "
\n",
+ " \n",
+ " 4006 | \n",
+ " NaN | \n",
+ " NaN | \n",
+ " NaN | \n",
+ " NaN | \n",
+ " NaN | \n",
+ " NaN | \n",
+ " NaN | \n",
+ " NaN | \n",
+ " NaN | \n",
+ " NaN | \n",
+ " NaN | \n",
+ "
\n",
+ " \n",
+ " 4007 | \n",
+ " NaN | \n",
+ " NaN | \n",
+ " NaN | \n",
+ " NaN | \n",
+ " NaN | \n",
+ " NaN | \n",
+ " NaN | \n",
+ " NaN | \n",
+ " NaN | \n",
+ " NaN | \n",
+ " NaN | \n",
+ "
\n",
+ " \n",
+ "
\n",
+ "
4008 rows × 11 columns
\n",
+ "
"
+ ],
+ "text/plain": [
+ " customer st gender education \\\n",
+ "0 RB50392 Washington NaN Master \n",
+ "1 QZ44356 Arizona F Bachelor \n",
+ "2 AI49188 Nevada F Bachelor \n",
+ "3 WW63253 California M Bachelor \n",
+ "4 GA49547 Washington M High School or Below \n",
+ "... ... ... ... ... \n",
+ "4003 NaN NaN NaN NaN \n",
+ "4004 NaN NaN NaN NaN \n",
+ "4005 NaN NaN NaN NaN \n",
+ "4006 NaN NaN NaN NaN \n",
+ "4007 NaN NaN NaN NaN \n",
+ "\n",
+ " customer-lifetime-value income monthly-premium-auto \\\n",
+ "0 NaN 0.0 1000.0 \n",
+ "1 697953.59% 0.0 94.0 \n",
+ "2 1288743.17% 48767.0 108.0 \n",
+ "3 764586.18% 0.0 106.0 \n",
+ "4 536307.65% 36357.0 68.0 \n",
+ "... ... ... ... \n",
+ "4003 NaN NaN NaN \n",
+ "4004 NaN NaN NaN \n",
+ "4005 NaN NaN NaN \n",
+ "4006 NaN NaN NaN \n",
+ "4007 NaN NaN NaN \n",
+ "\n",
+ " number-of-open-complaints policy-type vehicle-class \\\n",
+ "0 1/0/00 Personal Auto Four-Door Car \n",
+ "1 1/0/00 Personal Auto Four-Door Car \n",
+ "2 1/0/00 Personal Auto Two-Door Car \n",
+ "3 1/0/00 Corporate Auto SUV \n",
+ "4 1/0/00 Personal Auto Four-Door Car \n",
+ "... ... ... ... \n",
+ "4003 NaN NaN NaN \n",
+ "4004 NaN NaN NaN \n",
+ "4005 NaN NaN NaN \n",
+ "4006 NaN NaN NaN \n",
+ "4007 NaN NaN NaN \n",
+ "\n",
+ " total-claim-amount \n",
+ "0 2.704934 \n",
+ "1 1131.464935 \n",
+ "2 566.472247 \n",
+ "3 529.881344 \n",
+ "4 17.269323 \n",
+ "... ... \n",
+ "4003 NaN \n",
+ "4004 NaN \n",
+ "4005 NaN \n",
+ "4006 NaN \n",
+ "4007 NaN \n",
+ "\n",
+ "[4008 rows x 11 columns]"
]
- },
+ },
+ "execution_count": 14,
+ "metadata": {},
+ "output_type": "execute_result"
+ }
+ ],
+ "source": [
+ "df_1.columns = df_1.columns.str.lower()\n",
+ "df_1.columns = df_1.columns.str.replace(' ', '-')\n",
+ "df_1\n"
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": 16,
+ "id": "028054e1-0d00-4694-95d0-314962e2c100",
+ "metadata": {},
+ "outputs": [],
+ "source": [
+ "df_1.columns = df_1.columns.str.lower().str.replace(' ', '_')"
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": 11,
+ "id": "a3be9516-44ee-4466-bf5d-8d2c9a588ea8",
+ "metadata": {},
+ "outputs": [
{
- "cell_type": "markdown",
- "id": "81553f19-9f2c-484b-8940-520aff884022",
- "metadata": {
- "id": "81553f19-9f2c-484b-8940-520aff884022"
- },
- "source": [
- "## Exercise 1: Cleaning Column Names"
+ "data": {
+ "text/html": [
+ "\n",
+ "\n",
+ "
\n",
+ " \n",
+ " \n",
+ " | \n",
+ " customer | \n",
+ " state | \n",
+ " gender | \n",
+ " education | \n",
+ " customer-lifetime-value | \n",
+ " income | \n",
+ " monthly-premium-auto | \n",
+ " number-of-open-complaints | \n",
+ " policy-type | \n",
+ " vehicle-class | \n",
+ " total-claim-amount | \n",
+ "
\n",
+ " \n",
+ " \n",
+ " \n",
+ " 0 | \n",
+ " RB50392 | \n",
+ " Washington | \n",
+ " NaN | \n",
+ " Master | \n",
+ " NaN | \n",
+ " 0.0 | \n",
+ " 1000.0 | \n",
+ " 1/0/00 | \n",
+ " Personal Auto | \n",
+ " Four-Door Car | \n",
+ " 2.704934 | \n",
+ "
\n",
+ " \n",
+ " 1 | \n",
+ " QZ44356 | \n",
+ " Arizona | \n",
+ " F | \n",
+ " Bachelor | \n",
+ " 697953.59% | \n",
+ " 0.0 | \n",
+ " 94.0 | \n",
+ " 1/0/00 | \n",
+ " Personal Auto | \n",
+ " Four-Door Car | \n",
+ " 1131.464935 | \n",
+ "
\n",
+ " \n",
+ " 2 | \n",
+ " AI49188 | \n",
+ " Nevada | \n",
+ " F | \n",
+ " Bachelor | \n",
+ " 1288743.17% | \n",
+ " 48767.0 | \n",
+ " 108.0 | \n",
+ " 1/0/00 | \n",
+ " Personal Auto | \n",
+ " Two-Door Car | \n",
+ " 566.472247 | \n",
+ "
\n",
+ " \n",
+ " 3 | \n",
+ " WW63253 | \n",
+ " California | \n",
+ " M | \n",
+ " Bachelor | \n",
+ " 764586.18% | \n",
+ " 0.0 | \n",
+ " 106.0 | \n",
+ " 1/0/00 | \n",
+ " Corporate Auto | \n",
+ " SUV | \n",
+ " 529.881344 | \n",
+ "
\n",
+ " \n",
+ " 4 | \n",
+ " GA49547 | \n",
+ " Washington | \n",
+ " M | \n",
+ " High School or Below | \n",
+ " 536307.65% | \n",
+ " 36357.0 | \n",
+ " 68.0 | \n",
+ " 1/0/00 | \n",
+ " Personal Auto | \n",
+ " Four-Door Car | \n",
+ " 17.269323 | \n",
+ "
\n",
+ " \n",
+ " ... | \n",
+ " ... | \n",
+ " ... | \n",
+ " ... | \n",
+ " ... | \n",
+ " ... | \n",
+ " ... | \n",
+ " ... | \n",
+ " ... | \n",
+ " ... | \n",
+ " ... | \n",
+ " ... | \n",
+ "
\n",
+ " \n",
+ " 4003 | \n",
+ " NaN | \n",
+ " NaN | \n",
+ " NaN | \n",
+ " NaN | \n",
+ " NaN | \n",
+ " NaN | \n",
+ " NaN | \n",
+ " NaN | \n",
+ " NaN | \n",
+ " NaN | \n",
+ " NaN | \n",
+ "
\n",
+ " \n",
+ " 4004 | \n",
+ " NaN | \n",
+ " NaN | \n",
+ " NaN | \n",
+ " NaN | \n",
+ " NaN | \n",
+ " NaN | \n",
+ " NaN | \n",
+ " NaN | \n",
+ " NaN | \n",
+ " NaN | \n",
+ " NaN | \n",
+ "
\n",
+ " \n",
+ " 4005 | \n",
+ " NaN | \n",
+ " NaN | \n",
+ " NaN | \n",
+ " NaN | \n",
+ " NaN | \n",
+ " NaN | \n",
+ " NaN | \n",
+ " NaN | \n",
+ " NaN | \n",
+ " NaN | \n",
+ " NaN | \n",
+ "
\n",
+ " \n",
+ " 4006 | \n",
+ " NaN | \n",
+ " NaN | \n",
+ " NaN | \n",
+ " NaN | \n",
+ " NaN | \n",
+ " NaN | \n",
+ " NaN | \n",
+ " NaN | \n",
+ " NaN | \n",
+ " NaN | \n",
+ " NaN | \n",
+ "
\n",
+ " \n",
+ " 4007 | \n",
+ " NaN | \n",
+ " NaN | \n",
+ " NaN | \n",
+ " NaN | \n",
+ " NaN | \n",
+ " NaN | \n",
+ " NaN | \n",
+ " NaN | \n",
+ " NaN | \n",
+ " NaN | \n",
+ " NaN | \n",
+ "
\n",
+ " \n",
+ "
\n",
+ "
4008 rows × 11 columns
\n",
+ "
"
+ ],
+ "text/plain": [
+ " customer state gender education \\\n",
+ "0 RB50392 Washington NaN Master \n",
+ "1 QZ44356 Arizona F Bachelor \n",
+ "2 AI49188 Nevada F Bachelor \n",
+ "3 WW63253 California M Bachelor \n",
+ "4 GA49547 Washington M High School or Below \n",
+ "... ... ... ... ... \n",
+ "4003 NaN NaN NaN NaN \n",
+ "4004 NaN NaN NaN NaN \n",
+ "4005 NaN NaN NaN NaN \n",
+ "4006 NaN NaN NaN NaN \n",
+ "4007 NaN NaN NaN NaN \n",
+ "\n",
+ " customer-lifetime-value income monthly-premium-auto \\\n",
+ "0 NaN 0.0 1000.0 \n",
+ "1 697953.59% 0.0 94.0 \n",
+ "2 1288743.17% 48767.0 108.0 \n",
+ "3 764586.18% 0.0 106.0 \n",
+ "4 536307.65% 36357.0 68.0 \n",
+ "... ... ... ... \n",
+ "4003 NaN NaN NaN \n",
+ "4004 NaN NaN NaN \n",
+ "4005 NaN NaN NaN \n",
+ "4006 NaN NaN NaN \n",
+ "4007 NaN NaN NaN \n",
+ "\n",
+ " number-of-open-complaints policy-type vehicle-class \\\n",
+ "0 1/0/00 Personal Auto Four-Door Car \n",
+ "1 1/0/00 Personal Auto Four-Door Car \n",
+ "2 1/0/00 Personal Auto Two-Door Car \n",
+ "3 1/0/00 Corporate Auto SUV \n",
+ "4 1/0/00 Personal Auto Four-Door Car \n",
+ "... ... ... ... \n",
+ "4003 NaN NaN NaN \n",
+ "4004 NaN NaN NaN \n",
+ "4005 NaN NaN NaN \n",
+ "4006 NaN NaN NaN \n",
+ "4007 NaN NaN NaN \n",
+ "\n",
+ " total-claim-amount \n",
+ "0 2.704934 \n",
+ "1 1131.464935 \n",
+ "2 566.472247 \n",
+ "3 529.881344 \n",
+ "4 17.269323 \n",
+ "... ... \n",
+ "4003 NaN \n",
+ "4004 NaN \n",
+ "4005 NaN \n",
+ "4006 NaN \n",
+ "4007 NaN \n",
+ "\n",
+ "[4008 rows x 11 columns]"
]
- },
+ },
+ "execution_count": 11,
+ "metadata": {},
+ "output_type": "execute_result"
+ }
+ ],
+ "source": [
+ "df_1 = df_1.rename(columns={'st': 'state'})\n",
+ "df_1"
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "id": "9cb501ec-36ff-4589-b872-6252bb150316",
+ "metadata": {
+ "id": "9cb501ec-36ff-4589-b872-6252bb150316"
+ },
+ "source": [
+ "## Exercise 2: Cleaning invalid Values"
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "id": "771fdcf3-8e20-4b06-9c24-3a93ba2b0909",
+ "metadata": {
+ "id": "771fdcf3-8e20-4b06-9c24-3a93ba2b0909"
+ },
+ "source": [
+ "The dataset contains columns with inconsistent and incorrect values that could affect the accuracy of our analysis. Therefore, we need to clean these columns to ensure that they only contain valid data.\n",
+ "\n",
+ "Note that this exercise will focus only on cleaning inconsistent values and will not involve handling null values (NaN or None).\n",
+ "\n",
+ "*Hint*:\n",
+ "- *Gender column contains various inconsistent values such as \"F\", \"M\", \"Femal\", \"Male\", \"female\", which need to be standardized, for example, to \"M\" and \"F\".*\n",
+ "- *State abbreviations be can replaced with its full name, for example \"AZ\": \"Arizona\", \"Cali\": \"California\", \"WA\": \"Washington\"*\n",
+ "- *In education, \"Bachelors\" could be replaced by \"Bachelor\"*\n",
+ "- *In Customer Lifetime Value, delete the `%` character*\n",
+ "- *In vehicle class, \"Sports Car\", \"Luxury SUV\" and \"Luxury Car\" could be replaced by \"Luxury\"*"
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": 14,
+ "id": "3f8ee5cb-50ab-48af-8a9f-9a389804033c",
+ "metadata": {
+ "id": "3f8ee5cb-50ab-48af-8a9f-9a389804033c"
+ },
+ "outputs": [
{
- "cell_type": "markdown",
- "id": "34a929f4-1be4-4fa8-adda-42ffd920be90",
- "metadata": {
- "id": "34a929f4-1be4-4fa8-adda-42ffd920be90"
- },
- "source": [
- "To ensure consistency and ease of use, standardize the column names of the dataframe. Start by taking a first look at the dataframe and identifying any column names that need to be modified. Use appropriate naming conventions and make sure that column names are descriptive and informative.\n",
- "\n",
- "*Hint*:\n",
- "- *Column names should be in lower case*\n",
- "- *White spaces in column names should be replaced by `_`*\n",
- "- *`st` could be replaced for `state`*"
+ "data": {
+ "text/html": [
+ "\n",
+ "\n",
+ "
\n",
+ " \n",
+ " \n",
+ " | \n",
+ " customer | \n",
+ " state | \n",
+ " gender | \n",
+ " education | \n",
+ " customer-lifetime-value | \n",
+ " income | \n",
+ " monthly-premium-auto | \n",
+ " number-of-open-complaints | \n",
+ " policy-type | \n",
+ " vehicle-class | \n",
+ " total-claim-amount | \n",
+ "
\n",
+ " \n",
+ " \n",
+ " \n",
+ " 0 | \n",
+ " RB50392 | \n",
+ " Washington | \n",
+ " NaN | \n",
+ " Master | \n",
+ " NaN | \n",
+ " 0.0 | \n",
+ " 1000.0 | \n",
+ " 1/0/00 | \n",
+ " Personal Auto | \n",
+ " Four-Door Car | \n",
+ " 2.704934 | \n",
+ "
\n",
+ " \n",
+ " 1 | \n",
+ " QZ44356 | \n",
+ " Arizona | \n",
+ " F | \n",
+ " Bachelor | \n",
+ " 697953.59% | \n",
+ " 0.0 | \n",
+ " 94.0 | \n",
+ " 1/0/00 | \n",
+ " Personal Auto | \n",
+ " Four-Door Car | \n",
+ " 1131.464935 | \n",
+ "
\n",
+ " \n",
+ " 2 | \n",
+ " AI49188 | \n",
+ " Nevada | \n",
+ " F | \n",
+ " Bachelor | \n",
+ " 1288743.17% | \n",
+ " 48767.0 | \n",
+ " 108.0 | \n",
+ " 1/0/00 | \n",
+ " Personal Auto | \n",
+ " Two-Door Car | \n",
+ " 566.472247 | \n",
+ "
\n",
+ " \n",
+ " 3 | \n",
+ " WW63253 | \n",
+ " California | \n",
+ " M | \n",
+ " Bachelor | \n",
+ " 764586.18% | \n",
+ " 0.0 | \n",
+ " 106.0 | \n",
+ " 1/0/00 | \n",
+ " Corporate Auto | \n",
+ " SUV | \n",
+ " 529.881344 | \n",
+ "
\n",
+ " \n",
+ " 4 | \n",
+ " GA49547 | \n",
+ " Washington | \n",
+ " M | \n",
+ " High School or Below | \n",
+ " 536307.65% | \n",
+ " 36357.0 | \n",
+ " 68.0 | \n",
+ " 1/0/00 | \n",
+ " Personal Auto | \n",
+ " Four-Door Car | \n",
+ " 17.269323 | \n",
+ "
\n",
+ " \n",
+ "
\n",
+ "
"
+ ],
+ "text/plain": [
+ " customer state gender education customer-lifetime-value \\\n",
+ "0 RB50392 Washington NaN Master NaN \n",
+ "1 QZ44356 Arizona F Bachelor 697953.59% \n",
+ "2 AI49188 Nevada F Bachelor 1288743.17% \n",
+ "3 WW63253 California M Bachelor 764586.18% \n",
+ "4 GA49547 Washington M High School or Below 536307.65% \n",
+ "\n",
+ " income monthly-premium-auto number-of-open-complaints policy-type \\\n",
+ "0 0.0 1000.0 1/0/00 Personal Auto \n",
+ "1 0.0 94.0 1/0/00 Personal Auto \n",
+ "2 48767.0 108.0 1/0/00 Personal Auto \n",
+ "3 0.0 106.0 1/0/00 Corporate Auto \n",
+ "4 36357.0 68.0 1/0/00 Personal Auto \n",
+ "\n",
+ " vehicle-class total-claim-amount \n",
+ "0 Four-Door Car 2.704934 \n",
+ "1 Four-Door Car 1131.464935 \n",
+ "2 Two-Door Car 566.472247 \n",
+ "3 SUV 529.881344 \n",
+ "4 Four-Door Car 17.269323 "
]
- },
+ },
+ "execution_count": 14,
+ "metadata": {},
+ "output_type": "execute_result"
+ }
+ ],
+ "source": [
+ "df_1.head()"
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": 40,
+ "id": "307c6e73-f251-4c43-888f-2e4782ca698a",
+ "metadata": {},
+ "outputs": [
{
- "cell_type": "code",
- "execution_count": null,
- "id": "5810735c-8056-4442-bbf2-dda38d3e284a",
- "metadata": {
- "id": "5810735c-8056-4442-bbf2-dda38d3e284a"
- },
- "outputs": [],
- "source": [
- "# Your code here"
- ]
- },
+ "name": "stdout",
+ "output_type": "stream",
+ "text": [
+ "[nan]\n"
+ ]
+ }
+ ],
+ "source": [
+ "gener = df_1[\"gender\"].unique()\n",
+ "print(gener)"
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": 18,
+ "id": "7286f75b-69e2-46a0-adf1-2254b521b368",
+ "metadata": {},
+ "outputs": [],
+ "source": [
+ "gender_mapping = {\"Female\": \"F\", \"Female\": \"F\", \"Male\": \"M\", \"male\": \"M\", \"f\": \"F\", \"m\": \"M\"}\n",
+ "df_1['gender'] = df_1['gender'].map(gender_mapping).str.upper()"
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": 24,
+ "id": "f2f8fc47-23cc-47ba-80c6-02e35a69bcf9",
+ "metadata": {},
+ "outputs": [
{
- "cell_type": "markdown",
- "id": "9cb501ec-36ff-4589-b872-6252bb150316",
- "metadata": {
- "id": "9cb501ec-36ff-4589-b872-6252bb150316"
- },
- "source": [
- "## Exercise 2: Cleaning invalid Values"
+ "data": {
+ "text/html": [
+ "\n",
+ "\n",
+ "
\n",
+ " \n",
+ " \n",
+ " | \n",
+ " customer | \n",
+ " st | \n",
+ " gender | \n",
+ " education | \n",
+ " customer-lifetime-value | \n",
+ " income | \n",
+ " monthly-premium-auto | \n",
+ " number-of-open-complaints | \n",
+ " policy-type | \n",
+ " vehicle-class | \n",
+ " total-claim-amount | \n",
+ "
\n",
+ " \n",
+ " \n",
+ " \n",
+ " 0 | \n",
+ " RB50392 | \n",
+ " Washington | \n",
+ " NaN | \n",
+ " Master | \n",
+ " NaN | \n",
+ " 0.0 | \n",
+ " 1000.0 | \n",
+ " 1/0/00 | \n",
+ " Personal Auto | \n",
+ " Four-Door Car | \n",
+ " 2.704934 | \n",
+ "
\n",
+ " \n",
+ " 1 | \n",
+ " QZ44356 | \n",
+ " Arizona | \n",
+ " NaN | \n",
+ " Bachelor | \n",
+ " 697953.59% | \n",
+ " 0.0 | \n",
+ " 94.0 | \n",
+ " 1/0/00 | \n",
+ " Personal Auto | \n",
+ " Four-Door Car | \n",
+ " 1131.464935 | \n",
+ "
\n",
+ " \n",
+ " 2 | \n",
+ " AI49188 | \n",
+ " Nevada | \n",
+ " NaN | \n",
+ " Bachelor | \n",
+ " 1288743.17% | \n",
+ " 48767.0 | \n",
+ " 108.0 | \n",
+ " 1/0/00 | \n",
+ " Personal Auto | \n",
+ " Two-Door Car | \n",
+ " 566.472247 | \n",
+ "
\n",
+ " \n",
+ " 3 | \n",
+ " WW63253 | \n",
+ " California | \n",
+ " NaN | \n",
+ " Bachelor | \n",
+ " 764586.18% | \n",
+ " 0.0 | \n",
+ " 106.0 | \n",
+ " 1/0/00 | \n",
+ " Corporate Auto | \n",
+ " SUV | \n",
+ " 529.881344 | \n",
+ "
\n",
+ " \n",
+ " 4 | \n",
+ " GA49547 | \n",
+ " Washington | \n",
+ " NaN | \n",
+ " High School or Below | \n",
+ " 536307.65% | \n",
+ " 36357.0 | \n",
+ " 68.0 | \n",
+ " 1/0/00 | \n",
+ " Personal Auto | \n",
+ " Four-Door Car | \n",
+ " 17.269323 | \n",
+ "
\n",
+ " \n",
+ " ... | \n",
+ " ... | \n",
+ " ... | \n",
+ " ... | \n",
+ " ... | \n",
+ " ... | \n",
+ " ... | \n",
+ " ... | \n",
+ " ... | \n",
+ " ... | \n",
+ " ... | \n",
+ " ... | \n",
+ "
\n",
+ " \n",
+ " 4003 | \n",
+ " NaN | \n",
+ " NaN | \n",
+ " NaN | \n",
+ " NaN | \n",
+ " NaN | \n",
+ " NaN | \n",
+ " NaN | \n",
+ " NaN | \n",
+ " NaN | \n",
+ " NaN | \n",
+ " NaN | \n",
+ "
\n",
+ " \n",
+ " 4004 | \n",
+ " NaN | \n",
+ " NaN | \n",
+ " NaN | \n",
+ " NaN | \n",
+ " NaN | \n",
+ " NaN | \n",
+ " NaN | \n",
+ " NaN | \n",
+ " NaN | \n",
+ " NaN | \n",
+ " NaN | \n",
+ "
\n",
+ " \n",
+ " 4005 | \n",
+ " NaN | \n",
+ " NaN | \n",
+ " NaN | \n",
+ " NaN | \n",
+ " NaN | \n",
+ " NaN | \n",
+ " NaN | \n",
+ " NaN | \n",
+ " NaN | \n",
+ " NaN | \n",
+ " NaN | \n",
+ "
\n",
+ " \n",
+ " 4006 | \n",
+ " NaN | \n",
+ " NaN | \n",
+ " NaN | \n",
+ " NaN | \n",
+ " NaN | \n",
+ " NaN | \n",
+ " NaN | \n",
+ " NaN | \n",
+ " NaN | \n",
+ " NaN | \n",
+ " NaN | \n",
+ "
\n",
+ " \n",
+ " 4007 | \n",
+ " NaN | \n",
+ " NaN | \n",
+ " NaN | \n",
+ " NaN | \n",
+ " NaN | \n",
+ " NaN | \n",
+ " NaN | \n",
+ " NaN | \n",
+ " NaN | \n",
+ " NaN | \n",
+ " NaN | \n",
+ "
\n",
+ " \n",
+ "
\n",
+ "
4008 rows × 11 columns
\n",
+ "
"
+ ],
+ "text/plain": [
+ " customer st gender education \\\n",
+ "0 RB50392 Washington NaN Master \n",
+ "1 QZ44356 Arizona NaN Bachelor \n",
+ "2 AI49188 Nevada NaN Bachelor \n",
+ "3 WW63253 California NaN Bachelor \n",
+ "4 GA49547 Washington NaN High School or Below \n",
+ "... ... ... ... ... \n",
+ "4003 NaN NaN NaN NaN \n",
+ "4004 NaN NaN NaN NaN \n",
+ "4005 NaN NaN NaN NaN \n",
+ "4006 NaN NaN NaN NaN \n",
+ "4007 NaN NaN NaN NaN \n",
+ "\n",
+ " customer-lifetime-value income monthly-premium-auto \\\n",
+ "0 NaN 0.0 1000.0 \n",
+ "1 697953.59% 0.0 94.0 \n",
+ "2 1288743.17% 48767.0 108.0 \n",
+ "3 764586.18% 0.0 106.0 \n",
+ "4 536307.65% 36357.0 68.0 \n",
+ "... ... ... ... \n",
+ "4003 NaN NaN NaN \n",
+ "4004 NaN NaN NaN \n",
+ "4005 NaN NaN NaN \n",
+ "4006 NaN NaN NaN \n",
+ "4007 NaN NaN NaN \n",
+ "\n",
+ " number-of-open-complaints policy-type vehicle-class \\\n",
+ "0 1/0/00 Personal Auto Four-Door Car \n",
+ "1 1/0/00 Personal Auto Four-Door Car \n",
+ "2 1/0/00 Personal Auto Two-Door Car \n",
+ "3 1/0/00 Corporate Auto SUV \n",
+ "4 1/0/00 Personal Auto Four-Door Car \n",
+ "... ... ... ... \n",
+ "4003 NaN NaN NaN \n",
+ "4004 NaN NaN NaN \n",
+ "4005 NaN NaN NaN \n",
+ "4006 NaN NaN NaN \n",
+ "4007 NaN NaN NaN \n",
+ "\n",
+ " total-claim-amount \n",
+ "0 2.704934 \n",
+ "1 1131.464935 \n",
+ "2 566.472247 \n",
+ "3 529.881344 \n",
+ "4 17.269323 \n",
+ "... ... \n",
+ "4003 NaN \n",
+ "4004 NaN \n",
+ "4005 NaN \n",
+ "4006 NaN \n",
+ "4007 NaN \n",
+ "\n",
+ "[4008 rows x 11 columns]"
]
- },
+ },
+ "execution_count": 24,
+ "metadata": {},
+ "output_type": "execute_result"
+ }
+ ],
+ "source": [
+ "gender_mapping = {\"Male\": \"M\", \"female\":\"F\",}\n",
+ "df_1[\"gender\"] = df_1[\"gender\"].map(gender_mapping)\n",
+ "\n",
+ "df_1"
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": 26,
+ "id": "320abb22-44d2-4920-b764-1054faf986bd",
+ "metadata": {},
+ "outputs": [
{
- "cell_type": "markdown",
- "id": "771fdcf3-8e20-4b06-9c24-3a93ba2b0909",
- "metadata": {
- "id": "771fdcf3-8e20-4b06-9c24-3a93ba2b0909"
- },
- "source": [
- "The dataset contains columns with inconsistent and incorrect values that could affect the accuracy of our analysis. Therefore, we need to clean these columns to ensure that they only contain valid data.\n",
- "\n",
- "Note that this exercise will focus only on cleaning inconsistent values and will not involve handling null values (NaN or None).\n",
- "\n",
- "*Hint*:\n",
- "- *Gender column contains various inconsistent values such as \"F\", \"M\", \"Femal\", \"Male\", \"female\", which need to be standardized, for example, to \"M\" and \"F\".*\n",
- "- *State abbreviations be can replaced with its full name, for example \"AZ\": \"Arizona\", \"Cali\": \"California\", \"WA\": \"Washington\"*\n",
- "- *In education, \"Bachelors\" could be replaced by \"Bachelor\"*\n",
- "- *In Customer Lifetime Value, delete the `%` character*\n",
- "- *In vehicle class, \"Sports Car\", \"Luxury SUV\" and \"Luxury Car\" could be replaced by \"Luxury\"*"
- ]
- },
+ "ename": "KeyError",
+ "evalue": "'state'",
+ "output_type": "error",
+ "traceback": [
+ "\u001b[1;31m---------------------------------------------------------------------------\u001b[0m",
+ "\u001b[1;31mKeyError\u001b[0m Traceback (most recent call last)",
+ "File \u001b[1;32m~\\anaconda3\\Lib\\site-packages\\pandas\\core\\indexes\\base.py:3805\u001b[0m, in \u001b[0;36mIndex.get_loc\u001b[1;34m(self, key)\u001b[0m\n\u001b[0;32m 3804\u001b[0m \u001b[38;5;28;01mtry\u001b[39;00m:\n\u001b[1;32m-> 3805\u001b[0m \u001b[38;5;28;01mreturn\u001b[39;00m \u001b[38;5;28mself\u001b[39m\u001b[38;5;241m.\u001b[39m_engine\u001b[38;5;241m.\u001b[39mget_loc(casted_key)\n\u001b[0;32m 3806\u001b[0m \u001b[38;5;28;01mexcept\u001b[39;00m \u001b[38;5;167;01mKeyError\u001b[39;00m \u001b[38;5;28;01mas\u001b[39;00m err:\n",
+ "File \u001b[1;32mindex.pyx:167\u001b[0m, in \u001b[0;36mpandas._libs.index.IndexEngine.get_loc\u001b[1;34m()\u001b[0m\n",
+ "File \u001b[1;32mindex.pyx:196\u001b[0m, in \u001b[0;36mpandas._libs.index.IndexEngine.get_loc\u001b[1;34m()\u001b[0m\n",
+ "File \u001b[1;32mpandas\\\\_libs\\\\hashtable_class_helper.pxi:7081\u001b[0m, in \u001b[0;36mpandas._libs.hashtable.PyObjectHashTable.get_item\u001b[1;34m()\u001b[0m\n",
+ "File \u001b[1;32mpandas\\\\_libs\\\\hashtable_class_helper.pxi:7089\u001b[0m, in \u001b[0;36mpandas._libs.hashtable.PyObjectHashTable.get_item\u001b[1;34m()\u001b[0m\n",
+ "\u001b[1;31mKeyError\u001b[0m: 'state'",
+ "\nThe above exception was the direct cause of the following exception:\n",
+ "\u001b[1;31mKeyError\u001b[0m Traceback (most recent call last)",
+ "Cell \u001b[1;32mIn[26], line 2\u001b[0m\n\u001b[0;32m 1\u001b[0m state_mapping \u001b[38;5;241m=\u001b[39m {\u001b[38;5;124m\"\u001b[39m\u001b[38;5;124mAZ\u001b[39m\u001b[38;5;124m\"\u001b[39m: \u001b[38;5;124m\"\u001b[39m\u001b[38;5;124mArizona\u001b[39m\u001b[38;5;124m\"\u001b[39m, \u001b[38;5;124m\"\u001b[39m\u001b[38;5;124mCali\u001b[39m\u001b[38;5;124m\"\u001b[39m: \u001b[38;5;124m\"\u001b[39m\u001b[38;5;124mCalifornia\u001b[39m\u001b[38;5;124m\"\u001b[39m, \u001b[38;5;124m\"\u001b[39m\u001b[38;5;124mWA\u001b[39m\u001b[38;5;124m\"\u001b[39m: \u001b[38;5;124m\"\u001b[39m\u001b[38;5;124mWashington\u001b[39m\u001b[38;5;124m\"\u001b[39m}\n\u001b[1;32m----> 2\u001b[0m df_1[\u001b[38;5;124m'\u001b[39m\u001b[38;5;124mstate\u001b[39m\u001b[38;5;124m'\u001b[39m] \u001b[38;5;241m=\u001b[39m df_1[\u001b[38;5;124m'\u001b[39m\u001b[38;5;124mstate\u001b[39m\u001b[38;5;124m'\u001b[39m]\u001b[38;5;241m.\u001b[39mreplace(state_mapping)\n",
+ "File \u001b[1;32m~\\anaconda3\\Lib\\site-packages\\pandas\\core\\frame.py:4102\u001b[0m, in \u001b[0;36mDataFrame.__getitem__\u001b[1;34m(self, key)\u001b[0m\n\u001b[0;32m 4100\u001b[0m \u001b[38;5;28;01mif\u001b[39;00m \u001b[38;5;28mself\u001b[39m\u001b[38;5;241m.\u001b[39mcolumns\u001b[38;5;241m.\u001b[39mnlevels \u001b[38;5;241m>\u001b[39m \u001b[38;5;241m1\u001b[39m:\n\u001b[0;32m 4101\u001b[0m \u001b[38;5;28;01mreturn\u001b[39;00m \u001b[38;5;28mself\u001b[39m\u001b[38;5;241m.\u001b[39m_getitem_multilevel(key)\n\u001b[1;32m-> 4102\u001b[0m indexer \u001b[38;5;241m=\u001b[39m \u001b[38;5;28mself\u001b[39m\u001b[38;5;241m.\u001b[39mcolumns\u001b[38;5;241m.\u001b[39mget_loc(key)\n\u001b[0;32m 4103\u001b[0m \u001b[38;5;28;01mif\u001b[39;00m is_integer(indexer):\n\u001b[0;32m 4104\u001b[0m indexer \u001b[38;5;241m=\u001b[39m [indexer]\n",
+ "File \u001b[1;32m~\\anaconda3\\Lib\\site-packages\\pandas\\core\\indexes\\base.py:3812\u001b[0m, in \u001b[0;36mIndex.get_loc\u001b[1;34m(self, key)\u001b[0m\n\u001b[0;32m 3807\u001b[0m \u001b[38;5;28;01mif\u001b[39;00m \u001b[38;5;28misinstance\u001b[39m(casted_key, \u001b[38;5;28mslice\u001b[39m) \u001b[38;5;129;01mor\u001b[39;00m (\n\u001b[0;32m 3808\u001b[0m \u001b[38;5;28misinstance\u001b[39m(casted_key, abc\u001b[38;5;241m.\u001b[39mIterable)\n\u001b[0;32m 3809\u001b[0m \u001b[38;5;129;01mand\u001b[39;00m \u001b[38;5;28many\u001b[39m(\u001b[38;5;28misinstance\u001b[39m(x, \u001b[38;5;28mslice\u001b[39m) \u001b[38;5;28;01mfor\u001b[39;00m x \u001b[38;5;129;01min\u001b[39;00m casted_key)\n\u001b[0;32m 3810\u001b[0m ):\n\u001b[0;32m 3811\u001b[0m \u001b[38;5;28;01mraise\u001b[39;00m InvalidIndexError(key)\n\u001b[1;32m-> 3812\u001b[0m \u001b[38;5;28;01mraise\u001b[39;00m \u001b[38;5;167;01mKeyError\u001b[39;00m(key) \u001b[38;5;28;01mfrom\u001b[39;00m \u001b[38;5;21;01merr\u001b[39;00m\n\u001b[0;32m 3813\u001b[0m \u001b[38;5;28;01mexcept\u001b[39;00m \u001b[38;5;167;01mTypeError\u001b[39;00m:\n\u001b[0;32m 3814\u001b[0m \u001b[38;5;66;03m# If we have a listlike key, _check_indexing_error will raise\u001b[39;00m\n\u001b[0;32m 3815\u001b[0m \u001b[38;5;66;03m# InvalidIndexError. Otherwise we fall through and re-raise\u001b[39;00m\n\u001b[0;32m 3816\u001b[0m \u001b[38;5;66;03m# the TypeError.\u001b[39;00m\n\u001b[0;32m 3817\u001b[0m \u001b[38;5;28mself\u001b[39m\u001b[38;5;241m.\u001b[39m_check_indexing_error(key)\n",
+ "\u001b[1;31mKeyError\u001b[0m: 'state'"
+ ]
+ }
+ ],
+ "source": [
+ "state_mapping = {\"AZ\": \"Arizona\", \"Cali\": \"California\", \"WA\": \"Washington\"}\n",
+ "df_1['state'] = df_1['state'].replace(state_mapping)"
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": 22,
+ "id": "eb3ae7f3-ed57-4902-973a-8f686c70ba71",
+ "metadata": {},
+ "outputs": [
{
- "cell_type": "code",
- "execution_count": null,
- "id": "3f8ee5cb-50ab-48af-8a9f-9a389804033c",
- "metadata": {
- "id": "3f8ee5cb-50ab-48af-8a9f-9a389804033c"
- },
- "outputs": [],
- "source": [
- "# Your code here"
- ]
- },
+ "ename": "KeyError",
+ "evalue": "'state'",
+ "output_type": "error",
+ "traceback": [
+ "\u001b[1;31m---------------------------------------------------------------------------\u001b[0m",
+ "\u001b[1;31mKeyError\u001b[0m Traceback (most recent call last)",
+ "File \u001b[1;32m~\\anaconda3\\Lib\\site-packages\\pandas\\core\\indexes\\base.py:3805\u001b[0m, in \u001b[0;36mIndex.get_loc\u001b[1;34m(self, key)\u001b[0m\n\u001b[0;32m 3804\u001b[0m \u001b[38;5;28;01mtry\u001b[39;00m:\n\u001b[1;32m-> 3805\u001b[0m \u001b[38;5;28;01mreturn\u001b[39;00m \u001b[38;5;28mself\u001b[39m\u001b[38;5;241m.\u001b[39m_engine\u001b[38;5;241m.\u001b[39mget_loc(casted_key)\n\u001b[0;32m 3806\u001b[0m \u001b[38;5;28;01mexcept\u001b[39;00m \u001b[38;5;167;01mKeyError\u001b[39;00m \u001b[38;5;28;01mas\u001b[39;00m err:\n",
+ "File \u001b[1;32mindex.pyx:167\u001b[0m, in \u001b[0;36mpandas._libs.index.IndexEngine.get_loc\u001b[1;34m()\u001b[0m\n",
+ "File \u001b[1;32mindex.pyx:196\u001b[0m, in \u001b[0;36mpandas._libs.index.IndexEngine.get_loc\u001b[1;34m()\u001b[0m\n",
+ "File \u001b[1;32mpandas\\\\_libs\\\\hashtable_class_helper.pxi:7081\u001b[0m, in \u001b[0;36mpandas._libs.hashtable.PyObjectHashTable.get_item\u001b[1;34m()\u001b[0m\n",
+ "File \u001b[1;32mpandas\\\\_libs\\\\hashtable_class_helper.pxi:7089\u001b[0m, in \u001b[0;36mpandas._libs.hashtable.PyObjectHashTable.get_item\u001b[1;34m()\u001b[0m\n",
+ "\u001b[1;31mKeyError\u001b[0m: 'state'",
+ "\nThe above exception was the direct cause of the following exception:\n",
+ "\u001b[1;31mKeyError\u001b[0m Traceback (most recent call last)",
+ "Cell \u001b[1;32mIn[22], line 2\u001b[0m\n\u001b[0;32m 1\u001b[0m state_mapping \u001b[38;5;241m=\u001b[39m {\u001b[38;5;124m\"\u001b[39m\u001b[38;5;124mWashington\u001b[39m\u001b[38;5;124m\"\u001b[39m: \u001b[38;5;124m\"\u001b[39m\u001b[38;5;124mWA\u001b[39m\u001b[38;5;124m\"\u001b[39m, \u001b[38;5;124m\"\u001b[39m\u001b[38;5;124mArizona\u001b[39m\u001b[38;5;124m\"\u001b[39m:\u001b[38;5;124m\"\u001b[39m\u001b[38;5;124mAZ\u001b[39m\u001b[38;5;124m\"\u001b[39m, \u001b[38;5;124m\"\u001b[39m\u001b[38;5;124mCalifornia\u001b[39m\u001b[38;5;124m\"\u001b[39m: \u001b[38;5;124m\"\u001b[39m\u001b[38;5;124mCA\u001b[39m\u001b[38;5;124m\"\u001b[39m}\n\u001b[1;32m----> 2\u001b[0m df_1[\u001b[38;5;124m\"\u001b[39m\u001b[38;5;124mstate\u001b[39m\u001b[38;5;124m\"\u001b[39m] \u001b[38;5;241m=\u001b[39m df_1[\u001b[38;5;124m\"\u001b[39m\u001b[38;5;124mstate\u001b[39m\u001b[38;5;124m\"\u001b[39m]\u001b[38;5;241m.\u001b[39mmap(state_mapping)\n\u001b[0;32m 4\u001b[0m df_1\n",
+ "File \u001b[1;32m~\\anaconda3\\Lib\\site-packages\\pandas\\core\\frame.py:4102\u001b[0m, in \u001b[0;36mDataFrame.__getitem__\u001b[1;34m(self, key)\u001b[0m\n\u001b[0;32m 4100\u001b[0m \u001b[38;5;28;01mif\u001b[39;00m \u001b[38;5;28mself\u001b[39m\u001b[38;5;241m.\u001b[39mcolumns\u001b[38;5;241m.\u001b[39mnlevels \u001b[38;5;241m>\u001b[39m \u001b[38;5;241m1\u001b[39m:\n\u001b[0;32m 4101\u001b[0m \u001b[38;5;28;01mreturn\u001b[39;00m \u001b[38;5;28mself\u001b[39m\u001b[38;5;241m.\u001b[39m_getitem_multilevel(key)\n\u001b[1;32m-> 4102\u001b[0m indexer \u001b[38;5;241m=\u001b[39m \u001b[38;5;28mself\u001b[39m\u001b[38;5;241m.\u001b[39mcolumns\u001b[38;5;241m.\u001b[39mget_loc(key)\n\u001b[0;32m 4103\u001b[0m \u001b[38;5;28;01mif\u001b[39;00m is_integer(indexer):\n\u001b[0;32m 4104\u001b[0m indexer \u001b[38;5;241m=\u001b[39m [indexer]\n",
+ "File \u001b[1;32m~\\anaconda3\\Lib\\site-packages\\pandas\\core\\indexes\\base.py:3812\u001b[0m, in \u001b[0;36mIndex.get_loc\u001b[1;34m(self, key)\u001b[0m\n\u001b[0;32m 3807\u001b[0m \u001b[38;5;28;01mif\u001b[39;00m \u001b[38;5;28misinstance\u001b[39m(casted_key, \u001b[38;5;28mslice\u001b[39m) \u001b[38;5;129;01mor\u001b[39;00m (\n\u001b[0;32m 3808\u001b[0m \u001b[38;5;28misinstance\u001b[39m(casted_key, abc\u001b[38;5;241m.\u001b[39mIterable)\n\u001b[0;32m 3809\u001b[0m \u001b[38;5;129;01mand\u001b[39;00m \u001b[38;5;28many\u001b[39m(\u001b[38;5;28misinstance\u001b[39m(x, \u001b[38;5;28mslice\u001b[39m) \u001b[38;5;28;01mfor\u001b[39;00m x \u001b[38;5;129;01min\u001b[39;00m casted_key)\n\u001b[0;32m 3810\u001b[0m ):\n\u001b[0;32m 3811\u001b[0m \u001b[38;5;28;01mraise\u001b[39;00m InvalidIndexError(key)\n\u001b[1;32m-> 3812\u001b[0m \u001b[38;5;28;01mraise\u001b[39;00m \u001b[38;5;167;01mKeyError\u001b[39;00m(key) \u001b[38;5;28;01mfrom\u001b[39;00m \u001b[38;5;21;01merr\u001b[39;00m\n\u001b[0;32m 3813\u001b[0m \u001b[38;5;28;01mexcept\u001b[39;00m \u001b[38;5;167;01mTypeError\u001b[39;00m:\n\u001b[0;32m 3814\u001b[0m \u001b[38;5;66;03m# If we have a listlike key, _check_indexing_error will raise\u001b[39;00m\n\u001b[0;32m 3815\u001b[0m \u001b[38;5;66;03m# InvalidIndexError. Otherwise we fall through and re-raise\u001b[39;00m\n\u001b[0;32m 3816\u001b[0m \u001b[38;5;66;03m# the TypeError.\u001b[39;00m\n\u001b[0;32m 3817\u001b[0m \u001b[38;5;28mself\u001b[39m\u001b[38;5;241m.\u001b[39m_check_indexing_error(key)\n",
+ "\u001b[1;31mKeyError\u001b[0m: 'state'"
+ ]
+ }
+ ],
+ "source": [
+ "state_mapping = {\"Washington\": \"WA\", \"Arizona\":\"AZ\", \"California\": \"CA\"}\n",
+ "df_1[\"state\"] = df_1[\"state\"].map(state_mapping)\n",
+ "\n",
+ "df_1"
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": 32,
+ "id": "f12369be-549c-450b-bb9e-18591620c83d",
+ "metadata": {},
+ "outputs": [
{
- "cell_type": "markdown",
- "id": "85ff78ce-0174-4890-9db3-8048b7d7d2d0",
- "metadata": {
- "id": "85ff78ce-0174-4890-9db3-8048b7d7d2d0"
- },
- "source": [
- "## Exercise 3: Formatting data types"
+ "data": {
+ "text/html": [
+ "\n",
+ "\n",
+ "
\n",
+ " \n",
+ " \n",
+ " | \n",
+ " customer | \n",
+ " st | \n",
+ " gender | \n",
+ " education | \n",
+ " customer-lifetime-value | \n",
+ " income | \n",
+ " monthly-premium-auto | \n",
+ " number-of-open-complaints | \n",
+ " policy-type | \n",
+ " vehicle-class | \n",
+ " total-claim-amount | \n",
+ "
\n",
+ " \n",
+ " \n",
+ " \n",
+ " 0 | \n",
+ " RB50392 | \n",
+ " Washington | \n",
+ " NaN | \n",
+ " NaN | \n",
+ " NaN | \n",
+ " 0.0 | \n",
+ " 1000.0 | \n",
+ " 1/0/00 | \n",
+ " Personal Auto | \n",
+ " Four-Door Car | \n",
+ " 2.704934 | \n",
+ "
\n",
+ " \n",
+ " 1 | \n",
+ " QZ44356 | \n",
+ " Arizona | \n",
+ " NaN | \n",
+ " NaN | \n",
+ " 697953.59% | \n",
+ " 0.0 | \n",
+ " 94.0 | \n",
+ " 1/0/00 | \n",
+ " Personal Auto | \n",
+ " Four-Door Car | \n",
+ " 1131.464935 | \n",
+ "
\n",
+ " \n",
+ " 2 | \n",
+ " AI49188 | \n",
+ " Nevada | \n",
+ " NaN | \n",
+ " NaN | \n",
+ " 1288743.17% | \n",
+ " 48767.0 | \n",
+ " 108.0 | \n",
+ " 1/0/00 | \n",
+ " Personal Auto | \n",
+ " Two-Door Car | \n",
+ " 566.472247 | \n",
+ "
\n",
+ " \n",
+ " 3 | \n",
+ " WW63253 | \n",
+ " California | \n",
+ " NaN | \n",
+ " NaN | \n",
+ " 764586.18% | \n",
+ " 0.0 | \n",
+ " 106.0 | \n",
+ " 1/0/00 | \n",
+ " Corporate Auto | \n",
+ " SUV | \n",
+ " 529.881344 | \n",
+ "
\n",
+ " \n",
+ " 4 | \n",
+ " GA49547 | \n",
+ " Washington | \n",
+ " NaN | \n",
+ " NaN | \n",
+ " 536307.65% | \n",
+ " 36357.0 | \n",
+ " 68.0 | \n",
+ " 1/0/00 | \n",
+ " Personal Auto | \n",
+ " Four-Door Car | \n",
+ " 17.269323 | \n",
+ "
\n",
+ " \n",
+ " ... | \n",
+ " ... | \n",
+ " ... | \n",
+ " ... | \n",
+ " ... | \n",
+ " ... | \n",
+ " ... | \n",
+ " ... | \n",
+ " ... | \n",
+ " ... | \n",
+ " ... | \n",
+ " ... | \n",
+ "
\n",
+ " \n",
+ " 4003 | \n",
+ " NaN | \n",
+ " NaN | \n",
+ " NaN | \n",
+ " NaN | \n",
+ " NaN | \n",
+ " NaN | \n",
+ " NaN | \n",
+ " NaN | \n",
+ " NaN | \n",
+ " NaN | \n",
+ " NaN | \n",
+ "
\n",
+ " \n",
+ " 4004 | \n",
+ " NaN | \n",
+ " NaN | \n",
+ " NaN | \n",
+ " NaN | \n",
+ " NaN | \n",
+ " NaN | \n",
+ " NaN | \n",
+ " NaN | \n",
+ " NaN | \n",
+ " NaN | \n",
+ " NaN | \n",
+ "
\n",
+ " \n",
+ " 4005 | \n",
+ " NaN | \n",
+ " NaN | \n",
+ " NaN | \n",
+ " NaN | \n",
+ " NaN | \n",
+ " NaN | \n",
+ " NaN | \n",
+ " NaN | \n",
+ " NaN | \n",
+ " NaN | \n",
+ " NaN | \n",
+ "
\n",
+ " \n",
+ " 4006 | \n",
+ " NaN | \n",
+ " NaN | \n",
+ " NaN | \n",
+ " NaN | \n",
+ " NaN | \n",
+ " NaN | \n",
+ " NaN | \n",
+ " NaN | \n",
+ " NaN | \n",
+ " NaN | \n",
+ " NaN | \n",
+ "
\n",
+ " \n",
+ " 4007 | \n",
+ " NaN | \n",
+ " NaN | \n",
+ " NaN | \n",
+ " NaN | \n",
+ " NaN | \n",
+ " NaN | \n",
+ " NaN | \n",
+ " NaN | \n",
+ " NaN | \n",
+ " NaN | \n",
+ " NaN | \n",
+ "
\n",
+ " \n",
+ "
\n",
+ "
4008 rows × 11 columns
\n",
+ "
"
+ ],
+ "text/plain": [
+ " customer st gender education customer-lifetime-value income \\\n",
+ "0 RB50392 Washington NaN NaN NaN 0.0 \n",
+ "1 QZ44356 Arizona NaN NaN 697953.59% 0.0 \n",
+ "2 AI49188 Nevada NaN NaN 1288743.17% 48767.0 \n",
+ "3 WW63253 California NaN NaN 764586.18% 0.0 \n",
+ "4 GA49547 Washington NaN NaN 536307.65% 36357.0 \n",
+ "... ... ... ... ... ... ... \n",
+ "4003 NaN NaN NaN NaN NaN NaN \n",
+ "4004 NaN NaN NaN NaN NaN NaN \n",
+ "4005 NaN NaN NaN NaN NaN NaN \n",
+ "4006 NaN NaN NaN NaN NaN NaN \n",
+ "4007 NaN NaN NaN NaN NaN NaN \n",
+ "\n",
+ " monthly-premium-auto number-of-open-complaints policy-type \\\n",
+ "0 1000.0 1/0/00 Personal Auto \n",
+ "1 94.0 1/0/00 Personal Auto \n",
+ "2 108.0 1/0/00 Personal Auto \n",
+ "3 106.0 1/0/00 Corporate Auto \n",
+ "4 68.0 1/0/00 Personal Auto \n",
+ "... ... ... ... \n",
+ "4003 NaN NaN NaN \n",
+ "4004 NaN NaN NaN \n",
+ "4005 NaN NaN NaN \n",
+ "4006 NaN NaN NaN \n",
+ "4007 NaN NaN NaN \n",
+ "\n",
+ " vehicle-class total-claim-amount \n",
+ "0 Four-Door Car 2.704934 \n",
+ "1 Four-Door Car 1131.464935 \n",
+ "2 Two-Door Car 566.472247 \n",
+ "3 SUV 529.881344 \n",
+ "4 Four-Door Car 17.269323 \n",
+ "... ... ... \n",
+ "4003 NaN NaN \n",
+ "4004 NaN NaN \n",
+ "4005 NaN NaN \n",
+ "4006 NaN NaN \n",
+ "4007 NaN NaN \n",
+ "\n",
+ "[4008 rows x 11 columns]"
]
- },
+ },
+ "execution_count": 32,
+ "metadata": {},
+ "output_type": "execute_result"
+ }
+ ],
+ "source": [
+ "education_mapping = {\"Bachelors\": \"Bachelor\"}\n",
+ "df_1[\"education\"] = df_1[\"education\"].map(education_mapping)\n",
+ "\n",
+ "df_1"
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": 72,
+ "id": "637c369a-3d6c-4ba9-b3e9-cde9c55317c3",
+ "metadata": {},
+ "outputs": [],
+ "source": [
+ "df_1['customer-lifetime-value'] = df_1['customer-lifetime-value'].astype(str)"
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": 74,
+ "id": "b7f385ea-3675-470f-a267-59b15e52c153",
+ "metadata": {},
+ "outputs": [
{
- "cell_type": "markdown",
- "id": "b91c2cf8-79a2-4baf-9f65-ff2fb22270bd",
- "metadata": {
- "id": "b91c2cf8-79a2-4baf-9f65-ff2fb22270bd"
- },
- "source": [
- "The data types of many columns in the dataset appear to be incorrect. This could impact the accuracy of our analysis. To ensure accurate analysis, we need to correct the data types of these columns. Please update the data types of the columns as appropriate."
+ "data": {
+ "text/plain": [
+ "0 nan\n",
+ "1 697953.59\n",
+ "2 1288743.17\n",
+ "3 764586.18\n",
+ "4 536307.65\n",
+ " ... \n",
+ "4003 nan\n",
+ "4004 nan\n",
+ "4005 nan\n",
+ "4006 nan\n",
+ "4007 nan\n",
+ "Name: customer-lifetime-value, Length: 4008, dtype: object"
]
- },
+ },
+ "execution_count": 74,
+ "metadata": {},
+ "output_type": "execute_result"
+ }
+ ],
+ "source": [
+ "df_1['customer-lifetime-value']"
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": 76,
+ "id": "87dc7ff2-e28c-4191-a00b-0079d8b519f3",
+ "metadata": {},
+ "outputs": [],
+ "source": [
+ "df_1['customer-lifetime-value'] = df_1['customer-lifetime-value'].str.rstrip('%').astype(float)"
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": 36,
+ "id": "72dd4c1b-f1c3-4849-b875-1e2237c98792",
+ "metadata": {},
+ "outputs": [],
+ "source": [
+ "vehicle_class_mapping = {\"Sports Car\": \"Luxury\", \"Luxury SUV\": \"Luxury\", \"Luxury Car\": \"Luxury\"}\n",
+ "df_1[\"vehicle-class\"] = df_1[\"vehicle-class\"].replace(vehicle_class_mapping)"
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": 38,
+ "id": "1ac80f75-7fbd-40ad-815b-42a0352cb6c8",
+ "metadata": {},
+ "outputs": [
{
- "cell_type": "markdown",
- "id": "43e5d853-ff9e-43b2-9d92-aef2f78764f3",
- "metadata": {
- "id": "43e5d853-ff9e-43b2-9d92-aef2f78764f3"
- },
- "source": [
- "It is important to note that this exercise does not involve handling null values (NaN or None)."
- ]
- },
+ "ename": "TypeError",
+ "evalue": "argument of type 'float' is not iterable",
+ "output_type": "error",
+ "traceback": [
+ "\u001b[1;31m---------------------------------------------------------------------------\u001b[0m",
+ "\u001b[1;31mTypeError\u001b[0m Traceback (most recent call last)",
+ "Cell \u001b[1;32mIn[38], line 10\u001b[0m\n\u001b[0;32m 6\u001b[0m \u001b[38;5;28;01melse\u001b[39;00m:\n\u001b[0;32m 7\u001b[0m \u001b[38;5;28;01mreturn\u001b[39;00m row\n\u001b[1;32m---> 10\u001b[0m df_1[\u001b[38;5;124m\"\u001b[39m\u001b[38;5;124mvehicle-class\u001b[39m\u001b[38;5;124m\"\u001b[39m]\u001b[38;5;241m.\u001b[39mapply(vehicle)\n",
+ "File \u001b[1;32m~\\anaconda3\\Lib\\site-packages\\pandas\\core\\series.py:4924\u001b[0m, in \u001b[0;36mSeries.apply\u001b[1;34m(self, func, convert_dtype, args, by_row, **kwargs)\u001b[0m\n\u001b[0;32m 4789\u001b[0m \u001b[38;5;28;01mdef\u001b[39;00m \u001b[38;5;21mapply\u001b[39m(\n\u001b[0;32m 4790\u001b[0m \u001b[38;5;28mself\u001b[39m,\n\u001b[0;32m 4791\u001b[0m func: AggFuncType,\n\u001b[1;32m (...)\u001b[0m\n\u001b[0;32m 4796\u001b[0m \u001b[38;5;241m*\u001b[39m\u001b[38;5;241m*\u001b[39mkwargs,\n\u001b[0;32m 4797\u001b[0m ) \u001b[38;5;241m-\u001b[39m\u001b[38;5;241m>\u001b[39m DataFrame \u001b[38;5;241m|\u001b[39m Series:\n\u001b[0;32m 4798\u001b[0m \u001b[38;5;250m \u001b[39m\u001b[38;5;124;03m\"\"\"\u001b[39;00m\n\u001b[0;32m 4799\u001b[0m \u001b[38;5;124;03m Invoke function on values of Series.\u001b[39;00m\n\u001b[0;32m 4800\u001b[0m \n\u001b[1;32m (...)\u001b[0m\n\u001b[0;32m 4915\u001b[0m \u001b[38;5;124;03m dtype: float64\u001b[39;00m\n\u001b[0;32m 4916\u001b[0m \u001b[38;5;124;03m \"\"\"\u001b[39;00m\n\u001b[0;32m 4917\u001b[0m \u001b[38;5;28;01mreturn\u001b[39;00m SeriesApply(\n\u001b[0;32m 4918\u001b[0m \u001b[38;5;28mself\u001b[39m,\n\u001b[0;32m 4919\u001b[0m func,\n\u001b[0;32m 4920\u001b[0m convert_dtype\u001b[38;5;241m=\u001b[39mconvert_dtype,\n\u001b[0;32m 4921\u001b[0m by_row\u001b[38;5;241m=\u001b[39mby_row,\n\u001b[0;32m 4922\u001b[0m args\u001b[38;5;241m=\u001b[39margs,\n\u001b[0;32m 4923\u001b[0m kwargs\u001b[38;5;241m=\u001b[39mkwargs,\n\u001b[1;32m-> 4924\u001b[0m )\u001b[38;5;241m.\u001b[39mapply()\n",
+ "File \u001b[1;32m~\\anaconda3\\Lib\\site-packages\\pandas\\core\\apply.py:1427\u001b[0m, in \u001b[0;36mSeriesApply.apply\u001b[1;34m(self)\u001b[0m\n\u001b[0;32m 1424\u001b[0m \u001b[38;5;28;01mreturn\u001b[39;00m \u001b[38;5;28mself\u001b[39m\u001b[38;5;241m.\u001b[39mapply_compat()\n\u001b[0;32m 1426\u001b[0m \u001b[38;5;66;03m# self.func is Callable\u001b[39;00m\n\u001b[1;32m-> 1427\u001b[0m \u001b[38;5;28;01mreturn\u001b[39;00m \u001b[38;5;28mself\u001b[39m\u001b[38;5;241m.\u001b[39mapply_standard()\n",
+ "File \u001b[1;32m~\\anaconda3\\Lib\\site-packages\\pandas\\core\\apply.py:1507\u001b[0m, in \u001b[0;36mSeriesApply.apply_standard\u001b[1;34m(self)\u001b[0m\n\u001b[0;32m 1501\u001b[0m \u001b[38;5;66;03m# row-wise access\u001b[39;00m\n\u001b[0;32m 1502\u001b[0m \u001b[38;5;66;03m# apply doesn't have a `na_action` keyword and for backward compat reasons\u001b[39;00m\n\u001b[0;32m 1503\u001b[0m \u001b[38;5;66;03m# we need to give `na_action=\"ignore\"` for categorical data.\u001b[39;00m\n\u001b[0;32m 1504\u001b[0m \u001b[38;5;66;03m# TODO: remove the `na_action=\"ignore\"` when that default has been changed in\u001b[39;00m\n\u001b[0;32m 1505\u001b[0m \u001b[38;5;66;03m# Categorical (GH51645).\u001b[39;00m\n\u001b[0;32m 1506\u001b[0m action \u001b[38;5;241m=\u001b[39m \u001b[38;5;124m\"\u001b[39m\u001b[38;5;124mignore\u001b[39m\u001b[38;5;124m\"\u001b[39m \u001b[38;5;28;01mif\u001b[39;00m \u001b[38;5;28misinstance\u001b[39m(obj\u001b[38;5;241m.\u001b[39mdtype, CategoricalDtype) \u001b[38;5;28;01melse\u001b[39;00m \u001b[38;5;28;01mNone\u001b[39;00m\n\u001b[1;32m-> 1507\u001b[0m mapped \u001b[38;5;241m=\u001b[39m obj\u001b[38;5;241m.\u001b[39m_map_values(\n\u001b[0;32m 1508\u001b[0m mapper\u001b[38;5;241m=\u001b[39mcurried, na_action\u001b[38;5;241m=\u001b[39maction, convert\u001b[38;5;241m=\u001b[39m\u001b[38;5;28mself\u001b[39m\u001b[38;5;241m.\u001b[39mconvert_dtype\n\u001b[0;32m 1509\u001b[0m )\n\u001b[0;32m 1511\u001b[0m \u001b[38;5;28;01mif\u001b[39;00m \u001b[38;5;28mlen\u001b[39m(mapped) \u001b[38;5;129;01mand\u001b[39;00m \u001b[38;5;28misinstance\u001b[39m(mapped[\u001b[38;5;241m0\u001b[39m], ABCSeries):\n\u001b[0;32m 1512\u001b[0m \u001b[38;5;66;03m# GH#43986 Need to do list(mapped) in order to get treated as nested\u001b[39;00m\n\u001b[0;32m 1513\u001b[0m \u001b[38;5;66;03m# See also GH#25959 regarding EA support\u001b[39;00m\n\u001b[0;32m 1514\u001b[0m \u001b[38;5;28;01mreturn\u001b[39;00m obj\u001b[38;5;241m.\u001b[39m_constructor_expanddim(\u001b[38;5;28mlist\u001b[39m(mapped), index\u001b[38;5;241m=\u001b[39mobj\u001b[38;5;241m.\u001b[39mindex)\n",
+ "File \u001b[1;32m~\\anaconda3\\Lib\\site-packages\\pandas\\core\\base.py:921\u001b[0m, in \u001b[0;36mIndexOpsMixin._map_values\u001b[1;34m(self, mapper, na_action, convert)\u001b[0m\n\u001b[0;32m 918\u001b[0m \u001b[38;5;28;01mif\u001b[39;00m \u001b[38;5;28misinstance\u001b[39m(arr, ExtensionArray):\n\u001b[0;32m 919\u001b[0m \u001b[38;5;28;01mreturn\u001b[39;00m arr\u001b[38;5;241m.\u001b[39mmap(mapper, na_action\u001b[38;5;241m=\u001b[39mna_action)\n\u001b[1;32m--> 921\u001b[0m \u001b[38;5;28;01mreturn\u001b[39;00m algorithms\u001b[38;5;241m.\u001b[39mmap_array(arr, mapper, na_action\u001b[38;5;241m=\u001b[39mna_action, convert\u001b[38;5;241m=\u001b[39mconvert)\n",
+ "File \u001b[1;32m~\\anaconda3\\Lib\\site-packages\\pandas\\core\\algorithms.py:1743\u001b[0m, in \u001b[0;36mmap_array\u001b[1;34m(arr, mapper, na_action, convert)\u001b[0m\n\u001b[0;32m 1741\u001b[0m values \u001b[38;5;241m=\u001b[39m arr\u001b[38;5;241m.\u001b[39mastype(\u001b[38;5;28mobject\u001b[39m, copy\u001b[38;5;241m=\u001b[39m\u001b[38;5;28;01mFalse\u001b[39;00m)\n\u001b[0;32m 1742\u001b[0m \u001b[38;5;28;01mif\u001b[39;00m na_action \u001b[38;5;129;01mis\u001b[39;00m \u001b[38;5;28;01mNone\u001b[39;00m:\n\u001b[1;32m-> 1743\u001b[0m \u001b[38;5;28;01mreturn\u001b[39;00m lib\u001b[38;5;241m.\u001b[39mmap_infer(values, mapper, convert\u001b[38;5;241m=\u001b[39mconvert)\n\u001b[0;32m 1744\u001b[0m \u001b[38;5;28;01melse\u001b[39;00m:\n\u001b[0;32m 1745\u001b[0m \u001b[38;5;28;01mreturn\u001b[39;00m lib\u001b[38;5;241m.\u001b[39mmap_infer_mask(\n\u001b[0;32m 1746\u001b[0m values, mapper, mask\u001b[38;5;241m=\u001b[39misna(values)\u001b[38;5;241m.\u001b[39mview(np\u001b[38;5;241m.\u001b[39muint8), convert\u001b[38;5;241m=\u001b[39mconvert\n\u001b[0;32m 1747\u001b[0m )\n",
+ "File \u001b[1;32mlib.pyx:2972\u001b[0m, in \u001b[0;36mpandas._libs.lib.map_infer\u001b[1;34m()\u001b[0m\n",
+ "Cell \u001b[1;32mIn[38], line 2\u001b[0m, in \u001b[0;36mvehicle\u001b[1;34m(row)\u001b[0m\n\u001b[0;32m 1\u001b[0m \u001b[38;5;28;01mdef\u001b[39;00m \u001b[38;5;21mvehicle\u001b[39m(row):\n\u001b[1;32m----> 2\u001b[0m \u001b[38;5;28;01mif\u001b[39;00m \u001b[38;5;124m\"\u001b[39m\u001b[38;5;124mLuxury Car\u001b[39m\u001b[38;5;124m\"\u001b[39m \u001b[38;5;129;01min\u001b[39;00m row:\n\u001b[0;32m 3\u001b[0m \u001b[38;5;28;01mreturn\u001b[39;00m \u001b[38;5;124m\"\u001b[39m\u001b[38;5;124mLuxury\u001b[39m\u001b[38;5;124m\"\u001b[39m\n\u001b[0;32m 4\u001b[0m \u001b[38;5;28;01melif\u001b[39;00m \u001b[38;5;124m\"\u001b[39m\u001b[38;5;124mLuxury SUV\u001b[39m\u001b[38;5;124m\"\u001b[39m \u001b[38;5;129;01min\u001b[39;00m row:\n",
+ "\u001b[1;31mTypeError\u001b[0m: argument of type 'float' is not iterable"
+ ]
+ }
+ ],
+ "source": [
+ "def vehicle(row):\n",
+ " if \"Luxury Car\" in row:\n",
+ " return \"Luxury\"\n",
+ " elif \"Luxury SUV\" in row:\n",
+ " return \"Luxury\"\n",
+ " else:\n",
+ " return row\n",
+ "\n",
+ "\n",
+ "df_1[\"vehicle-class\"].apply(vehicle)"
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": 78,
+ "id": "b0eaefa3-0108-48db-b84c-be3c044e7b81",
+ "metadata": {},
+ "outputs": [
{
- "cell_type": "markdown",
- "id": "329ca691-9196-4419-8969-3596746237a1",
- "metadata": {
- "id": "329ca691-9196-4419-8969-3596746237a1"
- },
- "source": [
- "*Hint*:\n",
- "- *Customer lifetime value should be numeric*\n",
- "- *Number of open complaints has an incorrect format. Look at the different values it takes with `unique()` and take the middle value. As an example, 1/5/00 should be 5. Number of open complaints is a string - remember you can use `split()` to deal with it and take the number you need. Finally, since it should be numeric, cast the column to be in its proper type.*"
+ "data": {
+ "text/html": [
+ "\n",
+ "\n",
+ "
\n",
+ " \n",
+ " \n",
+ " | \n",
+ " customer | \n",
+ " state | \n",
+ " gender | \n",
+ " education | \n",
+ " customer-lifetime-value | \n",
+ " income | \n",
+ " monthly-premium-auto | \n",
+ " number-of-open-complaints | \n",
+ " policy-type | \n",
+ " vehicle-class | \n",
+ " total-claim-amount | \n",
+ "
\n",
+ " \n",
+ " \n",
+ " \n",
+ " 0 | \n",
+ " RB50392 | \n",
+ " WA | \n",
+ " NaN | \n",
+ " NaN | \n",
+ " NaN | \n",
+ " 0.0 | \n",
+ " 1000.0 | \n",
+ " 1/0/00 | \n",
+ " Personal Auto | \n",
+ " Four-Door Car | \n",
+ " 2.704934 | \n",
+ "
\n",
+ " \n",
+ " 1 | \n",
+ " QZ44356 | \n",
+ " AZ | \n",
+ " NaN | \n",
+ " Bachelors | \n",
+ " 697953.59 | \n",
+ " 0.0 | \n",
+ " 94.0 | \n",
+ " 1/0/00 | \n",
+ " Personal Auto | \n",
+ " Four-Door Car | \n",
+ " 1131.464935 | \n",
+ "
\n",
+ " \n",
+ " 2 | \n",
+ " AI49188 | \n",
+ " NaN | \n",
+ " NaN | \n",
+ " Bachelors | \n",
+ " 1288743.17 | \n",
+ " 48767.0 | \n",
+ " 108.0 | \n",
+ " 1/0/00 | \n",
+ " Personal Auto | \n",
+ " Two-Door Car | \n",
+ " 566.472247 | \n",
+ "
\n",
+ " \n",
+ " 3 | \n",
+ " WW63253 | \n",
+ " CA | \n",
+ " NaN | \n",
+ " Bachelors | \n",
+ " 764586.18 | \n",
+ " 0.0 | \n",
+ " 106.0 | \n",
+ " 1/0/00 | \n",
+ " Corporate Auto | \n",
+ " SUV | \n",
+ " 529.881344 | \n",
+ "
\n",
+ " \n",
+ " 4 | \n",
+ " GA49547 | \n",
+ " WA | \n",
+ " NaN | \n",
+ " NaN | \n",
+ " 536307.65 | \n",
+ " 36357.0 | \n",
+ " 68.0 | \n",
+ " 1/0/00 | \n",
+ " Personal Auto | \n",
+ " Four-Door Car | \n",
+ " 17.269323 | \n",
+ "
\n",
+ " \n",
+ "
\n",
+ "
"
+ ],
+ "text/plain": [
+ " customer state gender education customer-lifetime-value income \\\n",
+ "0 RB50392 WA NaN NaN NaN 0.0 \n",
+ "1 QZ44356 AZ NaN Bachelors 697953.59 0.0 \n",
+ "2 AI49188 NaN NaN Bachelors 1288743.17 48767.0 \n",
+ "3 WW63253 CA NaN Bachelors 764586.18 0.0 \n",
+ "4 GA49547 WA NaN NaN 536307.65 36357.0 \n",
+ "\n",
+ " monthly-premium-auto number-of-open-complaints policy-type \\\n",
+ "0 1000.0 1/0/00 Personal Auto \n",
+ "1 94.0 1/0/00 Personal Auto \n",
+ "2 108.0 1/0/00 Personal Auto \n",
+ "3 106.0 1/0/00 Corporate Auto \n",
+ "4 68.0 1/0/00 Personal Auto \n",
+ "\n",
+ " vehicle-class total-claim-amount \n",
+ "0 Four-Door Car 2.704934 \n",
+ "1 Four-Door Car 1131.464935 \n",
+ "2 Two-Door Car 566.472247 \n",
+ "3 SUV 529.881344 \n",
+ "4 Four-Door Car 17.269323 "
]
- },
+ },
+ "execution_count": 78,
+ "metadata": {},
+ "output_type": "execute_result"
+ }
+ ],
+ "source": [
+ "df_1.head()"
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": 86,
+ "id": "fa5dde9b-c5cd-49ae-b4ff-057e39899947",
+ "metadata": {},
+ "outputs": [
{
- "cell_type": "code",
- "execution_count": null,
- "id": "eb8f5991-73e9-405f-bf1c-6b7c589379a9",
- "metadata": {
- "id": "eb8f5991-73e9-405f-bf1c-6b7c589379a9"
- },
- "outputs": [],
- "source": [
- "# Your code here"
+ "data": {
+ "text/html": [
+ "\n",
+ "\n",
+ "
\n",
+ " \n",
+ " \n",
+ " | \n",
+ " customer | \n",
+ " state | \n",
+ " gender | \n",
+ " education | \n",
+ " customer-lifetime-value | \n",
+ " income | \n",
+ " monthly-premium-auto | \n",
+ " number-of-open-complaints | \n",
+ " policy-type | \n",
+ " vehicle-class | \n",
+ " total-claim-amount | \n",
+ "
\n",
+ " \n",
+ " \n",
+ " \n",
+ " 0 | \n",
+ " RB50392 | \n",
+ " WA | \n",
+ " NaN | \n",
+ " NaN | \n",
+ " NaN | \n",
+ " 0.0 | \n",
+ " 1000.0 | \n",
+ " 1/0/00 | \n",
+ " Personal Auto | \n",
+ " NaN | \n",
+ " 2.704934 | \n",
+ "
\n",
+ " \n",
+ " 1 | \n",
+ " QZ44356 | \n",
+ " AZ | \n",
+ " NaN | \n",
+ " Bachelors | \n",
+ " 697953.59 | \n",
+ " 0.0 | \n",
+ " 94.0 | \n",
+ " 1/0/00 | \n",
+ " Personal Auto | \n",
+ " NaN | \n",
+ " 1131.464935 | \n",
+ "
\n",
+ " \n",
+ " 2 | \n",
+ " AI49188 | \n",
+ " NaN | \n",
+ " NaN | \n",
+ " Bachelors | \n",
+ " 1288743.17 | \n",
+ " 48767.0 | \n",
+ " 108.0 | \n",
+ " 1/0/00 | \n",
+ " Personal Auto | \n",
+ " NaN | \n",
+ " 566.472247 | \n",
+ "
\n",
+ " \n",
+ " 3 | \n",
+ " WW63253 | \n",
+ " CA | \n",
+ " NaN | \n",
+ " Bachelors | \n",
+ " 764586.18 | \n",
+ " 0.0 | \n",
+ " 106.0 | \n",
+ " 1/0/00 | \n",
+ " Corporate Auto | \n",
+ " NaN | \n",
+ " 529.881344 | \n",
+ "
\n",
+ " \n",
+ " 4 | \n",
+ " GA49547 | \n",
+ " WA | \n",
+ " NaN | \n",
+ " NaN | \n",
+ " 536307.65 | \n",
+ " 36357.0 | \n",
+ " 68.0 | \n",
+ " 1/0/00 | \n",
+ " Personal Auto | \n",
+ " NaN | \n",
+ " 17.269323 | \n",
+ "
\n",
+ " \n",
+ " ... | \n",
+ " ... | \n",
+ " ... | \n",
+ " ... | \n",
+ " ... | \n",
+ " ... | \n",
+ " ... | \n",
+ " ... | \n",
+ " ... | \n",
+ " ... | \n",
+ " ... | \n",
+ " ... | \n",
+ "
\n",
+ " \n",
+ " 4003 | \n",
+ " NaN | \n",
+ " NaN | \n",
+ " NaN | \n",
+ " NaN | \n",
+ " NaN | \n",
+ " NaN | \n",
+ " NaN | \n",
+ " NaN | \n",
+ " NaN | \n",
+ " NaN | \n",
+ " NaN | \n",
+ "
\n",
+ " \n",
+ " 4004 | \n",
+ " NaN | \n",
+ " NaN | \n",
+ " NaN | \n",
+ " NaN | \n",
+ " NaN | \n",
+ " NaN | \n",
+ " NaN | \n",
+ " NaN | \n",
+ " NaN | \n",
+ " NaN | \n",
+ " NaN | \n",
+ "
\n",
+ " \n",
+ " 4005 | \n",
+ " NaN | \n",
+ " NaN | \n",
+ " NaN | \n",
+ " NaN | \n",
+ " NaN | \n",
+ " NaN | \n",
+ " NaN | \n",
+ " NaN | \n",
+ " NaN | \n",
+ " NaN | \n",
+ " NaN | \n",
+ "
\n",
+ " \n",
+ " 4006 | \n",
+ " NaN | \n",
+ " NaN | \n",
+ " NaN | \n",
+ " NaN | \n",
+ " NaN | \n",
+ " NaN | \n",
+ " NaN | \n",
+ " NaN | \n",
+ " NaN | \n",
+ " NaN | \n",
+ " NaN | \n",
+ "
\n",
+ " \n",
+ " 4007 | \n",
+ " NaN | \n",
+ " NaN | \n",
+ " NaN | \n",
+ " NaN | \n",
+ " NaN | \n",
+ " NaN | \n",
+ " NaN | \n",
+ " NaN | \n",
+ " NaN | \n",
+ " NaN | \n",
+ " NaN | \n",
+ "
\n",
+ " \n",
+ "
\n",
+ "
4008 rows × 11 columns
\n",
+ "
"
+ ],
+ "text/plain": [
+ " customer state gender education customer-lifetime-value income \\\n",
+ "0 RB50392 WA NaN NaN NaN 0.0 \n",
+ "1 QZ44356 AZ NaN Bachelors 697953.59 0.0 \n",
+ "2 AI49188 NaN NaN Bachelors 1288743.17 48767.0 \n",
+ "3 WW63253 CA NaN Bachelors 764586.18 0.0 \n",
+ "4 GA49547 WA NaN NaN 536307.65 36357.0 \n",
+ "... ... ... ... ... ... ... \n",
+ "4003 NaN NaN NaN NaN NaN NaN \n",
+ "4004 NaN NaN NaN NaN NaN NaN \n",
+ "4005 NaN NaN NaN NaN NaN NaN \n",
+ "4006 NaN NaN NaN NaN NaN NaN \n",
+ "4007 NaN NaN NaN NaN NaN NaN \n",
+ "\n",
+ " monthly-premium-auto number-of-open-complaints policy-type \\\n",
+ "0 1000.0 1/0/00 Personal Auto \n",
+ "1 94.0 1/0/00 Personal Auto \n",
+ "2 108.0 1/0/00 Personal Auto \n",
+ "3 106.0 1/0/00 Corporate Auto \n",
+ "4 68.0 1/0/00 Personal Auto \n",
+ "... ... ... ... \n",
+ "4003 NaN NaN NaN \n",
+ "4004 NaN NaN NaN \n",
+ "4005 NaN NaN NaN \n",
+ "4006 NaN NaN NaN \n",
+ "4007 NaN NaN NaN \n",
+ "\n",
+ " vehicle-class total-claim-amount \n",
+ "0 NaN 2.704934 \n",
+ "1 NaN 1131.464935 \n",
+ "2 NaN 566.472247 \n",
+ "3 NaN 529.881344 \n",
+ "4 NaN 17.269323 \n",
+ "... ... ... \n",
+ "4003 NaN NaN \n",
+ "4004 NaN NaN \n",
+ "4005 NaN NaN \n",
+ "4006 NaN NaN \n",
+ "4007 NaN NaN \n",
+ "\n",
+ "[4008 rows x 11 columns]"
]
- },
+ },
+ "execution_count": 86,
+ "metadata": {},
+ "output_type": "execute_result"
+ }
+ ],
+ "source": [
+ "vehicle_class_mapping = {\"Sports Car\": \"Luxury\", \"Luxury SUV\": \"Luxury\"}\n",
+ "df_1[\"vehicle-class\"] = df_1[\"vehicle-class\"].map(vehicle_class_mapping)\n",
+ "\n",
+ "df_1"
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "id": "85ff78ce-0174-4890-9db3-8048b7d7d2d0",
+ "metadata": {
+ "id": "85ff78ce-0174-4890-9db3-8048b7d7d2d0"
+ },
+ "source": [
+ "## Exercise 3: Formatting data types"
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": null,
+ "id": "8080375a-03e0-4be1-ab1e-c17ce40ae911",
+ "metadata": {},
+ "outputs": [],
+ "source": []
+ },
+ {
+ "cell_type": "markdown",
+ "id": "b91c2cf8-79a2-4baf-9f65-ff2fb22270bd",
+ "metadata": {
+ "id": "b91c2cf8-79a2-4baf-9f65-ff2fb22270bd"
+ },
+ "source": [
+ "The data types of many columns in the dataset appear to be incorrect. This could impact the accuracy of our analysis. To ensure accurate analysis, we need to correct the data types of these columns. Please update the data types of the columns as appropriate."
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "id": "43e5d853-ff9e-43b2-9d92-aef2f78764f3",
+ "metadata": {
+ "id": "43e5d853-ff9e-43b2-9d92-aef2f78764f3"
+ },
+ "source": [
+ "It is important to note that this exercise does not involve handling null values (NaN or None)."
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "id": "329ca691-9196-4419-8969-3596746237a1",
+ "metadata": {
+ "id": "329ca691-9196-4419-8969-3596746237a1"
+ },
+ "source": [
+ "*Hint*:\n",
+ "- *Customer lifetime value should be numeric*\n",
+ "- *Number of open complaints has an incorrect format. Look at the different values it takes with `unique()` and take the middle value. As an example, 1/5/00 should be 5. Number of open complaints is a string - remember you can use `split()` to deal with it and take the number you need. Finally, since it should be numeric, cast the column to be in its proper type.*"
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": 96,
+ "id": "eb8f5991-73e9-405f-bf1c-6b7c589379a9",
+ "metadata": {
+ "id": "eb8f5991-73e9-405f-bf1c-6b7c589379a9"
+ },
+ "outputs": [
{
- "cell_type": "markdown",
- "id": "14c52e28-2d0c-4dd2-8bd5-3476e34fadc1",
- "metadata": {
- "id": "14c52e28-2d0c-4dd2-8bd5-3476e34fadc1"
- },
- "source": [
- "## Exercise 4: Dealing with Null values"
+ "data": {
+ "text/html": [
+ "\n",
+ "\n",
+ "
\n",
+ " \n",
+ " \n",
+ " | \n",
+ " customer | \n",
+ " state | \n",
+ " gender | \n",
+ " education | \n",
+ " customer-lifetime-value | \n",
+ " income | \n",
+ " monthly-premium-auto | \n",
+ " number-of-open-complaints | \n",
+ " policy-type | \n",
+ " vehicle-class | \n",
+ " total-claim-amount | \n",
+ "
\n",
+ " \n",
+ " \n",
+ " \n",
+ " 0 | \n",
+ " RB50392 | \n",
+ " WA | \n",
+ " NaN | \n",
+ " NaN | \n",
+ " NaN | \n",
+ " 0.0 | \n",
+ " 1000.0 | \n",
+ " 1/0/00 | \n",
+ " Personal Auto | \n",
+ " NaN | \n",
+ " 2.704934 | \n",
+ "
\n",
+ " \n",
+ " 1 | \n",
+ " QZ44356 | \n",
+ " AZ | \n",
+ " NaN | \n",
+ " Bachelors | \n",
+ " 697953.59 | \n",
+ " 0.0 | \n",
+ " 94.0 | \n",
+ " 1/0/00 | \n",
+ " Personal Auto | \n",
+ " NaN | \n",
+ " 1131.464935 | \n",
+ "
\n",
+ " \n",
+ " 2 | \n",
+ " AI49188 | \n",
+ " NaN | \n",
+ " NaN | \n",
+ " Bachelors | \n",
+ " 1288743.17 | \n",
+ " 48767.0 | \n",
+ " 108.0 | \n",
+ " 1/0/00 | \n",
+ " Personal Auto | \n",
+ " NaN | \n",
+ " 566.472247 | \n",
+ "
\n",
+ " \n",
+ " 3 | \n",
+ " WW63253 | \n",
+ " CA | \n",
+ " NaN | \n",
+ " Bachelors | \n",
+ " 764586.18 | \n",
+ " 0.0 | \n",
+ " 106.0 | \n",
+ " 1/0/00 | \n",
+ " Corporate Auto | \n",
+ " NaN | \n",
+ " 529.881344 | \n",
+ "
\n",
+ " \n",
+ " 4 | \n",
+ " GA49547 | \n",
+ " WA | \n",
+ " NaN | \n",
+ " NaN | \n",
+ " 536307.65 | \n",
+ " 36357.0 | \n",
+ " 68.0 | \n",
+ " 1/0/00 | \n",
+ " Personal Auto | \n",
+ " NaN | \n",
+ " 17.269323 | \n",
+ "
\n",
+ " \n",
+ "
\n",
+ "
"
+ ],
+ "text/plain": [
+ " customer state gender education customer-lifetime-value income \\\n",
+ "0 RB50392 WA NaN NaN NaN 0.0 \n",
+ "1 QZ44356 AZ NaN Bachelors 697953.59 0.0 \n",
+ "2 AI49188 NaN NaN Bachelors 1288743.17 48767.0 \n",
+ "3 WW63253 CA NaN Bachelors 764586.18 0.0 \n",
+ "4 GA49547 WA NaN NaN 536307.65 36357.0 \n",
+ "\n",
+ " monthly-premium-auto number-of-open-complaints policy-type \\\n",
+ "0 1000.0 1/0/00 Personal Auto \n",
+ "1 94.0 1/0/00 Personal Auto \n",
+ "2 108.0 1/0/00 Personal Auto \n",
+ "3 106.0 1/0/00 Corporate Auto \n",
+ "4 68.0 1/0/00 Personal Auto \n",
+ "\n",
+ " vehicle-class total-claim-amount \n",
+ "0 NaN 2.704934 \n",
+ "1 NaN 1131.464935 \n",
+ "2 NaN 566.472247 \n",
+ "3 NaN 529.881344 \n",
+ "4 NaN 17.269323 "
]
- },
+ },
+ "execution_count": 96,
+ "metadata": {},
+ "output_type": "execute_result"
+ }
+ ],
+ "source": [
+ "df_1.head()"
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": 98,
+ "id": "1977e0e4-159b-4c46-a30e-9b93a36b2559",
+ "metadata": {},
+ "outputs": [],
+ "source": [
+ "df_1['customer-lifetime-value'] = df_1['customer-lifetime-value'].astype(float)"
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": 100,
+ "id": "fa40988e-1a58-4cdc-83f5-3dc61f3c16ff",
+ "metadata": {},
+ "outputs": [
{
- "cell_type": "markdown",
- "id": "34b9a20f-7d32-4417-975e-1b4dfb0e16cd",
- "metadata": {
- "id": "34b9a20f-7d32-4417-975e-1b4dfb0e16cd"
- },
- "source": [
- "Identify any columns with null or missing values. Identify how many null values each column has. You can use the `isnull()` function in pandas to find columns with null values.\n",
- "\n",
- "Decide on a strategy for handling the null values. There are several options, including:\n",
- "\n",
- "- Drop the rows or columns with null values\n",
- "- Fill the null values with a specific value (such as the column mean or median for numerical variables, and mode for categorical variables)\n",
- "- Fill the null values with the previous or next value in the column\n",
- "- Fill the null values based on a more complex algorithm or model (note: we haven't covered this yet)\n",
- "\n",
- "Implement your chosen strategy to handle the null values. You can use the `fillna()` function in pandas to fill null values or `dropna()` function to drop null values.\n",
- "\n",
- "Verify that your strategy has successfully handled the null values. You can use the `isnull()` function again to check if there are still null values in the dataset.\n",
- "\n",
- "Remember to document your process and explain your reasoning for choosing a particular strategy for handling null values.\n",
- "\n",
- "After formatting data types, as a last step, convert all the numeric variables to integers."
+ "data": {
+ "text/plain": [
+ "0 NaN\n",
+ "1 697953.59\n",
+ "2 1288743.17\n",
+ "3 764586.18\n",
+ "4 536307.65\n",
+ " ... \n",
+ "4003 NaN\n",
+ "4004 NaN\n",
+ "4005 NaN\n",
+ "4006 NaN\n",
+ "4007 NaN\n",
+ "Name: customer-lifetime-value, Length: 4008, dtype: float64"
]
- },
+ },
+ "execution_count": 100,
+ "metadata": {},
+ "output_type": "execute_result"
+ }
+ ],
+ "source": [
+ "df_1[\"customer-lifetime-value\"]"
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": 42,
+ "id": "9064ebbc-bb82-4703-8e2d-5980adbd8cb2",
+ "metadata": {},
+ "outputs": [
{
- "cell_type": "code",
- "execution_count": null,
- "id": "f184fc35-7831-4836-a0a5-e7f99e01b40e",
- "metadata": {
- "id": "f184fc35-7831-4836-a0a5-e7f99e01b40e"
- },
- "outputs": [],
- "source": [
- "# Your code here"
- ]
- },
+ "ename": "ValueError",
+ "evalue": "cannot convert float NaN to integer",
+ "output_type": "error",
+ "traceback": [
+ "\u001b[1;31m---------------------------------------------------------------------------\u001b[0m",
+ "\u001b[1;31mValueError\u001b[0m Traceback (most recent call last)",
+ "Cell \u001b[1;32mIn[42], line 1\u001b[0m\n\u001b[1;32m----> 1\u001b[0m df_1[\u001b[38;5;124m'\u001b[39m\u001b[38;5;124mnumber-of-open-complaints\u001b[39m\u001b[38;5;124m'\u001b[39m] \u001b[38;5;241m=\u001b[39m df_1[\u001b[38;5;124m'\u001b[39m\u001b[38;5;124mnumber-of-open-complaints\u001b[39m\u001b[38;5;124m'\u001b[39m]\u001b[38;5;241m.\u001b[39mstr\u001b[38;5;241m.\u001b[39msplit(\u001b[38;5;124m'\u001b[39m\u001b[38;5;124m/\u001b[39m\u001b[38;5;124m'\u001b[39m)\u001b[38;5;241m.\u001b[39mstr[\u001b[38;5;241m1\u001b[39m]\u001b[38;5;241m.\u001b[39mastype(\u001b[38;5;28mint\u001b[39m)\n",
+ "File \u001b[1;32m~\\anaconda3\\Lib\\site-packages\\pandas\\core\\generic.py:6643\u001b[0m, in \u001b[0;36mNDFrame.astype\u001b[1;34m(self, dtype, copy, errors)\u001b[0m\n\u001b[0;32m 6637\u001b[0m results \u001b[38;5;241m=\u001b[39m [\n\u001b[0;32m 6638\u001b[0m ser\u001b[38;5;241m.\u001b[39mastype(dtype, copy\u001b[38;5;241m=\u001b[39mcopy, errors\u001b[38;5;241m=\u001b[39merrors) \u001b[38;5;28;01mfor\u001b[39;00m _, ser \u001b[38;5;129;01min\u001b[39;00m \u001b[38;5;28mself\u001b[39m\u001b[38;5;241m.\u001b[39mitems()\n\u001b[0;32m 6639\u001b[0m ]\n\u001b[0;32m 6641\u001b[0m \u001b[38;5;28;01melse\u001b[39;00m:\n\u001b[0;32m 6642\u001b[0m \u001b[38;5;66;03m# else, only a single dtype is given\u001b[39;00m\n\u001b[1;32m-> 6643\u001b[0m new_data \u001b[38;5;241m=\u001b[39m \u001b[38;5;28mself\u001b[39m\u001b[38;5;241m.\u001b[39m_mgr\u001b[38;5;241m.\u001b[39mastype(dtype\u001b[38;5;241m=\u001b[39mdtype, copy\u001b[38;5;241m=\u001b[39mcopy, errors\u001b[38;5;241m=\u001b[39merrors)\n\u001b[0;32m 6644\u001b[0m res \u001b[38;5;241m=\u001b[39m \u001b[38;5;28mself\u001b[39m\u001b[38;5;241m.\u001b[39m_constructor_from_mgr(new_data, axes\u001b[38;5;241m=\u001b[39mnew_data\u001b[38;5;241m.\u001b[39maxes)\n\u001b[0;32m 6645\u001b[0m \u001b[38;5;28;01mreturn\u001b[39;00m res\u001b[38;5;241m.\u001b[39m__finalize__(\u001b[38;5;28mself\u001b[39m, method\u001b[38;5;241m=\u001b[39m\u001b[38;5;124m\"\u001b[39m\u001b[38;5;124mastype\u001b[39m\u001b[38;5;124m\"\u001b[39m)\n",
+ "File \u001b[1;32m~\\anaconda3\\Lib\\site-packages\\pandas\\core\\internals\\managers.py:430\u001b[0m, in \u001b[0;36mBaseBlockManager.astype\u001b[1;34m(self, dtype, copy, errors)\u001b[0m\n\u001b[0;32m 427\u001b[0m \u001b[38;5;28;01melif\u001b[39;00m using_copy_on_write():\n\u001b[0;32m 428\u001b[0m copy \u001b[38;5;241m=\u001b[39m \u001b[38;5;28;01mFalse\u001b[39;00m\n\u001b[1;32m--> 430\u001b[0m \u001b[38;5;28;01mreturn\u001b[39;00m \u001b[38;5;28mself\u001b[39m\u001b[38;5;241m.\u001b[39mapply(\n\u001b[0;32m 431\u001b[0m \u001b[38;5;124m\"\u001b[39m\u001b[38;5;124mastype\u001b[39m\u001b[38;5;124m\"\u001b[39m,\n\u001b[0;32m 432\u001b[0m dtype\u001b[38;5;241m=\u001b[39mdtype,\n\u001b[0;32m 433\u001b[0m copy\u001b[38;5;241m=\u001b[39mcopy,\n\u001b[0;32m 434\u001b[0m errors\u001b[38;5;241m=\u001b[39merrors,\n\u001b[0;32m 435\u001b[0m using_cow\u001b[38;5;241m=\u001b[39musing_copy_on_write(),\n\u001b[0;32m 436\u001b[0m )\n",
+ "File \u001b[1;32m~\\anaconda3\\Lib\\site-packages\\pandas\\core\\internals\\managers.py:363\u001b[0m, in \u001b[0;36mBaseBlockManager.apply\u001b[1;34m(self, f, align_keys, **kwargs)\u001b[0m\n\u001b[0;32m 361\u001b[0m applied \u001b[38;5;241m=\u001b[39m b\u001b[38;5;241m.\u001b[39mapply(f, \u001b[38;5;241m*\u001b[39m\u001b[38;5;241m*\u001b[39mkwargs)\n\u001b[0;32m 362\u001b[0m \u001b[38;5;28;01melse\u001b[39;00m:\n\u001b[1;32m--> 363\u001b[0m applied \u001b[38;5;241m=\u001b[39m \u001b[38;5;28mgetattr\u001b[39m(b, f)(\u001b[38;5;241m*\u001b[39m\u001b[38;5;241m*\u001b[39mkwargs)\n\u001b[0;32m 364\u001b[0m result_blocks \u001b[38;5;241m=\u001b[39m extend_blocks(applied, result_blocks)\n\u001b[0;32m 366\u001b[0m out \u001b[38;5;241m=\u001b[39m \u001b[38;5;28mtype\u001b[39m(\u001b[38;5;28mself\u001b[39m)\u001b[38;5;241m.\u001b[39mfrom_blocks(result_blocks, \u001b[38;5;28mself\u001b[39m\u001b[38;5;241m.\u001b[39maxes)\n",
+ "File \u001b[1;32m~\\anaconda3\\Lib\\site-packages\\pandas\\core\\internals\\blocks.py:758\u001b[0m, in \u001b[0;36mBlock.astype\u001b[1;34m(self, dtype, copy, errors, using_cow, squeeze)\u001b[0m\n\u001b[0;32m 755\u001b[0m \u001b[38;5;28;01mraise\u001b[39;00m \u001b[38;5;167;01mValueError\u001b[39;00m(\u001b[38;5;124m\"\u001b[39m\u001b[38;5;124mCan not squeeze with more than one column.\u001b[39m\u001b[38;5;124m\"\u001b[39m)\n\u001b[0;32m 756\u001b[0m values \u001b[38;5;241m=\u001b[39m values[\u001b[38;5;241m0\u001b[39m, :] \u001b[38;5;66;03m# type: ignore[call-overload]\u001b[39;00m\n\u001b[1;32m--> 758\u001b[0m new_values \u001b[38;5;241m=\u001b[39m astype_array_safe(values, dtype, copy\u001b[38;5;241m=\u001b[39mcopy, errors\u001b[38;5;241m=\u001b[39merrors)\n\u001b[0;32m 760\u001b[0m new_values \u001b[38;5;241m=\u001b[39m maybe_coerce_values(new_values)\n\u001b[0;32m 762\u001b[0m refs \u001b[38;5;241m=\u001b[39m \u001b[38;5;28;01mNone\u001b[39;00m\n",
+ "File \u001b[1;32m~\\anaconda3\\Lib\\site-packages\\pandas\\core\\dtypes\\astype.py:237\u001b[0m, in \u001b[0;36mastype_array_safe\u001b[1;34m(values, dtype, copy, errors)\u001b[0m\n\u001b[0;32m 234\u001b[0m dtype \u001b[38;5;241m=\u001b[39m dtype\u001b[38;5;241m.\u001b[39mnumpy_dtype\n\u001b[0;32m 236\u001b[0m \u001b[38;5;28;01mtry\u001b[39;00m:\n\u001b[1;32m--> 237\u001b[0m new_values \u001b[38;5;241m=\u001b[39m astype_array(values, dtype, copy\u001b[38;5;241m=\u001b[39mcopy)\n\u001b[0;32m 238\u001b[0m \u001b[38;5;28;01mexcept\u001b[39;00m (\u001b[38;5;167;01mValueError\u001b[39;00m, \u001b[38;5;167;01mTypeError\u001b[39;00m):\n\u001b[0;32m 239\u001b[0m \u001b[38;5;66;03m# e.g. _astype_nansafe can fail on object-dtype of strings\u001b[39;00m\n\u001b[0;32m 240\u001b[0m \u001b[38;5;66;03m# trying to convert to float\u001b[39;00m\n\u001b[0;32m 241\u001b[0m \u001b[38;5;28;01mif\u001b[39;00m errors \u001b[38;5;241m==\u001b[39m \u001b[38;5;124m\"\u001b[39m\u001b[38;5;124mignore\u001b[39m\u001b[38;5;124m\"\u001b[39m:\n",
+ "File \u001b[1;32m~\\anaconda3\\Lib\\site-packages\\pandas\\core\\dtypes\\astype.py:182\u001b[0m, in \u001b[0;36mastype_array\u001b[1;34m(values, dtype, copy)\u001b[0m\n\u001b[0;32m 179\u001b[0m values \u001b[38;5;241m=\u001b[39m values\u001b[38;5;241m.\u001b[39mastype(dtype, copy\u001b[38;5;241m=\u001b[39mcopy)\n\u001b[0;32m 181\u001b[0m \u001b[38;5;28;01melse\u001b[39;00m:\n\u001b[1;32m--> 182\u001b[0m values \u001b[38;5;241m=\u001b[39m _astype_nansafe(values, dtype, copy\u001b[38;5;241m=\u001b[39mcopy)\n\u001b[0;32m 184\u001b[0m \u001b[38;5;66;03m# in pandas we don't store numpy str dtypes, so convert to object\u001b[39;00m\n\u001b[0;32m 185\u001b[0m \u001b[38;5;28;01mif\u001b[39;00m \u001b[38;5;28misinstance\u001b[39m(dtype, np\u001b[38;5;241m.\u001b[39mdtype) \u001b[38;5;129;01mand\u001b[39;00m \u001b[38;5;28missubclass\u001b[39m(values\u001b[38;5;241m.\u001b[39mdtype\u001b[38;5;241m.\u001b[39mtype, \u001b[38;5;28mstr\u001b[39m):\n",
+ "File \u001b[1;32m~\\anaconda3\\Lib\\site-packages\\pandas\\core\\dtypes\\astype.py:133\u001b[0m, in \u001b[0;36m_astype_nansafe\u001b[1;34m(arr, dtype, copy, skipna)\u001b[0m\n\u001b[0;32m 129\u001b[0m \u001b[38;5;28;01mraise\u001b[39;00m \u001b[38;5;167;01mValueError\u001b[39;00m(msg)\n\u001b[0;32m 131\u001b[0m \u001b[38;5;28;01mif\u001b[39;00m copy \u001b[38;5;129;01mor\u001b[39;00m arr\u001b[38;5;241m.\u001b[39mdtype \u001b[38;5;241m==\u001b[39m \u001b[38;5;28mobject\u001b[39m \u001b[38;5;129;01mor\u001b[39;00m dtype \u001b[38;5;241m==\u001b[39m \u001b[38;5;28mobject\u001b[39m:\n\u001b[0;32m 132\u001b[0m \u001b[38;5;66;03m# Explicit copy, or required since NumPy can't view from / to object.\u001b[39;00m\n\u001b[1;32m--> 133\u001b[0m \u001b[38;5;28;01mreturn\u001b[39;00m arr\u001b[38;5;241m.\u001b[39mastype(dtype, copy\u001b[38;5;241m=\u001b[39m\u001b[38;5;28;01mTrue\u001b[39;00m)\n\u001b[0;32m 135\u001b[0m \u001b[38;5;28;01mreturn\u001b[39;00m arr\u001b[38;5;241m.\u001b[39mastype(dtype, copy\u001b[38;5;241m=\u001b[39mcopy)\n",
+ "\u001b[1;31mValueError\u001b[0m: cannot convert float NaN to integer"
+ ]
+ }
+ ],
+ "source": [
+ "df_1['number-of-open-complaints'] = df_1['number-of-open-complaints'].str.split('/').str[1].astype(int)"
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": 102,
+ "id": "1ad2a170-57d2-4bd2-9de4-cdf437a185e2",
+ "metadata": {},
+ "outputs": [
{
- "cell_type": "markdown",
- "id": "98416351-e999-4156-9834-9b00a311adfa",
- "metadata": {
- "id": "98416351-e999-4156-9834-9b00a311adfa"
- },
- "source": [
- "## Exercise 5: Dealing with duplicates"
+ "data": {
+ "text/plain": [
+ "array(['1/0/00', '1/2/00', '1/1/00', '1/3/00', '1/5/00', '1/4/00', nan],\n",
+ " dtype=object)"
]
- },
+ },
+ "execution_count": 102,
+ "metadata": {},
+ "output_type": "execute_result"
+ }
+ ],
+ "source": [
+ "df_1[\"number-of-open-complaints\"].unique()"
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": 104,
+ "id": "b300fef4-4c8d-4b5b-b615-ff028532a319",
+ "metadata": {},
+ "outputs": [
{
- "cell_type": "markdown",
- "id": "ea0816a7-a18e-4d4c-b667-a8452a800bd1",
- "metadata": {
- "id": "ea0816a7-a18e-4d4c-b667-a8452a800bd1"
- },
- "source": [
- "Use the `.duplicated()` method to identify any duplicate rows in the dataframe.\n",
- "\n",
- "Decide on a strategy for handling the duplicates. Options include:\n",
- "- Dropping all duplicate rows\n",
- "- Keeping only the first occurrence of each duplicated row\n",
- "- Keeping only the last occurrence of each duplicated row\n",
- "- Dropping duplicates based on a subset of columns\n",
- "- Dropping duplicates based on a specific column\n",
- "\n",
- "Implement your chosen strategy using the `drop_duplicates()` function.\n",
- "\n",
- "Verify that your strategy has successfully handled the duplicates by checking for duplicates again using `.duplicated()`.\n",
- "\n",
- "Remember to document your process and explain your reasoning for choosing a particular strategy for handling duplicates.\n",
- "\n",
- "Save the cleaned dataset to a new CSV file.\n",
- "\n",
- "*Hint*: *after dropping duplicates, reset the index to ensure consistency*."
+ "data": {
+ "text/html": [
+ "\n",
+ "\n",
+ "
\n",
+ " \n",
+ " \n",
+ " | \n",
+ " customer | \n",
+ " state | \n",
+ " gender | \n",
+ " education | \n",
+ " customer-lifetime-value | \n",
+ " income | \n",
+ " monthly-premium-auto | \n",
+ " number-of-open-complaints | \n",
+ " policy-type | \n",
+ " vehicle-class | \n",
+ " total-claim-amount | \n",
+ "
\n",
+ " \n",
+ " \n",
+ " \n",
+ " 0 | \n",
+ " RB50392 | \n",
+ " WA | \n",
+ " NaN | \n",
+ " NaN | \n",
+ " NaN | \n",
+ " 0.0 | \n",
+ " 1000.0 | \n",
+ " 0 | \n",
+ " Personal Auto | \n",
+ " NaN | \n",
+ " 2.704934 | \n",
+ "
\n",
+ " \n",
+ " 1 | \n",
+ " QZ44356 | \n",
+ " AZ | \n",
+ " NaN | \n",
+ " Bachelors | \n",
+ " 697953.59 | \n",
+ " 0.0 | \n",
+ " 94.0 | \n",
+ " 0 | \n",
+ " Personal Auto | \n",
+ " NaN | \n",
+ " 1131.464935 | \n",
+ "
\n",
+ " \n",
+ " 2 | \n",
+ " AI49188 | \n",
+ " NaN | \n",
+ " NaN | \n",
+ " Bachelors | \n",
+ " 1288743.17 | \n",
+ " 48767.0 | \n",
+ " 108.0 | \n",
+ " 0 | \n",
+ " Personal Auto | \n",
+ " NaN | \n",
+ " 566.472247 | \n",
+ "
\n",
+ " \n",
+ " 3 | \n",
+ " WW63253 | \n",
+ " CA | \n",
+ " NaN | \n",
+ " Bachelors | \n",
+ " 764586.18 | \n",
+ " 0.0 | \n",
+ " 106.0 | \n",
+ " 0 | \n",
+ " Corporate Auto | \n",
+ " NaN | \n",
+ " 529.881344 | \n",
+ "
\n",
+ " \n",
+ " 4 | \n",
+ " GA49547 | \n",
+ " WA | \n",
+ " NaN | \n",
+ " NaN | \n",
+ " 536307.65 | \n",
+ " 36357.0 | \n",
+ " 68.0 | \n",
+ " 0 | \n",
+ " Personal Auto | \n",
+ " NaN | \n",
+ " 17.269323 | \n",
+ "
\n",
+ " \n",
+ " ... | \n",
+ " ... | \n",
+ " ... | \n",
+ " ... | \n",
+ " ... | \n",
+ " ... | \n",
+ " ... | \n",
+ " ... | \n",
+ " ... | \n",
+ " ... | \n",
+ " ... | \n",
+ " ... | \n",
+ "
\n",
+ " \n",
+ " 4003 | \n",
+ " NaN | \n",
+ " NaN | \n",
+ " NaN | \n",
+ " NaN | \n",
+ " NaN | \n",
+ " NaN | \n",
+ " NaN | \n",
+ " NaN | \n",
+ " NaN | \n",
+ " NaN | \n",
+ " NaN | \n",
+ "
\n",
+ " \n",
+ " 4004 | \n",
+ " NaN | \n",
+ " NaN | \n",
+ " NaN | \n",
+ " NaN | \n",
+ " NaN | \n",
+ " NaN | \n",
+ " NaN | \n",
+ " NaN | \n",
+ " NaN | \n",
+ " NaN | \n",
+ " NaN | \n",
+ "
\n",
+ " \n",
+ " 4005 | \n",
+ " NaN | \n",
+ " NaN | \n",
+ " NaN | \n",
+ " NaN | \n",
+ " NaN | \n",
+ " NaN | \n",
+ " NaN | \n",
+ " NaN | \n",
+ " NaN | \n",
+ " NaN | \n",
+ " NaN | \n",
+ "
\n",
+ " \n",
+ " 4006 | \n",
+ " NaN | \n",
+ " NaN | \n",
+ " NaN | \n",
+ " NaN | \n",
+ " NaN | \n",
+ " NaN | \n",
+ " NaN | \n",
+ " NaN | \n",
+ " NaN | \n",
+ " NaN | \n",
+ " NaN | \n",
+ "
\n",
+ " \n",
+ " 4007 | \n",
+ " NaN | \n",
+ " NaN | \n",
+ " NaN | \n",
+ " NaN | \n",
+ " NaN | \n",
+ " NaN | \n",
+ " NaN | \n",
+ " NaN | \n",
+ " NaN | \n",
+ " NaN | \n",
+ " NaN | \n",
+ "
\n",
+ " \n",
+ "
\n",
+ "
4008 rows × 11 columns
\n",
+ "
"
+ ],
+ "text/plain": [
+ " customer state gender education customer-lifetime-value income \\\n",
+ "0 RB50392 WA NaN NaN NaN 0.0 \n",
+ "1 QZ44356 AZ NaN Bachelors 697953.59 0.0 \n",
+ "2 AI49188 NaN NaN Bachelors 1288743.17 48767.0 \n",
+ "3 WW63253 CA NaN Bachelors 764586.18 0.0 \n",
+ "4 GA49547 WA NaN NaN 536307.65 36357.0 \n",
+ "... ... ... ... ... ... ... \n",
+ "4003 NaN NaN NaN NaN NaN NaN \n",
+ "4004 NaN NaN NaN NaN NaN NaN \n",
+ "4005 NaN NaN NaN NaN NaN NaN \n",
+ "4006 NaN NaN NaN NaN NaN NaN \n",
+ "4007 NaN NaN NaN NaN NaN NaN \n",
+ "\n",
+ " monthly-premium-auto number-of-open-complaints policy-type \\\n",
+ "0 1000.0 0 Personal Auto \n",
+ "1 94.0 0 Personal Auto \n",
+ "2 108.0 0 Personal Auto \n",
+ "3 106.0 0 Corporate Auto \n",
+ "4 68.0 0 Personal Auto \n",
+ "... ... ... ... \n",
+ "4003 NaN NaN NaN \n",
+ "4004 NaN NaN NaN \n",
+ "4005 NaN NaN NaN \n",
+ "4006 NaN NaN NaN \n",
+ "4007 NaN NaN NaN \n",
+ "\n",
+ " vehicle-class total-claim-amount \n",
+ "0 NaN 2.704934 \n",
+ "1 NaN 1131.464935 \n",
+ "2 NaN 566.472247 \n",
+ "3 NaN 529.881344 \n",
+ "4 NaN 17.269323 \n",
+ "... ... ... \n",
+ "4003 NaN NaN \n",
+ "4004 NaN NaN \n",
+ "4005 NaN NaN \n",
+ "4006 NaN NaN \n",
+ "4007 NaN NaN \n",
+ "\n",
+ "[4008 rows x 11 columns]"
]
- },
+ },
+ "execution_count": 104,
+ "metadata": {},
+ "output_type": "execute_result"
+ }
+ ],
+ "source": [
+ "complaints_mapping = {\"1/0/00\": \"0\", \"1/2/00\": \"2\", \"1/1/00\": \"1\", \"1/3/00\": \"3\", \"1/4/00\": \"4\", \"1/5/00\": \"5\"}\n",
+ "df_1[\"number-of-open-complaints\"] = df_1[\"number-of-open-complaints\"].map(complaints_mapping)\n",
+ "\n",
+ "df_1"
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": 106,
+ "id": "96adf88a-396a-4db4-8504-e0da253a92a9",
+ "metadata": {},
+ "outputs": [
{
- "cell_type": "code",
- "execution_count": null,
- "id": "1929362c-47ed-47cb-baca-358b78d401a0",
- "metadata": {
- "id": "1929362c-47ed-47cb-baca-358b78d401a0"
- },
- "outputs": [],
- "source": [
- "# Your code here"
+ "data": {
+ "text/plain": [
+ "array(['0', '2', '1', '3', '5', '4', nan], dtype=object)"
]
- },
+ },
+ "execution_count": 106,
+ "metadata": {},
+ "output_type": "execute_result"
+ }
+ ],
+ "source": [
+ "df_1[\"number-of-open-complaints\"].unique()"
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "id": "14c52e28-2d0c-4dd2-8bd5-3476e34fadc1",
+ "metadata": {
+ "id": "14c52e28-2d0c-4dd2-8bd5-3476e34fadc1"
+ },
+ "source": [
+ "## Exercise 4: Dealing with Null values"
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "id": "34b9a20f-7d32-4417-975e-1b4dfb0e16cd",
+ "metadata": {
+ "id": "34b9a20f-7d32-4417-975e-1b4dfb0e16cd"
+ },
+ "source": [
+ "Identify any columns with null or missing values. Identify how many null values each column has. You can use the `isnull()` function in pandas to find columns with null values.\n",
+ "\n",
+ "Decide on a strategy for handling the null values. There are several options, including:\n",
+ "\n",
+ "- Drop the rows or columns with null values\n",
+ "- Fill the null values with a specific value (such as the column mean or median for numerical variables, and mode for categorical variables)\n",
+ "- Fill the null values with the previous or next value in the column\n",
+ "- Fill the null values based on a more complex algorithm or model (note: we haven't covered this yet)\n",
+ "\n",
+ "Implement your chosen strategy to handle the null values. You can use the `fillna()` function in pandas to fill null values or `dropna()` function to drop null values.\n",
+ "\n",
+ "Verify that your strategy has successfully handled the null values. You can use the `isnull()` function again to check if there are still null values in the dataset.\n",
+ "\n",
+ "Remember to document your process and explain your reasoning for choosing a particular strategy for handling null values.\n",
+ "\n",
+ "After formatting data types, as a last step, convert all the numeric variables to integers."
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": 65,
+ "id": "f184fc35-7831-4836-a0a5-e7f99e01b40e",
+ "metadata": {
+ "id": "f184fc35-7831-4836-a0a5-e7f99e01b40e"
+ },
+ "outputs": [
{
- "cell_type": "markdown",
- "id": "60840701-4783-40e2-b4d8-55303f9100c9",
- "metadata": {
- "id": "60840701-4783-40e2-b4d8-55303f9100c9"
- },
- "source": [
- "# Bonus: Challenge 2: creating functions on a separate `py` file"
+ "data": {
+ "text/html": [
+ "\n",
+ "\n",
+ "
\n",
+ " \n",
+ " \n",
+ " | \n",
+ " customer | \n",
+ " state | \n",
+ " gender | \n",
+ " education | \n",
+ " customer-lifetime-value | \n",
+ " income | \n",
+ " monthly-premium-auto | \n",
+ " number-of-open-complaints | \n",
+ " policy-type | \n",
+ " vehicle-class | \n",
+ " total-claim-amount | \n",
+ "
\n",
+ " \n",
+ " \n",
+ " \n",
+ " 0 | \n",
+ " RB50392 | \n",
+ " Washington | \n",
+ " NaN | \n",
+ " Master | \n",
+ " NaN | \n",
+ " 0.0 | \n",
+ " 1000.0 | \n",
+ " 1/0/00 | \n",
+ " Personal Auto | \n",
+ " Four-Door Car | \n",
+ " 2.704934 | \n",
+ "
\n",
+ " \n",
+ " 1 | \n",
+ " QZ44356 | \n",
+ " Arizona | \n",
+ " F | \n",
+ " Bachelor | \n",
+ " 697953.59% | \n",
+ " 0.0 | \n",
+ " 94.0 | \n",
+ " 1/0/00 | \n",
+ " Personal Auto | \n",
+ " Four-Door Car | \n",
+ " 1131.464935 | \n",
+ "
\n",
+ " \n",
+ " 2 | \n",
+ " AI49188 | \n",
+ " Nevada | \n",
+ " F | \n",
+ " Bachelor | \n",
+ " 1288743.17% | \n",
+ " 48767.0 | \n",
+ " 108.0 | \n",
+ " 1/0/00 | \n",
+ " Personal Auto | \n",
+ " Two-Door Car | \n",
+ " 566.472247 | \n",
+ "
\n",
+ " \n",
+ " 3 | \n",
+ " WW63253 | \n",
+ " California | \n",
+ " M | \n",
+ " Bachelor | \n",
+ " 764586.18% | \n",
+ " 0.0 | \n",
+ " 106.0 | \n",
+ " 1/0/00 | \n",
+ " Corporate Auto | \n",
+ " SUV | \n",
+ " 529.881344 | \n",
+ "
\n",
+ " \n",
+ " 4 | \n",
+ " GA49547 | \n",
+ " Washington | \n",
+ " M | \n",
+ " High School or Below | \n",
+ " 536307.65% | \n",
+ " 36357.0 | \n",
+ " 68.0 | \n",
+ " 1/0/00 | \n",
+ " Personal Auto | \n",
+ " Four-Door Car | \n",
+ " 17.269323 | \n",
+ "
\n",
+ " \n",
+ " ... | \n",
+ " ... | \n",
+ " ... | \n",
+ " ... | \n",
+ " ... | \n",
+ " ... | \n",
+ " ... | \n",
+ " ... | \n",
+ " ... | \n",
+ " ... | \n",
+ " ... | \n",
+ " ... | \n",
+ "
\n",
+ " \n",
+ " 4003 | \n",
+ " NaN | \n",
+ " NaN | \n",
+ " NaN | \n",
+ " NaN | \n",
+ " NaN | \n",
+ " NaN | \n",
+ " NaN | \n",
+ " NaN | \n",
+ " NaN | \n",
+ " NaN | \n",
+ " NaN | \n",
+ "
\n",
+ " \n",
+ " 4004 | \n",
+ " NaN | \n",
+ " NaN | \n",
+ " NaN | \n",
+ " NaN | \n",
+ " NaN | \n",
+ " NaN | \n",
+ " NaN | \n",
+ " NaN | \n",
+ " NaN | \n",
+ " NaN | \n",
+ " NaN | \n",
+ "
\n",
+ " \n",
+ " 4005 | \n",
+ " NaN | \n",
+ " NaN | \n",
+ " NaN | \n",
+ " NaN | \n",
+ " NaN | \n",
+ " NaN | \n",
+ " NaN | \n",
+ " NaN | \n",
+ " NaN | \n",
+ " NaN | \n",
+ " NaN | \n",
+ "
\n",
+ " \n",
+ " 4006 | \n",
+ " NaN | \n",
+ " NaN | \n",
+ " NaN | \n",
+ " NaN | \n",
+ " NaN | \n",
+ " NaN | \n",
+ " NaN | \n",
+ " NaN | \n",
+ " NaN | \n",
+ " NaN | \n",
+ " NaN | \n",
+ "
\n",
+ " \n",
+ " 4007 | \n",
+ " NaN | \n",
+ " NaN | \n",
+ " NaN | \n",
+ " NaN | \n",
+ " NaN | \n",
+ " NaN | \n",
+ " NaN | \n",
+ " NaN | \n",
+ " NaN | \n",
+ " NaN | \n",
+ " NaN | \n",
+ "
\n",
+ " \n",
+ "
\n",
+ "
4008 rows × 11 columns
\n",
+ "
"
+ ],
+ "text/plain": [
+ " customer state gender education \\\n",
+ "0 RB50392 Washington NaN Master \n",
+ "1 QZ44356 Arizona F Bachelor \n",
+ "2 AI49188 Nevada F Bachelor \n",
+ "3 WW63253 California M Bachelor \n",
+ "4 GA49547 Washington M High School or Below \n",
+ "... ... ... ... ... \n",
+ "4003 NaN NaN NaN NaN \n",
+ "4004 NaN NaN NaN NaN \n",
+ "4005 NaN NaN NaN NaN \n",
+ "4006 NaN NaN NaN NaN \n",
+ "4007 NaN NaN NaN NaN \n",
+ "\n",
+ " customer-lifetime-value income monthly-premium-auto \\\n",
+ "0 NaN 0.0 1000.0 \n",
+ "1 697953.59% 0.0 94.0 \n",
+ "2 1288743.17% 48767.0 108.0 \n",
+ "3 764586.18% 0.0 106.0 \n",
+ "4 536307.65% 36357.0 68.0 \n",
+ "... ... ... ... \n",
+ "4003 NaN NaN NaN \n",
+ "4004 NaN NaN NaN \n",
+ "4005 NaN NaN NaN \n",
+ "4006 NaN NaN NaN \n",
+ "4007 NaN NaN NaN \n",
+ "\n",
+ " number-of-open-complaints policy-type vehicle-class \\\n",
+ "0 1/0/00 Personal Auto Four-Door Car \n",
+ "1 1/0/00 Personal Auto Four-Door Car \n",
+ "2 1/0/00 Personal Auto Two-Door Car \n",
+ "3 1/0/00 Corporate Auto SUV \n",
+ "4 1/0/00 Personal Auto Four-Door Car \n",
+ "... ... ... ... \n",
+ "4003 NaN NaN NaN \n",
+ "4004 NaN NaN NaN \n",
+ "4005 NaN NaN NaN \n",
+ "4006 NaN NaN NaN \n",
+ "4007 NaN NaN NaN \n",
+ "\n",
+ " total-claim-amount \n",
+ "0 2.704934 \n",
+ "1 1131.464935 \n",
+ "2 566.472247 \n",
+ "3 529.881344 \n",
+ "4 17.269323 \n",
+ "... ... \n",
+ "4003 NaN \n",
+ "4004 NaN \n",
+ "4005 NaN \n",
+ "4006 NaN \n",
+ "4007 NaN \n",
+ "\n",
+ "[4008 rows x 11 columns]"
]
- },
+ },
+ "execution_count": 65,
+ "metadata": {},
+ "output_type": "execute_result"
+ }
+ ],
+ "source": [
+ "df_1"
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": 67,
+ "id": "697de917-29db-420d-8a5c-503fc959a21a",
+ "metadata": {},
+ "outputs": [
{
- "cell_type": "markdown",
- "id": "9d1adb3a-17cf-4899-8041-da21a4337fb4",
- "metadata": {
- "id": "9d1adb3a-17cf-4899-8041-da21a4337fb4"
- },
- "source": [
- "Put all the data cleaning and formatting steps into functions, and create a main function that performs all the cleaning and formatting.\n",
- "\n",
- "Write these functions in separate .py file(s). By putting these steps into functions, we can make the code more modular and easier to maintain."
+ "data": {
+ "text/plain": [
+ "customer 2937\n",
+ "state 2937\n",
+ "gender 3054\n",
+ "education 2937\n",
+ "customer-lifetime-value 2940\n",
+ "income 2937\n",
+ "monthly-premium-auto 2937\n",
+ "number-of-open-complaints 2937\n",
+ "policy-type 2937\n",
+ "vehicle-class 2937\n",
+ "total-claim-amount 2937\n",
+ "dtype: int64"
]
- },
+ },
+ "execution_count": 67,
+ "metadata": {},
+ "output_type": "execute_result"
+ }
+ ],
+ "source": [
+ "df_1.isnull().sum()"
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": 46,
+ "id": "b3279efd-e4af-4c89-81c3-a455afb33229",
+ "metadata": {},
+ "outputs": [
{
- "cell_type": "markdown",
- "id": "0e170dc2-b62c-417a-8248-e63ed18a70c4",
- "metadata": {
- "id": "0e170dc2-b62c-417a-8248-e63ed18a70c4"
- },
- "source": [
- "*Hint: autoreload module is a utility module in Python that allows you to automatically reload modules in the current session when changes are made to the source code. This can be useful in situations where you are actively developing code and want to see the effects of changes you make without having to constantly restart the Python interpreter or Jupyter Notebook kernel.*"
- ]
- },
+ "ename": "TypeError",
+ "evalue": "Cannot convert [['RB50392' 'QZ44356' 'AI49188' ... nan nan nan]\n ['Washington' 'Arizona' 'Nevada' ... nan nan nan]] to numeric",
+ "output_type": "error",
+ "traceback": [
+ "\u001b[1;31m---------------------------------------------------------------------------\u001b[0m",
+ "\u001b[1;31mTypeError\u001b[0m Traceback (most recent call last)",
+ "Cell \u001b[1;32mIn[46], line 1\u001b[0m\n\u001b[1;32m----> 1\u001b[0m df_1\u001b[38;5;241m.\u001b[39mfillna(df_1\u001b[38;5;241m.\u001b[39mmedian(), inplace\u001b[38;5;241m=\u001b[39m\u001b[38;5;28;01mTrue\u001b[39;00m)\n",
+ "File \u001b[1;32m~\\anaconda3\\Lib\\site-packages\\pandas\\core\\frame.py:11706\u001b[0m, in \u001b[0;36mDataFrame.median\u001b[1;34m(self, axis, skipna, numeric_only, **kwargs)\u001b[0m\n\u001b[0;32m 11698\u001b[0m \u001b[38;5;129m@doc\u001b[39m(make_doc(\u001b[38;5;124m\"\u001b[39m\u001b[38;5;124mmedian\u001b[39m\u001b[38;5;124m\"\u001b[39m, ndim\u001b[38;5;241m=\u001b[39m\u001b[38;5;241m2\u001b[39m))\n\u001b[0;32m 11699\u001b[0m \u001b[38;5;28;01mdef\u001b[39;00m \u001b[38;5;21mmedian\u001b[39m(\n\u001b[0;32m 11700\u001b[0m \u001b[38;5;28mself\u001b[39m,\n\u001b[1;32m (...)\u001b[0m\n\u001b[0;32m 11704\u001b[0m \u001b[38;5;241m*\u001b[39m\u001b[38;5;241m*\u001b[39mkwargs,\n\u001b[0;32m 11705\u001b[0m ):\n\u001b[1;32m> 11706\u001b[0m result \u001b[38;5;241m=\u001b[39m \u001b[38;5;28msuper\u001b[39m()\u001b[38;5;241m.\u001b[39mmedian(axis, skipna, numeric_only, \u001b[38;5;241m*\u001b[39m\u001b[38;5;241m*\u001b[39mkwargs)\n\u001b[0;32m 11707\u001b[0m \u001b[38;5;28;01mif\u001b[39;00m \u001b[38;5;28misinstance\u001b[39m(result, Series):\n\u001b[0;32m 11708\u001b[0m result \u001b[38;5;241m=\u001b[39m result\u001b[38;5;241m.\u001b[39m__finalize__(\u001b[38;5;28mself\u001b[39m, method\u001b[38;5;241m=\u001b[39m\u001b[38;5;124m\"\u001b[39m\u001b[38;5;124mmedian\u001b[39m\u001b[38;5;124m\"\u001b[39m)\n",
+ "File \u001b[1;32m~\\anaconda3\\Lib\\site-packages\\pandas\\core\\generic.py:12431\u001b[0m, in \u001b[0;36mNDFrame.median\u001b[1;34m(self, axis, skipna, numeric_only, **kwargs)\u001b[0m\n\u001b[0;32m 12424\u001b[0m \u001b[38;5;28;01mdef\u001b[39;00m \u001b[38;5;21mmedian\u001b[39m(\n\u001b[0;32m 12425\u001b[0m \u001b[38;5;28mself\u001b[39m,\n\u001b[0;32m 12426\u001b[0m axis: Axis \u001b[38;5;241m|\u001b[39m \u001b[38;5;28;01mNone\u001b[39;00m \u001b[38;5;241m=\u001b[39m \u001b[38;5;241m0\u001b[39m,\n\u001b[1;32m (...)\u001b[0m\n\u001b[0;32m 12429\u001b[0m \u001b[38;5;241m*\u001b[39m\u001b[38;5;241m*\u001b[39mkwargs,\n\u001b[0;32m 12430\u001b[0m ) \u001b[38;5;241m-\u001b[39m\u001b[38;5;241m>\u001b[39m Series \u001b[38;5;241m|\u001b[39m \u001b[38;5;28mfloat\u001b[39m:\n\u001b[1;32m> 12431\u001b[0m \u001b[38;5;28;01mreturn\u001b[39;00m \u001b[38;5;28mself\u001b[39m\u001b[38;5;241m.\u001b[39m_stat_function(\n\u001b[0;32m 12432\u001b[0m \u001b[38;5;124m\"\u001b[39m\u001b[38;5;124mmedian\u001b[39m\u001b[38;5;124m\"\u001b[39m, nanops\u001b[38;5;241m.\u001b[39mnanmedian, axis, skipna, numeric_only, \u001b[38;5;241m*\u001b[39m\u001b[38;5;241m*\u001b[39mkwargs\n\u001b[0;32m 12433\u001b[0m )\n",
+ "File \u001b[1;32m~\\anaconda3\\Lib\\site-packages\\pandas\\core\\generic.py:12377\u001b[0m, in \u001b[0;36mNDFrame._stat_function\u001b[1;34m(self, name, func, axis, skipna, numeric_only, **kwargs)\u001b[0m\n\u001b[0;32m 12373\u001b[0m nv\u001b[38;5;241m.\u001b[39mvalidate_func(name, (), kwargs)\n\u001b[0;32m 12375\u001b[0m validate_bool_kwarg(skipna, \u001b[38;5;124m\"\u001b[39m\u001b[38;5;124mskipna\u001b[39m\u001b[38;5;124m\"\u001b[39m, none_allowed\u001b[38;5;241m=\u001b[39m\u001b[38;5;28;01mFalse\u001b[39;00m)\n\u001b[1;32m> 12377\u001b[0m \u001b[38;5;28;01mreturn\u001b[39;00m \u001b[38;5;28mself\u001b[39m\u001b[38;5;241m.\u001b[39m_reduce(\n\u001b[0;32m 12378\u001b[0m func, name\u001b[38;5;241m=\u001b[39mname, axis\u001b[38;5;241m=\u001b[39maxis, skipna\u001b[38;5;241m=\u001b[39mskipna, numeric_only\u001b[38;5;241m=\u001b[39mnumeric_only\n\u001b[0;32m 12379\u001b[0m )\n",
+ "File \u001b[1;32m~\\anaconda3\\Lib\\site-packages\\pandas\\core\\frame.py:11562\u001b[0m, in \u001b[0;36mDataFrame._reduce\u001b[1;34m(self, op, name, axis, skipna, numeric_only, filter_type, **kwds)\u001b[0m\n\u001b[0;32m 11558\u001b[0m df \u001b[38;5;241m=\u001b[39m df\u001b[38;5;241m.\u001b[39mT\n\u001b[0;32m 11560\u001b[0m \u001b[38;5;66;03m# After possibly _get_data and transposing, we are now in the\u001b[39;00m\n\u001b[0;32m 11561\u001b[0m \u001b[38;5;66;03m# simple case where we can use BlockManager.reduce\u001b[39;00m\n\u001b[1;32m> 11562\u001b[0m res \u001b[38;5;241m=\u001b[39m df\u001b[38;5;241m.\u001b[39m_mgr\u001b[38;5;241m.\u001b[39mreduce(blk_func)\n\u001b[0;32m 11563\u001b[0m out \u001b[38;5;241m=\u001b[39m df\u001b[38;5;241m.\u001b[39m_constructor_from_mgr(res, axes\u001b[38;5;241m=\u001b[39mres\u001b[38;5;241m.\u001b[39maxes)\u001b[38;5;241m.\u001b[39miloc[\u001b[38;5;241m0\u001b[39m]\n\u001b[0;32m 11564\u001b[0m \u001b[38;5;28;01mif\u001b[39;00m out_dtype \u001b[38;5;129;01mis\u001b[39;00m \u001b[38;5;129;01mnot\u001b[39;00m \u001b[38;5;28;01mNone\u001b[39;00m \u001b[38;5;129;01mand\u001b[39;00m out\u001b[38;5;241m.\u001b[39mdtype \u001b[38;5;241m!=\u001b[39m \u001b[38;5;124m\"\u001b[39m\u001b[38;5;124mboolean\u001b[39m\u001b[38;5;124m\"\u001b[39m:\n",
+ "File \u001b[1;32m~\\anaconda3\\Lib\\site-packages\\pandas\\core\\internals\\managers.py:1500\u001b[0m, in \u001b[0;36mBlockManager.reduce\u001b[1;34m(self, func)\u001b[0m\n\u001b[0;32m 1498\u001b[0m res_blocks: \u001b[38;5;28mlist\u001b[39m[Block] \u001b[38;5;241m=\u001b[39m []\n\u001b[0;32m 1499\u001b[0m \u001b[38;5;28;01mfor\u001b[39;00m blk \u001b[38;5;129;01min\u001b[39;00m \u001b[38;5;28mself\u001b[39m\u001b[38;5;241m.\u001b[39mblocks:\n\u001b[1;32m-> 1500\u001b[0m nbs \u001b[38;5;241m=\u001b[39m blk\u001b[38;5;241m.\u001b[39mreduce(func)\n\u001b[0;32m 1501\u001b[0m res_blocks\u001b[38;5;241m.\u001b[39mextend(nbs)\n\u001b[0;32m 1503\u001b[0m index \u001b[38;5;241m=\u001b[39m Index([\u001b[38;5;28;01mNone\u001b[39;00m]) \u001b[38;5;66;03m# placeholder\u001b[39;00m\n",
+ "File \u001b[1;32m~\\anaconda3\\Lib\\site-packages\\pandas\\core\\internals\\blocks.py:404\u001b[0m, in \u001b[0;36mBlock.reduce\u001b[1;34m(self, func)\u001b[0m\n\u001b[0;32m 398\u001b[0m \u001b[38;5;129m@final\u001b[39m\n\u001b[0;32m 399\u001b[0m \u001b[38;5;28;01mdef\u001b[39;00m \u001b[38;5;21mreduce\u001b[39m(\u001b[38;5;28mself\u001b[39m, func) \u001b[38;5;241m-\u001b[39m\u001b[38;5;241m>\u001b[39m \u001b[38;5;28mlist\u001b[39m[Block]:\n\u001b[0;32m 400\u001b[0m \u001b[38;5;66;03m# We will apply the function and reshape the result into a single-row\u001b[39;00m\n\u001b[0;32m 401\u001b[0m \u001b[38;5;66;03m# Block with the same mgr_locs; squeezing will be done at a higher level\u001b[39;00m\n\u001b[0;32m 402\u001b[0m \u001b[38;5;28;01massert\u001b[39;00m \u001b[38;5;28mself\u001b[39m\u001b[38;5;241m.\u001b[39mndim \u001b[38;5;241m==\u001b[39m \u001b[38;5;241m2\u001b[39m\n\u001b[1;32m--> 404\u001b[0m result \u001b[38;5;241m=\u001b[39m func(\u001b[38;5;28mself\u001b[39m\u001b[38;5;241m.\u001b[39mvalues)\n\u001b[0;32m 406\u001b[0m \u001b[38;5;28;01mif\u001b[39;00m \u001b[38;5;28mself\u001b[39m\u001b[38;5;241m.\u001b[39mvalues\u001b[38;5;241m.\u001b[39mndim \u001b[38;5;241m==\u001b[39m \u001b[38;5;241m1\u001b[39m:\n\u001b[0;32m 407\u001b[0m res_values \u001b[38;5;241m=\u001b[39m result\n",
+ "File \u001b[1;32m~\\anaconda3\\Lib\\site-packages\\pandas\\core\\frame.py:11481\u001b[0m, in \u001b[0;36mDataFrame._reduce..blk_func\u001b[1;34m(values, axis)\u001b[0m\n\u001b[0;32m 11479\u001b[0m \u001b[38;5;28;01mreturn\u001b[39;00m np\u001b[38;5;241m.\u001b[39marray([result])\n\u001b[0;32m 11480\u001b[0m \u001b[38;5;28;01melse\u001b[39;00m:\n\u001b[1;32m> 11481\u001b[0m \u001b[38;5;28;01mreturn\u001b[39;00m op(values, axis\u001b[38;5;241m=\u001b[39maxis, skipna\u001b[38;5;241m=\u001b[39mskipna, \u001b[38;5;241m*\u001b[39m\u001b[38;5;241m*\u001b[39mkwds)\n",
+ "File \u001b[1;32m~\\anaconda3\\Lib\\site-packages\\pandas\\core\\nanops.py:147\u001b[0m, in \u001b[0;36mbottleneck_switch.__call__..f\u001b[1;34m(values, axis, skipna, **kwds)\u001b[0m\n\u001b[0;32m 145\u001b[0m result \u001b[38;5;241m=\u001b[39m alt(values, axis\u001b[38;5;241m=\u001b[39maxis, skipna\u001b[38;5;241m=\u001b[39mskipna, \u001b[38;5;241m*\u001b[39m\u001b[38;5;241m*\u001b[39mkwds)\n\u001b[0;32m 146\u001b[0m \u001b[38;5;28;01melse\u001b[39;00m:\n\u001b[1;32m--> 147\u001b[0m result \u001b[38;5;241m=\u001b[39m alt(values, axis\u001b[38;5;241m=\u001b[39maxis, skipna\u001b[38;5;241m=\u001b[39mskipna, \u001b[38;5;241m*\u001b[39m\u001b[38;5;241m*\u001b[39mkwds)\n\u001b[0;32m 149\u001b[0m \u001b[38;5;28;01mreturn\u001b[39;00m result\n",
+ "File \u001b[1;32m~\\anaconda3\\Lib\\site-packages\\pandas\\core\\nanops.py:787\u001b[0m, in \u001b[0;36mnanmedian\u001b[1;34m(values, axis, skipna, mask)\u001b[0m\n\u001b[0;32m 785\u001b[0m inferred \u001b[38;5;241m=\u001b[39m lib\u001b[38;5;241m.\u001b[39minfer_dtype(values)\n\u001b[0;32m 786\u001b[0m \u001b[38;5;28;01mif\u001b[39;00m inferred \u001b[38;5;129;01min\u001b[39;00m [\u001b[38;5;124m\"\u001b[39m\u001b[38;5;124mstring\u001b[39m\u001b[38;5;124m\"\u001b[39m, \u001b[38;5;124m\"\u001b[39m\u001b[38;5;124mmixed\u001b[39m\u001b[38;5;124m\"\u001b[39m]:\n\u001b[1;32m--> 787\u001b[0m \u001b[38;5;28;01mraise\u001b[39;00m \u001b[38;5;167;01mTypeError\u001b[39;00m(\u001b[38;5;124mf\u001b[39m\u001b[38;5;124m\"\u001b[39m\u001b[38;5;124mCannot convert \u001b[39m\u001b[38;5;132;01m{\u001b[39;00mvalues\u001b[38;5;132;01m}\u001b[39;00m\u001b[38;5;124m to numeric\u001b[39m\u001b[38;5;124m\"\u001b[39m)\n\u001b[0;32m 788\u001b[0m \u001b[38;5;28;01mtry\u001b[39;00m:\n\u001b[0;32m 789\u001b[0m values \u001b[38;5;241m=\u001b[39m values\u001b[38;5;241m.\u001b[39mastype(\u001b[38;5;124m\"\u001b[39m\u001b[38;5;124mf8\u001b[39m\u001b[38;5;124m\"\u001b[39m)\n",
+ "\u001b[1;31mTypeError\u001b[0m: Cannot convert [['RB50392' 'QZ44356' 'AI49188' ... nan nan nan]\n ['Washington' 'Arizona' 'Nevada' ... nan nan nan]] to numeric"
+ ]
+ }
+ ],
+ "source": [
+ "df_1.fillna(df_1.median(), inplace=True)"
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": 69,
+ "id": "f8321af2-e2fc-4b3e-b7c8-4d3ad3e41593",
+ "metadata": {},
+ "outputs": [
{
- "cell_type": "code",
- "execution_count": null,
- "id": "a52c6dfc-cd11-4d01-bda4-f719fa33e9a4",
- "metadata": {
- "id": "a52c6dfc-cd11-4d01-bda4-f719fa33e9a4"
- },
- "outputs": [],
- "source": [
- "# Your code here"
+ "data": {
+ "text/html": [
+ "\n",
+ "\n",
+ "
\n",
+ " \n",
+ " \n",
+ " | \n",
+ " customer | \n",
+ " state | \n",
+ " gender | \n",
+ " education | \n",
+ " customer-lifetime-value | \n",
+ " income | \n",
+ " monthly-premium-auto | \n",
+ " number-of-open-complaints | \n",
+ " policy-type | \n",
+ " vehicle-class | \n",
+ " total-claim-amount | \n",
+ "
\n",
+ " \n",
+ " \n",
+ " \n",
+ " 0 | \n",
+ " RB50392 | \n",
+ " Washington | \n",
+ " NaN | \n",
+ " Master | \n",
+ " NaN | \n",
+ " 0.0 | \n",
+ " 1000.0 | \n",
+ " 1/0/00 | \n",
+ " Personal Auto | \n",
+ " Four-Door Car | \n",
+ " 2.704934 | \n",
+ "
\n",
+ " \n",
+ " 1 | \n",
+ " QZ44356 | \n",
+ " Arizona | \n",
+ " F | \n",
+ " Bachelor | \n",
+ " 697953.59% | \n",
+ " 0.0 | \n",
+ " 94.0 | \n",
+ " 1/0/00 | \n",
+ " Personal Auto | \n",
+ " Four-Door Car | \n",
+ " 1131.464935 | \n",
+ "
\n",
+ " \n",
+ " 2 | \n",
+ " AI49188 | \n",
+ " Nevada | \n",
+ " F | \n",
+ " Bachelor | \n",
+ " 1288743.17% | \n",
+ " 48767.0 | \n",
+ " 108.0 | \n",
+ " 1/0/00 | \n",
+ " Personal Auto | \n",
+ " Two-Door Car | \n",
+ " 566.472247 | \n",
+ "
\n",
+ " \n",
+ " 3 | \n",
+ " WW63253 | \n",
+ " California | \n",
+ " M | \n",
+ " Bachelor | \n",
+ " 764586.18% | \n",
+ " 0.0 | \n",
+ " 106.0 | \n",
+ " 1/0/00 | \n",
+ " Corporate Auto | \n",
+ " SUV | \n",
+ " 529.881344 | \n",
+ "
\n",
+ " \n",
+ " 4 | \n",
+ " GA49547 | \n",
+ " Washington | \n",
+ " M | \n",
+ " High School or Below | \n",
+ " 536307.65% | \n",
+ " 36357.0 | \n",
+ " 68.0 | \n",
+ " 1/0/00 | \n",
+ " Personal Auto | \n",
+ " Four-Door Car | \n",
+ " 17.269323 | \n",
+ "
\n",
+ " \n",
+ " ... | \n",
+ " ... | \n",
+ " ... | \n",
+ " ... | \n",
+ " ... | \n",
+ " ... | \n",
+ " ... | \n",
+ " ... | \n",
+ " ... | \n",
+ " ... | \n",
+ " ... | \n",
+ " ... | \n",
+ "
\n",
+ " \n",
+ " 1066 | \n",
+ " TM65736 | \n",
+ " Oregon | \n",
+ " M | \n",
+ " Master | \n",
+ " 305955.03% | \n",
+ " 38644.0 | \n",
+ " 78.0 | \n",
+ " 1/1/00 | \n",
+ " Personal Auto | \n",
+ " Four-Door Car | \n",
+ " 361.455219 | \n",
+ "
\n",
+ " \n",
+ " 1067 | \n",
+ " VJ51327 | \n",
+ " Cali | \n",
+ " F | \n",
+ " High School or Below | \n",
+ " 2031499.76% | \n",
+ " 63209.0 | \n",
+ " 102.0 | \n",
+ " 1/2/00 | \n",
+ " Personal Auto | \n",
+ " SUV | \n",
+ " 207.320041 | \n",
+ "
\n",
+ " \n",
+ " 1068 | \n",
+ " GS98873 | \n",
+ " Arizona | \n",
+ " F | \n",
+ " Bachelor | \n",
+ " 323912.47% | \n",
+ " 16061.0 | \n",
+ " 88.0 | \n",
+ " 1/0/00 | \n",
+ " Personal Auto | \n",
+ " Four-Door Car | \n",
+ " 633.600000 | \n",
+ "
\n",
+ " \n",
+ " 1069 | \n",
+ " CW49887 | \n",
+ " California | \n",
+ " F | \n",
+ " Master | \n",
+ " 462680.11% | \n",
+ " 79487.0 | \n",
+ " 114.0 | \n",
+ " 1/0/00 | \n",
+ " Special Auto | \n",
+ " SUV | \n",
+ " 547.200000 | \n",
+ "
\n",
+ " \n",
+ " 1070 | \n",
+ " MY31220 | \n",
+ " California | \n",
+ " F | \n",
+ " College | \n",
+ " 899704.02% | \n",
+ " 54230.0 | \n",
+ " 112.0 | \n",
+ " 1/0/00 | \n",
+ " Personal Auto | \n",
+ " Two-Door Car | \n",
+ " 537.600000 | \n",
+ "
\n",
+ " \n",
+ "
\n",
+ "
1071 rows × 11 columns
\n",
+ "
"
+ ],
+ "text/plain": [
+ " customer state gender education \\\n",
+ "0 RB50392 Washington NaN Master \n",
+ "1 QZ44356 Arizona F Bachelor \n",
+ "2 AI49188 Nevada F Bachelor \n",
+ "3 WW63253 California M Bachelor \n",
+ "4 GA49547 Washington M High School or Below \n",
+ "... ... ... ... ... \n",
+ "1066 TM65736 Oregon M Master \n",
+ "1067 VJ51327 Cali F High School or Below \n",
+ "1068 GS98873 Arizona F Bachelor \n",
+ "1069 CW49887 California F Master \n",
+ "1070 MY31220 California F College \n",
+ "\n",
+ " customer-lifetime-value income monthly-premium-auto \\\n",
+ "0 NaN 0.0 1000.0 \n",
+ "1 697953.59% 0.0 94.0 \n",
+ "2 1288743.17% 48767.0 108.0 \n",
+ "3 764586.18% 0.0 106.0 \n",
+ "4 536307.65% 36357.0 68.0 \n",
+ "... ... ... ... \n",
+ "1066 305955.03% 38644.0 78.0 \n",
+ "1067 2031499.76% 63209.0 102.0 \n",
+ "1068 323912.47% 16061.0 88.0 \n",
+ "1069 462680.11% 79487.0 114.0 \n",
+ "1070 899704.02% 54230.0 112.0 \n",
+ "\n",
+ " number-of-open-complaints policy-type vehicle-class \\\n",
+ "0 1/0/00 Personal Auto Four-Door Car \n",
+ "1 1/0/00 Personal Auto Four-Door Car \n",
+ "2 1/0/00 Personal Auto Two-Door Car \n",
+ "3 1/0/00 Corporate Auto SUV \n",
+ "4 1/0/00 Personal Auto Four-Door Car \n",
+ "... ... ... ... \n",
+ "1066 1/1/00 Personal Auto Four-Door Car \n",
+ "1067 1/2/00 Personal Auto SUV \n",
+ "1068 1/0/00 Personal Auto Four-Door Car \n",
+ "1069 1/0/00 Special Auto SUV \n",
+ "1070 1/0/00 Personal Auto Two-Door Car \n",
+ "\n",
+ " total-claim-amount \n",
+ "0 2.704934 \n",
+ "1 1131.464935 \n",
+ "2 566.472247 \n",
+ "3 529.881344 \n",
+ "4 17.269323 \n",
+ "... ... \n",
+ "1066 361.455219 \n",
+ "1067 207.320041 \n",
+ "1068 633.600000 \n",
+ "1069 547.200000 \n",
+ "1070 537.600000 \n",
+ "\n",
+ "[1071 rows x 11 columns]"
]
- },
+ },
+ "execution_count": 69,
+ "metadata": {},
+ "output_type": "execute_result"
+ }
+ ],
+ "source": [
+ "df_1.dropna(subset = [\"customer\", \"state\", \"education\" ,\"income\", \"policy-type\"], thresh =4)"
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "id": "98416351-e999-4156-9834-9b00a311adfa",
+ "metadata": {
+ "id": "98416351-e999-4156-9834-9b00a311adfa"
+ },
+ "source": [
+ "## Exercise 5: Dealing with duplicates"
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "id": "ea0816a7-a18e-4d4c-b667-a8452a800bd1",
+ "metadata": {
+ "id": "ea0816a7-a18e-4d4c-b667-a8452a800bd1"
+ },
+ "source": [
+ "Use the `.duplicated()` method to identify any duplicate rows in the dataframe.\n",
+ "\n",
+ "Decide on a strategy for handling the duplicates. Options include:\n",
+ "- Dropping all duplicate rows\n",
+ "- Keeping only the first occurrence of each duplicated row\n",
+ "- Keeping only the last occurrence of each duplicated row\n",
+ "- Dropping duplicates based on a subset of columns\n",
+ "- Dropping duplicates based on a specific column\n",
+ "\n",
+ "Implement your chosen strategy using the `drop_duplicates()` function.\n",
+ "\n",
+ "Verify that your strategy has successfully handled the duplicates by checking for duplicates again using `.duplicated()`.\n",
+ "\n",
+ "Remember to document your process and explain your reasoning for choosing a particular strategy for handling duplicates.\n",
+ "\n",
+ "Save the cleaned dataset to a new CSV file.\n",
+ "\n",
+ "*Hint*: *after dropping duplicates, reset the index to ensure consistency*."
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": 108,
+ "id": "1929362c-47ed-47cb-baca-358b78d401a0",
+ "metadata": {
+ "id": "1929362c-47ed-47cb-baca-358b78d401a0"
+ },
+ "outputs": [
{
- "cell_type": "markdown",
- "id": "80f846bb-3f5e-4ca2-96c0-900728daca5a",
- "metadata": {
- "tags": [],
- "id": "80f846bb-3f5e-4ca2-96c0-900728daca5a"
- },
- "source": [
- "# Bonus: Challenge 3: Analyzing Clean and Formated Data"
+ "data": {
+ "text/html": [
+ "\n",
+ "\n",
+ "
\n",
+ " \n",
+ " \n",
+ " | \n",
+ " customer | \n",
+ " state | \n",
+ " gender | \n",
+ " education | \n",
+ " customer-lifetime-value | \n",
+ " income | \n",
+ " monthly-premium-auto | \n",
+ " number-of-open-complaints | \n",
+ " policy-type | \n",
+ " vehicle-class | \n",
+ " total-claim-amount | \n",
+ "
\n",
+ " \n",
+ " \n",
+ " \n",
+ " 0 | \n",
+ " RB50392 | \n",
+ " WA | \n",
+ " NaN | \n",
+ " NaN | \n",
+ " NaN | \n",
+ " 0.0 | \n",
+ " 1000.0 | \n",
+ " 0 | \n",
+ " Personal Auto | \n",
+ " NaN | \n",
+ " 2.704934 | \n",
+ "
\n",
+ " \n",
+ " 1 | \n",
+ " QZ44356 | \n",
+ " AZ | \n",
+ " NaN | \n",
+ " Bachelors | \n",
+ " 697953.59 | \n",
+ " 0.0 | \n",
+ " 94.0 | \n",
+ " 0 | \n",
+ " Personal Auto | \n",
+ " NaN | \n",
+ " 1131.464935 | \n",
+ "
\n",
+ " \n",
+ " 2 | \n",
+ " AI49188 | \n",
+ " NaN | \n",
+ " NaN | \n",
+ " Bachelors | \n",
+ " 1288743.17 | \n",
+ " 48767.0 | \n",
+ " 108.0 | \n",
+ " 0 | \n",
+ " Personal Auto | \n",
+ " NaN | \n",
+ " 566.472247 | \n",
+ "
\n",
+ " \n",
+ " 3 | \n",
+ " WW63253 | \n",
+ " CA | \n",
+ " NaN | \n",
+ " Bachelors | \n",
+ " 764586.18 | \n",
+ " 0.0 | \n",
+ " 106.0 | \n",
+ " 0 | \n",
+ " Corporate Auto | \n",
+ " NaN | \n",
+ " 529.881344 | \n",
+ "
\n",
+ " \n",
+ " 4 | \n",
+ " GA49547 | \n",
+ " WA | \n",
+ " NaN | \n",
+ " NaN | \n",
+ " 536307.65 | \n",
+ " 36357.0 | \n",
+ " 68.0 | \n",
+ " 0 | \n",
+ " Personal Auto | \n",
+ " NaN | \n",
+ " 17.269323 | \n",
+ "
\n",
+ " \n",
+ "
\n",
+ "
"
+ ],
+ "text/plain": [
+ " customer state gender education customer-lifetime-value income \\\n",
+ "0 RB50392 WA NaN NaN NaN 0.0 \n",
+ "1 QZ44356 AZ NaN Bachelors 697953.59 0.0 \n",
+ "2 AI49188 NaN NaN Bachelors 1288743.17 48767.0 \n",
+ "3 WW63253 CA NaN Bachelors 764586.18 0.0 \n",
+ "4 GA49547 WA NaN NaN 536307.65 36357.0 \n",
+ "\n",
+ " monthly-premium-auto number-of-open-complaints policy-type \\\n",
+ "0 1000.0 0 Personal Auto \n",
+ "1 94.0 0 Personal Auto \n",
+ "2 108.0 0 Personal Auto \n",
+ "3 106.0 0 Corporate Auto \n",
+ "4 68.0 0 Personal Auto \n",
+ "\n",
+ " vehicle-class total-claim-amount \n",
+ "0 NaN 2.704934 \n",
+ "1 NaN 1131.464935 \n",
+ "2 NaN 566.472247 \n",
+ "3 NaN 529.881344 \n",
+ "4 NaN 17.269323 "
]
- },
+ },
+ "execution_count": 108,
+ "metadata": {},
+ "output_type": "execute_result"
+ }
+ ],
+ "source": [
+ "df_1.head()"
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": 48,
+ "id": "c3dd5f43-f92f-4741-970b-042c37cc0d2d",
+ "metadata": {},
+ "outputs": [],
+ "source": [
+ "df_1.drop_duplicates(inplace=True)\n",
+ "df_1.reset_index(drop=True, inplace=True)"
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": 122,
+ "id": "4aa24225-57c5-4007-bb3b-4dcec8e62539",
+ "metadata": {},
+ "outputs": [
{
- "cell_type": "markdown",
- "id": "9021630e-cc90-446c-b5bd-264d6c864207",
- "metadata": {
- "id": "9021630e-cc90-446c-b5bd-264d6c864207"
- },
- "source": [
- "You have been tasked with analyzing the data to identify potential areas for improving customer retention and profitability. Your goal is to identify customers with a high policy claim amount and a low customer lifetime value.\n",
- "\n",
- "In the Pandas Lab, we only looked at high policy claim amounts because we couldn't look into low customer lifetime values. If we had tried to work with that column, we wouldn't have been able to because customer lifetime value wasn't clean and in its proper format. So after cleaning and formatting the data, let's get some more interesting insights!\n",
- "\n",
- "Instructions:\n",
- "\n",
- "- Review the statistics again for total claim amount and customer lifetime value to gain an understanding of the data.\n",
- "- To identify potential areas for improving customer retention and profitability, we want to focus on customers with a high policy claim amount and a low customer lifetime value. Consider customers with a high policy claim amount to be those in the top 25% of the total claim amount, and clients with a low customer lifetime value to be those in the bottom 25% of the customer lifetime value. Create a pandas DataFrame object that contains information about customers with a policy claim amount greater than the 75th percentile and a customer lifetime value in the bottom 25th percentile.\n",
- "- Use DataFrame methods to calculate summary statistics about the high policy claim amount and low customer lifetime value data. To do so, select both columns of the dataframe simultaneously and pass it to the `.describe()` method. This will give you descriptive statistics, such as mean, median, standard deviation, minimum and maximum values for both columns at the same time, allowing you to compare and analyze their characteristics."
+ "data": {
+ "text/plain": [
+ "4001"
]
- },
+ },
+ "execution_count": 122,
+ "metadata": {},
+ "output_type": "execute_result"
+ }
+ ],
+ "source": [
+ "df_1.duplicated(subset=[\"policy-type\", \"education\"]).sum()"
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": 124,
+ "id": "d5254710-f299-4802-8ecc-756f8fa0db2b",
+ "metadata": {},
+ "outputs": [
{
- "cell_type": "code",
- "execution_count": null,
- "id": "211e82b5-461a-4d6f-8a23-4deccb84173c",
- "metadata": {
- "id": "211e82b5-461a-4d6f-8a23-4deccb84173c"
- },
- "outputs": [],
- "source": [
- "# Your code here"
+ "data": {
+ "text/html": [
+ "\n",
+ "\n",
+ "
\n",
+ " \n",
+ " \n",
+ " | \n",
+ " customer | \n",
+ " state | \n",
+ " gender | \n",
+ " education | \n",
+ " customer-lifetime-value | \n",
+ " income | \n",
+ " monthly-premium-auto | \n",
+ " number-of-open-complaints | \n",
+ " policy-type | \n",
+ " vehicle-class | \n",
+ " total-claim-amount | \n",
+ "
\n",
+ " \n",
+ " \n",
+ " \n",
+ " 0 | \n",
+ " RB50392 | \n",
+ " WA | \n",
+ " NaN | \n",
+ " NaN | \n",
+ " NaN | \n",
+ " 0.0 | \n",
+ " 1000.0 | \n",
+ " 0 | \n",
+ " Personal Auto | \n",
+ " NaN | \n",
+ " 2.704934 | \n",
+ "
\n",
+ " \n",
+ " 1 | \n",
+ " QZ44356 | \n",
+ " AZ | \n",
+ " NaN | \n",
+ " Bachelors | \n",
+ " 697953.59 | \n",
+ " 0.0 | \n",
+ " 94.0 | \n",
+ " 0 | \n",
+ " Personal Auto | \n",
+ " NaN | \n",
+ " 1131.464935 | \n",
+ "
\n",
+ " \n",
+ " 2 | \n",
+ " AI49188 | \n",
+ " NaN | \n",
+ " NaN | \n",
+ " Bachelors | \n",
+ " 1288743.17 | \n",
+ " 48767.0 | \n",
+ " 108.0 | \n",
+ " 0 | \n",
+ " Personal Auto | \n",
+ " NaN | \n",
+ " 566.472247 | \n",
+ "
\n",
+ " \n",
+ " 3 | \n",
+ " WW63253 | \n",
+ " CA | \n",
+ " NaN | \n",
+ " Bachelors | \n",
+ " 764586.18 | \n",
+ " 0.0 | \n",
+ " 106.0 | \n",
+ " 0 | \n",
+ " Corporate Auto | \n",
+ " NaN | \n",
+ " 529.881344 | \n",
+ "
\n",
+ " \n",
+ " 4 | \n",
+ " GA49547 | \n",
+ " WA | \n",
+ " NaN | \n",
+ " NaN | \n",
+ " 536307.65 | \n",
+ " 36357.0 | \n",
+ " 68.0 | \n",
+ " 0 | \n",
+ " Personal Auto | \n",
+ " NaN | \n",
+ " 17.269323 | \n",
+ "
\n",
+ " \n",
+ " ... | \n",
+ " ... | \n",
+ " ... | \n",
+ " ... | \n",
+ " ... | \n",
+ " ... | \n",
+ " ... | \n",
+ " ... | \n",
+ " ... | \n",
+ " ... | \n",
+ " ... | \n",
+ " ... | \n",
+ "
\n",
+ " \n",
+ " 1067 | \n",
+ " VJ51327 | \n",
+ " NaN | \n",
+ " NaN | \n",
+ " NaN | \n",
+ " 2031499.76 | \n",
+ " 63209.0 | \n",
+ " 102.0 | \n",
+ " 2 | \n",
+ " Personal Auto | \n",
+ " NaN | \n",
+ " 207.320041 | \n",
+ "
\n",
+ " \n",
+ " 1068 | \n",
+ " GS98873 | \n",
+ " AZ | \n",
+ " NaN | \n",
+ " Bachelors | \n",
+ " 323912.47 | \n",
+ " 16061.0 | \n",
+ " 88.0 | \n",
+ " 0 | \n",
+ " Personal Auto | \n",
+ " NaN | \n",
+ " 633.600000 | \n",
+ "
\n",
+ " \n",
+ " 1069 | \n",
+ " CW49887 | \n",
+ " CA | \n",
+ " NaN | \n",
+ " NaN | \n",
+ " 462680.11 | \n",
+ " 79487.0 | \n",
+ " 114.0 | \n",
+ " 0 | \n",
+ " Special Auto | \n",
+ " NaN | \n",
+ " 547.200000 | \n",
+ "
\n",
+ " \n",
+ " 1070 | \n",
+ " MY31220 | \n",
+ " CA | \n",
+ " NaN | \n",
+ " NaN | \n",
+ " 899704.02 | \n",
+ " 54230.0 | \n",
+ " 112.0 | \n",
+ " 0 | \n",
+ " Personal Auto | \n",
+ " NaN | \n",
+ " 537.600000 | \n",
+ "
\n",
+ " \n",
+ " 1071 | \n",
+ " NaN | \n",
+ " NaN | \n",
+ " NaN | \n",
+ " NaN | \n",
+ " NaN | \n",
+ " NaN | \n",
+ " NaN | \n",
+ " NaN | \n",
+ " NaN | \n",
+ " NaN | \n",
+ " NaN | \n",
+ "
\n",
+ " \n",
+ "
\n",
+ "
1072 rows × 11 columns
\n",
+ "
"
+ ],
+ "text/plain": [
+ " customer state gender education customer-lifetime-value income \\\n",
+ "0 RB50392 WA NaN NaN NaN 0.0 \n",
+ "1 QZ44356 AZ NaN Bachelors 697953.59 0.0 \n",
+ "2 AI49188 NaN NaN Bachelors 1288743.17 48767.0 \n",
+ "3 WW63253 CA NaN Bachelors 764586.18 0.0 \n",
+ "4 GA49547 WA NaN NaN 536307.65 36357.0 \n",
+ "... ... ... ... ... ... ... \n",
+ "1067 VJ51327 NaN NaN NaN 2031499.76 63209.0 \n",
+ "1068 GS98873 AZ NaN Bachelors 323912.47 16061.0 \n",
+ "1069 CW49887 CA NaN NaN 462680.11 79487.0 \n",
+ "1070 MY31220 CA NaN NaN 899704.02 54230.0 \n",
+ "1071 NaN NaN NaN NaN NaN NaN \n",
+ "\n",
+ " monthly-premium-auto number-of-open-complaints policy-type \\\n",
+ "0 1000.0 0 Personal Auto \n",
+ "1 94.0 0 Personal Auto \n",
+ "2 108.0 0 Personal Auto \n",
+ "3 106.0 0 Corporate Auto \n",
+ "4 68.0 0 Personal Auto \n",
+ "... ... ... ... \n",
+ "1067 102.0 2 Personal Auto \n",
+ "1068 88.0 0 Personal Auto \n",
+ "1069 114.0 0 Special Auto \n",
+ "1070 112.0 0 Personal Auto \n",
+ "1071 NaN NaN NaN \n",
+ "\n",
+ " vehicle-class total-claim-amount \n",
+ "0 NaN 2.704934 \n",
+ "1 NaN 1131.464935 \n",
+ "2 NaN 566.472247 \n",
+ "3 NaN 529.881344 \n",
+ "4 NaN 17.269323 \n",
+ "... ... ... \n",
+ "1067 NaN 207.320041 \n",
+ "1068 NaN 633.600000 \n",
+ "1069 NaN 547.200000 \n",
+ "1070 NaN 537.600000 \n",
+ "1071 NaN NaN \n",
+ "\n",
+ "[1072 rows x 11 columns]"
]
+ },
+ "execution_count": 124,
+ "metadata": {},
+ "output_type": "execute_result"
}
- ],
- "metadata": {
- "kernelspec": {
- "display_name": "Python 3 (ipykernel)",
- "language": "python",
- "name": "python3"
- },
- "language_info": {
- "codemirror_mode": {
- "name": "ipython",
- "version": 3
- },
- "file_extension": ".py",
- "mimetype": "text/x-python",
- "name": "python",
- "nbconvert_exporter": "python",
- "pygments_lexer": "ipython3",
- "version": "3.9.13"
- },
- "colab": {
- "provenance": []
- }
+ ],
+ "source": [
+ "df_1.drop_duplicates()"
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "id": "60840701-4783-40e2-b4d8-55303f9100c9",
+ "metadata": {
+ "id": "60840701-4783-40e2-b4d8-55303f9100c9"
+ },
+ "source": [
+ "# Bonus: Challenge 2: creating functions on a separate `py` file"
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "id": "9d1adb3a-17cf-4899-8041-da21a4337fb4",
+ "metadata": {
+ "id": "9d1adb3a-17cf-4899-8041-da21a4337fb4"
+ },
+ "source": [
+ "Put all the data cleaning and formatting steps into functions, and create a main function that performs all the cleaning and formatting.\n",
+ "\n",
+ "Write these functions in separate .py file(s). By putting these steps into functions, we can make the code more modular and easier to maintain."
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "id": "0e170dc2-b62c-417a-8248-e63ed18a70c4",
+ "metadata": {
+ "id": "0e170dc2-b62c-417a-8248-e63ed18a70c4"
+ },
+ "source": [
+ "*Hint: autoreload module is a utility module in Python that allows you to automatically reload modules in the current session when changes are made to the source code. This can be useful in situations where you are actively developing code and want to see the effects of changes you make without having to constantly restart the Python interpreter or Jupyter Notebook kernel.*"
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": null,
+ "id": "a52c6dfc-cd11-4d01-bda4-f719fa33e9a4",
+ "metadata": {
+ "id": "a52c6dfc-cd11-4d01-bda4-f719fa33e9a4"
+ },
+ "outputs": [],
+ "source": [
+ "# Your code here"
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "id": "80f846bb-3f5e-4ca2-96c0-900728daca5a",
+ "metadata": {
+ "id": "80f846bb-3f5e-4ca2-96c0-900728daca5a",
+ "tags": []
+ },
+ "source": [
+ "# Bonus: Challenge 3: Analyzing Clean and Formated Data"
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "id": "9021630e-cc90-446c-b5bd-264d6c864207",
+ "metadata": {
+ "id": "9021630e-cc90-446c-b5bd-264d6c864207"
+ },
+ "source": [
+ "You have been tasked with analyzing the data to identify potential areas for improving customer retention and profitability. Your goal is to identify customers with a high policy claim amount and a low customer lifetime value.\n",
+ "\n",
+ "In the Pandas Lab, we only looked at high policy claim amounts because we couldn't look into low customer lifetime values. If we had tried to work with that column, we wouldn't have been able to because customer lifetime value wasn't clean and in its proper format. So after cleaning and formatting the data, let's get some more interesting insights!\n",
+ "\n",
+ "Instructions:\n",
+ "\n",
+ "- Review the statistics again for total claim amount and customer lifetime value to gain an understanding of the data.\n",
+ "- To identify potential areas for improving customer retention and profitability, we want to focus on customers with a high policy claim amount and a low customer lifetime value. Consider customers with a high policy claim amount to be those in the top 25% of the total claim amount, and clients with a low customer lifetime value to be those in the bottom 25% of the customer lifetime value. Create a pandas DataFrame object that contains information about customers with a policy claim amount greater than the 75th percentile and a customer lifetime value in the bottom 25th percentile.\n",
+ "- Use DataFrame methods to calculate summary statistics about the high policy claim amount and low customer lifetime value data. To do so, select both columns of the dataframe simultaneously and pass it to the `.describe()` method. This will give you descriptive statistics, such as mean, median, standard deviation, minimum and maximum values for both columns at the same time, allowing you to compare and analyze their characteristics."
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": null,
+ "id": "211e82b5-461a-4d6f-8a23-4deccb84173c",
+ "metadata": {
+ "id": "211e82b5-461a-4d6f-8a23-4deccb84173c"
+ },
+ "outputs": [],
+ "source": [
+ "# Your code here"
+ ]
+ }
+ ],
+ "metadata": {
+ "colab": {
+ "provenance": []
+ },
+ "kernelspec": {
+ "display_name": "Python 3 (ipykernel)",
+ "language": "python",
+ "name": "python3"
},
- "nbformat": 4,
- "nbformat_minor": 5
+ "language_info": {
+ "codemirror_mode": {
+ "name": "ipython",
+ "version": 3
+ },
+ "file_extension": ".py",
+ "mimetype": "text/x-python",
+ "name": "python",
+ "nbconvert_exporter": "python",
+ "pygments_lexer": "ipython3",
+ "version": "3.12.7"
+ }
+ },
+ "nbformat": 4,
+ "nbformat_minor": 5
}