Skip to content

Commit

Permalink
Update T2.md
Browse files Browse the repository at this point in the history
  • Loading branch information
rancidghoul authored Aug 28, 2024
1 parent 437d6ca commit 0b3a348
Showing 1 changed file with 20 additions and 46 deletions.
66 changes: 20 additions & 46 deletions source/_posts/T2.md
Original file line number Diff line number Diff line change
Expand Up @@ -264,67 +264,41 @@ Each challenge has a pre-determined score. A participant’s score depends on ho

**<span style="color: #ADD8E6; font-size: 1rem;">Data Preprocessing with Iris Dataset</span>**

**Objective: The goal of this task is to understand and apply basic data preprocessing techniques using the Iris dataset. This task will help you understand the basics of data preprocessing, including handling missing data, feature scaling, encoding categorical variables, and performing exploratory data analysis.
Data preprocessing is a crucial step in the data analysis and machine learning pipeline, where raw data is transformed into a clean and usable format. This process involves several steps
Data Cleaning: Handling missing values, correcting errors, and removing duplicates. Techniques include imputation, where missing data is filled in, or outlier removal to eliminate anomalies.
Data Transformation: Scaling features to ensure they are on the same scale (e.g., normalization or standardization), encoding categorical variables into numerical values (e.g., one-hot encoding), and transforming data to meet the assumptions of a model (e.g., log transformation).
Data Reduction: Reducing the dimensionality of the data through techniques like Principal Component Analysis (PCA) or feature selection, which helps improve model performance and reduce computation time.
Data Splitting: Dividing the dataset into training, validation, and test sets to evaluate model performance and prevent overfitting.
The Iris dataset is a classic dataset in machine learning, consisting of 150 samples of iris flowers, with four features (sepal length, sepal width, petal length, petal width) and a target variable (species of iris).
**
**Objective:** The goal of this task is to understand and apply basic data preprocessing techniques using the Iris dataset. This task will help you understand the basics of data preprocessing, including handling missing data, feature scaling, encoding categorical variables, and performing exploratory data analysis.

**Data preprocessing:**Data preprocessing is a crucial step in the data analysis and machine learning pipeline, where raw data is transformed into a clean and usable format. This process involves several steps.

<hr>
**Data Cleaning:** Handling missing values, correcting errors, and removing duplicates. Techniques include imputation, where missing data is filled in, or outlier removal to eliminate anomalies.

**<span style="color: #FF6363; font-size: 1rem;">Question 1</span>**
**Data Transformation:** Scaling features to ensure they are on the same scale (e.g., normalization or standardization), encoding categorical variables into numerical values (e.g., one-hot encoding), and transforming data to meet the assumptions of a model (e.g., log transformation).

**<span style="color: #ADD8E6; font-size: 1rem;">Numpy</span>**
**Data Reduction:** Reducing the dimensionality of the data through techniques like Principal Component Analysis (PCA) or feature selection, which helps improve model performance and reduce computation time.

**NumPy is a popular Python library used for working with numbers and arrays. Think of it as a tool that helps you do math and handle large sets of numbers easily. It makes it simple to perform calculations on lists of numbers and matrices. NumPy is great for anyone who wants to do data analysis or scientific computing because it speeds up these tasks with its fast and powerful features.**
**Data Splitting:** Dividing the dataset into training, validation, and test sets to evaluate model performance and prevent overfitting.

**<span style="color: #ADD8E6; font-size: 1rem;">DataSet</span>**
The Iris dataset is a classic dataset in machine learning, consisting of 150 samples of iris flowers, with four features (sepal length, sepal width, petal length, petal width) and a target variable (species of iris).
**

We use a dataset of details about 15 students each having attributes – Height, Weight, Age, Average Grade and Courses. We use the python code given below to create a NumPy array of our dataset.

**Python code to create NumPy array for the task:**
<hr>

```lua

import numpy as np
# Creating a dataset with 15 students and 5 attributes
data = np.array([
[170, 65, 19, 85, 5],
[180, 75, 20, 90, 6],
[160, 55, 18, 80, 4],
[175, 70, 21, 88, 7],
[155, 50, 19, 82, 5],
[165, 62, 22, 89, 6],
[178, 80, 23, 91, 7],
[162, 58, 20, 78, 3],
[172, 68, 19, 86, 5],
[169, 66, 20, 84, 4],
[171, 64, 22, 87, 6],
[177, 72, 21, 90, 9],
[174, 76, 24, 88, 8],
[158, 52, 18, 75, 3],
[164, 63, 19, 81, 4]
])

# Printing the dataset with student labels
print("Student\tHeight\tWeight\tAge\tAvg Grade\tCourses")
for index, student in enumerate(data):
print(f"Student {index + 1}\t{student[0]}\t{student[1]}\t{student[2]}\t{student[3]}\t\t{student[4]}")

```

<span style="color: #ADD8E6;">_Objective:_</span>
- <span style="color: #FF6363;">Question 1.1 : </span>Find the Average Height of the Students

<span style="color: #ADD8E6;">_Task1:_</span>
- <span style="color: #FF6363;">Question 1.1 : </span>Load the Dataset

Explanation: You need to use the mean() function from NumPy to compute the average value of the height column in the dataset.
Explanation: Load the Iris dataset from scikit-learn’s datasets module and Display the first 5 rows of the dataset to understand its structure.

- <span style="color: #FF6363;">Question 1.2 : </span>Find the Age of the Oldest Student
<span style="color: #ADD8E6;">_Task2:_</span>Data Cleaning
- <span style="color: #FF6363;">Question 2.1 : </span>Exploratory Data Analysis (EDA)

Explanation: Use the max() function from NumPy to find the maximum value in the age column and determine the age of the oldest student.
Explanation:
• Print the summary statistics of the dataset (mean, median, mode, standard deviation, etc.).
• Check for any missing values in the dataset and handle them appropriately.
• Plot the distribution of each feature using histograms.
• Visualize the pairwise relationships between features using a pair plot (or scatter plot matrix).

- <span style="color: #FF6363;">Question 1.3 : </span>Find the Index of the Student Who Took the Most Courses

Expand Down

0 comments on commit 0b3a348

Please sign in to comment.