Skip to content

Commit

Permalink
Update T2.md
Browse files Browse the repository at this point in the history
  • Loading branch information
rancidghoul authored Aug 28, 2024
1 parent 0b3a348 commit 1623228
Showing 1 changed file with 37 additions and 75 deletions.
112 changes: 37 additions & 75 deletions source/_posts/T2.md
Original file line number Diff line number Diff line change
Expand Up @@ -286,96 +286,58 @@ The Iris dataset is a classic dataset in machine learning, consisting of 150 sam



<span style="color: #ADD8E6;">_Task1:_</span>
- <span style="color: #FF6363;">Question 1.1 : </span>Load the Dataset
<span style="color: #ADD8E6;">_Task1 :_</span>
- <span style="color: #FF6363;">Question 1.1 : </span> Load the Dataset

Explanation: Load the Iris dataset from scikit-learn’s datasets module and Display the first 5 rows of the dataset to understand its structure.

<span style="color: #ADD8E6;">_Task2:_</span>Data Cleaning
<span style="color: #ADD8E6;">_Task2 :_</span> Data Cleaning
- <span style="color: #FF6363;">Question 2.1 : </span>Exploratory Data Analysis (EDA)

Explanation:
• Print the summary statistics of the dataset (mean, median, mode, standard deviation, etc.).
• Check for any missing values in the dataset and handle them appropriately.
• Plot the distribution of each feature using histograms.
• Visualize the pairwise relationships between features using a pair plot (or scatter plot matrix).

- <span style="color: #FF6363;">Question 1.3 : </span>Find the Index of the Student Who Took the Most Courses

Explanation: Use the argmax() function from NumPy to locate the index of the maximum value in the number of courses column.

- <span style="color: #FF6363;">Question 1.4 : </span>Find the Number of Students with an Average Grade Above 85

Explanation: Use a NumPy condition to filter the dataset for students with an average grade above 85, and then use the sum() function to count them.

- <span style="color: #FF6363;">Question 1.5 : </span>Calculate the Ratio of a Student's Age to Their Average Grade for Each Student

Explanation: Perform element-wise division of the age column by the average grade column to get the ratio for each student.
• Print the summary statistics of the dataset (mean, median, mode, standard deviation, etc.).
• Check for any missing values in the dataset and handle them appropriately.
• Plot the distribution of each feature using histograms.
• Visualize the pairwise relationships between features using a pair plot (or scatter plot matrix).

<span style="color: #ADD8E6;">_References:_</span>
- [<span style="color: #55AAFF;">W3Schools</span>](https://www.w3schools.com/python/numpy/default.asp)
- [<span style="color: #55AAFF;">Numpy Documentation</span>](https://numpy.org/doc/stable/reference/arrays.ndarray.html)
- [<span style="color: #55AAFF;">GeeksforGeeks</span>]( https://www.geeksforgeeks.org/numpy-tutorial/)
- [<span style="color: #55AAFF;">NumPy cheat sheet</span>](https://images.datacamp.com/image/upload/v1676302459/Marketing/Blog/Numpy_Cheat_Sheet.pdf)

<hr>

**<span style="color: #FF6363; font-size: 1rem;">Question 2</span>**

**<span style="color: #ADD8E6; font-size: 1rem;">Pandas</span>**

**Pandas is a powerful open-source data analysis and manipulation library for Python. It provides data structures like Data Frames and Series that are built on top of NumPy arrays and are designed to handle a wide range of data types and operations efficiently. Pandas is extensively used in data science and machine learning for tasks such as data cleaning, transformation, and analysis.**

**<span style="color: #ADD8E6; font-size: 1rem;">DataSet</span>**
<span style="color: #ADD8E6;">_Task3 :_</span> Data Transformation
- <span style="color: #FF6363;">Question 3.1 : </span> Feature Scaling

*We will use a dataset with 15 students, each having 5 attributes. Let's first convert the list into a Pandas DataFrame.*
```lua
data = [
[170, 65, 19, 85, 5],
[180, 75, 20, 90, 6],
[160, 55, 18, 80, 4],
[175, 70, 21, 88, 7],
[155, 50, 19, 82, 5],
[165, 62, 22, 89, 6],
[178, 80, 23, 91, 7],
[162, 58, 20, 78, 3],
[172, 68, 19, 86, 5],
[169, 66, 20, 84, 4],
[171, 64, 22, 87, 6],
[177, 72, 21, 90, 9],
[174, 76, 24, 88, 8],
[158, 52, 18, 75, 3],
[164, 63, 19, 81, 4]
]
# column names beingHeight’, ‘Weight’, ‘Age’, ‘Avg_GradeandCoursesin that order.
```
<span style="color: #ADD8E6;">_Objective:_</span>
- <span style="color: #FF6363;">Question 2.1 : </span>Create a Pandas DataFrame

Explanation: You need to understand how to convert a NumPy array into a DataFrame and assign column names.

- <span style="color: #FF6363;">Question 2.2 : </span>Describe the DataFrame

Explanation: The describe() function provides various summary statistics (mean, standard deviation, min, max, and percentiles) for numeric columns in the DataFrame.

- <span style="color: #FF6363;">Question 2.3 : </span>Count the Number of Students in Each Age Group
Explanation:
• Standardize the features by removing the mean and scaling to unit variance using StandardScaler from scikit-learn.
• Alternatively, perform Min-Max scaling on the features using MinMaxScaler.

Explanation: Use the value_counts() function to count occurrences of unique values in a column.
- <span style="color: #FF6363;">Question 3.2 : </span>Encoding the Target Variable

- <span style="color: #FF6363;">Question 2.4 : </span>Filter the DataFrame
Explanation:
• Encode the categorical target variable (species) into numeric values using label encoding.

<span style="color: #ADD8E6;">_Task4 :_</span> Data Splitting
- <span style="color: #FF6363;">Question 4.1 : </span> Splitting the Dataset

Explanation: Filtering allows you to extract specific rows from the DataFrame based on certain conditions.
Explanation:
• Split the dataset into training and testing sets using an 80-20 split. Use train_test_split from scikit-learn.

- <span style="color: #FF6363;">Question 2.5 : </span>Calculate the Average Grade for Each Age Group
<span style="color: #ADD8E6;">_Bonus Task :_</span>
- <span style="color: #FF6363;">Question 5.1 : </span> Principal Component Analysis (PCA)
Explanation:
• Perform PCA on the Iris dataset to reduce the dimensionality to 2 components.
• Plot the data points in the new 2D space with different colors for each species.


Explanation: The groupby() function in Pandas is used to group data based on one or more columns. After grouping, you can apply aggregation functions like mean() to these groups. In this task, you will group students by their age and then calculate the average grade for each age group.
<span style="color: #ADD8E6;">_Deliverables :_</span>
• A Jupyter Notebook or Google Colab Notebook containing:
• Code for each of the tasks.
• Comments explaining each step.
• Plots and visualizations generated during EDA.

<span style="color: #ADD8E6;">_References:_</span>
- [<span style="color: #55AAFF;">W3Schools</span>](https://www.w3schools.com/python/pandas/default.asp)
- [<span style="color: #55AAFF;">Pandas Documentation</span>](https://pandas.pydata.org/docs/reference/frame.html)
- [<span style="color: #55AAFF;">GeeksforGeeks</span>](https://www.geeksforgeeks.org/pandas-tutorial/)
- [<span style="color: #55AAFF;">Pandas cheat sheet </span>](https://pandas.pydata.org/Pandas_Cheat_Sheet.pdf)

- [<span style="color: #55AAFF;">Exploratory data analysis</span>](https://www.kaggle.com/code/imoore/intro-to-exploratory-data-analysis-eda-in-python)
- [<span style="color: #55AAFF;">Data preprocessing (Importing the dataset, identifying and handling the missing values, encoding the categorical data, splitting the dataset)</span>](https://www.kaggle.com/code/alirezahasannejad/data-preprocessing-in-machine-learning)
- [<span style="color: #55AAFF;">Data loading</span>]( https://youtu.be/h_NWeliQNOQ?si=-r7ZOcMmrGIqveHB)
- [<span style="color: #55AAFF;">Handling missing values </span>](https://youtu.be/J-KfMnhUrdA?si=1CRhXKqzL_lefQmf)
- [<span style="color: #55AAFF;">Data encoding </span>](https://youtu.be/r3pvRpCtaLQ?si=JAc5IQdZyIp8B7H1)
- [<span style="color: #55AAFF;">Splitting data into train and test </span>](https://youtu.be/Y2YoiAgG-Bk?si=ejqlmWjDTQgfl6sO)
<hr>
<hr>

Expand Down

0 comments on commit 1623228

Please sign in to comment.