Update T2.md

cognizance-amrita · Aug 28, 2024 · 1623228 · 1623228
1 parent 0b3a348
commit 1623228
Showing 1 changed file with 37 additions and 75 deletions.
diff --git a/source/_posts/T2.md b/source/_posts/T2.md
@@ -286,96 +286,58 @@ The Iris dataset is a classic dataset in machine learning, consisting of 150 sam
 
 
 
-<span style="color: #ADD8E6;">_Task1:_</span>
-- <span style="color: #FF6363;">Question 1.1 : </span>Load the Dataset
+<span style="color: #ADD8E6;">_Task1 :_</span>
+- <span style="color: #FF6363;">Question 1.1 : </span> Load the Dataset
 
     Explanation: Load the Iris dataset from scikit-learn’s datasets module and	Display the first 5 rows of the dataset to understand its structure.
 
-<span style="color: #ADD8E6;">_Task2:_</span>Data Cleaning
+<span style="color: #ADD8E6;">_Task2 :_</span> Data Cleaning
 - <span style="color: #FF6363;">Question 2.1 : </span>Exploratory Data Analysis (EDA)
 
     Explanation:
-•	Print the summary statistics of the dataset (mean, median, mode, standard deviation, etc.).
-•	Check for any missing values in the dataset and handle them appropriately.
-•	Plot the distribution of each feature using histograms.
-•	Visualize the pairwise relationships between features using a pair plot (or scatter plot matrix).
-
-- <span style="color: #FF6363;">Question 1.3 : </span>Find the Index of the Student Who Took the Most Courses
-
-    Explanation: Use the argmax() function from NumPy to locate the index of the maximum value in the number of courses column.
-
-- <span style="color: #FF6363;">Question 1.4 : </span>Find the Number of Students with an Average Grade Above 85
-
-    Explanation: Use a NumPy condition to filter the dataset for students with an average grade above 85, and then use the sum() function to count them.
-
-- <span style="color: #FF6363;">Question 1.5 : </span>Calculate the Ratio of a Student's Age to Their Average Grade for Each Student
-
-    Explanation: Perform element-wise division of the age column by the average grade column to get the ratio for each student.
+    •	Print the summary statistics of the dataset (mean, median, mode, standard deviation, etc.).
+    •	Check for any missing values in the dataset and handle them appropriately.
+    •	Plot the distribution of each feature using histograms.
+    •	Visualize the pairwise relationships between features using a pair plot (or scatter plot matrix).
 
-<span style="color: #ADD8E6;">_References:_</span>
-- [<span style="color: #55AAFF;">W3Schools</span>](https://www.w3schools.com/python/numpy/default.asp)
-- [<span style="color: #55AAFF;">Numpy Documentation</span>](https://numpy.org/doc/stable/reference/arrays.ndarray.html)
-- [<span style="color: #55AAFF;">GeeksforGeeks</span>]( https://www.geeksforgeeks.org/numpy-tutorial/)
-- [<span style="color: #55AAFF;">NumPy cheat sheet</span>](https://images.datacamp.com/image/upload/v1676302459/Marketing/Blog/Numpy_Cheat_Sheet.pdf)
-
-<hr>
-
-**<span style="color: #FF6363; font-size: 1rem;">Question 2</span>**
-
-**<span style="color: #ADD8E6; font-size: 1rem;">Pandas</span>**
-
-**Pandas is a powerful open-source data analysis and manipulation library for Python. It provides data structures like Data Frames and Series that are built on top of NumPy arrays and are designed to handle a wide range of data types and operations efficiently. Pandas is extensively used in data science and machine learning for tasks such as data cleaning, transformation, and analysis.**
-
-**<span style="color: #ADD8E6; font-size: 1rem;">DataSet</span>**
+<span style="color: #ADD8E6;">_Task3 :_</span> Data Transformation
+- <span style="color: #FF6363;">Question 3.1 : </span> Feature Scaling
 
-*We will use a dataset with 15 students, each having 5 attributes. Let's first convert the list into a Pandas DataFrame.*
-```lua
-data = [
-    [170, 65, 19, 85, 5],
-    [180, 75, 20, 90, 6],
-    [160, 55, 18, 80, 4],
-    [175, 70, 21, 88, 7],
-    [155, 50, 19, 82, 5],
-    [165, 62, 22, 89, 6],
-    [178, 80, 23, 91, 7],
-    [162, 58, 20, 78, 3],
-    [172, 68, 19, 86, 5],
-    [169, 66, 20, 84, 4],
-    [171, 64, 22, 87, 6],
-    [177, 72, 21, 90, 9],
-    [174, 76, 24, 88, 8],
-    [158, 52, 18, 75, 3],
-    [164, 63, 19, 81, 4]
-]
-# column names being ‘Height’, ‘Weight’, ‘Age’, ‘Avg_Grade’ and ‘Courses’ in that order.
-```
-<span style="color: #ADD8E6;">_Objective:_</span>
-- <span style="color: #FF6363;">Question 2.1 : </span>Create a Pandas DataFrame
-
-    Explanation: You need to understand how to convert a NumPy array into a DataFrame and assign column names.
-
-- <span style="color: #FF6363;">Question 2.2 : </span>Describe the DataFrame
-
-    Explanation: The describe() function provides various summary statistics (mean, standard deviation, min, max, and percentiles) for numeric columns in the DataFrame.
-
-- <span style="color: #FF6363;">Question 2.3 : </span>Count the Number of Students in Each Age Group
+    Explanation:
+    •	Standardize the features by removing the mean and scaling to unit variance using StandardScaler from scikit-learn.
+    •	Alternatively, perform Min-Max scaling on the features using MinMaxScaler.
 
-    Explanation: Use the value_counts() function to count occurrences of unique values in a column.
+- <span style="color: #FF6363;">Question 3.2 : </span>Encoding the Target Variable
 
-- <span style="color: #FF6363;">Question 2.4 : </span>Filter the DataFrame
+    Explanation:
+    •	Encode the categorical target variable (species) into numeric values using label encoding.
+
+<span style="color: #ADD8E6;">_Task4 :_</span> Data Splitting
+- <span style="color: #FF6363;">Question 4.1 : </span> Splitting the Dataset
 
-  Explanation: Filtering allows you to extract specific rows from the DataFrame based on certain conditions.
+    Explanation:
+    •	Split the dataset into training and testing sets using an 80-20 split. Use train_test_split from scikit-learn.
 
-- <span style="color: #FF6363;">Question 2.5 : </span>Calculate the Average Grade for Each Age Group
+  <span style="color: #ADD8E6;">_Bonus Task :_</span>
+  - <span style="color: #FF6363;">Question 5.1 : </span> Principal Component Analysis (PCA)
+    Explanation:
+    •	Perform PCA on the Iris dataset to reduce the dimensionality to 2 components.
+    •	Plot the data points in the new 2D space with different colors for each species.
+
 
-  Explanation: The groupby() function in Pandas is used to group data based on one or more columns. After grouping, you can apply aggregation functions like mean() to these groups. In this task, you will group students by their age and then calculate the average grade for each age group.
+<span style="color: #ADD8E6;">_Deliverables :_</span>
+•	A Jupyter Notebook or Google Colab Notebook containing:
+•	Code for each of the tasks.
+•	Comments explaining each step.
+•	Plots and visualizations generated during EDA.
 
 <span style="color: #ADD8E6;">_References:_</span>
-- [<span style="color: #55AAFF;">W3Schools</span>](https://www.w3schools.com/python/pandas/default.asp)
-- [<span style="color: #55AAFF;">Pandas Documentation</span>](https://pandas.pydata.org/docs/reference/frame.html)
-- [<span style="color: #55AAFF;">GeeksforGeeks</span>](https://www.geeksforgeeks.org/pandas-tutorial/)
-- [<span style="color: #55AAFF;">Pandas cheat sheet </span>](https://pandas.pydata.org/Pandas_Cheat_Sheet.pdf)
-
+- [<span style="color: #55AAFF;">Exploratory data analysis</span>](https://www.kaggle.com/code/imoore/intro-to-exploratory-data-analysis-eda-in-python)
+- [<span style="color: #55AAFF;">Data preprocessing (Importing the dataset, identifying and handling the missing values, encoding the categorical data, splitting the dataset)</span>](https://www.kaggle.com/code/alirezahasannejad/data-preprocessing-in-machine-learning)
+- [<span style="color: #55AAFF;">Data loading</span>]( https://youtu.be/h_NWeliQNOQ?si=-r7ZOcMmrGIqveHB)
+- [<span style="color: #55AAFF;">Handling missing values </span>](https://youtu.be/J-KfMnhUrdA?si=1CRhXKqzL_lefQmf)
+- [<span style="color: #55AAFF;">Data encoding </span>](https://youtu.be/r3pvRpCtaLQ?si=JAc5IQdZyIp8B7H1)
+- [<span style="color: #55AAFF;">Splitting data into train and test </span>](https://youtu.be/Y2YoiAgG-Bk?si=ejqlmWjDTQgfl6sO)
 <hr>
 <hr>