This repository showcases the results of the labs completed during my first semester at Igor Sikorsky Kyiv Polytechnic Institute, where I pursued a Master's degree in Informatics and Software Engineering🎓 -
This repository contains Python code that demonstrates data manipulation and analysis tasks using Apache Spark and PySpark library. The code works with a dataset related to crime rates in different regions and showcases various functionalities of Spark for big data processing.
-
Data Loading and Schema Definition:
- The code starts by installing the
pyspark
library and importing necessary modules. - It mounts Google Drive to access the dataset file named "Crime.csv".
- Defines the schema for the dataset using the
StructType
andStructField
classes to specify the data types for each column.
- The code starts by installing the
-
Creating a Spark Session and Loading Data:
- A Spark session is created using
SparkSession.builder
to enable Spark functionalities. - The dataset is read from the CSV file and loaded into a DataFrame (
df
) usingspark.read.format("csv")
.
- A Spark session is created using
-
Task 1: Finding the Average Crime Rate in Southern States:
- The code uses the
select
andagg
functions to filter rows where theSouthern
column is greater than 0 (indicating a southern state) and calculates the average of theCrimeRate
column.
- The code uses the
-
Task 2: Calculating Total Police Expenditure:
- The code uses the
groupBy
andsum
functions to calculate the total sum of theExpenditureYear0
column.
- The code uses the
-
Task 3: Filtering Data for High Crime Rates in Northern States with High Poverty:
- The code uses the
select
function to filter rows where theSouthern
column is greater than 0 (indicating a northern state) andBelowWage
column is greater than 200. - The filtered DataFrame is displayed.
- The code uses the
-
Task 4: Sorting Data by Youth Unemployment:
- The code uses the
sort
function to sort the DataFrame by theYouthUnemployment
column in ascending order.
- The code uses the
-
Task 5: Sorting Data by Wage in Descending Order:
- The code uses the
sort
function to sort the DataFrame by theWage
column in descending order.
- The code uses the
-
Task 1: Create a New Table with Expenditure on Police and Array of Males Count:
- The code creates a new table that contains two columns: "ExpenditureYear0" representing police expenditure and "Males," an array of the count of males per 1000 population corresponding to each expenditure frequency. This is achieved using the
collect_set
function to gather unique values of "Males" for each "ExpenditureYear0" group.
- The code creates a new table that contains two columns: "ExpenditureYear0" representing police expenditure and "Males," an array of the count of males per 1000 population corresponding to each expenditure frequency. This is achieved using the
-
Task 2: Transforming Second Column into Multiple Columns:
- The code transforms the second column of the table from Task 1 into several separate columns. This is done using the
select
function and accessing each element of the "Males" array by index (e.g.,df2.Males[0], df2.Males[1], df2.Males[2]
). The result displays the "ExpenditureYear0" and corresponding "Males" values in different columns.
- The code transforms the second column of the table from Task 1 into several separate columns. This is done using the
-
Task 3: Finding the Average Crime Rate for Combinations of "Southern" and "HighYouthUnemploy":
- The code finds the average crime rate for all possible combinations of "Southern" and "HighYouthUnemploy" parameters. It groups the DataFrame based on these two columns and calculates the average crime rate using the
collect_set
function.
- Task 4: Adding a New Column with the Difference between Average and Current Police Expenditure:
- The code adds a new column named "ExpenditureYear0_avg-ExpenditureYear0," which represents the difference between the average police expenditure and the current "ExpenditureYear0" value, depending on whether the state is southern or not. It calculates the average expenditure for both southern and non-southern states and then applies the difference calculation accordingly. The resulting DataFrame is saved to a new CSV file named "Crime2.csv."