Youtube Data Analysis ETL pipeline

Overview

YouTube Data Analysis project involves developing ETL pipeline using AWS services for securely managing, streamlining, and performing analysis on the structured and semi-structured data Using AWS Lambda and AWS glue. Further, the transformed parquet data is loaded into AWS Redshift to visualize data insights in Tableau Desktop.

Below are some of the business questions we aim to answer with this project.

Which region has the highest YouTube viewership?
Which category of videos are most consumed?
Which videos are most disliked based on category and region?
What are the total number of likes across all videos in regions Canada, United States, and Great Britain?
Which category of videos are most liked?

Architecture Diagram

Goals

AWS Cloud : Processing vast amounts of data to answer business questions could be challenging on local computers, so need to use the cloud, in this case, we will use AWS.
Data Ingestion : Build a mechanism to ingest data into S3. Involves planning a strategy to handle both CSV and JSON files extracted from Kaggle.
Staging : Build a staging Layer using the Glue catalog for further processing of JSON and CSV data.
Testing : Testing small batches of CSV and JSON Data with AWS Lambda and AWS Glue before automating the entire processing layer.
ETL (Extract, Transform, Load) : Both CSV and JSON Data in raw format should be cleaned and transformed into Parquet format using AWS Lambda and AWS Glue.
Automation : Make sure the entire ETL is automated to handle higher throughputs using AWS Lambda triggers.
Data Lake : All the cleaned data from CSV and JSON needs a centralized repo such as S3 to store them to perform further transformations and loading into the warehouse.
Analytics : Further transformed data should be loaded into Redshift for faster querying and analytics purposes.
Reporting : Build a dashboard from Redshift to get insights required for business needs using Amazon QuickSight.
Scalability : As the size of data increases, we need to make sure the architecture built scales with it.
Monitoring : Use CloudWatch to monitor logs across all AWS services.

Services

Amazon S3 : An object storage service providing manufacturing scalability, data availability, security, and performance.
AWS IAM : Identity and Access Management enabling secure management of access to AWS services and resources.
QuickSight : A scalable, serverless, embeddable, machine learning-powered business intelligence (BI) service built for the cloud.
AWS Glue : A serverless data integration service that makes it easy to discover, prepare, and combine data for analytics, machine learning, and application development.
AWS Lambda : A computing service allowing programmers to run code without creating or managing servers.
AWS Athena : An interactive query service for S3, with no need to load data as it stays in S3.
AWS Redshift : A data warehouse service used to store processed data for analytics purposes.
AWS CloudWatch : A service for monitoring and observing logs across all AWS services.

Datasets used

This Kaggle dataset contains statistics (CSV files and JSON files) on daily popular YouTube videos over the course of many months. There are up to 200 trending videos published every day for many locations. The data for each region is in its own file. The video title, channel title, publication time, tags, views, likes and dislikes, description, and comment count are among the items included in the data. A category_id field, which differs by area, is also included in the JSON file linked to the region. https://www.kaggle.com/datasets/datasnaek/youtube-new

Risks and Mitigation

Costs of AWS services could be increasing if not monitored properly. To tackle this problem, budgets can be introduced using AWS billing and cost management.

Conclusion

Overall, we have to process both CSV and JSON data in the cloud environment and visualize data insights to answer business questions.

Data Visualization with Tableau Desktop

Install the ODBC Redshift Driver: Required for Tableau Desktop to connect to Redshift.
Create an Extract Connection: Connect Tableau Desktop to Redshift.
Create Visualizations: Use Tableau Desktop to process and visualize the data.

Dashboard

Checkout Tableau data visualization at Youtube data metrics | Tableau Public

Name		Name	Last commit message	Last commit date
Latest commit History 10 Commits
transformed_parquet_data		transformed_parquet_data
ETL-job-csv-to-parquet.py		ETL-job-csv-to-parquet.py
README.md		README.md
Tableau dashboard.pdf		Tableau dashboard.pdf
lambda_function.py		lambda_function.py
loading_cleaned_data_into_warehouse_for_analysis (1).py		loading_cleaned_data_into_warehouse_for_analysis (1).py
youtube_data_analysis_architecture.png		youtube_data_analysis_architecture.png
youtube_data_analysis_architecture_100%.png		youtube_data_analysis_architecture_100%.png
youtube_data_visualization.hyper		youtube_data_visualization.hyper
youtube_data_visualization_with_extract.twb		youtube_data_visualization_with_extract.twb

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Youtube Data Analysis ETL pipeline

Overview

Architecture Diagram

Goals

Services

Datasets used

Risks and Mitigation

Conclusion

Data Visualization with Tableau Desktop

Dashboard

Checkout Tableau data visualization at Youtube data metrics | Tableau Public

About

Releases

Packages

Languages

ravishankar324/Youtube-Data-Analysis-ETL-Using-AWS-Cloud-Services

Folders and files

Latest commit

History

Repository files navigation

Youtube Data Analysis ETL pipeline

Overview

Architecture Diagram

Goals

Services

Datasets used

Risks and Mitigation

Conclusion

Data Visualization with Tableau Desktop

Dashboard

Checkout Tableau data visualization at Youtube data metrics | Tableau Public

About

Topics

Resources

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages