Subject | Percentage |
---|---|
1. MapReduce Paradigm | 25% |
2. PySpark and Spark | 65% |
3.Data Partitioning and SQL Queries | 10% |
- Understand the fundamentals of big data
- Understand the fundamentals of MapReduce Paradigm
- Use PySpark (Python API for Apache Spark) to solve big data problems
- Use SQL for NoSQL data (DataFrames in Spark and Amazon Athena)
- Understand Amazon Athena & Google BigQuery: Access & Analyze Big Data by SQL
At the completion of this course, students will be able to understand:
-
Elements of Big Data:
- Cluster Comouting
- Persistence, Queries, Analytics
- Data Replication
- Distributed File System and Fault Tolerance
- Scale-out Architecture vs. Scale-up Architecture
-
What is MapReduce paradigm?
- Data partitioning and partitions
- Mapper function:
map()
- Reducer function:
reduce()
- Combiner function:
combine()
- Sort & Shuffle: SQL's
GROUP BY
- Classic MapReduce Algorithms
- Data Design Patterns
-
Fundamentals of Spark and PySpark:
- Spark Architecture
- Spark: engine for large-scale data analytics
- Data Abstractions in Spark and PySpark
- RDDs and DataFrames
- Transformations and Actions
- Running simple programs in PySpark
-
NoSQL Databases & Serverless Architectures
- SQL for NoSQL data & Relational Algebra
- Amazon Athena and SQL
- Google BigQuery and SQL
for the first 3 weeks of class
for the last 7 weeks of class
- Apache Spark (main site)
- PySpark API and documentation
- RDD Programming Guide
- DataFrame Programming Guide
The weekly coverage might change as it
depends on the progress of the class.
However, you must keep up with the
reading and programming assignments.
TOPIC: Introduction to Big Data and MapReduce
- REQUIRED:
- OPTIONAL:
TOPIC: Introduction to Big Data and MapReduce
-
REQUIRED:
-
OPTIONAL:
TOPIC: Introduction to MapReduce
-
REQUIRED:
-
OPTIONAL:
TOPIC: Introduction to MapReduce
-
REQUIRED:
-
OPTIONAL:
TOPIC: Review of MapReduce paradigm with Examples
- Exam-1, in-class
- Closed books/notes/software/internet/friends
TOPIC: Introduction to Spark & PySpark
-
REQUIRED:
-
OPTIONAL:
TOPIC: Introduction to Spark and PySpark (Python API for Spark)
-
REQUIRED:
-
OPTIONAL:
TOPIC: Spark's Nuts and Bolts
-
REQUIRED:
-
OPTIONAL:
TOPIC: Data Design Patterns
- REQUIRED:
TOPIC: Data Design Patterns
-
REQUIRED:
- Chapters 3, 4, 5 of Data Algorithms with Spark Book by Mahmoud Parsian
- Data Design Patterns: InMapper Combiner, mapPartitions
- Top-10 Algorithm
- MinMax Algorithm
-
OPTIONAL:
TOPIC: RDD Design Patterns
- REQUIRED:
- Spark's RDD Partitioning
- Chapters 3, 4, 5 of Data Algorithms with Spark Book by Mahmoud Parsian
- Spark's
mapPartitions()
Transformation - mapPartitions() Tutorial
- Review reducers:
groupByKey()
,reduceByKey()
, andcombineByKey()
- Review for Exam-2
- Problem solving & Q/A session
- Exam-2, in-class
- Closed books/notes/software/internet/friends
- Spark's DataFrames (1)
- Chapters 4, 6, 7, 12 of PySpark Algorithms Book by Mahmoud Parsian
- Video: Structuring Spark: SQL, DataFrames, Datasets And Streaming - 28 mins
- Spark's DataFrames (2)
- Chapters 4, 6, 7, 12 of PySpark Algorithms Book by Mahmoud Parsian
- Video: Structuring Spark: SQL, DataFrames, Datasets And Streaming - 28 mins
- Introduction to Graph data structures
- MapReduce Design Pattern: Graph Algorithms
- Chapter 6 of Data Algorithms with Spark Book by Mahmoud Parsian
- Chapters 11 of PySpark Algorithms Book by Mahmoud Parsian
- MapReduce Design Pattern: Graph Algorithms
- Chapter 6 of Data Algorithms with Spark Book by Mahmoud Parsian
- Chapters 11 of PySpark Algorithms Book by Mahmoud Parsian
- Academic holiday (no classes)
- No Office Hours
- Introduction to Serverless Analytics
- SQL Access to Big Data
- SQL Access: Amazon Athena
- SQL Access: Google BigQuery
- Review for Final Exam
- Q/A session
- In-Class Exam
- Date: Tuesday, December 10, 2024
- Time: 5:45 PM - 7:45 PM PST (TBDL)
- closed book/notes/friend/internet/software