Skip to content

Latest commit

 

History

History
executable file
·
339 lines (225 loc) · 13.5 KB

File metadata and controls

executable file
·
339 lines (225 loc) · 13.5 KB

Course Information
for
Fall Quarter 2024


Main Subjects

Subject Percentage
1. MapReduce Paradigm 25%
2. PySpark and Spark 65%
3.Data Partitioning and SQL Queries 10%

Course Description & Objectives

  • Understand the fundamentals of big data
  • Understand the fundamentals of MapReduce Paradigm
  • Use PySpark (Python API for Apache Spark) to solve big data problems
  • Use SQL for NoSQL data (DataFrames in Spark and Amazon Athena)
  • Understand Amazon Athena & Google BigQuery: Access & Analyze Big Data by SQL

Course Objectives

At the completion of this course, students will be able to understand:

  • Elements of Big Data:

    • Cluster Comouting
    • Persistence, Queries, Analytics
    • Data Replication
    • Distributed File System and Fault Tolerance
    • Scale-out Architecture vs. Scale-up Architecture
  • What is MapReduce paradigm?

    • Data partitioning and partitions
    • Mapper function: map()
    • Reducer function: reduce()
    • Combiner function: combine()
    • Sort & Shuffle: SQL's GROUP BY
    • Classic MapReduce Algorithms
    • Data Design Patterns
  • Fundamentals of Spark and PySpark:

    • Spark Architecture
    • Spark: engine for large-scale data analytics
    • Data Abstractions in Spark and PySpark
    • RDDs and DataFrames
    • Transformations and Actions
    • Running simple programs in PySpark
  • NoSQL Databases & Serverless Architectures

    • SQL for NoSQL data & Relational Algebra
    • Amazon Athena and SQL
    • Google BigQuery and SQL

Required Books

for the first 3 weeks of class

for the last 7 weeks of class


Required Software, API, and Documentation


Tentative Course Outline

The weekly coverage might change as it 
depends on the  progress of the class. 
However,  you must  keep up  with  the 
reading and  programming  assignments.

Session-1: Tuesday, September 24, 2024

TOPIC: Introduction to Big Data and MapReduce


Session-2: Thursday, September 26, 2024

TOPIC: Introduction to Big Data and MapReduce


Session-3: Tuesday, October 1, 2024

TOPIC: Introduction to MapReduce


Session-4: Thursday, October 3, 2024

TOPIC: Introduction to MapReduce


Session-5: Tuesday, October 8, 2024

TOPIC: Review of MapReduce paradigm with Examples


Session-6: Thursday, October 10, 2024

  • Exam-1, in-class
  • Closed books/notes/software/internet/friends

Session-7: Tuesday, October 15, 2024

TOPIC: Introduction to Spark & PySpark


Session-8: Thursday, October 17, 2024

TOPIC: Introduction to Spark and PySpark (Python API for Spark)


Session-9: Tuesday, October 22, 2024

TOPIC: Spark's Nuts and Bolts


Session-10: Thursday, October 24, 2024

TOPIC: Data Design Patterns


Session-11: Tuesday, October 29, 2024

TOPIC: Data Design Patterns


Session-12: Thursday, October 31, 2024

TOPIC: RDD Design Patterns


Session-13: Tuesday, November 5, 2024

  • Review for Exam-2
  • Problem solving & Q/A session

Session-14: Thursday, November 7, 2024

  • Exam-2, in-class
  • Closed books/notes/software/internet/friends

Session-15: Tuesday, November 12, 2024


Session-16: Thursday, November 14, 2024


Session-17: Tuesday, November 19, 2024


Session-18: Thursday, November 21, 2024


November 25-29, Thanksgiving Recess

  • Academic holiday (no classes)
  • No Office Hours

Session-19: Tuesday, December 3, 2024

  • Introduction to Serverless Analytics
  • SQL Access to Big Data
    • SQL Access: Amazon Athena
    • SQL Access: Google BigQuery

Session-20: Thursday, December 5, 2024

  • Review for Final Exam
  • Q/A session

Session-21: Final Exam

  • In-Class Exam
  • Date: Tuesday, December 10, 2024
  • Time: 5:45 PM - 7:45 PM PST (TBDL)
  • closed book/notes/friend/internet/software