Skip to content

Institute for Advanced Analytics - Course - Distributed Data Processing

License

Notifications You must be signed in to change notification settings

chelseypaulsen/IAA_Sessions

Repository files navigation

Distributed Data Processing Module - Dan Zaratsian, Jan 2019


IAA Module - Session 1 - Platform Overview

  • Introduction and Module Agenda
  • Distributed Computing Platform
  • Walk-through of Apache Tools & Services (Spark, Zeppelin, Ambari, NiFi, Ranger, Hive, plus others)
  • Distributed Architectures and Use Cases
  • Demo - Hortonworks/Cloudera Data Platform (Hadoop)
  • Demo - Google Dataproc (Hadoop)
  • Docker Setup & Troubleshooting

IAA Module - Session 2 - SQL and NoSQL Services

  • Intro to Apache Hive
  • Apache Hive Syntax and Schema Design
  • Demo & Lab - Apache Hive
  • Intro to Apache HBase and Apache Phoenix
  • Apache HBase Schema Design & Best Practices
  • Apache Phoenix Syntax
  • Demo & Lab - Apache HBase/Phoenix
  • Intro to Apache SparkSQL
  • Apache SparkSQL Syntax and Best Practices
  • Demo & Lab - SparkSQL

IAA Module - Session 3 - Spark Data Processing & Machine Learning

  • Apache Spark Overview
  • Building and deploying Spark machine learning models
  • Considerations for ML in distributed environments
  • Spark Best Practices and Tuning
  • Demo & Lab - Spark ML with NYC Taxi Data

IAA Module - Session 4 - Realtime, Streaming Systems

  • Intro to Apache Kafka
  • Demo - Apache Kafka
  • Intro to Apache NiFi
  • Demo - Apache NiFi Walk-through
  • Demo & Lab - Realtime Data Processing and Analysis

IAA Module - Session 5 - Google Cloud Platform (GCP)

  • Intro to Google Cloud Platform
  • Overview of the cloud ecosystem & services
  • Deploying solutions in the Cloud
  • Industry trends & Applications
  • Walk-through of Tools and Services
  • Demos & Lab

IAA Module - Session 6 - Special Topics or Hackathon Project

This session will be used as an overflow from previous sessions. If extra time is needed or a deeper dive is required for specific content, then this session will be used for that.

In the case that all sessions go smooth and on time, then this session will be designated as a 75-minute hackathon. Here are the planned flow of events:
1) Ingest real-time streaming data using Apache NiFi
2) Process and store the streaming data using Apache NiFi (bonus points if Apache Spark is used as well)
3) Use Hive to query the persisted data in order to answer specific queries
4) Use Spark ML to build a preditive model against the persisted data


References

About

Institute for Advanced Analytics - Course - Distributed Data Processing

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published