Institute for Advanced Analytics

Distributed Data Processing Module - Dan Zaratsian, Jan 2019

IAA Module - Session 1 - Platform Overview

Introduction and Module Agenda
Distributed Computing Platform
Walk-through of Apache Tools & Services (Spark, Zeppelin, Ambari, NiFi, Ranger, Hive, plus others)
Distributed Architectures and Use Cases
Demo - Hortonworks/Cloudera Data Platform (Hadoop)
Demo - Google Dataproc (Hadoop)
Docker Setup & Troubleshooting

IAA Module - Session 2 - SQL and NoSQL Services

Intro to Apache Hive
Apache Hive Syntax and Schema Design
Demo & Lab - Apache Hive
Intro to Apache HBase and Apache Phoenix
Apache HBase Schema Design & Best Practices
Apache Phoenix Syntax
Demo & Lab - Apache HBase/Phoenix
Intro to Apache SparkSQL
Apache SparkSQL Syntax and Best Practices
Demo & Lab - SparkSQL

IAA Module - Session 3 - Spark Data Processing & Machine Learning

Apache Spark Overview
Building and deploying Spark machine learning models
Considerations for ML in distributed environments
Spark Best Practices and Tuning
Demo & Lab - Spark ML with NYC Taxi Data

IAA Module - Session 4 - Realtime, Streaming Systems

Intro to Apache Kafka
Demo - Apache Kafka
Intro to Apache NiFi
Demo - Apache NiFi Walk-through
Demo & Lab - Realtime Data Processing and Analysis

IAA Module - Session 5 - Google Cloud Platform (GCP)

Intro to Google Cloud Platform
Overview of the cloud ecosystem & services
Deploying solutions in the Cloud
Industry trends & Applications
Walk-through of Tools and Services
Demos & Lab

IAA Module - Session 6 - Special Topics or Hackathon Project

This session will be used as an overflow from previous sessions. If extra time is needed or a deeper dive is required for specific content, then this session will be used for that.

In the case that all sessions go smooth and on time, then this session will be designated as a 75-minute hackathon. Here are the planned flow of events:
1) Ingest real-time streaming data using Apache NiFi
2) Process and store the streaming data using Apache NiFi (bonus points if Apache Spark is used as well)
3) Use Hive to query the persisted data in order to answer specific queries
4) Use Spark ML to build a preditive model against the persisted data

Name		Name	Last commit message	Last commit date
Latest commit History 88 Commits
docker		docker
scripts		scripts
session_01		session_01
session_02		session_02
session_03		session_03
session_04		session_04
session_05		session_05
session_06		session_06
LICENSE		LICENSE
README.md		README.md
setup_and_installation.md		setup_and_installation.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Institute for Advanced Analytics

IAA Module - Session 1 - Platform Overview

IAA Module - Session 2 - SQL and NoSQL Services

IAA Module - Session 3 - Spark Data Processing & Machine Learning

IAA Module - Session 4 - Realtime, Streaming Systems

IAA Module - Session 5 - Google Cloud Platform (GCP)

IAA Module - Session 6 - Special Topics or Hackathon Project

References

About

Releases

Packages

Languages

License

chelseypaulsen/IAA_Sessions

Folders and files

Latest commit

History

Repository files navigation

Institute for Advanced Analytics

IAA Module - Session 1 - Platform Overview

IAA Module - Session 2 - SQL and NoSQL Services

IAA Module - Session 3 - Spark Data Processing & Machine Learning

IAA Module - Session 4 - Realtime, Streaming Systems

IAA Module - Session 5 - Google Cloud Platform (GCP)

IAA Module - Session 6 - Special Topics or Hackathon Project

References

About

Resources

License

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages