Distributed Data Processing Module - Dan Zaratsian, Jan 2019
- Introduction and Module Agenda
- Distributed Computing Platform
- Walk-through of Apache Tools & Services (Spark, Zeppelin, Ambari, NiFi, Ranger, Hive, plus others)
- Distributed Architectures and Use Cases
- Demo - Hortonworks/Cloudera Data Platform (Hadoop)
- Demo - Google Dataproc (Hadoop)
- Docker Setup & Troubleshooting
- Intro to Apache Hive
- Apache Hive Syntax and Schema Design
- Demo & Lab - Apache Hive
- Intro to Apache HBase and Apache Phoenix
- Apache HBase Schema Design & Best Practices
- Apache Phoenix Syntax
- Demo & Lab - Apache HBase/Phoenix
- Intro to Apache SparkSQL
- Apache SparkSQL Syntax and Best Practices
- Demo & Lab - SparkSQL
- Apache Spark Overview
- Building and deploying Spark machine learning models
- Considerations for ML in distributed environments
- Spark Best Practices and Tuning
- Demo & Lab - Spark ML with NYC Taxi Data
- Intro to Apache Kafka
- Demo - Apache Kafka
- Intro to Apache NiFi
- Demo - Apache NiFi Walk-through
- Demo & Lab - Realtime Data Processing and Analysis
- Intro to Google Cloud Platform
- Overview of the cloud ecosystem & services
- Deploying solutions in the Cloud
- Industry trends & Applications
- Walk-through of Tools and Services
- Demos & Lab
This session will be used as an overflow from previous sessions. If extra time is needed or a deeper dive is required for specific content, then this session will be used for that.
In the case that all sessions go smooth and on time, then this session will be designated
as a 75-minute hackathon. Here are the planned flow of events:
1) Ingest real-time streaming data using Apache NiFi
2) Process and store the streaming data using Apache NiFi (bonus points if Apache Spark is used as well)
3) Use Hive to query the persisted data in order to answer specific queries
4) Use Spark ML to build a preditive model against the persisted data