Distributed Services for Machine Learning - Dan Zaratsian, March 2022
- Introduction and Agenda
- Distributed Computing
- Walk-through of Tools and Services
- Distributed Architectures and Use Cases
- Google Colab Notebook Environment
- Google BigQuery Sandbox
- Background starting with Hadoop
- Intro to Apache Hive
- Apache Hive Syntax and Schema Design
- Intro to Apache SparkSQL
- Apache SparkSQL
- BigQuery (Serverless SQL)
- Google Cloud Firestore (NoSQL)
Assignments
-
Assignment 1 SQL - (Solution)
- Due on Tuesday, March 15 by 11:59pm
- Please complete as an individual assignment
- Email your code and answers to [email protected]
-
Assignment 2 NoSQL - (Solution)
- Due on Tuesday, March 15 by 11:59pm
- Please complete as an individual assignment
- Email your code and answers to [email protected]
Asset Directory
Slides
- Apache Spark Overview
- Spark Machine Learning (MLlib)
- ML Pipelines
- Building and deploying Spark machine learning models
- Considerations for ML in distributed environments
- Spark Best Practices and Tuning
- Spark Code Walk-through (within Google Colab)
Assignment
- Assignment 3
- Due on Monday, March 21 by 11:59pm
- Please complete as an individual assignment
- Email your code to [email protected]
NOTE: Slides from this week were a continuation from Session 3
- Spark Pipeline Components
- Spark Best Practices
- Deploying / Submitting Spark Applications
- Scikit-learn Model Deployment Process
Asset Directory
Slides
- Apache Kafka
- Google PubSub
- Demo of PubSub
- Spark Streaming
- Demo of Spark Streaming
- Apache Beam (Google Dataflow)
Asset Directory
Slides
- Overview of Google Cloud
- BigQueryML
- AutoML
- Serverless functions with Google Cloud Functions
- Container Based Deployments
Assignment
- Assignment 4 - SparkML or Docker Container
- This is a TEAM assignment.
- Due on Thursday, March 31 by 11:59pm
- Email me with any questions regarding the assignment.
- Please submit your code by email to [email protected]
- Google Colab Notebooks
- Google Vertex AI Platform
- Google Vertex Notebooks (Workbench)
- Apache Zeppelin
- Apache Spark Docs
- Google BigQuery
- Google BigQuery Sandbox
- Apache Hive Docs
- Google Cloud Firestore
- Apache HBase Docs
- Apache Phoenix Docs
- Google Cloud PubSub
- Apache Kafka Docs
- Apache NiFi Docs
- Docker Docs
- Google Deep Learning Containers