Spark Talk

What is Apache Spark
Use Cases
Batch and Streaming API's
Architecture
Directed Acyclic Graph (DAG)
Example With Last.fm Data

About Spark

Distributed data processing engine
Best suited for big data but not limited to it
Fast and resillient
Supports both batch and streaming
Scala, Java, Python, R, SQL
Machine learning with MLlib

Use Cases

Process big data that cannot fit into memory
Extract, transform, load (ETL)
Dashboards from streaming data (like sales on an ecommerce)
Basically any use case where you need to process data and do it fast.

Spark API's

Low level API's like RDD and Dstreams
High level Structured API's like DataFrame and Dataset
RDD and Dataset are typesafe.
DataFrame is not typesafe but has the highest performance.
Easy to switch between API's.

Architecture

Spark runs on a distributed cluster
Driver controls the Spark Job and distributes tasks to executors.
Cluster manager provides resources to executors and the driver.
Spark can run on standalone mode, YARN, Mesos.
Experimental Kubernetes support.

Directed Acyclic Graph (DAG)

Spark represents the flow as a Directed Acyclic Graph (DAG)
When a step fails, Spark re-runs steps before the failed one to regenerate the loss.

Example With Last.fm Data

Analyze 2018 data and find top 10 artists
Get recommendations for each artist using MusicBrainz API

Name		Name	Last commit message	Last commit date
Latest commit History 2 Commits
project		project
src/main/scala/com/kaplanbora		src/main/scala/com/kaplanbora
README.md		README.md
apis.png		apis.png
arch.png		arch.png
artistinfo.json		artistinfo.json
build.sbt		build.sbt
dag.png		dag.png
plan.png		plan.png
scrobbles.csv		scrobbles.csv

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Spark Talk

About Spark

Use Cases

Spark API's

Architecture

Directed Acyclic Graph (DAG)

Example With Last.fm Data

Sources

About

Releases

Packages

Languages

kaplanbora/spark-talk

Folders and files

Latest commit

History

Repository files navigation

Spark Talk

About Spark

Use Cases

Spark API's

Architecture

Directed Acyclic Graph (DAG)

Example With Last.fm Data

Sources

About

Resources

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages