Skip to content

angelaheumann/Sparkify-Capstone

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

5 Commits
 
 
 
 
 
 
 
 

Repository files navigation

Sparkify-Capstone

Sparkify Logo

Spark for Big Data - Project Description

Imagine you are working on the data team for a popular digital music service similar to Spotify or Pandora. Many of users stream their favourite songs to your service every day either using the free tier that place advertisements between the songs or using the premium subscription model, where they stream music as free but pay a monthly flat rate. Users can upgrade, downgrade or cancel their service at any time. So, it's crucial to make sure your users love the service. Every time a user interacts with the service while they're playing songs, logging out, liking a song with a thumbs up, hearing an ad, or downgrading their service, it generates data. All this data contains the key insights for keeping your users happy and helping your business thrive. It's your job on the data team to predict which users are at risk to churn either downgrading from premium to free tier or cancelling their service altogether. If you can accurately identify these users before they leave, your business can offer them discounts and incentives, potentially saving your business millions in revenue. To tackle this project, we have provided you with a large dataset that contains the events described. You will need to load, explore and clean this dataset with Spark. Based on your explanation, you will create features and build models with Spark to predict which users were churn from your digital music service. This project is all about demonstrating mastery of Spark scalable data manipulation and machine learning. After completing this project, you'll have built a useful model with a massive dataset. You'll be able to apply the same skills with Spark to wrangle data and build models in your role as a data scientist.

Table of Contents

  • Data Cleaning
  • Exploratory Data Analysis
  • Feature Engineering
  • Modeling

Libraries used

  • pyspark
  • matplotlib
  • seaborn

Run jupyter notebook (local)

Start either jupyter lab or jupyter notebook and run

Sparkify.ipynb

This notebook uses a tiny subset (128MB) of the full dataset available (12GB). The data file is 7 zipped - please unzip it before use.

Medium Blog Post

Can be found here: https://towardsdatascience.com/how-to-predict-churns-in-sparkify-ab9a5c3f218d

About

No description, website, or topics provided.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published