Imagine you are working on the data team for a popular digital music service similar to Spotify or Pandora. Many of users stream their favourite songs to your service every day either using the free tier that place advertisements between the songs or using the premium subscription model, where they stream music as free but pay a monthly flat rate. Users can upgrade, downgrade or cancel their service at any time. So, it's crucial to make sure your users love the service. Every time a user interacts with the service while they're playing songs, logging out, liking a song with a thumbs up, hearing an ad, or downgrading their service, it generates data. All this data contains the key insights for keeping your users happy and helping your business thrive. It's your job on the data team to predict which users are at risk to churn either downgrading from premium to free tier or cancelling their service altogether. If you can accurately identify these users before they leave, your business can offer them discounts and incentives, potentially saving your business millions in revenue. To tackle this project, we have provided you with a large dataset that contains the events described. You will need to load, explore and clean this dataset with Spark. Based on your explanation, you will create features and build models with Spark to predict which users were churn from your digital music service. This project is all about demonstrating mastery of Spark scalable data manipulation and machine learning. After completing this project, you'll have built a useful model with a massive dataset. You'll be able to apply the same skills with Spark to wrangle data and build models in your role as a data scientist.
- Data Cleaning
- Exploratory Data Analysis
- Feature Engineering
- Modeling
- pyspark
- matplotlib
- seaborn
Start either jupyter lab or jupyter notebook and run
Sparkify.ipynb
This notebook uses a tiny subset (128MB) of the full dataset available (12GB). The data file is 7 zipped - please unzip it before use.
Can be found here: https://towardsdatascience.com/how-to-predict-churns-in-sparkify-ab9a5c3f218d