Scalable Final Project

Alexander Yonchev

Fernando Vallecillos Ruiz

We discuss first the training and then the inference pipelines.

Training Pipeline

01_json2df

Since the MPD is in JSON format and multiple slices, we need a script that reads all the slices and creates datasets for easy use later on.

Creates multiples dataframes:

df_tracks_info: extra information about each track. E.g: artist, duration, album, tid, etc.
df_tracks: information about which track belong to which playlist in which position
df_playlists_info: information about each playlist. E.g: how many artists, edits, duration, name, how many tracks, etc.
df_playlists_test and df_playlists_info: same as before but for the test set provided in the challenge.

It also creates a dictionary that links each track URI to a track ID.

02_buildChallengeSet

This script will build its own test set to create a more robust challenge with playlists from the original one that are missing a certain amount of tracks randomly or from the end.

df_playlists_challenge: same as df_playlists but with the playlists chosen for the test set.
df_tracks_challenge: same as df_tracks but with the tracks chosen for the test set.
df_tracks_challenge_incomplete: same as df_tracks but with the modifications specified above.

03_buildTrainSet

This script creates the final training/test sets by substracting the previous sets from the original sets.

04_buildPlaylistSongMatrix

The dataframes created follow the template of df_tracks. This script creates new dataframes in which each row is a playlist with all the track IDs and their positions.

05_buildSongPlaylistMatrix

Creates extra dataframes in which each row is a track and specifies position and playlist ID it belongs to.

06_buildGiantMatrix

Creates the training playlist/song matrix. This matrix will have as many rows as playlists and as many columns as unique tracks. The songs belonging to a playlist will have a one in the corresponding place. Since it is a very large and sparse matrix, a sparse matrix is used.

07_buildGiantMatrix_truth

It will repeat the same process to create another matrix but this time it will be using the whole dataset instead of the training subset.

08_buildSimMatrix

This script builds a similarity matrix between playlists and between songs. This help to preprocess calculations later on.

playlist_recommender

It test the model against the test set previously created. The results are: R-Precision: 0.7607746294573476 NDCG: 0.768559893750165 Song Clicks: 0.0

Inference Pipeline

We have made a Gradio app in HuggingFace to test the model: App

To this end we use the following scripts:

fetchPlaylistTrackUris

Contains the neccesary functions to retrieve the URIs of the tracks given a playlist URL through the Spotify API.

recommender

Contains the necessary functions to use the model given a list of URI.

app

Main application that returns Spotify iframes according to the recommendations.

Online Training Pipeline

Since music recommendation varies throughout the years, it is essential for the model to be able to retrain. For that reason the Inference pipeline has been modified to that goal. Every time that an inference is made, the new list of URIs are saved in a dynamic dataset on Hugginface: Dataset 1 Dataset 2 This processed data is added to the static dataset inmediately and therefore could be used in the next inference.

Name		Name	Last commit message	Last commit date
Latest commit History 7 Commits
hugginface-inference		hugginface-inference
.gitignore		.gitignore
01_json2df.py		01_json2df.py
02_buildChallengeSet.py		02_buildChallengeSet.py
03_buildTrainSet.py		03_buildTrainSet.py
04_buildPlaylistSongMatrix.py		04_buildPlaylistSongMatrix.py
05_buildSongPlaylistMatrix.py		05_buildSongPlaylistMatrix.py
06_buildGiantMatrix.py		06_buildGiantMatrix.py
07_buildGiantMatrix_truth.py		07_buildGiantMatrix_truth.py
08_buildSimMatrix.py		08_buildSimMatrix.py
Inference.ipynb		Inference.ipynb
OPT_create_small_dataset.py		OPT_create_small_dataset.py
README.md		README.md
Testing_Inference.ipynb		Testing_Inference.ipynb
dict_uri2tid.pkl		dict_uri2tid.pkl
fetchPlaylistTrackUris.py		fetchPlaylistTrackUris.py
helper.py		helper.py
license.txt		license.txt
md5sums		md5sums
playlist_recommender.py		playlist_recommender.py
recommender.py		recommender.py
results_challenge.txt		results_challenge.txt
stats.py		stats.py
stats.txt		stats.txt

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Scalable Final Project

Alexander Yonchev

Fernando Vallecillos Ruiz

Training Pipeline

01_json2df

02_buildChallengeSet

03_buildTrainSet

04_buildPlaylistSongMatrix

05_buildSongPlaylistMatrix

06_buildGiantMatrix

07_buildGiantMatrix_truth

08_buildSimMatrix

playlist_recommender

Inference Pipeline

fetchPlaylistTrackUris

recommender

app

Online Training Pipeline

About

Releases

Packages

Contributors 2

Languages

License

nandovallec/scalable-final

Folders and files

Latest commit

History

Repository files navigation

Scalable Final Project

Alexander Yonchev

Fernando Vallecillos Ruiz

Training Pipeline

01_json2df

02_buildChallengeSet

03_buildTrainSet

04_buildPlaylistSongMatrix

05_buildSongPlaylistMatrix

06_buildGiantMatrix

07_buildGiantMatrix_truth

08_buildSimMatrix

playlist_recommender

Inference Pipeline

fetchPlaylistTrackUris

recommender

app

Online Training Pipeline

About

Resources

License

Stars

Watchers

Forks

Releases

Packages 0

Contributors 2

Languages

Packages