Skip to content

Latest commit

 

History

History
167 lines (150 loc) · 10.1 KB

README.md

File metadata and controls

167 lines (150 loc) · 10.1 KB

ACOUSTICS a music recommender system

One of the most used machine learning algorithms is recommendation systems. A recommender (or recommendation) system (or engine) is a filtering system which aim is to predict a rating or preference a user would give to an item, eg. a film, a product, a song, etc.

Which type of recommender can we have?

There are two main types of recommender systems: 

  • Content-based filters
  • Collaborative filters

Content-based filters predicts what a user likes based on what that particular user has liked in the past. On the other hand, collaborative-based filters predict what a user like based on what other users, that are similar to that particular user, have liked.

2) Collaborative filters

Collaborative Filters work with an interaction matrix, also called rating matrix. The aim of this algorithm is to learn a function that can predict if a user will benefit from an item - meaning the user will likely buy, listen to, watch this item.

Among collaborative-based systems, we can encounter two types: user-item filtering and item-item filtering.

What algorithms do collaborative filters use to recommend new songs? There are several machine learning algorithms that can be used in the case of collaborative filtering. Among them, we can mention nearest-neighbor, clustering, and matrix factorization.

K-Nearest Neighbors (kNN) is considered the standard method when it comes to both user-based and item-based collaborative filtering approaches.

We'll go through the steps for generating a music recommender system using a k-nearest algorithm approach.

Importing required libraries

Importing all the required libraries. import warnings warnings.filterwarnings("ignore", category=FutureWarning) import pandas as pd import numpy as np import matplotlib.pyplot as plt import seaborn as sns from scipy.sparse import csr_matrix from recommeders.knn_recommender import Recommender

Reading the files

We are going to use the Million Song Dataset, a freely-available collection of audio features and metadata for a million contemporary popular music tracks. There are two files that will be interesting for us. The first of them will give us information about the songs. Particularly, it contains the user ID, song ID and the listen count. On the other hand, the second file will contain song ID, title of that song, release, artist name and year. We need to merge these two DataFrames. For that aim, we'll use the song_ID #Read userid-songid-listen_count song_info = pd.read_csv('https://static.turi.com/datasets/millionsong/10000.txt',sep='\t',header=None) song_info.columns = ['user_id', 'song_id', 'listen_count']

#Read song metadata song_actual = pd.read_csv('https://static.turi.com/datasets/millionsong/song_data.csv') song_actual.drop_duplicates(['song_id'], inplace=True)

#Merge the two dataframes above to create input dataframe for recommender systems songs = pd.merge(song_info, song_actual, on="song_id", how="left") songs.head() We'll save this dataset into a csv file so we have this available if there is any other recommendation system project we want to do. songs.to_csv('songs.csv', index=False) We can read this file into a new DataFrame that we'd call df_songs. df_songs = pd.read_csv('songs.csv')

Exploring the data

As usual, any data science or machine learning project starts with an exploratory data analysis (EDA). The aim of EDA is to understand and get insights on our data. We'll first inspect the first rows of our DataFrame. df_songs.head() Then, we'll check how many observions there are in the dataset. #Get total observations print(f"There are {df_songs.shape[0]} observations in the dataset") Now, we should perform some cleaning steps. But looking at the dataset, we can see that there is no missing values. df_songs.isnull().sum() And most of the columns contain strings. df_songs.dtypes Let's start exploring some characteristics of the dataset:

  • Unique songs: #Unique songs unique_songs = df_songs['title'].unique().shape[0] print(f"There are {unique_songs} unique songs in the dataset")
  • Unique artists: #Unique artists unique_artists = df_songs['artist_name'].unique().shape[0] print(f"There are {unique_artists} unique artists in the dataset")
  • Unique users: #Unique users unique_users = df_songs['user_id'].unique().shape[0] print(f"There are {unique_users} unique users in the dataset") We'll go ahead and explore the popularity of songs and artists.

Most popular songs

How do we determine which are the most popular songs? For this task, we'll count how many times each song appears. Note that while we are using listen_count, we only care about the number of rows, we don't consider the number present in that row. This number represents how many times one user listen to the same song. #count how many rows we have by song, we show only the ten more popular songs ten_pop_songs = df_songs.groupby('title')['listen_count'].count().reset_index().sort_values(['listen_count', 'title'], ascending = [0,1]) ten_pop_songs['percentage'] = round(ten_pop_songs['listen_count'].div(ten_pop_songs['listen_count'].sum())*100, 2) ten_pop_songs = ten_pop_songs[:10] ten_pop_songs labels = ten_pop_songs['title'].tolist() counts = ten_pop_songs['listen_count'].tolist() plt.figure() sns.barplot(x=counts, y=labels, palette='Set3') sns.despine(left=True, bottom=True)

Most popular artist

For the next task, we'll count how many times each artist appears. Again, we'll count how many times the same artist appears. #count how many rows we have by artist name, we show only the ten more popular artist ten_pop_artists = df_songs.groupby(['artist_name'])['listen_count'].count().reset_index().sort_values(['listen_count', 'artist_name'], ascending = [0,1]) ten_pop_artists = ten_pop_artists[:10] ten_pop_artists plt.figure() labels = ten_pop_artists['artist_name'].tolist() counts = ten_pop_artists['listen_count'].tolist() sns.barplot(x=counts, y=labels, palette='Set2') sns.despine(left=True, bottom=True)

Listen count by user

We can also get some other information from the feature listen_count. We will answer the folloging questions: What was the maximum time the same user listen to a same song? listen_counts = pd.DataFrame(df_songs.groupby('listen_count').size(), columns=['count']) print(f"The maximum time the same user listened to the same songs was: {listen_counts.reset_index(drop=False)['listen_count'].iloc[-1]}") How many times on average the same user listen to a same song? print(f"On average, a user listen to the same song {df_songs['listen_count'].mean()} times") We can also check the distribution of listen_count: plt.figure(figsize=(20, 5)) sns.boxplot(x='listen_count', data=df_songs) sns.despine() What are the most frequent number of times a user listen to the same song? listen_counts_temp = listen_counts[listen_counts['count'] > 50].reset_index(drop=False) plt.figure(figsize=(16, 8)) sns.barplot(x='listen_count', y='count', palette='Set3', data=listen_counts_temp) plt.gca().spines['top'].set_visible(False) plt.gca().spines['right'].set_visible(False) plt.show(); How many songs does a user listen in average? song_user = df_songs.groupby('user_id')['song_id'].count() plt.figure(figsize=(16, 8)) sns.distplot(song_user.values, color='orange') plt.gca().spines['top'].set_visible(False) plt.gca().spines['right'].set_visible(False) plt.show(); print(f"A user listens to an average of {np.mean(song_user)} songs") print(f"A user listens to an average of {np.median(song_user)} songs, with minimum {np.min(song_user)} and maximum {np.max(song_user)} songs") We can see that a user listens in average to 27 songs. Even the maximum amount of songs listen by an user is 711, and we have 9567 songs in our dataset.

So, not all user listen to all songs, so a lot of values in the song x users matrix are going to be zero. Thus, we’ll be dealing with extremely sparse data.

How sparse? Let's check that:

Get how many values should it be if all songs have been listen by all users

values_matrix = unique_users * unique_songs

Substract the total values with the actural shape of the DataFrame songs

zero_values_matrix = values_matrix - df_songs.shape[0] print(f"The matrix of users x songs has {zero_values_matrix} values that are zero") Dealing with such a sparse matrix, we'll take a lot of memory and resources. To make our life easier, let's just select all those users that have listened to at least 16 songs.

Prepare the data

Get users which have listen to at least 16 songs

song_ten_id = song_user[song_user > 16].index.to_list()

Filtered the dataset to keep only those users with more than 16 listened

df_song_id_more_ten = df_songs[df_songs['user_id'].isin(song_ten_id)].reset_index(drop=True) We need now to work with a scipy-sparse matrix to avoid overflow and wasted memory. For that purpose, we'll use the csr_matrix function from scipy.sparse.

convert the dataframe into a pivot table

df_songs_features = df_song_id_more_ten.pivot(index='song_id', columns='user_id', values='listen_count').fillna(0)

obtain a sparse matrix

mat_songs_features = csr_matrix(df_songs_features.values) Let's take a look at the table user x song. df_songs_features.head() Because the system will output the id of the song, instead of the title, we'll make a function that maps those indices with the song title. df_unique_songs = df_songs.drop_duplicates(subset=['song_id']).reset_index(drop=True)[['song_id', 'title']] decode_id_song = { song: i for i, song in enumerate(list(df_unique_songs.set_index('song_id').loc[df_songs_features.index].title)) }

Model and recommendations

So, we know that we want to use the model to predict songs. For that, we'll use the Recommender class wrote in the knn_recommender file. model = Recommender(metric='cosine', algorithm='brute', k=20, data=mat_songs_features, decode_id_song=decode_id_song) song = 'I believe in miracles' new_recommendations = model.make_recommendation(new_song=song, n_recommendations=10) print(f"The recommendations for {song} are:") print(f"{new_recommendations}")