Title: Generate lyrics for your favorite singer
Collaborators: Jarra Omar, Orlando Cedeno
Abstract: The goal of this project is to generate lyrics for a favorite singer of your choice. For our data, we used Genius's API to load artists and a collection of their songs into a csv file. From there, after some pre-processing of the text we were able to train a deep learning model to generate new lyrics based on the patterns and themes found in the existing lyrics.
genius.ipynb: Create a Data file directory in the project structure that holds the artist you're looking for in the form of "kanye.txt". Iterate through that file directory, which will allow you to search the genius api for all the songs the artist has (currently limited to 10). It then places the songs in a song folder for a given artist. For example, "songs/Kanye West" which has the songs for this artist and its lyrics within a text file
lyricgenerator.ipnyb: Searches through the song folder and iterates through each artists and adds each individual song into a big csv, columned into ['Artist'] and ['Lyric']. From there each song is pre-processed to get rid of stop words, unknown characters, stemm words, and ultimately clean the lyrics in prep for training the model. This is then placed into a preprocessed_lyric.csv. We then tokenize each lyric found in this csv file, label encode each artist, and split the data into training, testing, and validation (which can be found in data_splits for further examination). From there we load the data and create a MLP model. After we one-hot encode the target variable and add early stopping. We run and test the model's accuracy for the subset of 50 songs which is available at the moment. Next up is setting up the RNN model, which uses the same training data (from preprocessed to the train/test/validation split). The LSTM-based model consists of three LSTM layers with 128 neurons each, followed by dropout layers with a rate of 0.2 between them to mitigate overfitting. We feed this model all of the lyric training data so that it can pick up any syntax and rhythmic patterns. In terms of lyric generation, we concatenated all the lyrics present within each artist, and random selected two words from their dataset and used that as a seed for the LSTM model. So every third word was generated from the random concatenated seed prior. The lyrics were then placed them within a csv file. This csv file has the generated lyrics in a similar formatting so that we can have the MLP model make a prediction. Where our MLP model has nine layers of varying configurations of neurons and activation functions. It calculates the weighted sum in order to make connections between the lyrics present and the artist it's typically associated with. By doing so, we have a non objective way of measuring if the lyrics we generate is similar to actual lyrics an artist has made. After training, we tested our model to see its accuracy and it came to be .725. This was a strong enough percentage for us to continue and classify the generated results. The tokenized the lyrics in a similar fashion to the preprocessed data and accessed for over 200 generated songs, there is a .895. This is accurate but to be sure, we want to clarify that there might be some overfitting due to the limited corpus of words for each artist, so there could be some bias in that regard when classifying the generated lyrics for each artist.
Genius File: This file uses Geniuses API in order to collect