Skip to content

Latest commit

 

History

History
51 lines (31 loc) · 14.6 KB

writeup.md

File metadata and controls

51 lines (31 loc) · 14.6 KB

Background

In 2015, Meek Mill famously accused Drake of having a verse ghost-written by Quentin Miller. In the ensuing fallout, an extensive debate took place over the role of ghostwriting in hip-hop and the extent to which Drake relied on others for his lyrics. While Miller was given 4 credits on Drake's 2015 album If You're Reading this it's too Late, it was never fully settled how much of these songs were attributable to Miller or if Miller's influence spread past them.

Goals/ Research Questions

This project has two main goals:

  1. See how effective authorial classification can be on small-medium sized corpora such as song lyrics.
  2. Test if Drake's lyrical style across If You're Reading This it's Too Late or simply in the songs co-written by Miller diverged from Drake's lyrical style in other projects.

Methods

I scraped lyrics from 138 Quentin Miller songs and 260 Drake songs, using only lyrics that they sung independently and throwing away involvement from featured artists. I removed obvious identifiers such as the words "Drake", "Miller", "Toronto." I then stemmed the corpus: removing conjugation and dissimilar endings so "walked" and 'walks' both are treated as the base form of 'walk.'

I used seven types of models to classify songs by author. The first three used linear methods: logistic regression, support vector machines, and linear classification with stochastic gradient descent. Models three through six used ensemble methods: random forest, adaboost, and gradient boosting. All six of these models used a "bag of words" approach in which songs are transformed into vectors representing word frequency. In particular, I found bi-grams (counting both individual word frequencies and word paring frequencies) to be the most effective way to vectorize the data after testing multiple models. I also cross-validated transformations using term-inverse document frequency, a technique that places more importance on rare words that one author disproportionately uses and decreasing the importance of words that appear most of the documents. The seventh model was an ensemble model made out of the previous models to aggregate predictions.

Results

The linear models obtained between 85-88 percent accuracy during cross validation. The ensemble models performed slightly worse: 79-84 percent accuracy. For what it's worth, our models got almost 100 percent training accuracy after cross validation. More importantly, all of the models overperformed on the validation set. Six of the models achieved between 86-93 percent test accuracy, the exception being our Adaboost model which performed poorly on the validation set (75 percent accuracy).

Interestingly, all linear models elected to use tf-idf transformations during cross-validation, whereas the ensemble methods performed better without tf-idf scaling. The superior performance of the linear models suggests that there are linear relationships between how frequently a (tf-idf scaled) term appears and its author. I should qualify this observation with the fact that there were more unexplored hyperparameters on the non-linear models, meaning that perhaps with proper training, there are non-linear relationships to be learned between frequency counts and the author. A deeper neural network or even deeper random forest might have also squeezed out more performance with more computing power.

Our best model was the stochastic gradient decent linear model which had the highest validation (93 percent). Our ensemble aggregation method also achieved 93 percent test accuracy, but I opted for SGD as our best model due to the ensemble method's lack of empirical rigor and interpretability. The SGD model only predicted 1 of the 15 Miller songs incorrectly and 1 of the 40 Drake songs incorrectly on the test set. While we shouldn't be overly optimistic, this is a very encouraging result. We are more worried about "false positives" - incorrectly predicting Miller when in fact it was actually Drake all along - than "false negatives" - incorrectly failing to identify Miller and instead attributing a Miller song to Drake. If we often incorrectly predict a Drake song to be by Miller during supervised testing, we can't be confident in any Miller prediction on our ambiguous set iyrtitl. The worst that can come from incorrectly predicting Drake for Miller songs is that there are Miller influenced/written songs that our model didn't pick up. A 1/40 false positive rate is not bad, whereas the 1/15 false negative rate is less concerning for us. In short, our model seems to tend towards predicting Drake, which means when it does predict Miller, it usually is correct. It is worth noting that the test set is relatively small. The 96 percent test accuracy seems a bit fluky when put in context of the 93 percent validation accuracy and performance of the other models. Furthermore, if we were to play around with priors, we could probably Simpson's Paradox our way into various different estimates of how likely a song actually is from Miller given a Miller prediction from our model. This is to say, we can't be incredibly precise about how confident we are in a Miller prediction on iyrtitl. However, as it pertains to our first goal, all available evidence suggest there is some validity in our approach of modeling authorship via song lyrics.

As for the second goal - better understanding Drake's lyrical style and Miller's potential influence on iyrtitl - I offer three cautious takeaways:

  1. The models, in aggregate suggest that two of the Drake's songs - "10 bands" and "No Tellin"- on If You're Reading This it's Too Late were closer to Miller's style than other Drake songs. Our SGD model also indicates that "Now and Forever" is worth a second look in terms of authorship. That our models identified "10 Bands" as potentially written by Miller is encouraging. After the ghostwriting allegations came out, DJ Funk Flex leaked a "10 Bands" reference track written by Miller. Compared to the 3 other Quentin Miller reference tracks that got leaked, the "10 Bands" final product was most similar to the reference track. In short, Miller wrote the first draft of "10 Bands." The song got edited (by some combination of people) before the final product. Our model seems to have identified a certain Miller-esque style to the song even after editing, which suggests that our method does have some potential to detect latent influence or authorship even when it is not direct copy-and-paste style plagiarism. Again, the evidence from our models is far from conclusive. But, from what we can gather, our models do suggest that Miller might have influence on multiple Drake songs from ifytitl.

  2. The SGD model identified If You're Reading This It's Too Late as the most similar to Miller's style out of all the Drake albums. The margins here are not enormous - the average Drake song is predicted to come from Drake with just below 70 percent confidence. Meanwhile, the average iyrtitl song has about 58 percent confidence that it comes from Drake. This is probably within the margin of error and could be due to the fact that iyrtitl was the only album held out of the training set in its entirety. Retraining the data and holding out a second album with our hyperparameters could provide more insight into this possibility.

  3. Based on the coefficients of the models, I found that the artists do employ different vocabularies. Thematic differences emerged such as how Drake's songs more frequently use romantic language such as 'girl,' 'baby' and 'night.' Meanwhile, our SGD model found Miller more likely to use the word "Nike," while other models found him more likely to talk about his "daughter." It also identified ad-libs and catch-phrases such as "wait wait," "yeah yeah" and "yuh" which distinguished Quentin Miller's work. Finally, although a somewhat controversial subject, our model did key on how the artists used non-contextual words such as 'are,' 'let,' and 'cause' at differnt frequencies. I say controversial because these words are frequently used by both artists, and most people in general. While evidence suggests authors tend to use non-contextual words at unique frequencies, we have to be cautious that differences in the frequency of common words comes from legitimate differences in distribution versus random noise/variance. The best we can say is the tf-idf should control partially for this noise concern, so Drake stylistically probably does use the word "are" more frequently than Miller.

Future Work on the Drake-Miller Controversy

  • Transforming the data: I used relatively modest stop-word removal. Other techniques range from only using non-contextual words by removing all contextual words to using only contextual words by removing all non-contextual words. The former approach of only looking for the frequency of non-contextual words is often employed in stylometric analysis of authorship. The latter approach focusing on only contextual words is used in teasing out thematic differences from different classes. Both have their advantages and disadvantages. Studying only non-contextual words lacks interpretability. Knowing that Drake uses the word 'as' 1.25 times as frequently as Miller doesn't tell us anything all that interesting, even if it is a decent predictor of Drake's style. Furthermore, the non-contextual word frequency approach to document classification is well founded in literature, but is less well tested in song-writing. When documents are the size of songs (250-900 words long), there is a lot more variance in frequency count within a document, limiting the effectiveness of a purely non-contextual word approach. The latter approach of focusing on theme is less widely used in finding ambiguous authorship. Drake could very well tell Miller to write a song about a given theme such as 'girls' or 'Toronto' and a purely thematic analysis would have a very difficult time knowing that Drake didn't write the song himself. Considering the challenges of utilizing both methods independently, I opted for the middle road of using both and having the models figure out which bigrams are most relevant. However, future exploration might examine contextual and non-contextual words independently or otherwise transform the data in unique ways.
  • Supplementing the data : The relative easiness of the binary classification problem limits how optimistic we can be about the results on the piece of ambiguous authorship. Even though our models suggest that portions of iyrtitl were more likely to be written by Quentin Miller than other Drake songs, it could well be that the album was simply different than other Drake projects and not actually closer to Miller's style. It could be that the album was closer to some other artist Drake was trying to imitate at the time. Future attempts might add 'noise' to the task by including other artists and turning it into a multi-class problem to see if iyrtitl's relative closeness to Miller's style still bears out. The data folder of this project already has albums from J. Cole -an artist found to be close to Drake- and PartyNextDoor - a frequent collaborator who has also received writing credits on Drake songs.
  • Changing document size: So far, we've analyzed songs as documents. A different approach might scale upwards by looking at whole albums as a single document or scale downwards by looking at individual verses, bars, or lines as documents. The latter few approaches could be interesting by trying to go beyond the bag of words or even basic word-embedding approach by attempting to learn rhyme scheme, assonance, consonance, and meter.
  • Additional models Beyond improving the existing models with better hyperparameters, other classification strategies could provide further insight into the data. Based on the success of linear modeling, perhaps LDA (or even QDA) would also be a solid classifier. Other unsupervised methods (PCA, Clustering), could offer insight into similarities and better visualizations of the distance between individual songs and albums.
  • Full stacking algorithm: The final model presented is a naïve ensemble method that merely aggregates predictions of the other models. There could be some advantages to this approach compared to simply picking a single model based on validation accuracy because of the disparate preprocessing procedures. At present, there was a small positive effect on test set performance compared to most of the models which indicates there is some potential to the idea. One easy step to improve the current aggregate approach would be weighting sub-models based on their performance metrics. The most empirically robust approach, though, would be to implement a full stacking algorithm where we deploy models based on their effectiveness on subsets of the feature space. Stacking is generally conducted with weaker, more independent learners than those used in the present approach. Thus, stacking would require a substantial rework of the base learners. However, considering the size and sparsity of the feature space, it is very possible some models do better than others in certain cases, meaning there could be additional performance gains to be had through stacking.
  • Further neural network exploration: The current convolutional neural network is a very simple model, in part due to limitations in processing power. Beyond experimentation with different architectures or hyperparameters, we could also try using pretrained word embeddings (eg GloVe, Fasttext, Word2Vec) to get better predictions while also keeping training time manageable.

Future Applications

  • Ghostwriting in hip-hop: So far we've looked at the most famous example of ghostwriting in hip hop. This approach could be applied to rumored ghostwriting connections (see basically any Dr. Dre or P. Diddy song). It could also be applied to credited writers to see how much a song is a collaborative affair vs. a case of the credited writer doing most of the lifting with the artist just taking credit for it. The most ambitious application, however, would be a 'cold start' approach where we have to look for ghostwriting or plagiarism without knowing in advance who is the most likely ghost writer. For instance, what if Miller had never been outed? Could an algorithm have identified Miller out of a broad set of artists as the real author of "10 Bands?" As mentioned above, the multiclass problem increases the complexity immensely, so at present it seems unlikely 'cold start' identifications are possible. Probably somewhere in the middle - where a set of potential writers are identified in advance and then machine learning tries to find stylistic similarity/influence - is the most promising area for future work.