song_lyrics_genre_classifier

Dataset

Our data uses a small, random sample of the song_lyrics dataset on HuggingFace, which can be found here. Of the sampled data, 80% of it is used for training, 10% for validation, and 10% for testing.

Training and evaluation

Preprocessing for this task involves isolating the relevant features from the sampled data (lyrics and genre) and tokenizing the genre according to the following mapping:

Genre	Token
pop	0
rap	1
rock	2
rb	3
misc	4
country	5

The tasks of training and evaluation were expedited by the NLPScholar toolkit. The model can be reproduced by running main.py with train_config.yaml as an additional argument, and our evaluation results can be replicated by running main.py with eval_config.yaml. Performance metrics for the resulting predictions in preds.tsv can be subsequently attained using sklearn.

Status Update

Dataset splitting

Using random ordering of the dataset we create a 80:10:10 split of our data for training, validation and evaluation respectively. From there we progressively split our dataset in half for the purposes of having various size datasets in order to ensure training and evaluation could happen in reasonable amounts of time. Preprocessing was done in order to remove unnecssary col (title) and change col headers to work for NLPScholar

Training

We trained a series of models, with varying dataset sizes (full,half,quarter,eighth) in order to gurantee results within the timeframe we needed. BERT-cased was selected as the model we would be using from the recommendation of Professor Forest. The full model was trained within 24 hours and the rest of the models will be unused. The updated train config file is included in the github now.

Evaluation

Metrics for each of the six genres can be found in analysis.tsv. The model performs well on genres that are overrepresented in the training data, such as pop and rap, but also performs remarkably well on misc, which occurs far less frequently in the training data. The two worst genres for classification, country and R&B, were also the genres with the lowest representation in the training data.

Name		Name	Last commit message	Last commit date
Latest commit History 41 Commits
classificationModel		classificationModel
dataAnalysisAndTools		dataAnalysisAndTools
generationModel		generationModel
userGuessData		userGuessData
README.md		README.md
data.rar		data.rar

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

song_lyrics_genre_classifier

Dataset

Training and evaluation

Status Update

Dataset splitting

Training

Evaluation

About

Releases

Packages

Contributors 2

Languages

atauber27/song_lyrics_genre_classifier

Folders and files

Latest commit

History

Repository files navigation

song_lyrics_genre_classifier

Dataset

Training and evaluation

Status Update

Dataset splitting

Training

Evaluation

About

Resources

Stars

Watchers

Forks

Releases

Packages 0

Contributors 2

Languages

Packages