collecting_data_assignment4

Intended use

The intended use of the collected corpus is the lyrical analysis of singer-songerwiter and producer Grimes. Anyone with interest in her musical projects can download and use the corpus.

Corpus collection and selection

The corpus of 16 songs was manually selected and downloaded via the Genius artist page of Grimes. There were several criteria for the song selections: (1) Grimes was the only singer in the song. (2) The song is popular amongst Grimes' song searches on Genius as suggested by the view counter. (3) I tried balancing the corpus by including songs from different EPs and Albums. However, there might be some overrepresentation of songs from the Vision and especially Art Angels albums.

Metadacta collection

I decided to gather some metadata on the selected corpus of 16 songs. The following values were manually collected for each song via the Discogs artist page of Grimes:

title	length	release_year	ablum_title
song title	length in minutes	release year in int	album/EP/release title

Cleaning and preprocessing

The texts downloaded from Genius already contain some extra information and are not completely 'raw' as sections are labeled according to the song structure (e.g. "chorus","verse 1", "outro", etc.). Therefore, I decided to use this annotation for creating the different song parts and adding them to our CSV file. This way, we can for instance compare all the choruses in the corpus. Additionally, after the creation of the song parts, I removed the brackets indicating the song parts to get the 'raw' lyric text. I reassign this to the Text column and only tokenized and lemmatized this part.

Annotations added and tools used

The annotations were added using a spaCy Jupyter Notebook tutorial. Some notes were added to evaluate and comment on the quality of the annotations.

Format of corpus files

The final csv-file contains the following rows:

Column	Description
Filename	Filename
length	length in minutes
release_year	year of release
ablum_title	Title of the release (can also be an EP or Single)
Text	Cleaned lyric
Song Part	As the Genius lyrics were annotated, I assigned the lyrics to the part they belong to, i.e. "verse_1" or "chorus"
Doc	The doc element created using spaCy
Tokens	Tokenization of the entire text
Lemmas	Lemmatization of the entire text

Quality checks

The doc element created with the used spaCy pipeline contained some errors with the more complex corpus annotation tasks. For instance, for Named Entity Recognition (NER) vocalizations such as 'la-la-la' are recognized as persons or organisation by spaCy. It seems that spaCy is struggling with annotating a lyrcial corpus, which might be due to a lack of lyrical training material of the pipeline. I therefore decided to remove these annotations in the updated version so that the enriched CSV file only contains well processed annotations. Concretely, this means that the updated CSV file now does not contain the following columns anymore: Proper_nouns, POS, NER. Further explanation is given in the Juypter Notebook.

Name		Name	Last commit message	Last commit date
Latest commit History 50 Commits
Grimes		Grimes
README.md		README.md
corpus-analysis-with-spacy.ipynb		corpus-analysis-with-spacy.ipynb
grimes_corpus_with_spaCy_tags.csv		grimes_corpus_with_spaCy_tags.csv
metadata.csv		metadata.csv

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

collecting_data_assignment4

Intended use

Corpus collection and selection

Metadacta collection

Cleaning and preprocessing

Annotations added and tools used

Format of corpus files

Quality checks

About

Releases

Packages

Languages

v1alina/collecting_data_assignment4

Folders and files

Latest commit

History

Repository files navigation

collecting_data_assignment4

Intended use

Corpus collection and selection

Metadacta collection

Cleaning and preprocessing

Annotations added and tools used

Format of corpus files

Quality checks

About

Resources

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages