From e2822716233eededcd0dc49a62885d2f2c373fb0 Mon Sep 17 00:00:00 2001 From: Lasse Hansen Date: Mon, 16 Oct 2023 12:15:51 +0200 Subject: [PATCH 1/5] docs: add daradio datasheet --- docs/datasheets/daradio.md | 130 +++++++++++++++++++++++++++++++++++++ 1 file changed, 130 insertions(+) create mode 100644 docs/datasheets/daradio.md diff --git a/docs/datasheets/daradio.md b/docs/datasheets/daradio.md new file mode 100644 index 00000000..02a765d4 --- /dev/null +++ b/docs/datasheets/daradio.md @@ -0,0 +1,130 @@ +# DaRadio Datasheet + +*Version*: 1.0.0 + +*Homepage*: https://github.com/centre-for-humanities-computing/danish-foundation-models + +*License*: Not publicly available. + +--- + +DaRadio consists of radio broadcasts from the Danish radio stations DR P1 and Radio24Syv, and contains approximately 140.000 hours of speech. DaRadio includes all shows aired on DR P1 from 2005 to 2021, and all shows aired on Radio24Syv from 2011 to 2019. + +DaRadio has been deduplicated using a series of heuristics based on metadata. For more on deduplication, see the data cleaning section further below. + + +## Datasheet + +Following the recommendation and framework of [1], we add the following datasheet. + +### Motivation: + +**For what purpose was the dataset created? Who created the dataset? Who funded the creation of the dataset?** + +Data included in DaRadio was collected following the Danish [Legal Deposit Act](https://www.retsinformation.dk/eli/lta/2004/1439) by the Royal Danish Library (RDL). From this, a dataset of Danish speech-only radio was derived by RDL. The dataset was created for research purposes, including training a Danish wav2vec2.0 model. + +The dataset was preprocessed to remove duplicates by a team of researchers at the Center for Humanities Computing, Aarhus University (CHC) with collaborators from the Danish speech-processing company [Alvenir](alvenir.ai). + +**Any other comments?** + +No. + + +## Composition + +**What do the instances that comprise the dataset represent (e.g., documents, photos, people, countries)?** + +Instances of the dataset include an mp3 file for each show aired on the two staions within the period. Further metadata include information on date and time of airing, title, short description of the show, and various internal identifiers used by DNL. + +**How many instances are there in total (of each type, if appropriate)?** + +DaRadio consists of a total of 215.582 hours of unprocessed Danish speech radio shows across two stations, DR P1 and Radio24syv. The table below shows the distribution over the stations with and without heuristic rerun removal. + + +| Source | Duration (hours) | Reruns removed | +|------------|------------------|----------------| +| P1 | 145.160 | False | +| | 97.401 | True | +| Radio24syv | 70.422 | False | +| | 44.569 | True | + + +**Does the dataset contain all possible instances or is it a sample (not necessarily random) of instances from a larger set?** + +The dataset contains all shows from the two stations in the time period (2005-2021 for DR P1 and 2011-2019 for Radio24syv). + +**If the dataset is a sample from a larger set, what was the sampling strategy?** + +The dataset is a subset of all Danish radio. The two stations were chosen for the dataset as they are talk-radio only. + + +**Who was involved in the data collection process?** + +The Royal Danish Library collects Danish radio shows and constructed DaRadio for handing to researchers at CHC. + + +**Over what timeframe was the data collected?** + +The dataset includes radio shows from the period 2005 to 2021. + +**Were any ethical review processes conducted?** + +The Royal Danish Library collects radio shows in adherence to Danish Archival laws. DaRadio was constructed for a research project, for which a project proposal was accepted by RDL. No other ethical review processes were conducted. + + +## Preprocessing/cleaning/labeling + +**Was any preprocessing/Cleaning/Labeling of the data done +(e.g., discretization or bucketing, tokenization, part-of-speech tagging, +SIFT feature extraction, removal of instances, processing of missing values)?** + +DaRadio has been deduplicated using a series of heuristic filters and all files have been converted to 16 Khz .wav files. + +Reruns/duplicates were identified by the following rules: + +- If the phrase "sendt første gang" ["aired the first time"] or "genudsendelse" ["rerun"] appeared in the show description. +- If the title contained "(G)" (short for "genudsendelse")) +- If the show was broadcast between 23:00 and 5:00. + + + +**Is the software used to preprocess/clean/label the instances available?** + +The scripts are available at the following GitHub repository: [link](https://github.com/centre-for-humanities-computing/Gjallarhorn). + +## Uses + +**Has the dataset been used for any tasks already?** + +Yes, the dataset has been used to pre-train a [Danish wav2vec2.0 model.](https://huggingface.co/chcaa/xls-r-300m-danish) + +**Is there a repository that links to any or all papers or systems that use the dataset?** + +No. + +**What (other) tasks could the dataset be used for?** + +As the dataset only contains un-labelled data, i.e. no transcriptions, it is mainly useful for pre-training language models without further processing. However, given the metadata and re-occuring hosts, further processing might make it possible to train e.g. text-to-speech systems. + +**Is there anything about the composition of the dataset or the way it was collected and +preprocessed/cleaned/labeled that might impact future uses?** + +This dataset is static and does not evolve over time with the language, thus will become increasingly outdated over time. + + +## Distribution + +**Will the dataset be distributed to third parties outside of the entity (e.g., company, institution, organization) on behalf of which the dataset was created?** + +No. + + +### Citation +If you wish to cite this work please see our GitHub page for an up to date citation: https://github.com/centre-for-humanities-computing/danish-foundation-models + + +### References: + +- [1] T. Gebru, J. Morgenstern, B. Vecchione, J. W. Vaughan, H. Wallach, H. Daumé III, + and K. Crawford. Datasheets for datasets. arXiv preprint arXiv:1803.09010, 2018. + From 33267e50bb66b2b0455f2a1ccb038b5577628699 Mon Sep 17 00:00:00 2001 From: Lasse Hansen Date: Mon, 16 Oct 2023 13:46:43 +0200 Subject: [PATCH 2/5] Update docs/datasheets/daradio.md Co-authored-by: Kenneth Enevoldsen --- docs/datasheets/daradio.md | 3 --- 1 file changed, 3 deletions(-) diff --git a/docs/datasheets/daradio.md b/docs/datasheets/daradio.md index 02a765d4..8e740651 100644 --- a/docs/datasheets/daradio.md +++ b/docs/datasheets/daradio.md @@ -25,9 +25,6 @@ Data included in DaRadio was collected following the Danish [Legal Deposit Act]( The dataset was preprocessed to remove duplicates by a team of researchers at the Center for Humanities Computing, Aarhus University (CHC) with collaborators from the Danish speech-processing company [Alvenir](alvenir.ai). -**Any other comments?** - -No. ## Composition From 5a1a3cdb41f145f466875d87ced1ae5be9256960 Mon Sep 17 00:00:00 2001 From: Lasse Hansen Date: Mon, 16 Oct 2023 13:47:31 +0200 Subject: [PATCH 3/5] Apply suggestions from code review Co-authored-by: Kenneth Enevoldsen --- docs/datasheets/daradio.md | 6 +++--- 1 file changed, 3 insertions(+), 3 deletions(-) diff --git a/docs/datasheets/daradio.md b/docs/datasheets/daradio.md index 8e740651..f9ddab27 100644 --- a/docs/datasheets/daradio.md +++ b/docs/datasheets/daradio.md @@ -31,7 +31,7 @@ The dataset was preprocessed to remove duplicates by a team of researchers at th **What do the instances that comprise the dataset represent (e.g., documents, photos, people, countries)?** -Instances of the dataset include an mp3 file for each show aired on the two staions within the period. Further metadata include information on date and time of airing, title, short description of the show, and various internal identifiers used by DNL. +Instances of the dataset include an mp3 file for each show aired on the two staions within the period. Further metadata include information on date and time of airing, title, short description of the show, and various internal identifiers used by RDL. **How many instances are there in total (of each type, if appropriate)?** @@ -97,11 +97,11 @@ Yes, the dataset has been used to pre-train a [Danish wav2vec2.0 model.](https:/ **Is there a repository that links to any or all papers or systems that use the dataset?** -No. +No, but as of 23/10/16 no others have used the dataset. **What (other) tasks could the dataset be used for?** -As the dataset only contains un-labelled data, i.e. no transcriptions, it is mainly useful for pre-training language models without further processing. However, given the metadata and re-occuring hosts, further processing might make it possible to train e.g. text-to-speech systems. +As the dataset only contains un-labelled data, i.e. no transcriptions, it is mainly designed for pre-training language models. However, given the metadata and re-occuring hosts, further processing might make it possible to train e.g. text-to-speech systems. **Is there anything about the composition of the dataset or the way it was collected and preprocessed/cleaned/labeled that might impact future uses?** From 15fd1fdb8efbbf9e93c338047b9fed19a5e202f4 Mon Sep 17 00:00:00 2001 From: Lasse Hansen Date: Mon, 16 Oct 2023 13:49:04 +0200 Subject: [PATCH 4/5] Apply suggestions from code review --- docs/datasheets/daradio.md | 4 ++-- 1 file changed, 2 insertions(+), 2 deletions(-) diff --git a/docs/datasheets/daradio.md b/docs/datasheets/daradio.md index f9ddab27..81112011 100644 --- a/docs/datasheets/daradio.md +++ b/docs/datasheets/daradio.md @@ -57,7 +57,7 @@ The dataset is a subset of all Danish radio. The two stations were chosen for th **Who was involved in the data collection process?** -The Royal Danish Library collects Danish radio shows and constructed DaRadio for handing to researchers at CHC. +The RDL collects Danish radio shows and constructed DaRadio for handing to researchers at CHC. **Over what timeframe was the data collected?** @@ -66,7 +66,7 @@ The dataset includes radio shows from the period 2005 to 2021. **Were any ethical review processes conducted?** -The Royal Danish Library collects radio shows in adherence to Danish Archival laws. DaRadio was constructed for a research project, for which a project proposal was accepted by RDL. No other ethical review processes were conducted. +The RDL collects radio shows in adherence to Danish Archival laws. DaRadio was constructed for a research project, for which a project proposal was accepted by RDL. No other ethical review processes were conducted. ## Preprocessing/cleaning/labeling From 696a9db213d9f58ddea2e21c24c4a7ab8dc84d3f Mon Sep 17 00:00:00 2001 From: Lasse Hansen Date: Mon, 16 Oct 2023 13:50:47 +0200 Subject: [PATCH 5/5] Apply suggestions from code review --- docs/datasheets/daradio.md | 1 + 1 file changed, 1 insertion(+) diff --git a/docs/datasheets/daradio.md b/docs/datasheets/daradio.md index 81112011..2710db84 100644 --- a/docs/datasheets/daradio.md +++ b/docs/datasheets/daradio.md @@ -83,6 +83,7 @@ Reruns/duplicates were identified by the following rules: - If the title contained "(G)" (short for "genudsendelse")) - If the show was broadcast between 23:00 and 5:00. +The deduplication was coded and conducted by researchers at CHC. **Is the software used to preprocess/clean/label the instances available?**