Add datasheets #265

KennethEnevoldsen · 2024-04-24T11:30:17Z

Agreed with @peterbjorgensen to add datasheets to our datasets.

@jankounchained will you add one for NCC
@TTTTao725 will you add one for the datasets you created - we can discuss it next time you are in
@peterbjorgensen

For now feel free to keep them minimal (then we can always expand on it). Is there anything we feel like the datasheet should at least contain?

Columns (at least a description of non-standard ones)
Source
data transformations

It might be useful to look at:

T. Gebru, J. Morgenstern, B. Vecchione, J. W. Vaughan, H. Wallach, H. Daumé III, and K. Crawford. Datasheets for datasets. arXiv preprint arXiv:1803.09010, 2018.

previous datasheets:
https://github.com/centre-for-humanities-computing/danish-foundation-models/tree/main/docs/datasheets

peterbjorgensen · 2024-04-25T12:42:29Z

As I wrote in another issue:

For each dataset we should have a dataset card or datasheet in the same style has HuggingFace data cards https://huggingface.co/docs/hub/datasets-cards
I prefer to have the dataset cards on github to be able to track changes. The filename of the datasheet should be the same name as the "source" identifier, i.e. {source}.md.
The data card contains a header in yaml to make it machine readable, which is then followed by descriptions in markdown.
I see that it can be a problem if the datasets contain sub-sources with different licenses for example. In that case the license field in the yaml should be a dictionary that maps from sub-sources to a specific license.

Alternatively the license field could be a keyword, e.g. multiple and then we add a "license" field in the "metadata" dictionary of each document. I think I will prefer the yaml dictionary approach, because the idea is that the datasheets makes it possible to select datasets based on the metadata without reading through the actual data first.

License should also be a required field in my opinion.

KennethEnevoldsen assigned KennethEnevoldsen, peterbjorgensen, jankounchained and TTTTao725 Apr 24, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add datasheets #265

Add datasheets #265

KennethEnevoldsen commented Apr 24, 2024 •

edited

Loading

peterbjorgensen commented Apr 25, 2024

Add datasheets #265

Add datasheets #265

Comments

KennethEnevoldsen commented Apr 24, 2024 • edited Loading

peterbjorgensen commented Apr 25, 2024

KennethEnevoldsen commented Apr 24, 2024 •

edited

Loading