Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add datasheets #265

Open
KennethEnevoldsen opened this issue Apr 24, 2024 · 1 comment
Open

Add datasheets #265

KennethEnevoldsen opened this issue Apr 24, 2024 · 1 comment
Assignees

Comments

@KennethEnevoldsen
Copy link
Contributor

KennethEnevoldsen commented Apr 24, 2024

Agreed with @peterbjorgensen to add datasheets to our datasets.

@jankounchained will you add one for NCC
@TTTTao725 will you add one for the datasets you created - we can discuss it next time you are in
@peterbjorgensen

For now feel free to keep them minimal (then we can always expand on it). Is there anything we feel like the datasheet should at least contain?

  • Columns (at least a description of non-standard ones)
  • Source
  • data transformations

It might be useful to look at:

T. Gebru, J. Morgenstern, B. Vecchione, J. W. Vaughan, H. Wallach, H. Daumé III, and K. Crawford. Datasheets for datasets. arXiv preprint arXiv:1803.09010, 2018.

previous datasheets:
https://github.com/centre-for-humanities-computing/danish-foundation-models/tree/main/docs/datasheets

@peterbjorgensen
Copy link
Contributor

As I wrote in another issue:

For each dataset we should have a dataset card or datasheet in the same style has HuggingFace data cards https://huggingface.co/docs/hub/datasets-cards
I prefer to have the dataset cards on github to be able to track changes. The filename of the datasheet should be the same name as the "source" identifier, i.e. {source}.md.
The data card contains a header in yaml to make it machine readable, which is then followed by descriptions in markdown.
I see that it can be a problem if the datasets contain sub-sources with different licenses for example. In that case the license field in the yaml should be a dictionary that maps from sub-sources to a specific license.

Alternatively the license field could be a keyword, e.g. multiple and then we add a "license" field in the "metadata" dictionary of each document. I think I will prefer the yaml dictionary approach, because the idea is that the datasheets makes it possible to select datasets based on the metadata without reading through the actual data first.

License should also be a required field in my opinion.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants