You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
For now feel free to keep them minimal (then we can always expand on it). Is there anything we feel like the datasheet should at least contain?
Columns (at least a description of non-standard ones)
Source
data transformations
It might be useful to look at:
T. Gebru, J. Morgenstern, B. Vecchione, J. W. Vaughan, H. Wallach, H. Daumé III, and K. Crawford. Datasheets for datasets. arXiv preprint arXiv:1803.09010, 2018.
For each dataset we should have a dataset card or datasheet in the same style has HuggingFace data cards https://huggingface.co/docs/hub/datasets-cards
I prefer to have the dataset cards on github to be able to track changes. The filename of the datasheet should be the same name as the "source" identifier, i.e. {source}.md.
The data card contains a header in yaml to make it machine readable, which is then followed by descriptions in markdown.
I see that it can be a problem if the datasets contain sub-sources with different licenses for example. In that case the license field in the yaml should be a dictionary that maps from sub-sources to a specific license.
Alternatively the license field could be a keyword, e.g. multiple and then we add a "license" field in the "metadata" dictionary of each document. I think I will prefer the yaml dictionary approach, because the idea is that the datasheets makes it possible to select datasets based on the metadata without reading through the actual data first.
License should also be a required field in my opinion.
Agreed with @peterbjorgensen to add datasheets to our datasets.
@jankounchained will you add one for NCC
@TTTTao725 will you add one for the datasets you created - we can discuss it next time you are in
@peterbjorgensen
For now feel free to keep them minimal (then we can always expand on it). Is there anything we feel like the datasheet should at least contain?
It might be useful to look at:
T. Gebru, J. Morgenstern, B. Vecchione, J. W. Vaughan, H. Wallach, H. Daumé III, and K. Crawford. Datasheets for datasets. arXiv preprint arXiv:1803.09010, 2018.
previous datasheets:
https://github.com/centre-for-humanities-computing/danish-foundation-models/tree/main/docs/datasheets
The text was updated successfully, but these errors were encountered: