Skip to content

Latest commit

 

History

History
96 lines (72 loc) · 6.15 KB

1803.09010.md

File metadata and controls

96 lines (72 loc) · 6.15 KB

Datasheets for datasets, Gebru et al., 2018

Paper, Tags: #nlp, #datasets

Having no standardized process for documenting datasets such as in ML, can have severe consequences in high-stake domains such as criminal justice, hiring, critical infrastructure or finance. We propose datasheets for datasets, suggesting that every dataset be accompanied with a datasheet that documents its motivation, composition, collection process, recommended uses, potential harms, etc.

We have refined a list of suggested datasheet questions

Motivation

  • For what purpose was this dataset created?
  • Who created the dataset, and on behalf of which entity?
  • Who funded the creation of the dataset?
  • Any other comments?

Composition

Dataset creators should read through the questions in this section prior to any data collection and then provide answers once collection is complete. These questions are aimed to dataset consumers so that they can make informed decisions about using the dataset for specific tasks. The answers to some of these questions reveal information about compliance with the EU's GDPR regulations

  • What do the instances that comprise the dataset represent (e.g. documents, photos, people, coutries)
  • How many instances are there in total?
  • Does the dataset contain all possible instances or is it a sample of instances from a larger set?
  • What data does each instance consist of?
  • Is there a label or target associated with each instance?
  • Is any information missing from individual instances?
  • Are relationships between individual instances made explicit (e.g. users' movie ratings, social network links)?
  • Are there recommended data splits (e.g. training, development/validation/testing)?
  • Are there any errors, sources of noise, or redundancies in the dataset?
  • Is the dataset self-contained, or does it link to or otherwise rely on external resources (e.g. websites, tweets, other datasets)?
  • Does the dataset contain data that might be considered confidential?
  • Does the dataset contain data that, if viewed directly, might be offensive, insulting, threatening, or might otherwise cause anxiety?
  • Does the dataset relate to people?
  • Does the dataset identify any subpopulations?
  • Is it possible to identify individuals either directly or indirectly?
  • Does the dataset contain data that might be considered sensitive in any way?
  • Any other comments?

Collection process

As with the previous section, dataset creators should read through these questions prior to any data collection. In addition to the goals of the prior section, the answers to questions here may provide information that allow others to reconstruct the dataset without access to it.

  • How was the data associated with each instance acquired? What mechanisms or procedures were used to collect the data?
  • If the dataset is a sample from a larger set, what was the sampling strategy?
  • Who was involved in the data collection process?
  • Over what timeframe was the data collected?
  • Were any ethical review processes conducted?
  • Does the dataset relate to people? (more questions follow in this subsection)
  • Any other comments?

Preprocessing / cleaning / labeling

Dataset creators should read through these questions prior to any preprocessing, cleaning, or labeling and then provide answers once these tasks are complete. Dataset consumers will be able to determine whether the 'raw' data has been processed in ways that are compatible with their chosen tasks.

  • Was any preprocessing/cleaning/labeling of the data done?
  • Was the 'raw' data saved in addition to the preprocessed/cleaned/labeled data?
  • Is the software used to preprocess/clean/label the instances available?
  • Any other comments?

Uses

These questions are intended to encourage dataset creators to reflect on the tasks for which the dataset should and should not be used. By explicitly highlighting these tasks, dataset creators can help dataset consumers to make informed decisions, thereby avoiding potential risks or harms.

  • Has the dataset been used for any tasks already?
  • Is there a repository that links to any or all papers or systems that use the dataset?
  • What tasks can the dataset be used for?
  • Is there anything about the composition of the dataset or the way it was collected and preprocessed/cleaned/labeled that might impact future uses?
  • Are there tasks for which the dataset should not be used?
  • Any other comments?

Distribution

Dataset creators should provide answers to these questions prior to distributing the dataset either internally within the entity on behalf of which the dataset was created or externally to third parties.

  • Will the dataset be distributed to third parties outside of the entity on behalf of which the dataset was created?
  • How will the dataset be distributed?
  • When will the dataset be distributed?
  • Will the dataset be distributed under a copyright or other intellectual property license, and/or uer applicable terms of use?
  • Have any third parties impose IP-based or other restrictions on the data associated with the instances?
  • Do any export controls or other regulatory restrictions apply to the dataset or to individual instances?
  • Any other comments?

Maintenance

  • Who is supporting/hosting/maintaining the dataset?
  • How can the owner/curator/manager of the dataset be contacted?
  • Is there an erratum?
  • Will the dataset be updated?
  • If the dataset relates to people, are there applicable limits on the retention of the data associated with the instances?
  • Will older versions of the dataset continue to be supported/hosted/maintained?
  • If others want to extend/augment/build on/contribute to the dataset, is there a mechanism for them to do so?
  • Any other comments?

When creating datasheets for datasets that relate to people, it may be necessary for dataset creators to work with experts in other domains such as anthropology. There are complex and contextual social, historical, andgeographical factors that influence how best to collect a dataset in a manner that is respectful of individuals and their data protection and privacy.

The paper contains 2 examples of the datasheets applied to 2 datasets: Labeled Faces in the Wild, and Pang and Lee's polarity dataset.