Lessons from archives: strategies for collecting sociocultural data in machine learning, Seo Jo and Gebru, 2019
Paper, Tags: #nlp, #datasets
We argue that a new specialization should be formed within ML that is focused on methodologies for data collection and annotation: efforts that require institutional frameworks and procedures. By showing data collection practices from another field, we encourage ML research to be more cognizant and systematic in data collection.
Lessons from archives: summaries of approaches in archival and library sciences to some of the most important topics in data collection:
- Consent: community and participatory archives
- Inclusivity: mission statements and collection policies
- Power: data consortia
- Transparency: appraisal records and committee-based data collection
- Ethics and privacy: codes of ethics and conduct
Archives have institutional and procedural structures in place that regulate data collection, annotation and preservation that ML can draw from.
Data collection in ML subfields is done without following a rigorous procedure or set of guidelines. Some subfields have fine-grained approaches, but NLP and CV emphasize size and efficiency and this collection is minimally supervise.
On the two sides of the spectrum, laissez-fare does not intervene at all while in the interventionist extreme archives and so on show an extensive monitoring.
Datasets composed without an adequate degree of intervention will replicate biases accrued from multiple levels of filtering. Even before data collection, data is subject to historical and representation bias. See paper section 4 for examples.
Data obtained from the Internet overrepresent younger generations and those from developed countries with Internet access.
Archives are not the only place we can learn from. For dealing with direct human subjects, and issues of privacy and representation, we can draw from experimental and field-work driven social sciences such as sociology and psychology. Historians are well-versed in historical context and anthropologists in cultural sensitivities. In navigating an uncharted path, the ML community can look to older fields for examples of successes and failures on comparable matters.