The SoSciSoCi corpus was created as a training corpus for the identification of software usage statements in social science publications. The corpus consists of the methods sections from 480 randomly chosen articles from PLoS, which contain the keyword "Social Science". The corpus was created in a joint effort of David Schindler, Benjamin Zapilko and Frank Krüger.
Objective was to mark all usage statements of software within the scientific publications. It was assumed that the number of software mentions is low in comparison to other domains. For this purpose, we added XXX sentences that contain software usage statements as positive samples. The sentences were shuffled for the annotation, to restrict the reasoning of annotators to the context of the sentence. Additional information such as version or manufacturer was explicitly excluded from the efforts. Annotation was performed by seven different annotators (two students from University of Rostock, two students from Gesis, Benjamin Zapilko from Gesis and David Schindler and Frank Krüger from University of Rostock).
To determine the quality of the annotation procedure, 10% of sentences were annotated by two annotators. The computed IRR is a Cohen's kappa of 0.816.
The Corpus was used in the following publication:
David Schindler and Benjamin Zapilko and Frank Krüger: Investigating Software Usage in the Social Sciences: A Knowledge Graph Approach, In Proceedings of the 17th Extended Semantic Web Conference, Heraklion, Crete, Greece, May 31 - June 4 2020
Please cite this publication, when using the corpus.
This work is licensed under a Creative Commons Attribution 4.0 International License.