-
Notifications
You must be signed in to change notification settings - Fork 0
/
README-org.txt
87 lines (72 loc) · 3.41 KB
/
README-org.txt
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
IMPORTANT NOTE:
The test set for the hierarchical labels is the expert labels on the same (aggregated, crowd-sourced) spans
that were used as inputs to the second annotation phase. This makes it only a true gold standard for the
task of assigning detailed labels to extracted spans.
If you are developing a model that performs both the span extraction and label assigment jointly, you will
need a test set that contains both gold standard spans as well as labels. We are currently collecting this,
and expect to be finished very shortly. Thanks for your patience!
This corpus release contains 4,993 documents annotated with (P)articipants, (I)nterventions, and (O)utcomes.
The files included in this release are as follows
documents/
Documents are labeled by their PubMed identification number (PMID).
Each document has two files:
1) PMID.text - the raw text of the abstract
2) PMID.tokens - the space-separated tokens (from nltk's punkt tokenizer) that labels are assigned to
annotations/
All annotation files are presented as a space-separated list of labels for the corresponding document tokens.
The first division of annotations is in to either:
1) individual/
This folder contains each individual annotation provided by each worker, with multiple annotations per document
Annotators are assigned unique worker id (WID) numbers. The annotation files are labeled as:
PMID_WID.ann
2) aggregated/
This folder contains the aggregated (cleaner, less noisy) annotations with only one file per document
Within these folders, the annotations are separated by the two annotation phases:
1) starting_spans/
These are the first-phase annotations where workers highlighted spans containing target information.
2) hierarchical_labels/
These are the second-phase annotations where workers received the previous PMID_AGGREGATED.ann annotations
for each document and assigned more specific labels to whichever already labeled tokens deemed relevant.
After this, the files are separated by PICO element and then the train/test partitions. The test folder contains
two versions:
1) test/gold/
These annotations were collected from medical professionals and are the true target testing set
2) test/crowd/
This annotations were collected on AMT and used to validate the quality of the crowd-sourced labels
The label mappings for each PIO element are:
participants/
0: No label
1: Age
2: Sex
3: Sample size
4: Condition
interventions/
0: No label
1: Surgical
2: Physical
3: Pharmacological
4: Educational
5: Psychological
6: Other
7: Control
outcomes/
0: No label
1: Physical
2: Pain
3: Mortality
4: Adverse effects
5: Mental
6: Other
Here, two sections of the hierarchy have been collapsed:
Mental =
Mental health
Mental behavioral impact
Participant behavior
Other =
Satisfaction with care
Non-health outcomes
Quality of intervention
Resource use
Withdrawl from study
The more specific labels were conflated so frequently that they were of little practical use individually.
The full expansions of the "Mental" and "Other" labels are available upon request ([email protected]).