Skip to content

Commit

Permalink
STREUSLE 4.0: merge from prepare-4.0 branch
Browse files Browse the repository at this point in the history
  • Loading branch information
nschneid committed Feb 11, 2018
2 parents 1281f67 + af6020c commit 688e652
Show file tree
Hide file tree
Showing 47 changed files with 153,485 additions and 145,974 deletions.
18 changes: 16 additions & 2 deletions ACKNOWLEDGMENTS.md
Original file line number Diff line number Diff line change
Expand Up @@ -17,18 +17,28 @@ Preposition supersense annotation schemers:
* Tim O'Gorman
* Meredith Green
* Abhijit Suresh
* Na-Rae Han
* Archna Bhatia
* Sarah R. Moeller
* Omri Abend
* Austin Blodgett
* Jakob Prange
* Adi Bitan
* Dotan Dvir

Preposition supersense annotators (University of Colorado):
Preposition supersense v1 annotators (University of Colorado):

* Meredith Green (lead)
* Julia Bonn
* Evan Coles-Harris
* Audrey Farber
* Nicole Gordiyenko
* Megan Hutto
* Story Kiser
* Celeste Smitz
* Tim Watervoort

Preposition supersense pilot annotators (Carnegie Mellon University):
Preposition supersense v1 pilot annotators (Carnegie Mellon University):

* Archna Bhatia
* Carlos Ramírez
Expand All @@ -48,6 +58,7 @@ Special thanks
* Mark Steedman
* Claire Bonial
* Tim Baldwin
* Miriam Butt
* Chris Dyer
* Ed Hovy
* Lingpeng Kong
Expand All @@ -63,3 +74,6 @@ This research was supported in part by:
* NSF CAREER grant IIS-1054319
* DARPA grant FA8750-12-2-0342 funded under the DEFT program
* a Google Research Award for Q/A PropBank annotation
* DARPA 15-18-CwC-FP-032 Communicating with Computers
* DTRA HDTRA1-16-1-0002/Project # 1553695, eTASC - Empirical Evidence for a Theoretical Approach to Semantic Components
* DARPA LORELEI Semantic Annotation and Technology Transfer
140 changes: 140 additions & 0 deletions CONLLULEX.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,140 @@
CoNLL-U-Lex Format
==================

*Nathan Schneider, 2018-12-01*

The file [streusle.conllulex](streusle.conllulex) contains the STREUSLE corpus.
It is structured in a tab-separated format which augments the
10-column [CoNLL-U format](http://universaldependencies.org/format.html)
with 9 additional columns for lexical expressions, for a total of 19 columns.

Sentences are ordered sequentially within documents (reviews);
documents are presented in numerical order by their ID, all in the same file.
Sentences are separated by blank lines.
The markup for each sentence consists of:

- a header section with lines of the form `# key = value`, and
- a body consisting of tokens, one per line.

As an illustration, refer to the following example (preferably in a spreadsheet editor):

```
# sent_id = reviews-010378-0002
# text = I did not have a good experience w/ Dr. Ghassemlou.
# streusle_sent_id = ewtb.r.010378.2
# mwe = I did not have_ a good _experience~w / Dr._Ghassemlou .
1 I I PRON PRP Case=Nom|Number=Sing|Person=1|PronType=Prs 4 nsubj 4:nsubj _ _ PRON I _ _ _ _ _ O-PRON
2 did do AUX VBD Mood=Ind|Tense=Past|VerbForm=Fin 4 aux 4:aux _ _ AUX do _ _ _ _ _ O-AUX
3 not not PART RB _ 4 advmod 4:advmod _ _ ADV not _ _ _ _ _ O-ADV
4 have have VERB VB VerbForm=Inf 0 root 0:root _ 1:1 V have experience v.stative _ 3:1 _ have experience with B-V-v.stative
5 a a DET DT Definite=Ind|PronType=Art 7 det 7:det _ _ DET a _ _ _ _ _ o-DET
6 good good ADJ JJ Degree=Pos 7 amod 7:amod _ _ ADJ good _ _ _ _ _ o-ADJ
7 experience experience NOUN NN Number=Sing 4 obj 4:obj _ 1:2 _ _ _ _ 3:2 _ _ I_
8 w with ADP IN Abbr=Yes 10 case 10:case SpaceAfter=No _ P with p.Topic p.Topic 3:3 _ _ I~-P-p.Topic
9 / / PUNCT , _ 10 punct 10:punct _ _ PUNCT / _ _ _ _ _ O-PUNCT
10 Dr. Dr. PROPN NNP Number=Sing 7 nmod 7:nmod _ 2:1 N Dr. Ghassemlou n.PERSON _ _ _ _ B-N-n.PERSON
11 Ghassemlou Ghassemlou PROPN NNP Number=Sing 10 flat 10:flat SpaceAfter=No 2:2 _ _ _ _ _ _ _ I_
12 . . PUNCT . _ 4 punct 4:punct _ _ PUNCT . _ _ _ _ _ O-PUNCT
```

Header
------

There are 4 pieces of information in the sentence header:

- `sent_id`: the sentence ID in the UD_English corpus
- `text`: the original sentence string
- `streusle_sent_id`: the sentence ID from STREUSLE releases going back to version 1.0;
this begins with the designator `ewtb.r` for English Web Treebank - Reviews subcorpus.
The UD_English sentences are the ones from the English Web Treebank, so `sent_id`
and `streusle_sent_id` are redundant, but including `streusle_sent_id` leaves open
the possibility of including non-UD sentences in the future.
- `mwe`: a human-readable string consisting of the tokens of the sentence with `_` and `~`
added to mark up strong and weak MWEs, respectively. Equivalent machine-readable
information is indicated in the body of the sentence.

Additionally, the first sentence in each document is preceded by a `newdoc` header line.

Body
----

Each token line has the following 19 columns, with `_` indicating an empty value
in a column.

The first 10 columns are copied exactly from the UD_English corpus following the
UDv2 standard. __TODO: The UD_English version is ..., subsequent to 2.0 to incorporate
corrections (primarily to lemmas and POS tags).__
Refer to [this page](http://universaldependencies.org/format.html)
and others on the UD website for documentation of UD's conventions for
encoding orthography, morphology, and syntax.

1. ID: Word index: an integer starting at 1 for each new sentence, or a decimal number for empty nodes that capture ellipsis phenomena. Empty nodes are listed but ignored for purposes of lexical semantics.

2. FORM: Word form or punctuation symbol.

3. LEMMA: Lemma or stem of word form.

4. UPOSTAG: Universal part-of-speech tag, e.g. `ADP` for adpositions.

5. XPOSTAG: Language-specific part-of-speech tag. For UD_English this comes from the Penn Treebank (PTB) tagset: e.g. `IN` for adpositions and subordinating conjunctions.

6. FEATS: List of morphological features, separated by `|` symbols.

7. HEAD: Head of the current word, which is either a value of ID or zero (0).

8. DEPREL: Dependency relation to the HEAD, e.g. `obj` for direct object (`root` iff HEAD = 0).

9. DEPS: Enhanced dependency graph in the form of a list of head-deprel pairs.

10. MISC: Any other annotation. In this corpus, for non-empty (regular) token nodes,
the only thing that goes here is `SpaceAfter=No` to indicate how the tokenization
maps to the original sentence string.

11. SMWE: Two integers, the first identifying a strong MWE grouping of tokens, and the second identifying the current token's position relative to the other tokens that form the MWE. E.g., in the above example, *have* and *experience* form a (discontinuous) strong MWE; *have* has `1:1` in the SMWE column and *experience* has `1:2`. Both integers are 1-based.

12. LEXCAT: A syntactic category that applies to *strong lexical expressions* (strong MWEs and single-word expressions, regardless of whether they belong to a weak MWE).
The set of valid supersense labels (SS and SS2) is determined based on LEXCAT.

Possible values of LEXCAT are: `N` (noun, common or proper), `PRON` (non-possessive pronoun, including indefinites like *someone*), `PRON.POSS` (possessive pronoun), `POSS` (possessive clitic), `V` (full verb or copula), `AUX` (auxiliary), `P` (single-word adposition), `PP` (prepositional phrase MWE), `INF` (nonsemantic infinitive marker *to* or infinitive-subject-marker *for*), `INF.P` (infinitive maker *to* when it receives an adposition supersense), `DISC` (discourse/pragmatic expression); and `ADJ`, `ADV`, `DET`, `CCONJ`, `SCONJ`, `INTJ`, `NUM`, `SYM`, `PUNCT`, `X`, which are in line with Universal part-of-speech tags.

__Approximately 300 tokens currently have LEXCAT=`!!@` to indicate that they need to be manually corrected, in most cases by adding a noun supersense. These will be fixed in a subsequent release.__

13. LEXLEMMA: The lemma(s) of the component word(s) of the strong expression (single- or multiword) that begins with the current token. `_` for non-initial tokens in a strong MWE. Thus, for *have*, LEXLEMMA is `have experience`, while for `experience` it is `_`.

14. SS: Supersense label, if applicable, and the token is initial within its strong expression. Noun supersense label (prefixed with `n.`; requires LEXCAT=`N`), verb supersense label (prefixed with `v.`; requires LEXCAT=`V`), or adposition supersense label (prefixed with `p.`; requires LEXCAT=`P`, `PP`, `INF.P`, `POSS`, or `PRON.POSS`). Special values are `` `$`` (opaque possessive slot in idiom; requires LEXCAT=`POSS` or `PRON.POSS`) and `??` (unable to assign a supersense because the usage is unintelligible, incomplete, marginal, or nonnative).

15. SS2: Second supersense label; used only for adpositional expressions, which always have two labels listed, a role label in SS and a function label in SS2 (often these are identical).

16. WMWE: Weak MWE grouping and position, analogous to the SMWE column. In the example, *have experience w* forms a weak MWE, and this is indicated with WMWE=`3:1`, `3:2`, and `3:3` on the respective tokens. Weak MWE identifiers are kept distinct from strong MWE identifiers.

17. WLEMMA: If the token begins a weak MWE, as *have* does, then this column holds the lemmas of its constituent words. Otherwise, it is blank (`_`).

18. WCAT: Placeholder for a weak MWE category (currently not used).

19. LEXTAG: BIO-style tag summarizing the full lexical analysis, including any strong and weak MWE segmentations, LEXCAT, and supersenses. This is intended for sequence taggers.

* The BIO symbols are: `O` for token not belonging to any MWE, `B` for the token beginning an MWE, `I_` for a token continuing a strong MWE, and `I~` for a token continuing a weak MWE. Lowercase variants `o`, `b`, `i_`, and `i~` apply when the token is contained within a separate discontinuous MWE.

* If the token is not continuing a strong expression (i.e. everything but `I_` and `i_`), the LEXTAG and supersense (if applicable) are appended following hyphens. If SS and SS2 are identical, only one copy is included in the tag; if they differ, they are rendered as SS`|`SS2.

* Thus, for the tokens *have a good experience w*, the respective LEXTAG values are:

- *have*: `B-V-v.stative` - begins an MWE, verbal, stative supersense
- *a*: `o-DET` - not part of any MWE, but contained within one; determiner, no supersense
- *good*: `o-ADJ` - not part of any MWE, but contained within one; adjective, no supersense
- *experience*: `I_` - attaches to the most recent non-`O`/`o` token (*have*) to join it in a strong MWE
- *w*: `I~-P-p.Topic` - attaches non-`O`/`o` token (*experience*) to join its strong expression (*have experience*) into a weak expression with whatever strong expression contains *w*. Adpositional; SS and SS2 are both `p.Topic`.

Remarks
-------

The CoNLL-U-Lex format was designed to balance machine-readability, human-readability, and interoperability.
It supports workflows such as viewing/editing in a spreadsheet editor, processing by Unix command-line utilities or simple scripts, and viewing a diff of changes in version control.
The replication of CoNLL-U in the first 10 columns gives direct access to rich morphological and syntactic information from the UD project and facilitates easy patching as new versions of the UD data are made available.

To simplify use cases like sorting and filtering by various components of the annotation, there is considerable redundancy in the lexical-level annotations.
The LEXTAG and LEMMA columns are sufficient to reconstruct columns 11-18 and the `mwe` string in the header
(with the exception of 6 sentences, where the analysis in `mwe` is too complex to be encoded below and has
thus been automatically simplified).

A script is provided for checking internal consistency of the .conllulex file and converting to a JSON representation: [conllulex2json.py](conllulex2json.py). The JSON format contains the same information but consolidates columns 11-18 into lexical-level data structures under `"swes"` (single-word expressions), `"smwes"` (strong MWEs), and `"wmwes"` (weak MWEs). For Python scripts, the `conllulex2json` module can be imported for loading Python objects directly without storing a JSON file.
24 changes: 24 additions & 0 deletions LEXCAT.txt
Original file line number Diff line number Diff line change
@@ -0,0 +1,24 @@
262 !!@
52 !@
4129 ADJ
3608 ADV
2020 AUX
2176 CCONJ
4043 DET
183 DISC
487 INF
247 INF.P
132 INTJ
9106 N
588 NUM
3981 P
59 POSS
170 PP
5088 PRON
1058 PRON.POSS
5875 PUNCT
473 SCONJ
123 SYM
7677 V
37 X
4011 _
5 changes: 0 additions & 5 deletions LICENSE

This file was deleted.

Loading

0 comments on commit 688e652

Please sign in to comment.