Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Tagset for GOST tagger #89

Open
marcverhagen opened this issue Apr 19, 2019 · 5 comments
Open

Tagset for GOST tagger #89

marcverhagen opened this issue Apr 19, 2019 · 5 comments
Assignees

Comments

@marcverhagen
Copy link
Contributor

I am adding CLAWS tag sets to the vocabulary (so far just CLAWS5 and CLAWS7), but it is not clear what the GOST tagger is using. It is clearly not CLAWS five, as shown in the table below for the tags that appear in the GOST output for MASC3-0203 (which by the way only gives 23 tokens), but CLAWS7 isn't it either.

tag CLAWS5 CLAWS7
MCMC - +
NN1 + +
NNU - +
NP1 - +
VV0 - +
YCOL - -
YDSH - -

Must look into the other CLAWS tag sets.

@marcverhagen
Copy link
Contributor Author

Probably CLAWS6 or CLAWS8, but need to look at more tags. Note that YCOL and YDSH are punctuation tags and that CLAWS7 is CLAWS6 minus punctuation tags.

tag CLAWS5 CLAWS6 CLAWS7 CLAWS8
MCMC - + + +
NN1 + + + +
NNU - + + +
NP1 - + + +
VV0 - + + +
YCOL - + - +
YDSH - + - +

@marcverhagen
Copy link
Contributor Author

In addition, beyond the pos tags, GOST also produces semantic tags from the 200+ basic semantic tags from the UCREL Semantic Analysis System (USAS, http://ucrel.lancs.ac.uk/usas/) as well as identifiers from the GO ontology. The GOST service uses a list-valued semtags attribute on the Token to put these (where the list will either have one USAS semantic tag or one or more GO categories).

Because we have two tagsets for the same property, we need to define this in the metadata a bit differently from existing tag set definitions where we just give a URI, for example for the value of posTagSet on Token we can use a URI pointing to a tag set discriminator in the vocabulary. Now that we have both USAS types and GO categories in the semtags property, we need to be able to say that in the metadata

Properties Types Description
semanticTags List of String or URI Semantic types that can be used in the semtags property

So in the metadata we can say:

{ "contains": {
   "http://vocab.lappsgrid.org/Token": {
      "semanticTags": [ "tags-sem-bio-go", "tags-sem-basic-asus" ] }}}

For the full names I am proposing one of the following:

  1. ns/tagset/sem#basic-asus and ns/tagset/sem#bio-go
  2. ns/tagset/sem-basic#asus and ns/tagset/sem-bio#go
  3. ns/tagset/sem/basic#asus and ns/tagset/sem/bio#go

I think I prefer the last one because the number of different set of semantic tags may be impressive.

@marcverhagen marcverhagen self-assigned this Apr 20, 2019
@marcverhagen
Copy link
Contributor Author

marcverhagen commented May 2, 2019

For the full names we are now leaning towards not creating a subdirectories http://vocab.lappsgrid.org/ns/tagset/sem, so we would get something like

name url
tags-sem-asus http://vocab.lappsgrid.org/ns/tagset/sem#asus
tags-sem-bio-go http://vocab.lappsgrid.org/ns/tagset/sem#bio-go

@nancyide
Copy link
Contributor

nancyide commented May 3, 2019 via email

@marcverhagen
Copy link
Contributor Author

The asus discriminator refers to the 200+ semantic tags used by the UCREL Semantic Analysis System (USAS), and they are in the GOST output.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants