-
Notifications
You must be signed in to change notification settings - Fork 11
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
LexicalEntry ids #55
Comments
For the XML files I think we should follow XML conventions and use ID for ids. This may also be necessary for validation tools to ensure, e.g., that IDs are unique in a document and that IDREF targets are present. There is no interpretable meaning within the ID strings, and using forms that look like lemmas is only a convenience for human annotators. The actual forms are in If you must have a LexicalEntry ID be an accurate representation of the lemma, |
Thank you Michael. We were considering use a hash from the lemmas but punycode seems more robust. We have a 1-1 correspondence with the lemma, maybe useful for validation. |
Actually, what was the problem with putting the unicode lemmas directly in the ID value, such as |
Accented characters can be used in XML IDs so I don't really see an issue here. The use of IDs also provides some extra validation to the DTD, namely that IDs are unique and that all references to the IDs actually exist. |
English WordNet 2021 sense IDs conform to the old ID definition but not the more recent xsd:ID. |
@arademaker, did you consider hashes are one-way (you cannot retrieve what you hash), so eventually they are not legible? As are '_'-substitutions on a number of off-limits characters, to a lesser extent. |
@goodmami, PunyCode is rather English-centric and may be very cumbersome when more than one character cannot boil down to ascii. |
@1313ou I'd say it's English-centric only in that it's ASCII-based, but so are some other languages, e.g., Malay or Rotokas. In any case, having looked closer at the XML spec, I suggested in my second comment above that Punycode, or any such encoding, is not necessary as the accented characters can be used in IDs. To be clear, I no longer recommend using it for this purpose, and I've edited my comment above to make this more obvious. I do suggest that we add some text to the page for ID suggestions, or maybe even a Javascript-based validator. All we have currently that I can see is:
The lexicon ID prefix is probably good advice for lexical entries as well because we might have lexical entries for digits or something else that shouldn't appear as the initial character in an XML ID. This means we should have recommendations for lexicon IDs (e.g., that it follows |
I don't think the global schema must define IDs beyond the requirement that they be valid xsd:IDs. What's the problem with letting each word net define what they look like ? The basic reason is IDs are functionally opaque (and as such should not be parsed) even if it's nice for the lexicographer to recognize something in it. So "recommendations" is the good word. |
Right, I'm only suggesting that we write some "recommendations". Even if the current text says "...synsets must have...", it might be better to change that must to should. These recommendations are just to help ensure the lexicons can be validated correctly. Otherwise wordnet authors should be free to design their own conventions. |
About the discussion, in fact, as @jmccrae said
The problem occurs for some other characters, such as |
At first, the option was to, after replacing spaces by underlines, apply some other substitutions, as follows: # formatting lexical_entry
written_form_ = written_form.replace(" ", "_")
word_id = f"word-{written_form_}-{part_of_speech}"
for char in "&;()+º',?–!’":
word_id = word_id.replace(char, '_')
for char in "/":
word_id = word_id.replace(char, ':') But, we'd like to avoid this ad-hoc solution. Maybe in a the future a new character could break the code. |
Again, makes sense to have this global (not depending on a specific language or environment) and reversible mapping instead of generating a hash or random Id:
An option can be to consider using the utf-8 hexadecimal encoding of the lemma, with part-of-speech for uniqueness. In this case, we generate, for the example before "Jack, o Estripador", from 11077369-n, the ID What do you think @jmccrae @goodmami @1313ou @arademaker ? |
@FredsoNerd thanks for the additional context. Yes, many punctuation characters are excluded from the First, if you collapse multiple punctuation characters into a single replacement character (as you currently do with Let's also look at some examples from the Open English Wordnet:
So it looks like the OEWN has some ad-hoc rules for replacing those. In addition, spaces are replaced with underscores ( To construct an ID, you can then:
To recover the form from the ID, you do those steps in reverse. That is, after stripping the lexicon ID and part of speech and their dashes, all other dash characters indicate escape patterns to be unescaped. |
This is the plan and the reason for opening the issue and ask @FredsoNerd to comment here. Of course I never considered chance the actual ID type in the schema. Thank you for your suggestions @goodmami. I will discuss them with @FredsoNerd on how to implement. |
As @goodmami pointed out we have some ad-hoc rules in OEWN for special characters: https://github.com/globalwordnet/english-wordnet/blob/master/scripts/wordnet_yaml.py#L13 A less 'ad-hoc' approach is to replace them with XML character entities such as |
@FredsoNerd, I think the global word net does not have to superimpose constraints to specific word nets. However xsd:ID well-formedness is required because uniqueness and proper reference are involved. I have a problem with legacy sensekeys being promoted IDs because they are not conformant IDs because of the colon. However they can prove useful in the database and should be kept as an extra field, possibly as an extension. |
Yes, I like that idea, in line with @goodmami suggestion too. But |
The
That is, |
One extra puzzle to me. Why xml uses begin/end marks (& and ;)? Eventually can we run into any trouble by using only s single mark in -appos-? |
In the DTDs, a LexicalEntry have an identifier defined as https://github.com/globalwordnet/schemas/blob/master/WN-LMF-1.1.dtd#L35
The type ID, https://www.w3.org/TR/REC-xml/#id, is quite restricted and can potentially be an issue for words in other languages with accents, etc. Nevertheless, I do want to preserve the legibility and avoid creating extra artificial ids. Ideally, I would like 1-1 relation with the URI used in the RDF encoding. But we can use
%
scape in URIs. Any idea?The text was updated successfully, but these errors were encountered: