-
Notifications
You must be signed in to change notification settings - Fork 45
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
xml2ccg #20
base: master
Are you sure you want to change the base?
xml2ccg #20
Conversation
Additionally removed some outdated imports and comments. One "arg == None" was changed to "arg is None".
- Using 2to3 and some manual labor - Especially comments can be overlooked - Updating lex.py and yacc.py - Tested the changes with arabic.ccg - The editor buttons are broken with this version
It still works fine inside a debian-docker container using XQuartz on the host system. Windows will be tested soon.
The content of Features and Testbed is presented properly again. Before, only the last items were shown, as the indentation level was one too few.
…ested by tiny.ccg.
…aising UnboundLocalError for pos in the next line.
…w. Adding CategoryParser for complexcat and atomcats.
…he CategoryParser.
I tried to keep it to a minimum, but it is possible that more locations are missing. In future iterations, this can be improved to be more convservative and consistent with the places where maybe_quote is used.
This reflects the fact that in hand-crafted xml files, rules / macros might have been forgotten.
case<0>: acc0:p-case; should become <macro name="@Acc0"> <fs id="0" attr="case" val="p-case"/> </macro> but becomes instead <macro name="@Acc0"> <fs id="0"> <feat attr="case" val="p-case" /> </fs> </macro> This is in theory fine (although tiny.ccg claims the prior case would be true), but causes trouble when comparing the XMLs and when converting back and forth between XML and CCG. However, since xml2ccg's purpose is more or less a one-way-recovery, this slight inconsistence is fine and should be permitted by the tests.
Caveat: If a macro is defined in the form <macro name="@macro"> <lf> <satop nomvar="NOMVAR"> <diamond mode="MODE"> <prop name="NAME" /> </diamond> </satop> </lf> </macro> The resulting ccg entry looks like this: MACRO<NOMVAR:MODE>: NAME; However, ccg2xml drops the macro name and uses the prop name instead: <macro name="@name"> <lf> <satop nomvar="NOMVAR"> <diamond mode="MODE"> <prop name="NAME" /> </diamond> </satop> </lf> </macro> This is functionaly equivalent, as the macro name is only an identifier. However, this can lead to name clashes in certain circumstances as well as issues with with unit tests (as the macro names now differ).
66aa180
to
2decc3b
Compare
WIth commit 66aa180 / 2decc3b I identified a couple of problems with the test suite which resulted in some wrongly translated xml's to slip through (below are only the important excerpts). arabic morph:
arabic lexicon:
diaspace lexicon:
diaspace rules:
inherit lexicon:
tiny has multiple of the above issues but no new issues. |
…ized. Fixing typerrais parsing. Fixing several [*DEFAULT*] values and handling nomvars more properly.
The only remaining problem with the diaspace grammar is now family entries which have features of the following type:
These are currently parsed into [modality], thus the information about the nomvar is lost. I am not sure, if modality is even a thing to be treated special like the "index" features -- and if so, it could only also work with a single uppercase letter as its name, i guess. In either case, I don't know how to represent this in ccg so that it would generate the right output. Maybe the original xml grammar can be changed or this is something the ccg format does not support, while OpenCCG does. |
Also disallows saveSection on a None text element. If a section was edited but "Done" is clicked on another section, a NoneType has no get exception was thrown. However, a proper fix would be to allow "Done" only on edited sections.
Similarly to the above mentioned modality attributes, in the diaspace grammar there are a few index attributes with complex names: <feat attr="index">
<lf>
<nomvar name="GL:gs-GeneralizedLocation"/>
</lf>
</feat> Since ccg2xml parses only (so it seems) single uppercase letters properly into index attributes, this index feature gets translated from its current ccg representation
into <feat attr="GL">
<featvar name="GL:gs-GeneralizedLocation"/>
</feat> Is this a limitation of the ccg files? Or are those errors in the grammar which should not be possible in xml either? These two issues seem ( :-) ) to be the remaining problems for xml2ccg. |
This may be a limitation of what ccg2xml can parse. But in general the ability to support LF-valued features (beyond the special index feature) is an important part of the native XML grammar format (note that the .ccg format was designed for easier human authoring but was never exhaustively checked against what the native XML format supports). In the flights and comic grammars (under openccg/grammars), LF-valued features are used to propagate the info and owner features from the semantics to the syntax, in order to implement a version of Steedman's theory of communicative structure (theme/rheme and 'kontrast'), which is described in this article [http://aclweb.org/anthology/J10-2001.pdf]. One way to wrap up xml2ccg, of course, would be to emit warnings when a native XML grammar cannot be adequately translated to .ccg; another would be to try to make ccg2xml complete, but that option would not be for the faint of heart. |
Thank you, I was already afraid that this would be the case. I will consider the options and see if I can find some time over the holidays to implement one or the other. |
Note: This PR relies on #18 and #19 and thus contains the same commits as well. Once those are merged, it will be slightly smaller. I can also rebase/squash etc. for a shorter history.
xml2ccg
This PR introduces a script xml2ccg, which is roughly the inverse of ccg2xml.
Since the recommended way to edit grammars is not fiddling around with xml files but with a ccg file, the tool should only be seen as a one-off generator of a lost ccg file.
I am looking forward to your review and feedback!
Changelog
Features
xml2ccg.py
: a new script to create a ccg file from a directory containing the appropriate grammar xml files. It comes with the same xml2ccg and xml2ccg.bat convenience scripts as ccg2xml. Just like ccg2xml.py, it is copied to the bin directory using theccg-build
process. However, it is not auto-generated.xml2ccg is tested as follows:
1a. An additional hand-crafted grammar (diaspace, LGPL 2.1+) is used, although no original ccg files exist anymore.
that some implicit macro definitions were not explicitly written by hand,
while the ccg2xml generator adds those. This is especially the case for the
types.xml, which lists all macro types explicitly when generated via ccg2xml,
while the hand-crafted variant only contains ontology types.
case<0>: acc0:p-case;
is converted. According to tiny.ccg it should become:val != None
is treated in the same way as macro/fs.Fixes & smaller changes
Caveats
MACRO<NOMVAR:MODE>: NAME;
results inmacro name="@MACRO"
. Thus, in a handcrafted xml where macro names are different from prop names, the information is converted "properly" to ccg, but lost on the conversion back, leading to some strange errors. The only solution to this problem is to change the xml files before hand, so that the prop names and macro names are the same (and unique) already.grammar.xml
's content is largely ignored, the script assumes all files to be in the same directory instead of following the paths inside grammar.xml.