Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

xml2ccg #20

Open
wants to merge 48 commits into
base: master
Choose a base branch
from
Open

xml2ccg #20

wants to merge 48 commits into from

Conversation

shoeffner
Copy link
Contributor

Note: This PR relies on #18 and #19 and thus contains the same commits as well. Once those are merged, it will be slightly smaller. I can also rebase/squash etc. for a shorter history.

xml2ccg

This PR introduces a script xml2ccg, which is roughly the inverse of ccg2xml.
Since the recommended way to edit grammars is not fiddling around with xml files but with a ccg file, the tool should only be seen as a one-off generator of a lost ccg file.

I am looking forward to your review and feedback!

Changelog

Features

xml2ccg.py: a new script to create a ccg file from a directory containing the appropriate grammar xml files. It comes with the same xml2ccg and xml2ccg.bat convenience scripts as ccg2xml. Just like ccg2xml.py, it is copied to the bin directory using the ccg-build process. However, it is not auto-generated.

xml2ccg is tested as follows:

  1. Each available ccg grammar (arabic.ccg, tiny.ccg, tinytiny.ccg, grammar_template.ccg, inherit.ccg) was converted to its xml counterpart and put into the test/ccg2xml directory.
    1a. An additional hand-crafted grammar (diaspace, LGPL 2.1+) is used, although no original ccg files exist anymore.
  2. The test_xml2ccg.py generates a ccg file from each xml directory and then generates a new xml directory from that temporary directory.
  3. The original xml directory and the new xml directory are compared (except for the properties of the root elements and the grammar.xml's file attributes).
  4. While this works well for all ccg2xml generated grammars, for the hand-crafted grammar a few looser rules are needed:
  • The newly generated grammar is allowed to have more entries. It is possible
    that some implicit macro definitions were not explicitly written by hand,
    while the ccg2xml generator adds those. This is especially the case for the
    types.xml, which lists all macro types explicitly when generated via ccg2xml,
    while the hand-crafted variant only contains ontology types.
  • The ccg2xml tool has a few small inconsistencies with the documentation in tiny.ccg for handling certain situations, especially how case<0>: acc0:p-case; is converted. According to tiny.ccg it should become:
    <macro name="@Acc0">
        <fs id="0" attr="case" val="p-case"/>
    </macro>
    but instead becomes
    <macro name="@Acc0">
      <fs id="0">
        <feat attr="case" val="p-case" />
      </fs>
    </macro>
    These two variants, however, represent the same content in some way. So for the final comparison in the test, macro/fs/feat with a val != None is treated in the same way as macro/fs.

Fixes & smaller changes

  • Multiple entries with similar names would be discarded by ccg2xml, as for some complex xml structures only shallow copies have been performed. This was wrapped into deepcopies (ccg.ply:785, ccg.ply:1881)
  • warning_count was not defined or properly used and thus removed (ccg.ply)
  • Removes the executable file permission from various files (ccg.ply, README, arabic.ccg)
  • Indentation and whitespacing in the build.xml and src/ccg2xml/build.xml is streamlined

Caveats

  • ccg2xml ignores macro names when generating xml files but instead uses the entity names prefixed with an @ for macro names. Thus, an entry MACRO<NOMVAR:MODE>: NAME; results in
    <macro name="@NAME">
      <lf>
        <satop nomvar="NOMVAR">
          <diamond mode="MODE">
            <prop name="NAME" />
          </diamond>
        </satop>
      </lf>
    </macro>
    instead of using macro name="@MACRO". Thus, in a handcrafted xml where macro names are different from prop names, the information is converted "properly" to ccg, but lost on the conversion back, leading to some strange errors. The only solution to this problem is to change the xml files before hand, so that the prop names and macro names are the same (and unique) already.
  • The grammar.xml's content is largely ignored, the script assumes all files to be in the same directory instead of following the paths inside grammar.xml.

Additionally removed some outdated imports and comments.
One "arg == None" was changed to "arg is None".
- Using 2to3 and some manual labor
   - Especially comments can be overlooked
- Updating lex.py and yacc.py
- Tested the changes with arabic.ccg
- The editor buttons are broken with this version
It still works fine inside a debian-docker container using XQuartz on
the host system.
Windows will be tested soon.
The content of Features and Testbed is presented properly again.
Before, only the last items were shown, as the indentation level was one
too few.
…aising UnboundLocalError for pos in the next line.
…w. Adding CategoryParser for complexcat and atomcats.
I tried to keep it to a minimum, but it is possible that more locations
are missing. In future iterations, this can be improved to be more
convservative and consistent with the places where maybe_quote is used.
This reflects the fact that in hand-crafted xml files,
rules / macros might have been forgotten.
case<0>: acc0:p-case;

should become

<macro name="@Acc0">
    <fs id="0" attr="case" val="p-case"/>
</macro>

but becomes instead

<macro name="@Acc0">
  <fs id="0">
    <feat attr="case" val="p-case" />
  </fs>
</macro>

This is in theory fine (although tiny.ccg claims the prior case would be
true), but causes trouble when comparing the XMLs and
when converting back and forth between XML and CCG.

However, since xml2ccg's purpose is more or less a one-way-recovery,
this slight inconsistence is fine and should be permitted by the tests.
Caveat:
If a macro is defined in the form

  <macro name="@macro">
    <lf>
      <satop nomvar="NOMVAR">
        <diamond mode="MODE">
          <prop name="NAME" />
        </diamond>
      </satop>
    </lf>
  </macro>

The resulting ccg entry looks like this:

MACRO<NOMVAR:MODE>: NAME;

However, ccg2xml drops the macro name and uses the prop name instead:

  <macro name="@name">
    <lf>
      <satop nomvar="NOMVAR">
        <diamond mode="MODE">
          <prop name="NAME" />
        </diamond>
      </satop>
    </lf>
  </macro>

This is functionaly equivalent, as the macro name is only an identifier.
However, this can lead to name clashes in certain circumstances as well
as issues with with unit tests (as the macro names now differ).
@shoeffner
Copy link
Contributor Author

WIth commit 66aa180 / 2decc3b I identified a couple of problems with the test suite which resulted in some wrongly translated xml's to slip through (below are only the important excerpts).
I am currently working on a fix for these parses.

arabic morph:

Original:
<fs id="2">
    <feat attr="PERS" val="1st"/>
</fs>

Generated:
<fs attr="PERS" id="2" val="1st"></fs>

arabic lexicon:

Original:
<feat attr="lex" val="[*DEFAULT*]"/>

Generated:
<feat attr="lex" val="*"/>

diaspace lexicon:

Original:
<feat attr="num">
    <featvar name="NUM"/>
</feat>

Generated:
<feat attr="NUM">
    <featvar name="NUM"/>
</feat>

diaspace rules:

Original:
<typeraising dir="forward" useDollar="false">
    <arg>
        <atomcat type="pper"/>
    </arg>
</typeraising>

Generated:
<typeraising dir="forward" useDollar="false"/>

inherit lexicon:

Original:
<feat attr="index">
    <lf>
        <nomvar name="E"/>
    </lf>
</feat>

Generated:
// nothing (other feats are processed, but feats containing lf not)

tiny has multiple of the above issues but no new issues.

…ized. Fixing typerrais parsing. Fixing several [*DEFAULT*] values and handling nomvars more properly.
@shoeffner
Copy link
Contributor Author

shoeffner commented Dec 7, 2018

The only remaining problem with the diaspace grammar is now family entries which have features of the following type:

<feat attr="modality">
    <lf>
        <nomvar name="SM:gs-SpatialModality"/>
    </lf>
</feat>

These are currently parsed into [modality], thus the information about the nomvar is lost. I am not sure, if modality is even a thing to be treated special like the "index" features -- and if so, it could only also work with a single uppercase letter as its name, i guess.

In either case, I don't know how to represent this in ccg so that it would generate the right output. Maybe the original xml grammar can be changed or this is something the ccg format does not support, while OpenCCG does.

Also disallows saveSection on a None text element.
If a section was edited but "Done" is clicked on another section,
a NoneType has no get exception was thrown.
However, a proper fix would be to allow "Done" only on edited sections.
@shoeffner
Copy link
Contributor Author

Similarly to the above mentioned modality attributes, in the diaspace grammar there are a few index attributes with complex names:

<feat attr="index">
    <lf>
        <nomvar name="GL:gs-GeneralizedLocation"/>
    </lf>
</feat>

Since ccg2xml parses only (so it seems) single uppercase letters properly into index attributes, this index feature gets translated from its current ccg representation

[GL:gs-GeneralizedLocation]

into

<feat attr="GL">
    <featvar name="GL:gs-GeneralizedLocation"/>
</feat>

Is this a limitation of the ccg files? Or are those errors in the grammar which should not be possible in xml either?

These two issues seem ( :-) ) to be the remaining problems for xml2ccg.
Do you have any ideas on how to progress with these?

@mwhite14850
Copy link
Member

This may be a limitation of what ccg2xml can parse. But in general the ability to support LF-valued features (beyond the special index feature) is an important part of the native XML grammar format (note that the .ccg format was designed for easier human authoring but was never exhaustively checked against what the native XML format supports). In the flights and comic grammars (under openccg/grammars), LF-valued features are used to propagate the info and owner features from the semantics to the syntax, in order to implement a version of Steedman's theory of communicative structure (theme/rheme and 'kontrast'), which is described in this article [http://aclweb.org/anthology/J10-2001.pdf]. One way to wrap up xml2ccg, of course, would be to emit warnings when a native XML grammar cannot be adequately translated to .ccg; another would be to try to make ccg2xml complete, but that option would not be for the faint of heart.

@shoeffner
Copy link
Contributor Author

Thank you, I was already afraid that this would be the case. I will consider the options and see if I can find some time over the holidays to implement one or the other.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants