Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

GraphML output in CollateX Python #65

Open
djbpitt opened this issue Aug 18, 2018 · 1 comment
Open

GraphML output in CollateX Python #65

djbpitt opened this issue Aug 18, 2018 · 1 comment

Comments

@djbpitt
Copy link
Collaborator

djbpitt commented Aug 18, 2018

@tla @rhdekker I'm looking at adding GraphML output to CollateX Python (it's already in CollateX Java), and I'm not confident about the target output format. Specifically, the Java GraphML output for nodes contains three fields, one for the node id, one for the rank, and one that is a concatenation of the t properties of the tokens on that node. For example, using the first example at https://collatex.net/demo/, the Dekker alignment algorithm, and with Segmentation and Transposition both checked, the first non-start node in the GraphML output is:

        <node id="n1">
            <data key="d0">1</data>
            <data key="d2">1</data>
            <data key="d1">This morning </data>
        </node>

This means node 1, rank 1, and the concatenated t value of the tokens is “This morning ”.

This output seems to have two limitations:

  1. It does not persist the n values, which cannot be recreated without knowing how normalization was performed. It is possible to add this a separate property on (that is, <data> child of) the node. It also does not persist other properties that the user might have added to the token during normalization.
  2. It does not persist the tokenization, which cannot be recreated without knowing how it was performed originally.

The second of these limitations goes away if Segmentation is turned off, so that no node can contain more than one token, but that then restricts the types of CollateX variant graphs that can be exported as GraphML.

It is possible to support complex objects in GraphML by customizing the schema (http://graphml.graphdrawing.org/primer/graphml-primer.html#Complex). In that case, even with Segmentation enabled, each pair of t and n values (and other properties that the user might have added to the token during normalization) could be represented by a complex type. It isn't clear to me, though, whether that is the best strategy, especially because it was not adopted for the CollateX Java output,

Might either of you be able to provide some guidance about the requirements and expectations?

@tla
Copy link
Member

tla commented Aug 20, 2018

It's a good question - to be honest I never use segmentation from CollateX, as I prefer to join segments later using my own rules, so I never thought about how segments ought to be represented. I'm not sure who else actually uses the GraphML output.

Concerning persistence of token properties: I think it would be a great idea to extend the schema to allow for arbitrary JSON-like structures in tokens. (And then as long as you're extending the schema, I suppose there could also be support for tokens-within-tokens for the segmentation.) That would actually make the GraphML output useful to me again.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

2 participants