GraphML output in CollateX Python #65

djbpitt · 2018-08-18T16:01:35Z

@tla @rhdekker I'm looking at adding GraphML output to CollateX Python (it's already in CollateX Java), and I'm not confident about the target output format. Specifically, the Java GraphML output for nodes contains three fields, one for the node id, one for the rank, and one that is a concatenation of the t properties of the tokens on that node. For example, using the first example at https://collatex.net/demo/, the Dekker alignment algorithm, and with Segmentation and Transposition both checked, the first non-start node in the GraphML output is:

        <node id="n1">
            <data key="d0">1</data>
            <data key="d2">1</data>
            <data key="d1">This morning </data>
        </node>

This means node 1, rank 1, and the concatenated t value of the tokens is “This morning ”.

This output seems to have two limitations:

It does not persist the n values, which cannot be recreated without knowing how normalization was performed. It is possible to add this a separate property on (that is, <data> child of) the node. It also does not persist other properties that the user might have added to the token during normalization.
It does not persist the tokenization, which cannot be recreated without knowing how it was performed originally.

The second of these limitations goes away if Segmentation is turned off, so that no node can contain more than one token, but that then restricts the types of CollateX variant graphs that can be exported as GraphML.

It is possible to support complex objects in GraphML by customizing the schema (http://graphml.graphdrawing.org/primer/graphml-primer.html#Complex). In that case, even with Segmentation enabled, each pair of t and n values (and other properties that the user might have added to the token during normalization) could be represented by a complex type. It isn't clear to me, though, whether that is the best strategy, especially because it was not adopted for the CollateX Java output,

Might either of you be able to provide some guidance about the requirements and expectations?

The text was updated successfully, but these errors were encountered:

tla · 2018-08-20T21:00:06Z

It's a good question - to be honest I never use segmentation from CollateX, as I prefer to join segments later using my own rules, so I never thought about how segments ought to be represented. I'm not sure who else actually uses the GraphML output.

Concerning persistence of token properties: I think it would be a great idea to extend the schema to allow for arbitrary JSON-like structures in tokens. (And then as long as you're extending the schema, I suppose there could also be support for tokens-within-tokens for the segmentation.) That would actually make the GraphML output useful to me again.

djbpitt added python feature labels Aug 18, 2018

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

GraphML output in CollateX Python #65

GraphML output in CollateX Python #65

djbpitt commented Aug 18, 2018

tla commented Aug 20, 2018

GraphML output in CollateX Python #65

GraphML output in CollateX Python #65

Comments

djbpitt commented Aug 18, 2018

tla commented Aug 20, 2018