You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
@tla@rhdekker I'm looking at adding GraphML output to CollateX Python (it's already in CollateX Java), and I'm not confident about the target output format. Specifically, the Java GraphML output for nodes contains three fields, one for the node id, one for the rank, and one that is a concatenation of the t properties of the tokens on that node. For example, using the first example at https://collatex.net/demo/, the Dekker alignment algorithm, and with Segmentation and Transposition both checked, the first non-start node in the GraphML output is:
This means node 1, rank 1, and the concatenated t value of the tokens is “This morning ”.
This output seems to have two limitations:
It does not persist the n values, which cannot be recreated without knowing how normalization was performed. It is possible to add this a separate property on (that is, <data> child of) the node. It also does not persist other properties that the user might have added to the token during normalization.
It does not persist the tokenization, which cannot be recreated without knowing how it was performed originally.
The second of these limitations goes away if Segmentation is turned off, so that no node can contain more than one token, but that then restricts the types of CollateX variant graphs that can be exported as GraphML.
It is possible to support complex objects in GraphML by customizing the schema (http://graphml.graphdrawing.org/primer/graphml-primer.html#Complex). In that case, even with Segmentation enabled, each pair of t and n values (and other properties that the user might have added to the token during normalization) could be represented by a complex type. It isn't clear to me, though, whether that is the best strategy, especially because it was not adopted for the CollateX Java output,
Might either of you be able to provide some guidance about the requirements and expectations?
The text was updated successfully, but these errors were encountered:
It's a good question - to be honest I never use segmentation from CollateX, as I prefer to join segments later using my own rules, so I never thought about how segments ought to be represented. I'm not sure who else actually uses the GraphML output.
Concerning persistence of token properties: I think it would be a great idea to extend the schema to allow for arbitrary JSON-like structures in tokens. (And then as long as you're extending the schema, I suppose there could also be support for tokens-within-tokens for the segmentation.) That would actually make the GraphML output useful to me again.
@tla @rhdekker I'm looking at adding GraphML output to CollateX Python (it's already in CollateX Java), and I'm not confident about the target output format. Specifically, the Java GraphML output for nodes contains three fields, one for the node id, one for the rank, and one that is a concatenation of the
t
properties of the tokens on that node. For example, using the first example at https://collatex.net/demo/, the Dekker alignment algorithm, and with Segmentation and Transposition both checked, the first non-start node in the GraphML output is:This means node 1, rank 1, and the concatenated
t
value of the tokens is “This morning ”.This output seems to have two limitations:
n
values, which cannot be recreated without knowing how normalization was performed. It is possible to add this a separate property on (that is,<data>
child of) the node. It also does not persist other properties that the user might have added to the token during normalization.The second of these limitations goes away if Segmentation is turned off, so that no node can contain more than one token, but that then restricts the types of CollateX variant graphs that can be exported as GraphML.
It is possible to support complex objects in GraphML by customizing the schema (http://graphml.graphdrawing.org/primer/graphml-primer.html#Complex). In that case, even with Segmentation enabled, each pair of
t
andn
values (and other properties that the user might have added to the token during normalization) could be represented by a complex type. It isn't clear to me, though, whether that is the best strategy, especially because it was not adopted for the CollateX Java output,Might either of you be able to provide some guidance about the requirements and expectations?
The text was updated successfully, but these errors were encountered: