Regex to specify lower-priority collation tokens #69

tla · 2018-11-17T22:16:40Z

It often happens in automated collation that very common / frequent tokens, e.g. punctuation or words like 'and' or 'the', get matched a little too eagerly by the algorithm so that more substantive tokens are misaligned. Moreover, the set of tokens that cause this problem will vary according to language / text type / etc.

At the moment I am dealing with this by assigning random strings of characters in the n field of the JSON object for these tokens, so that CollateX won't match them with anything else. This works, but leads to a bunch of duplicated tokens in the output, which I deal with using a graph search algorithms.

Since what I am doing in post-processing looks and smells an awful lot like collation, it seems like something CollateX should be able to handle internally - match the 'substantive' tokens on a first pass, and the non-substantive ones on a second pass, relative to the alignment that has already been done. The easiest way of specifying these 'unimportant' tokens might be a regular expression, since (as mentioned) they will vary from text to text.

The text was updated successfully, but these errors were encountered:

tla added java feature labels Nov 17, 2018

tla mentioned this issue Nov 17, 2018

Misalignment #58

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Regex to specify lower-priority collation tokens #69

Regex to specify lower-priority collation tokens #69

tla commented Nov 17, 2018

Regex to specify lower-priority collation tokens #69

Regex to specify lower-priority collation tokens #69

Comments

tla commented Nov 17, 2018