You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
It often happens in automated collation that very common / frequent tokens, e.g. punctuation or words like 'and' or 'the', get matched a little too eagerly by the algorithm so that more substantive tokens are misaligned. Moreover, the set of tokens that cause this problem will vary according to language / text type / etc.
At the moment I am dealing with this by assigning random strings of characters in the n field of the JSON object for these tokens, so that CollateX won't match them with anything else. This works, but leads to a bunch of duplicated tokens in the output, which I deal with using a graph search algorithms.
Since what I am doing in post-processing looks and smells an awful lot like collation, it seems like something CollateX should be able to handle internally - match the 'substantive' tokens on a first pass, and the non-substantive ones on a second pass, relative to the alignment that has already been done. The easiest way of specifying these 'unimportant' tokens might be a regular expression, since (as mentioned) they will vary from text to text.
The text was updated successfully, but these errors were encountered:
It often happens in automated collation that very common / frequent tokens, e.g. punctuation or words like 'and' or 'the', get matched a little too eagerly by the algorithm so that more substantive tokens are misaligned. Moreover, the set of tokens that cause this problem will vary according to language / text type / etc.
At the moment I am dealing with this by assigning random strings of characters in the
n
field of the JSON object for these tokens, so that CollateX won't match them with anything else. This works, but leads to a bunch of duplicated tokens in the output, which I deal with using a graph search algorithms.Since what I am doing in post-processing looks and smells an awful lot like collation, it seems like something CollateX should be able to handle internally - match the 'substantive' tokens on a first pass, and the non-substantive ones on a second pass, relative to the alignment that has already been done. The easiest way of specifying these 'unimportant' tokens might be a regular expression, since (as mentioned) they will vary from text to text.
The text was updated successfully, but these errors were encountered: