Allow to pass pre-tokenized Texts to HeidelTime #89

narnold-cl · 2021-11-26T15:43:34Z

Hello,

thank you very much for your work. We are using HeidelTime in a dynamic setting and have several problems. We will list them here in issues together with design changes suggestions that should address them. Most will be straight-forward to implement for someone familiar with the project.

Is this software still under active development? If not, would you mind translating those high-level propositions to a lower level and point out, which parts of the implementation would need to change for that?

Standalone's dependencies

Speaking about the standalone version, as far as I understand, heideltime needs tokenized text to work, but it doesn't accept pretokenized text as input. Instead it contains hard-coded dependencies on external taggers (for tokenization as well as for POS-tagging), which need to be installed separately.

This has several disadvantages:

Out of sync Tokenization if you don't use the exact same Tokenizer (even then you have to run the Tokenizer twice)
The internally used Tokens are forgotten, as the TimeML-version in use does not support explicit Token-tags.
hard-coded dependencies (use those specific Tokenizers/Taggers or use none at all)
it's not standalone
currently generating the TimeML for a single textfile involves loading a big language model for Tokenization/POS-Tagging. tagging another file repeats the whole procedure.

Especially in dynamic contexts this introduces a huge cost that could be easily avoided.

It's quite simple to parse Tokenized texts, for example they could be given in a "one token per line" format, or similarly something like CoNLL. Not much harder should it be, to implement something similar allowing for already POS-tagged text, completely getting rid of hard-coded external dependencies without reducing performance, necessarily.

Solution:

Provide a way to parse pretokenized texts instead of invoking an external Tokenizer on your own.
Add CLI-Option to define data format (raw / pretokenized / POS-tagged (CoNLL)

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Allow to pass pre-tokenized Texts to HeidelTime #89

Allow to pass pre-tokenized Texts to HeidelTime #89

narnold-cl commented Nov 26, 2021

Allow to pass pre-tokenized Texts to HeidelTime #89

Allow to pass pre-tokenized Texts to HeidelTime #89

Comments

narnold-cl commented Nov 26, 2021

Standalone's dependencies

This has several disadvantages: