Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Allow to pass pre-tokenized Texts to HeidelTime #89

Open
2 tasks
narnold-cl opened this issue Nov 26, 2021 · 0 comments
Open
2 tasks

Allow to pass pre-tokenized Texts to HeidelTime #89

narnold-cl opened this issue Nov 26, 2021 · 0 comments

Comments

@narnold-cl
Copy link

Hello,

thank you very much for your work. We are using HeidelTime in a dynamic setting and have several problems. We will list them here in issues together with design changes suggestions that should address them. Most will be straight-forward to implement for someone familiar with the project.

Is this software still under active development? If not, would you mind translating those high-level propositions to a lower level and point out, which parts of the implementation would need to change for that?

Standalone's dependencies

Speaking about the standalone version, as far as I understand, heideltime needs tokenized text to work, but it doesn't accept pretokenized text as input. Instead it contains hard-coded dependencies on external taggers (for tokenization as well as for POS-tagging), which need to be installed separately.

This has several disadvantages:

  • Out of sync Tokenization if you don't use the exact same Tokenizer (even then you have to run the Tokenizer twice)
  • The internally used Tokens are forgotten, as the TimeML-version in use does not support explicit Token-tags.
  • hard-coded dependencies (use those specific Tokenizers/Taggers or use none at all)
  • it's not standalone
  • currently generating the TimeML for a single textfile involves loading a big language model for Tokenization/POS-Tagging. tagging another file repeats the whole procedure.

Especially in dynamic contexts this introduces a huge cost that could be easily avoided.

It's quite simple to parse Tokenized texts, for example they could be given in a "one token per line" format, or similarly something like CoNLL. Not much harder should it be, to implement something similar allowing for already POS-tagged text, completely getting rid of hard-coded external dependencies without reducing performance, necessarily.

Solution:

  • Provide a way to parse pretokenized texts instead of invoking an external Tokenizer on your own.
  • Add CLI-Option to define data format (raw / pretokenized / POS-tagged (CoNLL)
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant