Custom model with pretokenized input including multiword #56
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
Hello!
I'm trying to train custom parser using trankit with pretokenized input extracted from conllu files.
Maybe I didn't get the right way but in my way some bug occurred for French (
multiword token
) and Chinese ("KeyError UD-Japanese-Like"
if I parse my test file just after finish training), so I modified the source code to fix them. I also modified the path of xlm_roberta model in file_utils.py such that it will be downloaded only one time when training multiple models of the same type, such as 'customized'.The file train_pred_trainkit.py is an example to apply these modification, especially the function
pred_trankit
.I hope this would be helpful for you and thanks a lot for developing trankit!