Custom model with pretokenized input including multiword #56

ziqianPeng · 2022-07-30T23:10:24Z

Hello!
I'm trying to train custom parser using trankit with pretokenized input extracted from conllu files.

Maybe I didn't get the right way but in my way some bug occurred for French (multiword token) and Chinese ("KeyError UD-Japanese-Like" if I parse my test file just after finish training), so I modified the source code to fix them. I also modified the path of xlm_roberta model in file_utils.py such that it will be downloaded only one time when training multiple models of the same type, such as 'customized'.
The file train_pred_trainkit.py is an example to apply these modification, especially the function pred_trankit.

I hope this would be helpful for you and thanks a lot for developing trankit!

ziqianPeng added 2 commits July 31, 2022 00:28

adapte trankit for custom model

a6b21d4

example

82b1135

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Custom model with pretokenized input including multiword #56

Custom model with pretokenized input including multiword #56

ziqianPeng commented Jul 30, 2022

Custom model with pretokenized input including multiword #56

Are you sure you want to change the base?

Custom model with pretokenized input including multiword #56

Conversation

ziqianPeng commented Jul 30, 2022