-
Notifications
You must be signed in to change notification settings - Fork 67
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
使用英文数据集(Conll2003)时,tokenizer的问题 #12
Comments
后面有空补充一下这块的处理吧,之前没有考虑到英文数据集wordpiece的问题。 |
您好,能否提供一下英文数据集wordpiece处理的思路,想自己试一试,感谢! |
@Daniel19960601 把这行去掉, 然后对齐标签就行 |
針對這個問題可以參考這一篇 |
@brennenhuang -1感觉不行啊,eval反推标签的时候,-1没有key。 |
@reBiocoder |
您好,感谢对我再“出现wordPiece应该怎么办?”这一问题下的提问了。
是我描述不到位,我遇到的问题是:
数据集使用的Conll2003,Bert模型使用的是bert-base-cased。运行时出现如下错误:
可见,是tokenizer将单词切分了,导致assert len(ori_tokens) == len(ntokens)不能通过,请问如何解决?感谢您。
The text was updated successfully, but these errors were encountered: