使用英文数据集(Conll2003)时，tokenizer的问题 #12

Daniel19960601 · 2021-02-25T02:10:09Z

您好，感谢对我再“出现wordPiece应该怎么办？”这一问题下的提问了。
是我描述不到位，我遇到的问题是：
数据集使用的Conll2003，Bert模型使用的是bert-base-cased。运行时出现如下错误：

File "D:\python-workspace\BERT-BiLSTM-CRF-NER-pytorch-master\utils.py", line 162, in convert_examples_to_features
assert len(ori_tokens) == len(ntokens), f"{len(ori_tokens)}, {len(ntokens)}, {ori_tokens}, {ntokens}"
AssertionError: 3, 8, ['[CLS]', '-DOCSTART-', '[SEP]'], ['[CLS]', '-', 'do', '##cs', '##tar', '##t', '-', '[SEP]']

可见，是tokenizer将单词切分了，导致assert len(ori_tokens) == len(ntokens)不能通过，请问如何解决？感谢您。

hertz-pj · 2021-03-04T06:38:26Z

后面有空补充一下这块的处理吧，之前没有考虑到英文数据集wordpiece的问题。

Daniel19960601 · 2021-03-28T07:57:50Z

后面有空补充一下这块的处理吧，之前没有考虑到英文数据集wordpiece的问题。

您好，能否提供一下英文数据集wordpiece处理的思路，想自己试一试，感谢！

zingxy · 2021-10-01T01:23:35Z

@Daniel19960601 把这行去掉，然后对齐标签就行

brennenhuang · 2022-06-20T01:59:39Z

針對這個問題可以參考這一篇
https://github.com/wzhouad/NLL-IE 裡面的prepro.py
具體來說會先把-DOCSTART-過濾掉，這代表一句話的開頭，然後每個docs第一句使用truecase轉成正確英文大小寫
每一句都會使用Wordpiece進行切割，注意一個詞如果切割後變成三個subword，只有第一個會對應到label，後面的label都會對應到-1(不進行回饋)

reBiocoder · 2022-08-11T03:47:54Z

@brennenhuang -1感觉不行啊,eval反推标签的时候，-1没有key。

brennenhuang · 2022-08-22T02:29:16Z

@reBiocoder
舉例來說要eval時
I live in Taipei
經過tokenizer
I live in Tai ##pei
轉換ID
20 30 10 21 23
預測
None None None PLACE X
也就是##pei的預測值不理會，進行輸出

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

使用英文数据集(Conll2003)时，tokenizer的问题 #12

使用英文数据集(Conll2003)时，tokenizer的问题 #12

Daniel19960601 commented Feb 25, 2021

hertz-pj commented Mar 4, 2021

Daniel19960601 commented Mar 28, 2021

zingxy commented Oct 1, 2021

brennenhuang commented Jun 20, 2022

reBiocoder commented Aug 11, 2022

brennenhuang commented Aug 22, 2022

使用英文数据集(Conll2003)时，tokenizer的问题 #12

使用英文数据集(Conll2003)时，tokenizer的问题 #12

Comments

Daniel19960601 commented Feb 25, 2021

hertz-pj commented Mar 4, 2021

Daniel19960601 commented Mar 28, 2021

zingxy commented Oct 1, 2021

brennenhuang commented Jun 20, 2022

reBiocoder commented Aug 11, 2022

brennenhuang commented Aug 22, 2022