Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

ValueError: cannot copy sequence with size 37 to array axis with dimension 36 #3

Open
tianke0711 opened this issue Jun 13, 2021 · 11 comments

Comments

@tianke0711
Copy link

tianke0711 commented Jun 13, 2021

你好 我换成BIEOS数据标签后,test数据没有标签。我每个字添加一个临时标签都是O,
然后允许模型,出现了以下错误,请指教!

File "/NER/CLUENER2020/BERT-LSTM-CRF/train.py", line 83, in evaluate
    for idx, batch_samples in enumerate(dev_loader):
  File "/opt/conda/lib/python3.6/site-packages/torch/utils/data/dataloader.py", line 560, in __next__
    batch = self.collate_fn([self.dataset[i] for i in indices])
  File "NER/CLUENER2020/BERT-LSTM-CRF/data_loader.py", line 97, in collate_fn
    batch_labels[j][:cur_tags_len] = labels[j]
@whyalwaysonline
Copy link

你好 我换成BIEOS数据标签后,test数据没有标签。我每个字添加一个临时标签都是O,
然后允许模型,出现了以下错误,请指教!

File "/NER/CLUENER2020/BERT-LSTM-CRF/train.py", line 83, in evaluate
    for idx, batch_samples in enumerate(dev_loader):
  File "/opt/conda/lib/python3.6/site-packages/torch/utils/data/dataloader.py", line 560, in __next__
    batch = self.collate_fn([self.dataset[i] for i in indices])
  File "NER/CLUENER2020/BERT-LSTM-CRF/data_loader.py", line 97, in collate_fn
    batch_labels[j][:cur_tags_len] = labels[j]

我也出现了同样的问题,例如36是batch里的第一个数据,当后面的数据比36长时就会报错,不知道如何解决,如有思路可发邮件至[email protected]

@tianke0711
Copy link
Author

@whyalwaysonline 还没解决,暂时放弃啦

@hemingkx
Copy link
Owner

不好意思,这两天比较忙,下周我看一下这个问题~

@tianke0711
Copy link
Author

@hemingkx 谢谢 麻烦啦

@whyalwaysonline
Copy link

不好意思,这两天比较忙,下周我看一下这个问题~

sentences.append((self.tokenizer.convert_tokens_to_ids(words), token_start_idxs))
给大家一个参考,问题应该出在这句话中的self.tokenizer.convert_tokens_to_ids(words),我测试了一下,不会报错的句子该元素的size应该是大于token_start_idxs,而对于报错的句子这个值就小了,导致之后size的不匹配。

@whyalwaysonline
Copy link

发现问题所在了,当数据中包含英文单词时比如“Air Jordan”,在token的时候就会把空格略去,导致size不匹配

@chenslcool
Copy link

发现问题所在了,当数据中包含英文单词时比如“Air Jordan”,在token的时候就会把空格略去,导致size不匹配

请问那应该如何解决呢?

@chenslcool
Copy link

发现问题所在了,当数据中包含英文单词时比如“Air Jordan”,在token的时候就会把空格略去,导致size不匹配

请问那应该如何解决呢?

解决了,把数据中的空格去掉即可

@mzx987654
Copy link

mzx987654 commented Apr 22, 2022

@whyalwaysonline> 发现问题所在了,当数据中包含英文单词时比如“Air Jordan”,在token的时候就会把空格略去,导致size不匹配
请问 去掉空格了还是有这个问题怎么办

@chenslcool
Copy link

chenslcool commented Apr 22, 2022 via email

@chernzheng
Copy link

@whyalwaysonline> 发现问题所在了,当数据中包含英文单词时比如“Air Jordan”,在token的时候就会把空格略去,导致size不匹配 请问 去掉空格了还是有这个问题怎么办

最简单的方法是将空格替换成下划线“_”。仅去掉空格而不去掉相应的标签,会导致对应错误。我的训练数据也是混合了中英文的,解决办法就是将空格替换成下划线,模型最终效果非常好。

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

6 participants