Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

*.prp files contain ^M artifacts which break model.setup() #14

Open
ypuzikov opened this issue Feb 17, 2018 · 1 comment
Open

*.prp files contain ^M artifacts which break model.setup() #14

ypuzikov opened this issue Feb 17, 2018 · 1 comment

Comments

@ypuzikov
Copy link

Observed on LDC2014T12 data instances:

  • train_380
  • train_961
  • train_995
  • train_1442

After preprocessing, there is this *.prp file which contains annotations done by the Stanford CoreNLP tool. I have noticed that in all the cases above there is a ^M in the middle of CoreNLP output, like so:

[Text=currently CharacterOffsetBegin=0 ... ]^M                                                                                                                                                                                                    
[Text=america CharacterOffsetBegin=10 ... ]^M                                                                                                      
^M                                                                                                                                                                                                                 
[Text=is CharacterOffsetBegin=18 ... ]^M         
...

Not sure why this happens -- maybe CoreNLP does not process multi-sentence instances correctly? In any case, reporting for those who might wonder what is going on.

I solved it by manually deleting the dangling ^M part from the *.prp file.

@Juicechuan
Copy link
Member

That is a carriage-return character often returned by windows. The wrapper for Stanford Corenlp inserted these when outputting the processed file. However, it should be correctly processed corenlp.py though.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants