*.prp files contain ^M artifacts which break model.setup() #14

ypuzikov · 2018-02-17T16:32:32Z

Observed on LDC2014T12 data instances:

train_380
train_961
train_995
train_1442

After preprocessing, there is this *.prp file which contains annotations done by the Stanford CoreNLP tool. I have noticed that in all the cases above there is a ^M in the middle of CoreNLP output, like so:

[Text=currently CharacterOffsetBegin=0 ... ]^M                                                                                                                                                                                                    
[Text=america CharacterOffsetBegin=10 ... ]^M                                                                                                      
^M                                                                                                                                                                                                                 
[Text=is CharacterOffsetBegin=18 ... ]^M         
...

Not sure why this happens -- maybe CoreNLP does not process multi-sentence instances correctly? In any case, reporting for those who might wonder what is going on.

I solved it by manually deleting the dangling ^M part from the *.prp file.

The text was updated successfully, but these errors were encountered:

Juicechuan · 2018-02-18T20:19:49Z

That is a carriage-return character often returned by windows. The wrapper for Stanford Corenlp inserted these when outputting the processed file. However, it should be correctly processed corenlp.py though.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

*.prp files contain ^M artifacts which break model.setup() #14

*.prp files contain ^M artifacts which break model.setup() #14

ypuzikov commented Feb 17, 2018

Juicechuan commented Feb 18, 2018

*.prp files contain ^M artifacts which break model.setup() #14

*.prp files contain ^M artifacts which break model.setup() #14

Comments

ypuzikov commented Feb 17, 2018

Juicechuan commented Feb 18, 2018