You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Just wanted to point out that a good number of highlights in SCAT sentences appear to be malformed, most likely due to sequential insertion of hon/hoff tags without accounting for the overhead of previous ones. Here are just some examples, but there are many more:
highlighted.train.en
row 383: We get up and then <hon><p>it</p> < <hon>hoff><hoff> ends up snowing a foot on us.
row 9907: <hon>They<hoff> didn't have enough mo <hon>ney to su < <hon>hoff<hoff> > pport themselves... so <p>they</p> go and have nine kids.
row 9916: Everybody reads the paper since you <hon <hon>> m<hoff> ade<hoff> <p>it</p> a daily.
row 10497: But that doesn't stop a young <hon>Platecarpus < <hon>hof<hoff> f> ... when <p>it</p> wants a snack.
row 10549: Just imagine how far away from us you'd have to move <hon>the Sun < <hon>hof<hoff> f> to make <p>it</p> appear as small and faint as a star.
row 10967: In cuba, with <hon>people < <hon>hoff><hoff> like me, <p>they</p> always found a reason to hit us.
row 11196: -You mean, when <hon <hon>> i<hoff> t < <hon>ho<hoff> ff> hardens, <p>it</p>-- -lt turns into plastic.
row 11211: And then if <hon>the boys<hoff> do want to farm, or if Laurie marries someone that would like <hon>to farm. <hoff <hon>> ..<hoff> and the boys don't want to... at least <p>they</p> have that college education to fall back on.
row 11225: If you ever get to be astronauts, you're going to thank us for making you wear <hon>these jumpsuits <hon><hoff<hoff> > because <p>they</p> provide ease of movement and additional storage space in orbit.
row 11213: <hon>The weather < <hon>ho<hoff> ff> does as <p>it</p> pleases.
row 11232: She was wearing a <hon>muumu <hon>u <<hoff> hoff <hon>><hoff> , but <p>it</p> had to be sl<p>it</p> so she could f<p>it</p> into <p>it</p>.
highlighted.test.fr
row 750: J'ai fait un calcul rapide et <p>il</p> <hon>éta <hon>it <hoff<hoff> > peu probable qu'<p>il</p> dise un truc important ou qu'<p>il</p> fasse une interview télévisée, donc je ne pensais pas priver ma chaine de grand chose.
row 901: Ravi l'a attrapé, e <hon>t <hoff> noté l'adresse référencée sur le document que vous nous avez donné, et nous pensions qu'<p>il</p> pourrait <hon <hon>> être <hof<hoff> f> intéressant de le souligner.
I don't think the amount of corrupted data is enough to cause significant disruption in your results, but for sure they may be an issue. Would you consider implementing a well-formedness check for tags, and correct malformed examples? Thank you in advance!
The text was updated successfully, but these errors were encountered:
I didn't work on this part, but do you have statistics for the prevalence of the malformed tags? If it's very small and it doesn't break the code it prob won't change the results much. Does it break your use case for them? I would just recommend dropping the samples in that case
Dear authors,
Just wanted to point out that a good number of highlights in SCAT sentences appear to be malformed, most likely due to sequential insertion of
hon
/hoff
tags without accounting for the overhead of previous ones. Here are just some examples, but there are many more:highlighted.train.en
We get up and then <hon><p>it</p> < <hon>hoff><hoff> ends up snowing a foot on us.
<hon>They<hoff> didn't have enough mo <hon>ney to su < <hon>hoff<hoff> > pport themselves... so <p>they</p> go and have nine kids.
Everybody reads the paper since you <hon <hon>> m<hoff> ade<hoff> <p>it</p> a daily.
But that doesn't stop a young <hon>Platecarpus < <hon>hof<hoff> f> ... when <p>it</p> wants a snack.
Just imagine how far away from us you'd have to move <hon>the Sun < <hon>hof<hoff> f> to make <p>it</p> appear as small and faint as a star.
In cuba, with <hon>people < <hon>hoff><hoff> like me, <p>they</p> always found a reason to hit us.
-You mean, when <hon <hon>> i<hoff> t < <hon>ho<hoff> ff> hardens, <p>it</p>-- -lt turns into plastic.
And then if <hon>the boys<hoff> do want to farm, or if Laurie marries someone that would like <hon>to farm. <hoff <hon>> ..<hoff> and the boys don't want to... at least <p>they</p> have that college education to fall back on.
If you ever get to be astronauts, you're going to thank us for making you wear <hon>these jumpsuits <hon><hoff<hoff> > because <p>they</p> provide ease of movement and additional storage space in orbit.
<hon>The weather < <hon>ho<hoff> ff> does as <p>it</p> pleases.
She was wearing a <hon>muumu <hon>u <<hoff> hoff <hon>><hoff> , but <p>it</p> had to be sl<p>it</p> so she could f<p>it</p> into <p>it</p>.
highlighted.test.fr
J'ai fait un calcul rapide et <p>il</p> <hon>éta <hon>it <hoff<hoff> > peu probable qu'<p>il</p> dise un truc important ou qu'<p>il</p> fasse une interview télévisée, donc je ne pensais pas priver ma chaine de grand chose.
Ravi l'a attrapé, e <hon>t <hoff> noté l'adresse référencée sur le document que vous nous avez donné, et nous pensions qu'<p>il</p> pourrait <hon <hon>> être <hof<hoff> f> intéressant de le souligner.
I don't think the amount of corrupted data is enough to cause significant disruption in your results, but for sure they may be an issue. Would you consider implementing a well-formedness check for tags, and correct malformed examples? Thank you in advance!
The text was updated successfully, but these errors were encountered: