Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Malformed highlight tags #16

Open
gsarti opened this issue Oct 22, 2021 · 2 comments
Open

Malformed highlight tags #16

gsarti opened this issue Oct 22, 2021 · 2 comments

Comments

@gsarti
Copy link

gsarti commented Oct 22, 2021

Dear authors,

Just wanted to point out that a good number of highlights in SCAT sentences appear to be malformed, most likely due to sequential insertion of hon/hoff tags without accounting for the overhead of previous ones. Here are just some examples, but there are many more:

highlighted.train.en

  • row 383: We get up and then <hon><p>it</p> < <hon>hoff><hoff> ends up snowing a foot on us.
  • row 9907: <hon>They<hoff> didn't have enough mo <hon>ney to su < <hon>hoff<hoff> > pport themselves... so <p>they</p> go and have nine kids.
  • row 9916: Everybody reads the paper since you <hon <hon>> m<hoff> ade<hoff> <p>it</p> a daily.
  • row 10497: But that doesn't stop a young <hon>Platecarpus < <hon>hof<hoff> f> ... when <p>it</p> wants a snack.
  • row 10549: Just imagine how far away from us you'd have to move <hon>the Sun < <hon>hof<hoff> f> to make <p>it</p> appear as small and faint as a star.
  • row 10967: In cuba, with <hon>people < <hon>hoff><hoff> like me, <p>they</p> always found a reason to hit us.
  • row 11196: -You mean, when <hon <hon>> i<hoff> t < <hon>ho<hoff> ff> hardens, <p>it</p>-- -lt turns into plastic.
  • row 11211: And then if <hon>the boys<hoff> do want to farm, or if Laurie marries someone that would like <hon>to farm. <hoff <hon>> ..<hoff> and the boys don't want to... at least <p>they</p> have that college education to fall back on.
  • row 11225: If you ever get to be astronauts, you're going to thank us for making you wear <hon>these jumpsuits <hon><hoff<hoff> > because <p>they</p> provide ease of movement and additional storage space in orbit.
  • row 11213: <hon>The weather < <hon>ho<hoff> ff> does as <p>it</p> pleases.
  • row 11232: She was wearing a <hon>muumu <hon>u <<hoff> hoff <hon>><hoff> , but <p>it</p> had to be sl<p>it</p> so she could f<p>it</p> into <p>it</p>.

highlighted.test.fr

  • row 750: J'ai fait un calcul rapide et <p>il</p> <hon>éta <hon>it <hoff<hoff> > peu probable qu'<p>il</p> dise un truc important ou qu'<p>il</p> fasse une interview télévisée, donc je ne pensais pas priver ma chaine de grand chose.
  • row 901: Ravi l'a attrapé, e <hon>t <hoff> noté l'adresse référencée sur le document que vous nous avez donné, et nous pensions qu'<p>il</p> pourrait <hon <hon>> être <hof<hoff> f> intéressant de le souligner.

I don't think the amount of corrupted data is enough to cause significant disruption in your results, but for sure they may be an issue. Would you consider implementing a well-formedness check for tags, and correct malformed examples? Thank you in advance!

@gsarti
Copy link
Author

gsarti commented Jun 10, 2022

Hi @kayoyin @CoderPat,

Could you please let me know if you intend to fix the issue with the highlights? Thank you in advance!

@CoderPat
Copy link
Collaborator

I didn't work on this part, but do you have statistics for the prevalence of the malformed tags? If it's very small and it doesn't break the code it prob won't change the results much. Does it break your use case for them? I would just recommend dropping the samples in that case

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants