Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Single-side deduplication #928

Open
ZJaume opened this issue Nov 13, 2024 · 5 comments
Open

Single-side deduplication #928

ZJaume opened this issue Nov 13, 2024 · 5 comments
Labels
quality Improving robustness and translation quality

Comments

@ZJaume
Copy link
Collaborator

ZJaume commented Nov 13, 2024

Some experiments that a colleague did during MaCoCu project, found that deduplication taking into account only source side or target side, improved translation quality. IIRC it was not clear what was better, to do it on the source or on the target, but both were better than deduplicating In some cases I think it was about 1 BLEU point for mid-resource languages. This probably reduces the amount of translation inconsistencies.

I couldn't found the table with the results, but I think this is worth exploring.

Maybe you are already doing this, but I was not sure. At least in the old pipeline dedupe is using the whole sentence pair.

@ZJaume ZJaume added the quality Improving robustness and translation quality label Nov 13, 2024
@gregtatum
Copy link
Member

We are de-duplicating based on source and target.

@eu9ene
Copy link
Collaborator

eu9ene commented Dec 13, 2024

I was investigating another issue and I saw this for en-zh tokenized corpus:

Natural ▁ compound

Natural ▁ products

Natural ▁ product

Natural ▁ Products

correspond to

天然 产物

天然 产物

天然 产物

天然 产物

Google translate translates 天然 产物 as Natural Products.

So probably when we train en-zh it's ok to leave it, but for zh-en it would make sense to do source-side deduplication, otherwise we have 4 different translations for the same Chinese phrase. The question here is which translation is correct... It would make sense to run some model to score each of them and pick the best one instead of naive deduplication.

@ZJaume
Copy link
Collaborator Author

ZJaume commented Dec 16, 2024

Picking the one with best BCAI score?

天然 产物       Natural compund 0.548
天然 产物       Natural products        0.889
天然 产物       Natural product 0.881
天然 产物       Natural Products        0.901

@gregtatum
Copy link
Member

@ZJaume What is the BCAI score?

@ZJaume
Copy link
Collaborator Author

ZJaume commented Dec 17, 2024

Bicleaner AI, sorry 😅

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
quality Improving robustness and translation quality
Projects
None yet
Development

No branches or pull requests

3 participants