Single-side deduplication #928

ZJaume · 2024-11-13T15:31:12Z

Some experiments that a colleague did during MaCoCu project, found that deduplication taking into account only source side or target side, improved translation quality. IIRC it was not clear what was better, to do it on the source or on the target, but both were better than deduplicating In some cases I think it was about 1 BLEU point for mid-resource languages. This probably reduces the amount of translation inconsistencies.

I couldn't found the table with the results, but I think this is worth exploring.

Maybe you are already doing this, but I was not sure. At least in the old pipeline dedupe is using the whole sentence pair.

The text was updated successfully, but these errors were encountered:

gregtatum · 2024-11-15T14:50:46Z

We are de-duplicating based on source and target.

eu9ene · 2024-12-13T20:08:59Z

I was investigating another issue and I saw this for en-zh tokenized corpus:

Natural ▁ compound

Natural ▁ products

Natural ▁ product

Natural ▁ Products

correspond to

天然 产物

天然 产物

天然 产物

天然 产物

Google translate translates 天然产物 as Natural Products.

So probably when we train en-zh it's ok to leave it, but for zh-en it would make sense to do source-side deduplication, otherwise we have 4 different translations for the same Chinese phrase. The question here is which translation is correct... It would make sense to run some model to score each of them and pick the best one instead of naive deduplication.

ZJaume · 2024-12-16T11:15:38Z

Picking the one with best BCAI score?

天然 产物       Natural compund 0.548
天然 产物       Natural products        0.889
天然 产物       Natural product 0.881
天然 产物       Natural Products        0.901

gregtatum · 2024-12-17T15:50:52Z

@ZJaume What is the BCAI score?

ZJaume · 2024-12-17T16:39:21Z

Bicleaner AI, sorry 😅

ZJaume added the quality Improving robustness and translation quality label Nov 13, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Single-side deduplication #928

Single-side deduplication #928

ZJaume commented Nov 13, 2024

gregtatum commented Nov 15, 2024

eu9ene commented Dec 13, 2024

ZJaume commented Dec 16, 2024

gregtatum commented Dec 17, 2024

ZJaume commented Dec 17, 2024

Single-side deduplication #928

Single-side deduplication #928

Comments

ZJaume commented Nov 13, 2024

gregtatum commented Nov 15, 2024

eu9ene commented Dec 13, 2024

ZJaume commented Dec 16, 2024

gregtatum commented Dec 17, 2024

ZJaume commented Dec 17, 2024