Skip to content

Commit

Permalink
fixed bug releated to NLLB language token being duplicated
Browse files Browse the repository at this point in the history
  • Loading branch information
ghanvert committed Dec 17, 2024
1 parent 4ab60f5 commit 2d205bd
Showing 1 changed file with 1 addition and 1 deletion.
2 changes: 1 addition & 1 deletion src/tokenizer_wrapper.py
Original file line number Diff line number Diff line change
Expand Up @@ -21,7 +21,7 @@ def __call__(
) -> dict:
if isinstance(self.tokenizer, NllbTokenizerFast):
input_ids = self.tokenizer.encode(text, **self.kwargs)
input_ids[0] = self.tokenizer.convert_tokens_to_ids(src_lang)
input_ids[:, 0] = self.tokenizer.convert_tokens_to_ids(src_lang)
elif isinstance(self.tokenizer, T5TokenizerFast):
input_ids = self.tokenizer.encode(src_lang + text, **self.kwargs)

Expand Down

0 comments on commit 2d205bd

Please sign in to comment.