You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Some multilingual/seq2seq models such as M2M100 (c.f. Generation section in the linked page) require the bos_token set to the target language id in the sequence tgt property.
In the case of the translation server, to be able to specify the requested translation language, we the need to directly manipulate the sequence tgt property prior to translation.
But in its current state the server has a disconnection between the sequence ref/ref_tok (which can be manipulated through tokenizers/processors btw) and tgt string prior to being sent to ctranslate2.
I successfully implemented a one-line patch that properly passes the parameter through and allows me to do multilingual translation. It should not have side-effects on other type of models (for which the sequence ref is empty after tokenizing the sequence), by setting the parameter as an empty string in those cases.
importosfrompathlibimportPathfromtransformersimportM2M100Tokenizercache=NonedefloadTokenizer(model_root, logger):
globalcacheifcacheisnotNone:
returncachemodel_path=os.path.join(model_root, "m2m-multi4-ft-ck945k/tokenizer/")
logger.info("Loading m2m100 tokenizer from %s", model_path)
cache=M2M100Tokenizer.from_pretrained(model_path)
returncachedefpreprocess(sequence, server_model):
"""Preprocess a single sequence. Args: sequence (dict[str, Unknown]): The sequence to preprocess. Returns: sequence (dict[str, Unknown]): The preprocessed sequence."""server_model.logger.info(f"Running preprocessor '{Path(__file__).stem}'")
ref=sequence.get("ref", None)
ifref[0] isnotNone:
server_model.logger.debug(f"${ref[0]=}")
tgt_lang=ref[0].get("tgt_lang", None)
iftgt_langisnotNone:
server_model.logger.debug(f"${tgt_lang=}")
tokenizer=loadTokenizer(server_model.model_root, server_model.logger)
seg=sequence.get("seg", None)
tok=tokenizer.convert_ids_to_tokens(
tokenizer.encode(seg[0])
)
tok=" ".join(tok)
sequence["seg"][0] =toklang_prefix=f"__{tgt_lang}__"sequence["ref"][0] =f"{lang_prefix}"server_model.logger.info(f"Added lang prefix to ref: '{lang_prefix}'")
server_model.logger.debug(f"${sequence['ref'][0]=}")
returnsequencedefpostprocess(sequence, server_model):
"""Postprocess a single sequence. Args: sequence (dict[str, Unknown]): The sequence to postprocess. Returns: sequence (dict[str, Unknown]): The post processed sequence."""server_model.logger.info(f"Running postprocessor '{Path(__file__).stem}'")
tokenizer=loadTokenizer(server_model.model_root, server_model.logger)
seg=sequence.get("seg", None)
detok=tokenizer.decode(
tokenizer.convert_tokens_to_ids(seg[0].split()[1:]),
skip_special_tokens=True
)
returndetok
Sample request to server:
[
{
"src": "Brian is in the kitchen.",
"id": 100,
"ref": {
"src_lang": "en",
"tgt_lang": "fr"
}
},
{
"src": "By the way, do you like to eat pancakes?",
"id": 100,
"ref": {
"src_lang": "en",
"tgt_lang": "fr"
}
}
]
The text was updated successfully, but these errors were encountered:
Please read the README of the project, we are no longer supporting OpenNMT-py and switching to https://github.com/eole-nlp/eole
I suggest you to switch to eole if you intend to get support in the future.
The server in eole is not ready yet but future devs will be done there.
cheers.
Some multilingual/seq2seq models such as M2M100 (c.f. Generation section in the linked page) require the
bos_token
set to the target language id in the sequencetgt
property.In the case of the translation server, to be able to specify the requested translation language, we the need to directly manipulate the sequence
tgt
property prior to translation.But in its current state the server has a disconnection between the sequence
ref
/ref_tok
(which can be manipulated through tokenizers/processors btw) andtgt
string prior to being sent toctranslate2
.c.f.
OpenNMT-py/onmt/translate/translation_server.py
Line 588 in cb1cb22
Basically the parameter
tgt
of theself.translator.translate
method is never provided.c.f.
OpenNMT-py/onmt/translate/translation_server.py
Line 599 in cb1cb22
I successfully implemented a one-line patch that properly passes the parameter through and allows me to do multilingual translation. It should not have side-effects on other type of models (for which the sequence
ref
is empty after tokenizing the sequence), by setting the parameter as an empty string in those cases.Here’s the PR: #2585
Example of multilingual translation with a M2M100 model:
conf.json
available_models/m2m-multi4-ft-ck945k/tokenizer/m2m100_tokenizer.py
Sample request to server:
The text was updated successfully, but these errors were encountered: