Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Philip's blog #24

Open
p208p2002 opened this issue Sep 6, 2023 · 0 comments
Open

Philip's blog #24

p208p2002 opened this issue Sep 6, 2023 · 0 comments

Comments

@p208p2002
Copy link
Owner

https://blog.philip-huang.tech/?page=bpe-tokenization

- tags: 論文筆記 gpt2 bpe tokenizer tokenization - date: 2023/09/6

論文連結: https://arxiv.org/abs/1508.07909

神經機器翻譯(NMT)模型通常使用固定的詞彙表,但翻譯是一個開放詞彙的問題。

先前的研究解決了 out-of-vocabulary(OOV)的單詞的翻譯問題,通常通過 back-off dictionary 來解決。在本文中,我們介紹了一種更簡單且更有效的方法,使NMT模型能夠通過將罕見和未知單詞編碼為 subwords 來進行開放詞彙的翻譯。這是基於一種直覺,即各種詞類可以通過比詞彙更小的單位進行翻譯。

我們討論了不同的詞彙分割技術的適用性,包括簡單的字符n-gram模型和基於字節對編碼壓縮算法的分割,並根據實驗結果顯示,subword model 在 WMT 15 的翻譯任務中相對於 baseline 分別提高了1.1和1.3個BLEU分數。

論文核心問題
Can we improve the

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

1 participant