fix skip_lang / filter improvement / fix "Repeating because of invalid translation" error #788
+266
−193
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
发现这部分代码在一开始位置就放错了,从检测OCR后零碎的textline改为检测合并后的text_regions从而提高了langdetect的正确率,从只有零星几个正确识别反转成了仅零星几个错误识别,使得skip_lang有效了(检测库依然不够准确需要进一步提升)
EDIT: 修复几个chatgpt的大bug,顺便改了deepseek
Repeating because of invalid translation
错误此错误会造成极大的token浪费。当翻译后列表数目不足时会出现这个错误,而这是非常常见的,不知道为什么没人提过,当遇到此错误时,无论下一批翻译结果的列表数目是否与queries相同,会一直重试到重试上限,而最终的输出结果永远是第一批错误的内容+最后一次重试时对应的值append到第一批空串的位置,也就是说重试的内容全部都浪费掉了,只有最后一次重试时对应第一批空串位置上的翻译有效,而且最终结果一定是错的。
例子:
去掉这提示词后测试,以前无法返回的内容至少有一半返回了,单纯拉大重试次数可以简单堆量解决风控问题,若加了这个提示词堆量也是绕不过去的
增加文字头尾的自定义字符删除列表,用于去除OCR后text头尾的gibberish用字典删更方便 | 同时删除在替换字典后可能产生的开头空格(虽然在翻译时会删空格但是我体感上好像没删,空格似乎在输出后还占着位置)