Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
AnalyzeContext.result 从 LinkedList 改成 TreeSet,修复 issue 662 问题:#662
用这三个 case 复现了一下问题:
ik_max 分词结果:
问题表现:
输出的 Lexeme 并不是按 offset 顺序输出的,如下:
问题来源:
这个 commit 在 AnalyzeContext 中新增的 “切分前一个冲突词元的单字” 方法,与 AnalyzeContext 中原本的单字逻辑 “输出path内部,词元间遗漏的单字” 有冲突。某些情况下(如上述三个例子),前者输出的单字,会在后者中重复输出,并且引入顺序问题,造成 ES 索引报错。
解决方法:
AnalyzeContext 的 result 从 LinkedList 改成 TreeSet,避免重复添加,并保证结果按序输出。
修复后结果: