关于startOffset must be non-negative的报错 #662

scfw · 2019-03-29T03:48:33Z

在我自己的环境下测试，pengcong90的提交导致了startOffset must be non-negative的报错，不知道是个案还是确实有bug

BigBrother5 · 2019-03-29T05:54:34Z

同遇到这个问题！在索引“得饶人处且饶人”时候会遇到，
配置为
"analyzer": "ik_max_word",
"search_analyzer": "ik_max_word"

foqq · 2019-04-09T12:46:43Z

text: "黎明前的黑暗"
analyzer: "ik_max_word",
search_analyzer: "ik_max_word"
exception:
"startOffset must be non-negative, and endOffset must be >= startOffset, and offsets must not go backwards startOffset=2,endOffset=3,lastStartOffset=3 for field 'description'"

version: 6.7.0

Cosin · 2019-04-14T07:24:46Z

@foqq @BigBrother5 两位真是巧了，我也碰到了这问题，并且文本都是一样的：

得饶人处且饶人
黎明前的黑暗
宝剑锋从磨砺出

经测试，只在6.7.0、6.7.1上存在此bug，且只存在于ik_max_word模式，ik_smart也没有，之前和之后的版本都没有

@medcl 我听朋友说您之前已经解决了此bug，能否再看看？

levylll · 2019-04-15T02:49:05Z

同遇到，版本是6.7.1 希望 @medcl 早日修复一下，多谢了！

levylll · 2019-04-15T02:51:27Z

看到medcl 在别的贴说似乎是无法解决的。。。sad
#652 (comment)
或许得版本回退一下啊。。。。

scfw · 2019-04-15T02:53:55Z

目前我是这么解决的：把pengcong90提交的代码注释掉，重新打包，一切都安静了。

scfw · 2019-04-15T02:57:25Z

从原理上是无法解决，但是有些报错可能是pengcong90提交的那部分代码引起的。具体原因自己测试一下就明确了

medcl · 2019-04-15T13:10:38Z

谁来帮忙提交个 PR，谢谢，我忙晕了要。

dotNetDR · 2019-05-06T11:13:19Z

刚在阿里云开了台测试的elasticsearch 6.7.0正好踩中此坑~~~

PUT test?include_type_name=false
{
  "mappings": {
    "dynamic": "strict",
    "properties": {
      "content": {
        "type": "text",
        "term_vector": "with_positions_offsets",
        "analyzer": "ik_max_word"
      }
    }
  },
  "settings": {
    "index": {
      "number_of_shards": "1",
      "number_of_replicas": "1"
    }
  }
}

PUT test/_doc/1
{
  "content": "机关算尽太聪明"
}

----
{
  "error": {
    "root_cause": [
      {
        "type": "illegal_argument_exception",
        "reason": "startOffset must be non-negative, and endOffset must be >= startOffset, and offsets must not go backwards startOffset=2,endOffset=3,lastStartOffset=4 for field 'content'"
      }
    ],
    "type": "illegal_argument_exception",
    "reason": "startOffset must be non-negative, and endOffset must be >= startOffset, and offsets must not go backwards startOffset=2,endOffset=3,lastStartOffset=4 for field 'content'"
  },
  "status": 400
}

douniwan5788 · 2019-05-15T10:23:14Z

最新的7.0.0 也有这个问题
黎明前的黑暗
java.lang.IllegalArgumentException: startOffset must be non-negative, and endOffset must be >= startOffset, and offsets must not go backwards startOff
set=2,endOffset=3,lastStartOffset=3

scfw · 2019-05-15T10:28:30Z

看一下7.0之后的提交，已经把注释提交上去了，只是还没有发布release

vistart · 2019-05-16T17:59:23Z

@scfw 7.0.1 和 6.7.2 已经发布了，是否解决了此问题呢？我自己测试的结果显示，直接升级版本，不改变data文件夹下的内容，问题依旧。

Starting with Elasticsearch 6.7, the offset check has become more stringent, and the word segmentation feature no longer supports backtracking. infinilabs/analysis-ik#662

…ng90提交的代码注释掉

medcl · 2019-07-07T23:15:01Z

麻烦最新版本帮忙测一下

douniwan5788 · 2019-07-08T06:29:27Z

麻烦最新版本帮忙测一下

好像是可以了，不过又要reindex了……

czhcc · 2019-10-08T10:43:38Z

我用的是6.8.0，有解决的办法吗？

IDrinkMoreWater · 2019-10-10T12:51:39Z

我也碰到类似问题，好烦恼，用的6.8.2

sinboun · 2019-11-26T14:06:52Z

6.8.1也有同样的问题，麻烦有解决办法吗

dotNetDR · 2019-11-29T10:47:22Z

看了代码，修复是从7.0.0开始的
bug代码是从6.7.0引进的，影响一直持续到6.8.4

bmnnmb · 2020-08-13T06:31:59Z

startOffset must be non-negative, and endOffset must be >= startOffset, and offsets must not go backwards startOffset=16,endOffset=17,lastStartOffset=17 for field 'remark'
版本 ES 跟 ik 都是6.8.11，关键词是“黎明前的黑暗”，还是出现了同样的错误，尴尬了，尝试回退到6.6.1试试

dotNetDR · 2020-10-28T03:35:08Z

自己把
/src/main/java/org/wltea/analyzer/core/AnalyzeContext.java
271行注释掉（参考：7.0.0）

                                        int innerIndex = index + 1;
					for (; innerIndex < index + l.getLength(); innerIndex++) {
						Lexeme innerL = path.peekFirst();
						if (innerL != null && innerIndex == innerL.getBegin()) {
							this.outputSingleCJK(innerIndex - 1);
						}
					}

然后重新编译一个ik出来，再安装进去就能解决了。

编译的前记得把/src/main/resources/plugin-descriptor.properties

elasticsearch.version=<你的es版本>

如：

elasticsearch.version=6.8.6

2020-11-12更新：
同时要记得更改pom.xml文件里的

        <elasticsearch.version>${改成你对应的版本}</elasticsearch.version>

yanghanxy · 2020-11-19T11:16:22Z

@medcl 您好，
我这边用 6.X branch 的代码也遇到了这个问题，
看了下 6.X branch 的 commit 记录，是最后一个 commit 引入的这个问题，也一直没有修复。
是不是可以 revert 这个 commit，以免影响更多使用 6.X branch 的人。谢谢！

this.result 从 LinkedList 改成 TreeSet，修复 issue 662 问题， infinilabs#662

yanghanxy · 2020-11-19T12:38:43Z

用这三个 case 复现了一下问题：

得饶人处且饶人
黎明前的黑暗
宝剑锋从磨砺出

ik_max 分词结果：

[0-7 : 得饶人处且饶人 : CN_WORD, 0-4 : 得饶人处 : CN_WORD, 1-3 : 饶人 : CN_WORD, 4-5 : 且 : CN_CHAR, 3-4 : 处 : CN_CHAR, 4-5 : 且 : CN_CHAR, 5-7 : 饶人 : CN_WORD]
[0-6 : 黎明前的黑暗 : CN_WORD, 0-3 : 黎明前 : CN_WORD, 0-2 : 黎明 : CN_WORD, 3-4 : 的 : CN_CHAR, 2-3 : 前 : CN_CHAR, 3-4 : 的 : CN_CHAR, 4-6 : 黑暗 : CN_WORD]
[0-7 : 宝剑锋从磨砺出 : CN_WORD, 0-3 : 宝剑锋 : CN_WORD, 0-2 : 宝剑 : CN_WORD, 3-4 : 从 : CN_CHAR, 2-3 : 锋 : CN_CHAR, 3-4 : 从 : CN_CHAR, 4-6 : 磨砺 : CN_WORD, 6-7 : 出 : CN_CHAR]

问题表现：
输出的 Lexeme 并不是按 offset 顺序输出的，如下：

[4-5 : 且 : CN_CHAR, 3-4 : 处 : CN_CHAR, 4-5 : 且 : CN_CHAR]
[3-4 : 的 : CN_CHAR, 2-3 : 前 : CN_CHAR, 3-4 : 的 : CN_CHAR]
[3-4 : 从 : CN_CHAR, 2-3 : 锋 : CN_CHAR, 3-4 : 从 : CN_CHAR]

问题来源：
这个 commit 在 AnalyzeContext 中新增的 “切分前一个冲突词元的单字” 方法，与 AnalyzeContext 中原本的单字逻辑 “输出path内部，词元间遗漏的单字” 有冲突。某些情况下（如上述三个例子），前者输出的单字，会在后者中重复输出，并且引入顺序问题，造成 ES 索引报错。

解决方法：
AnalyzeContext 的 result 从 LinkedList 改成 TreeSet，避免重复添加，并保证结果按序输出。

修复后结果：

[0-7 : 得饶人处且饶人 : CN_WORD, 0-4 : 得饶人处 : CN_WORD, 1-3 : 饶人 : CN_WORD, 3-4 : 处 : CN_CHAR, 4-5 : 且 : CN_CHAR, 5-7 : 饶人 : CN_WORD]
[0-6 : 黎明前的黑暗 : CN_WORD, 0-3 : 黎明前 : CN_WORD, 0-2 : 黎明 : CN_WORD, 2-3 : 前 : CN_CHAR, 3-4 : 的 : CN_CHAR, 4-6 : 黑暗 : CN_WORD]
[0-7 : 宝剑锋从磨砺出 : CN_WORD, 0-3 : 宝剑锋 : CN_WORD, 0-2 : 宝剑 : CN_WORD, 2-3 : 锋 : CN_CHAR, 3-4 : 从 : CN_CHAR, 4-6 : 磨砺 : CN_WORD, 6-7 : 出 : CN_CHAR]

@medcl 帮忙评估一下，谢谢：#835

pfsan · 2021-04-25T09:02:03Z

请问6.7.2的版本页出现了此问题该怎么解决呢？

medcl · 2022-05-24T06:22:14Z

已 revert

vistart mentioned this issue May 16, 2019

【6.7.1版本】startOffset must be non-negative, and endOffset must be >= startOffset #675

Open

DustinPT referenced this issue in DustinPT/elasticsearch-analysis-ik May 26, 2019

https://github.com/medcl/elasticsearch-analysis-ik/issues/662，把pengco…

b3d9547

…ng90提交的代码注释掉

yanghanxy added a commit to yanghanxy/elasticsearch-analysis-ik that referenced this issue Nov 19, 2020

bug fix for issue 662

c8dacc3

this.result 从 LinkedList 改成 TreeSet，修复 issue 662 问题， infinilabs#662

yanghanxy mentioned this issue Nov 19, 2020

bug fix for issue 662 #835

Open

ghr111 mentioned this issue Jun 7, 2021

es7.4.1 endOffset must be >= startOffset报错 #888

Open

yudingling pushed a commit to yudingling/elasticsearch-analysis-ik that referenced this issue Jun 25, 2021

fix: infinilabs#662

ee19f22

marmot-z mentioned this issue May 13, 2022

Why cloud service manufacturers use open source software for free and don't fix open source software's bugs #945

Closed

medcl closed this as completed May 24, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

关于startOffset must be non-negative的报错 #662

关于startOffset must be non-negative的报错 #662

scfw commented Mar 29, 2019

BigBrother5 commented Mar 29, 2019 •

edited

Loading

foqq commented Apr 9, 2019 •

edited

Loading

Cosin commented Apr 14, 2019

levylll commented Apr 15, 2019

levylll commented Apr 15, 2019

scfw commented Apr 15, 2019

scfw commented Apr 15, 2019

medcl commented Apr 15, 2019

dotNetDR commented May 6, 2019

douniwan5788 commented May 15, 2019

scfw commented May 15, 2019

vistart commented May 16, 2019 •

edited

Loading

medcl commented Jul 7, 2019

douniwan5788 commented Jul 8, 2019

czhcc commented Oct 8, 2019

IDrinkMoreWater commented Oct 10, 2019

sinboun commented Nov 26, 2019

dotNetDR commented Nov 29, 2019

bmnnmb commented Aug 13, 2020

dotNetDR commented Oct 28, 2020 •

edited

Loading

yanghanxy commented Nov 19, 2020

yanghanxy commented Nov 19, 2020 •

edited

Loading

pfsan commented Apr 25, 2021

medcl commented May 24, 2022

关于startOffset must be non-negative的报错 #662

关于startOffset must be non-negative的报错 #662

Comments

scfw commented Mar 29, 2019

BigBrother5 commented Mar 29, 2019 • edited Loading

foqq commented Apr 9, 2019 • edited Loading

Cosin commented Apr 14, 2019

levylll commented Apr 15, 2019

levylll commented Apr 15, 2019

scfw commented Apr 15, 2019

scfw commented Apr 15, 2019

medcl commented Apr 15, 2019

dotNetDR commented May 6, 2019

douniwan5788 commented May 15, 2019

scfw commented May 15, 2019

vistart commented May 16, 2019 • edited Loading

medcl commented Jul 7, 2019

douniwan5788 commented Jul 8, 2019

czhcc commented Oct 8, 2019

IDrinkMoreWater commented Oct 10, 2019

sinboun commented Nov 26, 2019

dotNetDR commented Nov 29, 2019

bmnnmb commented Aug 13, 2020

dotNetDR commented Oct 28, 2020 • edited Loading

yanghanxy commented Nov 19, 2020

yanghanxy commented Nov 19, 2020 • edited Loading

pfsan commented Apr 25, 2021

medcl commented May 24, 2022

BigBrother5 commented Mar 29, 2019 •

edited

Loading

foqq commented Apr 9, 2019 •

edited

Loading

vistart commented May 16, 2019 •

edited

Loading

dotNetDR commented Oct 28, 2020 •

edited

Loading

yanghanxy commented Nov 19, 2020 •

edited

Loading