Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

中文分词问题 #27

Open
liuxm6 opened this issue Aug 2, 2016 · 2 comments
Open

中文分词问题 #27

liuxm6 opened this issue Aug 2, 2016 · 2 comments

Comments

@liuxm6
Copy link

liuxm6 commented Aug 2, 2016

比如“哈尔滨市” 用Complex和Simple模式都会只会分出来“哈尔滨市” 而不能分出来“哈尔滨市”和“哈尔滨”,用MaxWord分出来了“哈”,“尔”,“滨”,“市”,这个要怎么解决呢?感谢作者。

@liuxm6
Copy link
Author

liuxm6 commented Aug 2, 2016

然后用户搜索“哈尔滨”的时候 里面有“哈尔滨市”的文章就不会出现在结果里面

@amao12580
Copy link

amao12580 commented Aug 4, 2016

@liuxm6 花了点时间研究词库加载逻辑,发现当自定义的dicPath没有正确被load时(或者没有定义dicPath),会加载默认的3个词库:chars.dic、units.dic、words.dic,路径位于mmseg4j-core-1.10.2.jar文件内:mmseg4j-core-1.10.2.jar!\data*。而words.dic已经定义了"哈尔滨市"(35887行)是一个完整词汇,且35884行定义了“哈尔滨”也是一个完整词汇,所以出现你提的问题。

解决:download mmseg4j-core源码,删除35887行,用Complex和Simple模式可以正确分词了,自行install释出jar包。

为方便调试,我提供一个完成修改的jar包,链接

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants