增加关键词抽取功能，并提供benchmark #23

引入可下载的外部词典，辅助新词发现排除旧词 #24
blmoistawinde · Oct 8, 2020 · f97f2bb · f97f2bb
1 parent bdd1098
commit f97f2bb
Show file tree

Hide file tree

Showing 10 changed files with 928 additions and 34 deletions.
diff --git a/README.md b/README.md
@@ -7,7 +7,7 @@ Sow with little data seed, harvest much from a text field.
 ![GitHub stars](https://img.shields.io/github/stars/blmoistawinde/harvesttext?style=social) 
 ![PyPI - Python Version](https://img.shields.io/badge/python-3.6+-blue.svg) 
 ![GitHub](https://img.shields.io/github/license/mashape/apistatus.svg) 
-![Version](https://img.shields.io/badge/version-V0.7-red.svg)
+![Version](https://img.shields.io/badge/version-V0.8-red.svg)
 
 ## 用途
 HarvestText是一个专注无（弱）监督方法，能够整合领域知识（如类型，别名）对特定领域文本进行简单高效地处理和分析的库。适用于许多文本预处理和初步探索性分析任务，在小说分析，网络文本，专业文献等领域都有潜在应用价值。
@@ -478,6 +478,37 @@ Text summarization(避免重复)
 武磊和郜林，谁是中国最好的前锋？
 ```
 
+<a id="关键词抽取"> </a>
+
+### 关键词抽取
+
+目前提供包括`textrank`和HarvestText封装jieba并配置好参数和停用词的`jieba_tfidf`（默认）两种算法。
+
+示例(完整见[example](./examples/basics.py))：
+
+```python3
+# text为林俊杰《关键词》歌词
+print("《关键词》里的关键词")
+kwds = ht.extract_keywords(text, 5, method="jieba_tfidf")
+print("jieba_tfidf", kwds)
+kwds = ht.extract_keywords(text, 5, method="textrank")
+print("textrank", kwds)
+```
+
+```
+《关键词》里的关键词
+jieba_tfidf ['自私', '慷慨', '落叶', '消逝', '故事']
+textrank ['自私', '落叶', '慷慨', '故事', '位置']
+```
+
+[CSL.ipynb](./examples/kwd_benchmark/CSL.ipynb)提供了不同算法，以及本库的实现与[textrank4zh](https://github.com/letiantian/TextRank4ZH)的在[CSL数据集](https://github.com/CLUEbenchmark/CLUE#6-csl-%E8%AE%BA%E6%96%87%E5%85%B3%E9%94%AE%E8%AF%8D%E8%AF%86%E5%88%AB-keyword-recognition)上的比较。由于仅有一个数据集且数据集对于以上算法都很不友好，表现仅供参考。
+
+| 算法 | P@5 | R@5 | F@5 |
+| --- | --- | --- | --- |
+| textrank4zh | 0.0836 | 0.1174 | 0.0977 |
+| ht_textrank | 0.0955 | 0.1342 | 0.1116 |
+| ht_jieba_tfidf | **0.1035** | **0.1453** | **0.1209** |
+
 
 <a id="内置资源"> </a>
 
@@ -486,9 +517,11 @@ Text summarization(避免重复)
 现在本库内集成了一些资源，方便使用和建立demo。
 
 资源包括：
-- 褒贬义词典 清华大学 李军 整理自http://nlp.csai.tsinghua.edu.cn/site2/index.php/13-sms
-- 百度停用词词典 来自网络：https://wenku.baidu.com/view/98c46383e53a580216fcfed9.html
-- 领域词典 来自清华THUNLP： http://thuocl.thunlp.org/ 全部类型`['IT', '动物', '医药', '历史人名', '地名', '成语', '法律', '财经', '食物']`
+- `get_qh_sent_dict`: 褒贬义词典 清华大学 李军 整理自http://nlp.csai.tsinghua.edu.cn/site2/index.php/13-sms
+- `get_baidu_stopwords`: 百度停用词词典 来自网络：https://wenku.baidu.com/view/98c46383e53a580216fcfed9.html
+- `get_qh_typed_words`: 领域词典 来自清华THUNLP： http://thuocl.thunlp.org/ 全部类型`['IT', '动物', '医药', '历史人名', '地名', '成语', '法律', '财经', '食物']`
+- `get_english_senti_lexicon`: 英语情感词典
+- `get_jieba_dict`: （需要下载）jieba词频词典
 
 
 此外，还提供了一个特殊资源——《三国演义》，包括：
@@ -590,6 +623,21 @@ min_aggregation = np.sqrt(length) / 15
 </details>
 <br/>
 
+<details><summary>使用结巴词典过滤旧词（展开查看）</summary>
+```
+from harvesttext.resources import get_jieba_dict
+jieba_dict = get_jieba_dict(min_freq=100)
+print("jiaba词典中的词频>100的词语数：", len(jieba_dict))
+text = "1979-1998-2020的喜宝们 我现在记忆不太好，大概是拍戏时摔坏了~有什么笔记都要当下写下来。前几天翻看，找着了当时记下的话.我觉得喜宝既不娱乐也不启示,但这就是生活就是人生,10/16来看喜宝吧"
+new_words_info = ht.word_discover(text, 
+                                    excluding_words=set(jieba_dict),       # 排除词典已有词语
+                                    exclude_number=True)                   # 排除数字（默认True）     
+new_words = new_words_info.index.tolist()
+print(new_words)                                                         # ['喜宝']
+```
+</details>
+<br/>
+
 [根据反馈更新](https://github.com/blmoistawinde/HarvestText/issues/13#issue-551894838) 原本默认接受一个单独的字符串，现在也可以接受字符串列表输入，会自动进行拼接
 
 [根据反馈更新](https://github.com/blmoistawinde/HarvestText/issues/14#issuecomment-576081430) 现在默认按照词频降序排序，也可以传入`sort_by='score'`参数，按照综合质量评分排序。
@@ -802,3 +850,5 @@ we imagine what we'll find, in another life.
 
 [EventTriplesExtraction](https://github.com/liuhuanyong/EventTriplesExtraction)
 
+[textrank4ZH](https://github.com/letiantian/TextRank4ZH)
+
diff --git a/examples/basics.py b/examples/basics.py
@@ -1,6 +1,7 @@
 #coding=utf-8
 import re
 from harvesttext import HarvestText
+
 ht = HarvestText()
 
 def new_word_discover():
@@ -398,29 +399,80 @@ def test_english():
     # for sent0 in sentences:
     #     print(sent0, ht_eng.analyse_sent(sent0))
 
-
+def jieba_dict_new_word():
+    from harvesttext.resources import get_jieba_dict
+    jieba_dict = get_jieba_dict(min_freq=100)
+    print("jiaba词典中的词频>100的词语数：", len(jieba_dict))
+    text = "1979-1998-2020的喜宝们 我现在记忆不太好，大概是拍戏时摔坏了~有什么笔记都要当下写下来。前几天翻看，找着了当时记下的话.我觉得喜宝既不娱乐也不启示,但这就是生活就是人生,10/16来看喜宝吧"
+    new_words_info = ht.word_discover(text, 
+                                      excluding_words=set(jieba_dict),       # 排除词典已有词语
+                                      exclude_number=True)                   # 排除数字（默认True）     
+    new_words = new_words_info.index.tolist()
+    print(new_words)                                                         # ['喜宝']
+
+def extract_keywords():
+    text = """
+好好爱自己 就有人会爱你
+这乐观的说词
+幸福的样子 我感觉好真实
+找不到形容词
+沉默在掩饰 快泛滥的激情
+只剩下语助词
+有一种踏实 当你口中喊我名字
+落叶的位置 谱出一首诗
+时间在消逝 我们的故事开始
+这是第一次
+让我见识爱情 可以慷慨又自私
+你是我的关键词
+我不太确定 爱最好的方式
+是动词或名词
+很想告诉你 最赤裸的感情
+却又忘词
+聚散总有时 而哭笑也有时
+我不怕潜台词
+有一种踏实 是你心中有我名字
+落叶的位置 谱出一首诗
+时间在消逝 我们的故事开始
+这是第一次
+让我见识爱情 可以慷慨又自私
+你是我的关键词
+你藏在歌词 代表的意思
+是专有名词
+落叶的位置 谱出一首诗
+我们的故事 才正要开始
+这是第一次
+爱一个人爱得 如此慷慨又自私
+你是我的关键
+    """
+    print("《关键词》里的关键词")
+    kwds = ht.extract_keywords(text, 5, method="jieba_tfidf")
+    print("jieba_tfidf", kwds)
+    kwds = ht.extract_keywords(text, 5, method="textrank")
+    print("textrank", kwds)
 
 if __name__ == "__main__":
-    test_english()
-    new_word_discover()
-    new_word_register()
-    entity_segmentation()
-    sentiment_dict()
-    sentiment_dict_default()
-    entity_search()
-    text_summarization()
-    entity_network()
-    save_load_clear()
-    load_resources()
-    linking_strategy()
-    find_with_rules()
-    load_resources()
-    using_typed_words()
-    build_word_ego_graph()
-    entity_error_check()
-    depend_parse()
-    named_entity_recognition()
-    el_keep_all()
-    filter_el_with_rule()
-    clean_text()
-    cut_paragraph()
+    # test_english()
+    # new_word_discover()
+    # new_word_register()
+    # entity_segmentation()
+    # sentiment_dict()
+    # sentiment_dict_default()
+    # entity_search()
+    # text_summarization()
+    # entity_network()
+    # save_load_clear()
+    # load_resources()
+    # linking_strategy()
+    # find_with_rules()
+    # load_resources()
+    # using_typed_words()
+    # build_word_ego_graph()
+    # entity_error_check()
+    # depend_parse()
+    # named_entity_recognition()
+    # el_keep_all()
+    # filter_el_with_rule()
+    # clean_text()
+    # cut_paragraph()
+    # jieba_dict_new_word()
+    extract_keywords()