Skip to content

Commit

Permalink
fixes
Browse files Browse the repository at this point in the history
  • Loading branch information
blmoistawinde committed May 13, 2024
1 parent 35c841c commit ba9b3d2
Show file tree
Hide file tree
Showing 7 changed files with 148 additions and 69 deletions.
128 changes: 86 additions & 42 deletions README.md
Original file line number Diff line number Diff line change
@@ -1,8 +1,6 @@
# HarvestText

Sow with little data seed, harvest much from a text field.

播撒几多种子词,收获万千领域实
HarvestText : A Toolkit for Text Mining and Preprocessing

![GitHub stars](https://img.shields.io/github/stars/blmoistawinde/harvesttext?style=social)
![PyPI - Python Version](https://img.shields.io/badge/python-3.6+-blue.svg)
Expand Down Expand Up @@ -186,61 +184,98 @@ text1 = "回复@钱旭明QXM:[嘻嘻][嘻嘻] //@钱旭明QXM:杨大哥[good][go
print("清洗微博【@和表情符等】")
print("原:", text1)
print("清洗后:", ht0.clean_text(text1))
```

```
各种清洗文本
清洗微博【@和表情符等】
原: 回复@钱旭明QXM:[嘻嘻][嘻嘻] //@钱旭明QXM:杨大哥[good][good]
清洗后: 杨大哥
```

```python
# URL的清理
text1 = "【#赵薇#:正筹备下一部电影 但不是青春片....http://t.cn/8FLopdQ"
print("清洗网址URL")
print("原:", text1)
print("清洗后:", ht0.clean_text(text1, remove_url=True))
```
```
清洗网址URL
原: 【#赵薇#:正筹备下一部电影 但不是青春片....http://t.cn/8FLopdQ
清洗后: 【#赵薇#:正筹备下一部电影 但不是青春片....
```
```python
# 清洗邮箱
text1 = "我的邮箱是[email protected],欢迎联系"
print("清洗邮箱")
print("原:", text1)
print("清洗后:", ht0.clean_text(text1, email=True))
```
```
清洗邮箱
原: 我的邮箱是[email protected],欢迎联系
清洗后: 我的邮箱是,欢迎联系
```
```python
# 处理URL转义字符
text1 = "www.%E4%B8%AD%E6%96%87%20and%20space.com"
print("URL转正常字符")
print("原:", text1)
print("清洗后:", ht0.clean_text(text1, norm_url=True, remove_url=False))
```
```
URL转正常字符
原: www.%E4%B8%AD%E6%96%87%20and%20space.com
清洗后: www.中文 and space.com
```
```python
text1 = "www.中文 and space.com"
print("正常字符转URL[含有中文和空格的request需要注意]")
print("原:", text1)
print("清洗后:", ht0.clean_text(text1, to_url=True, remove_url=False))
```
```
正常字符转URL[含有中文和空格的request需要注意]
原: www.中文 and space.com
清洗后: www.%E4%B8%AD%E6%96%87%20and%20space.com
```
```python
# 处理HTML转义字符
text1 = "<a c> ''"
print("HTML转正常字符")
print("原:", text1)
print("清洗后:", ht0.clean_text(text1, norm_html=True))
```
```
HTML转正常字符
原: <a c> ''
清洗后: <a c> ''
```
```python
# 繁体字转简体
text1 = "心碎誰買單"
print("繁体字转简体")
print("原:", text1)
print("清洗后:", ht0.clean_text(text1, t2s=True))
```

```
各种清洗文本
清洗微博【@和表情符等】
原: 回复@钱旭明QXM:[嘻嘻][嘻嘻] //@钱旭明QXM:杨大哥[good][good]
清洗后: 杨大哥
清洗网址URL
原: 【#赵薇#:正筹备下一部电影 但不是青春片....http://t.cn/8FLopdQ
清洗后: 【#赵薇#:正筹备下一部电影 但不是青春片....
清洗邮箱
原: 我的邮箱是[email protected],欢迎联系
清洗后: 我的邮箱是,欢迎联系
URL转正常字符
原: www.%E4%B8%AD%E6%96%87%20and%20space.com
清洗后: www.中文 and space.com
正常字符转URL[含有中文和空格的request需要注意]
原: www.中文 and space.com
清洗后: www.%E4%B8%AD%E6%96%87%20and%20space.com
HTML转正常字符
原: &lt;a c&gt;&nbsp;&#x27;&#x27;
清洗后: <a c> ''
繁体字转简体
原: 心碎誰買單
清洗后: 心碎谁买单
```
```python
# markdown超链接提取文本
text1 = "欢迎使用[HarvestText : A Toolkit for Text Mining and Preprocessing](https://github.com/blmoistawinde/HarvestText)这个库"
print("markdown超链接提取文本")
print("原:", text1)
print("清洗后:", ht0.clean_text(text1, t2s=True))
```
```
markdown超链接提取文本
原: 欢迎使用[HarvestText : A Toolkit for Text Mining and Preprocessing](https://github.com/blmoistawinde/HarvestText)这个库
清洗后: 欢迎使用HarvestText : A Toolkit for Text Mining and Preprocessing这个库
```

<a id="命名实体识别"> </a>

Expand Down Expand Up @@ -307,27 +342,33 @@ def entity_error_check():
sent0 = "武磊和吴磊拼音相同"
print(sent0)
print(ht0.entity_linking(sent0, pinyin_tolerance=0))
"""
武磊和吴磊拼音相同
[([0, 2], ('武磊', '#人名#')), [(3, 5), ('武磊', '#人名#')]]
"""
sent1 = "武磊和吴力只差一个拼音"
print(sent1)
print(ht0.entity_linking(sent1, pinyin_tolerance=1))
"""
武磊和吴力只差一个拼音
[([0, 2], ('武磊', '#人名#')), [(3, 5), ('武磊', '#人名#')]]
"""
sent2 = "武磊和吴磊只差一个字"
print(sent2)
print(ht0.entity_linking(sent2, char_tolerance=1))
"""
武磊和吴磊只差一个字
[([0, 2], ('武磊', '#人名#')), [(3, 5), ('武磊', '#人名#')]]
"""
sent3 = "吴磊和吴力都可能是武磊的代称"
print(sent3)
print(ht0.get_linking_mention_candidates(sent3, pinyin_tolerance=1, char_tolerance=1))
"""
吴磊和吴力都可能是武磊的代称
('吴磊和吴力都可能是武磊的代称', defaultdict(<class 'list'>, {(0, 2): {'武磊'}, (3, 5): {'武磊'}}))
"""
```

```
武磊和吴磊拼音相同
[([0, 2], ('武磊', '#人名#')), [(3, 5), ('武磊', '#人名#')]]
武磊和吴力只差一个拼音
[([0, 2], ('武磊', '#人名#')), [(3, 5), ('武磊', '#人名#')]]
武磊和吴磊只差一个字
[([0, 2], ('武磊', '#人名#')), [(3, 5), ('武磊', '#人名#')]]
吴磊和吴力都可能是武磊的代称
('吴磊和吴力都可能是武磊的代称', defaultdict(<class 'list'>, {(0, 2): {'武磊'}, (3, 5): {'武磊'}}))
```
<a id="情感分析"> </a>

### 情感分析
Expand Down Expand Up @@ -378,7 +419,17 @@ print("%s:%f" % ("二十万",sent_dict["二十万"]))
print("%s:%f" % ("万恶",sent_dict["万恶"]))
print("%f:%s" % (ht.analyse_sent(docs[0]), docs[0]))
print("%f:%s" % (ht.analyse_sent(docs[1]), docs[1]))

```
```
sentiment dictionary using default seed words
scale="0-1", 按照最大为1,最小为0进行线性伸缩,0.5未必是中性
赞同:1.000000
二十万:0.153846
万恶:0.000000
0.449412:张市筹设兴华实业公司外区资本家踊跃投资晋察冀边区兴华实业公司,自筹备成立以来,解放区内外企业界人士及一般商民,均踊跃认股投资
0.364910:打倒万恶的资本家
```
```
print("scale=\"+-1\", 在正负区间内分别伸缩,保留0作为中性的语义")
sent_dict = ht.build_sent_dict(docs,min_times=1,scale="+-1")
print("%s:%f" % ("赞同",sent_dict["赞同"]))
Expand All @@ -389,13 +440,6 @@ print("%f:%s" % (ht.analyse_sent(docs[1]), docs[1]))
```

```
sentiment dictionary using default seed words
scale="0-1", 按照最大为1,最小为0进行线性伸缩,0.5未必是中性
赞同:1.000000
二十万:0.153846
万恶:0.000000
0.449412:张市筹设兴华实业公司外区资本家踊跃投资晋察冀边区兴华实业公司,自筹备成立以来,解放区内外企业界人士及一般商民,均踊跃认股投资
0.364910:打倒万恶的资本家
scale="+-1", 在正负区间内分别伸缩,保留0作为中性的语义
赞同:1.000000
二十万:0.000000
Expand Down Expand Up @@ -859,7 +903,7 @@ we imagine what we'll find, in another life.
```
@misc{zhangHarvestText,
author = {Zhiling Zhang},
title = {{G}it{H}ub - blmoistawinde/{H}arvest{T}ext},
title = {HarvestText: A Toolkit for Text Mining and Preprocessing},
journal = {GitHub repository},
howpublished = {\url{https://github.com/blmoistawinde/HarvestText}},
year = {2023}
Expand Down
5 changes: 5 additions & 0 deletions examples/basics.py
Original file line number Diff line number Diff line change
Expand Up @@ -350,6 +350,11 @@ def clean_text():
print("繁体字转简体")
print("原:", text1)
print("清洗后:", ht0.clean_text(text1, t2s=True))
# markdown超链接提取文本
text1 = "欢迎使用[HarvestText : A Toolkit for Text Mining and Preprocessing](https://github.com/blmoistawinde/HarvestText)这个库"
print("markdown超链接提取文本")
print("原:", text1)
print("清洗后:", ht0.clean_text(text1, t2s=True))

def extract_only_chinese(file):
pattern = re.compile(r'[^\u4e00-\u9fa5]')
Expand Down
2 changes: 1 addition & 1 deletion harvesttext/__init__.py
Original file line number Diff line number Diff line change
Expand Up @@ -4,7 +4,7 @@
from .harvesttext import HarvestText
from .resources import *

__version__ = '0.8.1.8'
__version__ = '0.8.2.1'

def saveHT(htModel,filename):
with open(filename, "wb") as f:
Expand Down
5 changes: 4 additions & 1 deletion harvesttext/algorithms/entity_discoverer.py
Original file line number Diff line number Diff line change
Expand Up @@ -8,7 +8,6 @@
from pypinyin import lazy_pinyin
from collections import defaultdict
from sklearn.metrics.pairwise import cosine_similarity
from gensim.models import FastText

class NERPEntityDiscover:
def __init__(self, sent_words, type_entity_dict, entity_count, pop_words_cnt, word2id, id2word,
Expand Down Expand Up @@ -137,6 +136,10 @@ def __init__(self, sent_words, type_entity_dict, entity_count, pop_words_cnt, wo
self.entity_mention_dict, self.entity_type_dict = self.organize(partition, pattern_entity2mentions)

def train_emb(self, sent_words, word2id, id2word, emb_dim, min_count, ft_iters, use_subword, min_n, max_n):
try:
from gensim.models import FastText
except:
raise Exception("使用fasttext功能需要pip install -U gensim")
"""因为fasttext的词频筛选策略(>=5),word2id和id2word会发生改变,但是要保持按照词频的排序
:return: emb_mat, word2id, id2word
Expand Down
58 changes: 35 additions & 23 deletions harvesttext/harvesttext.py
Original file line number Diff line number Diff line change
Expand Up @@ -727,10 +727,10 @@ def cut_sentences(self, para, drop_empty_line=True, strip=True, deduplicate=Fals
return sentences

def clean_text(self, text, remove_url=True, email=True, weibo_at=True, stop_terms=("转发微博",),
emoji=True, weibo_topic=False, deduplicate_space=True,
emoji=True, weibo_topic=False, markdown_hyperlink=True, deduplicate_space=True,
norm_url=False, norm_html=False, to_url=False,
remove_puncts=False, remove_tags=True, t2s=False,
expression_len=(1,6), linesep2space=False):
expression_len=(1,6), linesep2space=False, custom_regex=None):
'''
进行各种文本清洗操作,微博中的特殊格式,网址,email,html代码,等等
Expand All @@ -741,6 +741,7 @@ def clean_text(self, text, remove_url=True, email=True, weibo_at=True, stop_term
:param stop_terms: 去除文本中的一些特定词语,默认参数为("转发微博",)
:param emoji: (默认使用)去除\[\]包围的文本,一般是表情符号
:param weibo_topic: (默认不使用)去除##包围的文本,一般是微博话题
:param markdown_hyperlink: (默认使用)将类似markdown超链接的格式 "[文本内容](链接)" 清洗为只剩下"文本内容"
:param deduplicate_space: (默认使用)合并文本中间的多个空格为一个
:param norm_url: (默认不使用)还原URL中的特殊字符为普通格式,如(%20转为空格)
:param norm_html: (默认不使用)还原HTML中的特殊字符为普通格式,如(\&nbsp;转为空格)
Expand All @@ -750,6 +751,7 @@ def clean_text(self, text, remove_url=True, email=True, weibo_at=True, stop_term
:param t2s: (默认不使用)繁体字转中文
:param expression_len: 假设表情的表情长度范围,不在范围内的文本认为不是表情,不加以清洗,如[加上特别番外荞麦花开时共五册]。设置为None则没有限制
:param linesep2space: (默认不使用)把换行符转换成空格
:param custom_regex: (默认None)一个正则表达式或一个列表的正则表达式,会优先根据这些表达式将对应内容替换为空
:return: 清洗后的文本
'''
# unicode不可见字符
Expand All @@ -760,12 +762,42 @@ def clean_text(self, text, remove_url=True, email=True, weibo_at=True, stop_term
# 反向的矛盾设置
if norm_url and to_url:
raise Exception("norm_url和to_url是矛盾的设置")
if custom_regex is not None:
if type(custom_regex) == str:
custom_regex = [custom_regex]
for pattern in custom_regex:
text = re.sub(pattern, "", text)

if norm_html:
text = html.unescape(text)
if to_url:
text = urllib.parse.quote(text)
if remove_tags:
text = w3lib.html.remove_tags(text)
if markdown_hyperlink:
text = re.sub(r"\[(.+?)\]\(\S+\)", r"\1", text)
if weibo_topic:
text = re.sub(r"#.+#", "", text) # 去除话题内容(中间可能有空格)
if emoji:
# 去除括号包围的表情符号
# ? lazy match避免把两个表情中间的部分去除掉
if type(expression_len) in {tuple, list} and len(expression_len) == 2:
# 设置长度范围避免误伤人用的中括号内容,如[加上特别番外荞麦花开时共五册]
lb, rb = expression_len
text = re.sub(r"\[\S{"+str(lb)+r","+str(rb)+r"}?\]", "", text)
else:
text = re.sub(r"\[\S+?\]", "", text)
# text = re.sub(r"\[\S+\]", "", text)
# 去除真,图标式emoji
emoji_pattern = re.compile("["
u"\U0001F600-\U0001F64F" # emoticons
u"\U0001F300-\U0001F5FF" # symbols & pictographs
u"\U0001F680-\U0001F6FF" # transport & map symbols
u"\U0001F1E0-\U0001F1FF" # flags (iOS)
u"\U00002702-\U000027B0"
"]+", flags=re.UNICODE)
text = emoji_pattern.sub(r'', text)

if remove_url:
try:
URL_REGEX = re.compile(
Expand All @@ -786,27 +818,7 @@ def clean_text(self, text, remove_url=True, email=True, weibo_at=True, stop_term
text = re.sub(EMAIL_REGEX, "", text)
if weibo_at:
text = re.sub(r"(回复)?(//)?\s*@\S*?\s*(:|:| |$)", " ", text) # 去除正文中的@和回复/转发中的用户名
if emoji:
# 去除括号包围的表情符号
# ? lazy match避免把两个表情中间的部分去除掉
if type(expression_len) in {tuple, list} and len(expression_len) == 2:
# 设置长度范围避免误伤人用的中括号内容,如[加上特别番外荞麦花开时共五册]
lb, rb = expression_len
text = re.sub(r"\[\S{"+str(lb)+r","+str(rb)+r"}?\]", "", text)
else:
text = re.sub(r"\[\S+?\]", "", text)
# text = re.sub(r"\[\S+\]", "", text)
# 去除真,图标式emoji
emoji_pattern = re.compile("["
u"\U0001F600-\U0001F64F" # emoticons
u"\U0001F300-\U0001F5FF" # symbols & pictographs
u"\U0001F680-\U0001F6FF" # transport & map symbols
u"\U0001F1E0-\U0001F1FF" # flags (iOS)
u"\U00002702-\U000027B0"
"]+", flags=re.UNICODE)
text = emoji_pattern.sub(r'', text)
if weibo_topic:
text = re.sub(r"#\S+#", "", text) # 去除话题内容

if linesep2space:
text = text.replace("\n", " ") # 不需要换行的时候变成1行
if deduplicate_space:
Expand Down
1 change: 0 additions & 1 deletion requirements.txt
Original file line number Diff line number Diff line change
@@ -1,5 +1,4 @@
scikit-learn
gensim
jieba
numpy
scipy
Expand Down
18 changes: 17 additions & 1 deletion tests/test_hard_text_cleaning.py
Original file line number Diff line number Diff line change
Expand Up @@ -29,7 +29,23 @@ def test_hard_text_cleaning():
text1 = "JJ棋牌数据4.3万。数据链接http://www.jj.cn/,数据第一个账号,第二个密码,95%可登录,可以登录官网查看数据是否准确"
text2 = ht.clean_text(text1)
assert text2 == "JJ棋牌数据4.3万。数据链接,数据第一个账号,第二个密码,95%可登录,可以登录官网查看数据是否准确"

# 复杂网页清洗
text1 = "发布了头条文章:《【XT】每日开工链新事儿 06.30 星期二》 [http://t.cn/A6LsKirA#区块链[超话]#](http://t.cn/A6LsKirA#%E5%8C%BA%E5%9D%97%E9%93%BE[%E8%B6%85%E8%AF%9D]#) #数字货币[超话]# #买价值币,只选XT# #比特币[超话]# #XT每日开工链新事儿? 06.30# #腾讯回应起诉老干妈#"
text2 = ht.clean_text(text1, markdown_hyperlink=True, weibo_topic=True)
print("清洗前:", [text1])
print("清洗后:", [text2])
assert text2 == "发布了头条文章:《【XT】每日开工链新事儿 06.30 星期二》"
# 自定义正则表达式补充清洗
text1 = "【#马化腾状告陶华碧#,#腾讯请求查封贵州老于妈公司1624万财产#】6月30日,据中国裁判文书网,【】广东省深圳市南山区人民法院发布一则民事裁定书"
text2 = ht.clean_text(text1, custom_regex=r"【.*?】")
print("清洗前:", [text1])
print("清洗后:", [text2])
assert text2 == "6月30日,据中国裁判文书网,广东省深圳市南山区人民法院发布一则民事裁定书"
text1 = "#嘎龙[超话]#【云次方/嘎龙】 回忆录?!1-2 http://t.cn/A6yvkujb 3 http://t.cn/A6yvkGO 4 http://t.cn/A6yZ59m0"
text2 = ht.clean_text(text1, weibo_topic=True, custom_regex=[r"【.*?】", r'[0-9\-]* +http[s]?://(?:[a-zA-Z]|[0-9]|[#$%*-;=?&@~.&+]|[!*,])+'])
print("清洗前:", [text1])
print("清洗后:", [text2])
assert text2 == "回忆录?!"

if __name__ == "__main__":
test_hard_text_cleaning()

0 comments on commit ba9b3d2

Please sign in to comment.