python 中的jieba分词库

作者：L-L 时间：2023-08-10 01:26:30　

jieba 库是优秀的中文分词第三方库，中文文本需要通过分词获得单个的词语

1、jieba库安装

管理员身份运行cmd窗口输入命令：pip install jieba

2、jieba库功能介绍

特征：
支持三种分词模式：
精确模式：试图将句子最精确地切开，适合文本分析
全模式：把句子中所有的可以成词的词语都扫描出来, 速度非常快，但是不能解决歧义
搜索引擎模式：在精确模式的基础上，对长词再次切分，提高召回率，适合用于搜索引擎分词

支持繁体分词
支持自定义词典

分词功能：

jieba.cut 和 jieba.lcut 方法接受两个传入参数：

第一个参数为需要分词的字符串
cut_all参数用来控制是否采用全模式

lcut 将返回的对象转化为 list 对象返回

jieba.cut_for_search 和 jieba.lcut_for_search 方法接受一个参数

需要分词的字符串

该方法适合用于搜索引擎构建倒排索引的分词，颗粒度较细
jieba.lcut_for_search 方法返回列表类型

添加自定义词典：

开发者可以指定自己自定义的词典，以便包含jieba词库里没有的词。虽然jieba有新词识别能力，但是自行添加新词可以保证更高的正确率

用法：

使用自定义词典文件：

jieba.load_userdict(file_name) # file_name 是自定义词典的路径

使用jieba在程序中动态修改词典：

jieba.add_word(new_words) # new_words 是想要添加的新词

jieba.del_word(words) # 删除words

关键词提取：

jieba.analyse.extract_tags(sentence,topK) #需要先import jieba.analyse

sentence 为待提取的文本

topK 为返回几个TF/IDF权重最大的关键词，默认是20

词性标注：

jieba.posseg.POSTokenizer(tokenizer=None) 新建自定义分词器，tokenizer参数可指定内部使用的jieba.Tokenizer 分词

jieba.posseg.dt 为默认词性标注分词器
标注句子分词后每个词的词性，采用和ictclas兼容的标记法

3、案例

3.1、精确模式

import jieba
list1 = jieba.lcut("中华人民共和国是一个伟大的国家")
print(list1)
print("精确模式："+"/".join(list1))

3.2、全模式

list2 = jieba.lcut("中华人民共和国是一个伟大的国家",cut_all = True)
print(list2,end=",")
print("全模式："+"/".join(list2))

3.3、搜索引擎模式

list3 = jieba.lcut_for_search("中华人民共和国是一个伟大的国家")
print(list3)
print("搜索引擎模式："+" ".join(list3))

3.4、修改词典

import jieba
text = "中信建投投资公司了一款游戏，中信也投资了一个游戏公司"
word = jieba.lcut(text)
print(word)
# 添加词
jieba.add_word("中信建投")
jieba.add_word("投资公司")
word1 = jieba.lcut(text)
print(word1)
# 删除词
jieba.del_word("中信建投")
word2 = jieba.lcut(text)
print(word2)

3.5、词性标注

import jieba.posseg as pseg
words = pseg.cut("我爱北京天安门")
for i in words:
print(i.word,i.flag)

3.6、统计三国演义中人物出场的次数

三演义文本下载：

import jieba
txt = open("文件路径", "r", encoding='utf-8').read() # 打开并读取文件
words = jieba.lcut(txt) # 使用精确模式对文本进行分词
counts = {} # 通过键值对的形式存储词语及其出现的次数
for word in words:
if len(word) == 1: # 单个词语不计算在内
continue
else:
counts[word] = counts.get(word, 0) + 1 # 遍历所有词语，每出现一次其对应的值加 1
items = list(counts.items()) #将键值对转换成列表
items.sort(key=lambda x: x[1], reverse=True) # 根据词语出现的次数进行从大到小排序
for i in range(15):
word, count = items[i]
print("{0:<10}{1:>5}".format(word, count))

import jieba
excludes = {"将军","却说","荆州","二人","不可","不能","如此","如何"}
txt = open("三国演义.txt", "r", encoding='utf-8').read()
words = jieba.lcut(txt)
counts = {}
for word in words:
if len(word) == 1:
continue
elif word == "诸葛亮" or word == "孔明曰":
rword = "孔明"
elif word == "关公" or word == "云长":
rword = "关羽"
elif word == "玄德" or word == "玄德曰":
rword = "刘备"
elif word == "孟德" or word == "丞相":
rword = "曹操"
else:
rword = word
counts[rword] = counts.get(rword,0) + 1

for i in excludes:
del counts[i]

items = list(counts.items())
items.sort(key=lambda x:x[1], reverse=True)
for i in range(10):
word, count = items[i]
print ("{0:<10}{1:>5}".format(word, count))

来源：https://www.cnblogs.com/L-hua/p/15584823.html

标签：python,jieba,分词库

投稿

python 中的jieba分词库

1、jieba库安装

2、jieba库功能介绍

3、案例

3.1、精确模式

3.2、全模式

3.3、搜索引擎模式

3.4、修改词典

3.5、词性标注

3.6、统计三国演义中人物出场的次数

猜你喜欢

Python版中国省市经纬度

python实现查询IP地址所在地

Python hashlib模块与subprocess模块使用详细介绍

Python实现随机生成手机号及正则验证手机号的方法

python 图像判断,清晰度(明暗),彩色与黑白实例

使用Python实现将多表分批次从数据库导出到Excel

发现一个不错的11px字体：PMingLiu

Go语言k8s kubernetes使用leader election实现选举

python爬虫urllib中的异常模块处理

python实现textrank关键词提取

Jupyter Notebook添加代码自动补全功能的实现

python 使用tkinter+you-get实现视频下载器

Python使用内置函数setattr设置对象的属性值

python在html中插入简单的代码并加上时间戳的方法

Python如何把字典写入到CSV文件的方法示例

Tensorflow加载Vgg预训练模型操作

使用python实现一个简单ping pong服务器

Python爬虫模拟登录带验证码网站

Oracle例外用法实例详解

Python爬虫Requests库的使用详情