Python3 文章标题关键字提取的例子

作者：Muzi_Water 时间：2022-02-08 03:45:32　

思路：

1.读取所有文章标题；

2.用“结巴分词”的工具包进行文章标题的词语分割；

3.用“sklearn”的工具包计算Tf-idf（词频-逆文档率）;

4.得到满足关键词权重阈值的词

结巴分词详见：结巴分词Github

sklearn详见：文本特征提取——4.2.3.4 Tf-idf项加权

import os
import jieba
import sys
from sklearn.feature_extraction.text import TfidfVectorizer

sys.path.append("../")
jieba.load_userdict('userdictTest.txt')
STOP_WORDS = set((
"基于", "面向", "研究", "系统", "设计", "综述", "应用", "进展", "技术", "框架", "txt"
))

def getFileList(path):
filelist = []
files = os.listdir(path)
for f in files:
if f[0] == '.':
pass
else:
filelist.append(f)
return filelist, path

def fenci(filename, path, segPath):

# 保存分词结果的文件夹
if not os.path.exists(segPath):
os.mkdir(segPath)
seg_list = jieba.cut(filename)
result = []
for seg in seg_list:
seg = ''.join(seg.split())
if len(seg.strip()) >= 2 and seg.lower() not in STOP_WORDS:
result.append(seg)

# 将分词后的结果用空格隔开，保存至本地
f = open(segPath + "/" + filename + "-seg.txt", "w+")
f.write(' '.join(result))
f.close()

def Tfidf(filelist, sFilePath, path, tfidfw):
corpus = []
for ff in filelist:
fname = path + ff
f = open(fname + "-seg.txt", 'r+')
content = f.read()
f.close()
corpus.append(content)

vectorizer = TfidfVectorizer() # 该类实现词向量化和Tf-idf权重计算
tfidf = vectorizer.fit_transform(corpus)
word = vectorizer.get_feature_names()
weight = tfidf.toarray()

if not os.path.exists(sFilePath):
os.mkdir(sFilePath)

for i in range(len(weight)):
print('----------writing all the tf-idf in the ', i, 'file into ', sFilePath + '/', i, ".txt----------")
f = open(sFilePath + "/" + str(i) + ".txt", 'w+')
result = {}
for j in range(len(word)):
if weight[i][j] >= tfidfw:
result[word[j]] = weight[i][j]
resultsort = sorted(result.items(), key=lambda item: item[1], reverse=True)
for z in range(len(resultsort)):
f.write(resultsort[z][0] + " " + str(resultsort[z][1]) + '\r\n')
print(resultsort[z][0] + " " + str(resultsort[z][1]))
f.close()

TfidfVectorizer( ) 类实现了词向量化和Tf-idf权重的计算

词向量化：vectorizer.fit_transform是将corpus中保存的切分后的单词转为词频矩阵，其过程为先将所有标题切分的单词形成feature特征和列索引，并在dictionary中保存了{‘特征'：索引，……}，如{‘农业'：0，‘大数据'：1，……}，在csc_matric中为每个标题保存了 (标题下标，特征索引) 词频tf……，然后对dictionary中的单词进行排序重新编号，并对应更改csc_matric中的特征索引，以便形成一个特征向量词频矩阵，接着计算每个feature的idf权重，其计算公式为其中是所有文档数量，是包含该单词的文档数。最后计算tf*idf并进行正则化，得到关键词权重。

以下面六个文章标题为例进行关键词提取

Using jieba on 农业大数据研究与应用进展综述.txt

Using jieba on 基于Hadoop的分布式并行增量爬虫技术研究.txt

Using jieba on 基于RPA的财务共享服务中心账表核对流程优化.txt

Using jieba on 基于大数据的特征趋势统计系统设计.txt

Using jieba on 网络大数据平台异常风险监测系统设计.txt

Using jieba on 面向数据中心的多源异构数据统一访问框架.txt

----------writing all the tf-idf in the 0 file into ./keywords/ 0 .txt----------

农业 0.773262366783

大数据 0.634086202434

----------writing all the tf-idf in the 1 file into ./keywords/ 1 .txt----------

hadoop 0.5

分布式 0.5

并行增量 0.5

爬虫 0.5

----------writing all the tf-idf in the 2 file into ./keywords/ 2 .txt----------

rpa 0.408248290464

优化 0.408248290464

服务中心 0.408248290464

流程 0.408248290464

财务共享 0.408248290464

账表核对 0.408248290464

----------writing all the tf-idf in the 3 file into ./keywords/ 3 .txt----------

特征 0.521823488025

统计 0.521823488025

趋势 0.521823488025

大数据 0.427902724969

----------writing all the tf-idf in the 4 file into ./keywords/ 4 .txt----------

大数据平台 0.4472135955

异常 0.4472135955

监测 0.4472135955

网络 0.4472135955

风险 0.4472135955

----------writing all the tf-idf in the 5 file into ./keywords/ 5 .txt----------

多源异构数据 0.57735026919

数据中心 0.57735026919

统一访问 0.57735026919

来源：https://blog.csdn.net/Muzi_Water/article/details/83993842

标签：Python3,标题,关键字,提取

投稿

Python3 文章标题关键字提取的例子

猜你喜欢

asp.net DropDownList实现二级联动效果

MySQL全局锁和表锁的深入理解

浅谈PyTorch的可重复性问题(如何使实验结果可复现)

Go channel实现原理分析

解密Python中的描述符（descriptor）

python自动识别文本编码格式代码

javascript基础之数组(Array)对象

精致的web设计

python shell命令行中import多层目录下的模块操作

Win10系统下Pytorch环境的搭建过程

信息分类是为了更好的索引

用javascript给表格加滚动条

sql时间格式化输出、Convert函数应用示例

用python实现一个简单的验证码

关于base64编码的原理及实现方法分享

基于tensorflow指定GPU运行及GPU资源分配的几种方式小结

ASP中使用Form和QueryString集合

sqlserver 触发器学习（实现自动编号）

基于python 的Pygame最小开发框架

将以用户为中心的设计嵌入产品设计和开发流程