论文查重python文本相似性计算simhash源码
作者:别None了 时间:2023-02-05 18:11:35
场景:
1.计算SimHash值,及Hamming距离。
2.SimHash适用于较长文本(大于三五百字)的相似性比较,文本越短误判率越高。
Python实现:
代码如下
# -*- encoding:utf-8 -*-
import math
import jieba
import jieba.analyse
class SimHash(object):
def getBinStr(self, source):
if source == "":
return 0
else:
x = ord(source[0]) << 7
m = 1000003
mask = 2 ** 128 - 1
for c in source:
x = ((x * m) ^ ord(c)) & mask
x ^= len(source)
if x == -1:
x = -2
x = bin(x).replace('0b', '').zfill(64)[-64:]
return str(x)
def getWeight(self, source):
return ord(source)
def unwrap_weight(self, arr):
ret = ""
for item in arr:
tmp = 0
if int(item) > 0:
tmp = 1
ret += str(tmp)
return ret
def sim_hash(self, rawstr):
seg = jieba.cut(rawstr)
keywords = jieba.analyse.extract_tags("|".join(seg), topK=100, withWeight=True)
ret = []
for keyword, weight in keywords:
binstr = self.getBinStr(keyword)
keylist = []
for c in binstr:
weight = math.ceil(weight)
if c == "1":
keylist.append(int(weight))
else:
keylist.append(-int(weight))
ret.append(keylist)
# 降维
rows = len(ret)
cols = len(ret[0])
result = []
for i in range(cols):
tmp = 0
for j in range(rows):
tmp += int(ret[j][i])
if tmp > 0:
tmp = "1"
elif tmp <= 0:
tmp = "0"
result.append(tmp)
return "".join(result)
def distince(self, hashstr1, hashstr2):
length = 0
for index, char in enumerate(hashstr1):
if char == hashstr2[index]:
continue
else:
length += 1
return length
if __name__ == "__main__":
simhash = SimHash()
str1 = '咱哥俩谁跟谁啊'
str2 = '咱们俩谁跟谁啊'
hash1 = simhash.sim_hash(str1)
print(hash1)
hash2 = simhash.sim_hash(str2)
distince = simhash.distince(hash1, hash2)
value = 5
print("simhash", distince, "距离:", value, "是否相似:", distince<=value)
来源:https://coderl.blog.csdn.net/article/details/122740744
标签:python,simhash,论文查重,文本相似性
0
投稿
猜你喜欢
Python判断字符串是否包含特定子字符串的多种方法(7种方法)
2021-09-20 02:22:51
Python OpenCV中的drawMatches()关键匹配绘制方法
2022-10-31 22:36:57
Python控制windows系统音量实现实例
2022-07-29 12:09:46
MySQL中几种常见的嵌套查询详解
2024-01-19 07:48:40
HTML和SEO基础知识:H标签全透视
2010-09-21 16:45:00
Python命令行参数解析工具 docopt 安装和应用过程详解
2022-01-15 05:31:05
MySQL中Like概念及用法讲解
2024-01-21 01:12:07
王孟友教你如何设计标志(LOGO)
2008-04-17 13:30:00
python调用摄像头显示图像的实例
2021-07-18 20:46:58
彻底解决Python包下载慢问题
2021-07-17 16:02:27
jsp下页面跳转的几种方法小结
2023-07-22 00:38:07
WEB移动应用框架构想
2010-09-28 16:26:00
python 实现docx与doc文件的互相转换
2022-01-19 06:45:58
Python实现带参数的用户验证功能装饰器示例
2023-05-15 01:34:05
微信 用脚本查看是否被微信好友删除
2021-12-30 19:59:42
Python实现自动计算特定格式的时间差
2021-08-16 22:47:24
Python Opencv实战之文字检测OCR
2023-03-18 14:05:41
python subprocess pipe 实时输出日志的操作
2022-10-07 00:39:51
准备SQL Server 2008透明数据加密
2009-01-22 14:18:00
python使用PyGame绘制图像并保存为图片文件的方法
2023-05-13 16:17:03