python实现过滤敏感词

作者：学到老时间：2021-02-26 04:23:17　

简述：

关于敏感词过滤可以看成是一种文本反垃圾算法，例如
题目：敏感词文本文件 filtered_words.txt，当用户输入敏感词语，则用星号 * 替换，例如当用户输入「北京是个好城市」，则变成「**是个好城市」
代码：

#coding=utf-8
def filterwords(x):
with open(x,'r') as f:
text=f.read()
print text.split('\n')
userinput=raw_input('myinput:')
for i in text.split('\n'):
if i in userinput:
replace_str='*'*len(i.decode('utf-8'))
word=userinput.replace(i,replace_str)
return word

print filterwords('filtered_words.txt')

再例如反黄系列：

开发敏感词语过滤程序，提示用户输入评论内容，如果用户输入的内容中包含特殊的字符：
敏感词列表 li = ["苍老师"," * ",” * ”,” * ”]
则将用户输入的内容中的敏感词汇替换成***，并添加到一个列表中；如果用户输入的内容没有敏感词汇，则直接添加到上述的列表中。
content = input('请输入你的内容：')
li = ["苍老师"," * "," * "," * "]
i = 0
while i < 4:
for li[i] in content:
li1 = content.replace('苍老师','***')
li2 = li1.replace(' * ','***')
li3 = li2.replace(' * ','***')
li4 = li3.replace(' * ','***')
else:
pass
i += 1

实战案例：

一道bat面试题：快速替换10亿条标题中的5万个敏感词，有哪些解决思路？
有十亿个标题，存在一个文件中，一行一个标题。有5万个敏感词，存在另一个文件。写一个程序过滤掉所有标题中的所有敏感词，保存到另一个文件中。

1、DFA过滤敏感词算法

在实现文字过滤的算法中，DFA是比较好的实现算法。DFA即Deterministic Finite Automaton，也就是确定有穷自动机。
算法核心是建立了以敏感词为基础的许多敏感词树。
python 实现DFA算法：

# -*- coding:utf-8 -*-

import time
time1=time.time()

# DFA算法
class DFAFilter():
def __init__(self):
self.keyword_chains = {}
self.delimit = '\x00'

def add(self, keyword):
keyword = keyword.lower()
chars = keyword.strip()
if not chars:
return
level = self.keyword_chains
for i in range(len(chars)):
if chars[i] in level:
level = level[chars[i]]
else:
if not isinstance(level, dict):
break
for j in range(i, len(chars)):
level[chars[j]] = {}
last_level, last_char = level, chars[j]
level = level[chars[j]]
last_level[last_char] = {self.delimit: 0}
break
if i == len(chars) - 1:
level[self.delimit] = 0

def parse(self, path):
with open(path,encoding='utf-8') as f:
for keyword in f:
self.add(str(keyword).strip())

def filter(self, message, repl="*"):
message = message.lower()
ret = []
start = 0
while start < len(message):
level = self.keyword_chains
step_ins = 0
for char in message[start:]:
if char in level:
step_ins += 1
if self.delimit not in level[char]:
level = level[char]
else:
ret.append(repl * step_ins)
start += step_ins - 1
break
else:
ret.append(message[start])
break
else:
ret.append(message[start])
start += 1

return ''.join(ret)

if __name__ == "__main__":
gfw = DFAFilter()
path="F:/文本反垃圾算法/sensitive_words.txt"
gfw.parse(path)
text=" * 苹果新品发布会雞八"
result = gfw.filter(text)

print(text)
print(result)
time2 = time.time()
print('总共耗时：' + str(time2 - time1) + 's')

运行效果：

* 苹果新品发布会雞八
****苹果新品发布会**
总共耗时：0.0010344982147216797s

2、AC自动机过滤敏感词算法

AC自动机：一个常见的例子就是给出n个单词，再给出一段包含m个字符的文章，让你找出有多少个单词在文章里出现过。
简单地讲，AC自动机就是字典树+kmp算法+失配指针

# -*- coding:utf-8 -*-

import time
time1=time.time()

# AC自动机算法
class node(object):
def __init__(self):
self.next = {}
self.fail = None
self.isWord = False
self.word = ""

class ac_automation(object):

def __init__(self):
self.root = node()

# 添加敏感词函数
def addword(self, word):
temp_root = self.root
for char in word:
if char not in temp_root.next:
temp_root.next[char] = node()
temp_root = temp_root.next[char]
temp_root.isWord = True
temp_root.word = word

# 失败指针函数
def make_fail(self):
temp_que = []
temp_que.append(self.root)
while len(temp_que) != 0:
temp = temp_que.pop(0)
p = None
for key,value in temp.next.item():
if temp == self.root:
temp.next[key].fail = self.root
else:
p = temp.fail
while p is not None:
if key in p.next:
temp.next[key].fail = p.fail
break
p = p.fail
if p is None:
temp.next[key].fail = self.root
temp_que.append(temp.next[key])

# 查找敏感词函数
def search(self, content):
p = self.root
result = []
currentposition = 0

while currentposition < len(content):
word = content[currentposition]
while word in p.next == False and p != self.root:
p = p.fail

if word in p.next:
p = p.next[word]
else:
p = self.root

if p.isWord:
result.append(p.word)
p = self.root
currentposition += 1
return result

# 加载敏感词库函数
def parse(self, path):
with open(path,encoding='utf-8') as f:
for keyword in f:
self.addword(str(keyword).strip())

# 敏感词替换函数
def words_replace(self, text):
"""
:param ah: AC自动机
:param text: 文本
:return: 过滤敏感词之后的文本
"""
result = list(set(self.search(text)))
for x in result:
m = text.replace(x, '*' * len(x))
text = m
return text

if __name__ == '__main__':

ah = ac_automation()
path='F:/文本反垃圾算法/sensitive_words.txt'
ah.parse(path)
text1=" * 苹果新品发布会雞八"
text2=ah.words_replace(text1)

print(text1)
print(text2)

time2 = time.time()
print('总共耗时：' + str(time2 - time1) + 's')

运行结果：

* 苹果新品发布会雞八
****苹果新品发布会**
总共耗时：0.0010304450988769531s

来源：https://cloud.tencent.com/developer/article/1395616

标签：python,过滤,敏感词

投稿

python实现过滤敏感词

目录

简述：

实战案例：

猜你喜欢

JavaScript封装弹框插件的方法

Python3 批量扫描端口的例子

vscode调试django项目的方法

别开生面：纯CSS实现相册滑动浏览

PHP投票系统防刷票判断流程分析

TensorFlow 滑动平均的示例代码

Pytorch从0实现Transformer的实践

JavaScript中Promise处理异步的并行与串行

PHP中include和require的使用详解

Python assert断言声明,遇到错误则立即返回问题

基于python检查矩阵计算结果

python生成随机数、随机字符、随机字符串的方法示例

Google的YSlow——Page Speed（附插件下载)

如何解决python多种版本冲突问题

vue3封装侧导航文字骨架效果组件

并行查询让SQL Server加速运行

python生成器用法实例详解

网页设计的十要十不要

Python读写zip压缩文件的方法

详解Python中Sync与Async执行速度快慢对比