python筛选出两个文件中重复行的方法

作者：非完美主义者时间：2021-02-16 12:53:04　

本文实例为大家分享了python脚本筛选出两个文件中重复的行数，供大家参考，具体内容如下

'''
查找A文件中，与B文件中内容不重复的内容
'''
#!usr/bin/python

import sys
import os

'''
字符串查找函数，使用二分查找法在列表中进行查询
'''
def binarySearch(value, lines):
right = len(lines) - 1
left = 0
a = value.strip()
while left <= right:
middle = int((right + left + 1)/2)
b = lines[middle].strip()
if a == b:
return 1

if a < b:
right = middle - 1
else:
left = middle + 1

return 0

DPT = 100000 # DPT 是Data Per File的意思

fileAName = sys.argv[1];
fileBName = sys.argv[2];

#STEP1：先拆掉B文件，作为比较基准，临时文件命名为temp1,temp2,...,tempN
print("拆分比对文件...\n")
fB = open(fileBName)
tempFileNo = 1
tempFileName = "temp{0}".format(tempFileNo)
fTemp = open(tempFileName, "w+")
line = fB.readline()
lineCount = 0
while line:
if lineCount >= DPT:
fTemp.flush()
fTemp.close()
tempFileNo = tempFileNo + 1
tempFileName = "temp{0}".format(tempFileNo)
fTemp = open(tempFileName, "w+")
lineCount = 0
fTemp.write(line)
lineCount = lineCount + 1
line = fB.readline()

fTemp.flush()
fTemp.close()

fB.close()
print("拆分完成，一共{0}个临时文件，{1}条数据。\n".format(tempFileNo, (tempFileNo-1)*DPT + lineCount))

#STEP2：把A文件与B文件拆出来的临时文件逐个进行比较，将结果轮流写入文件result0, result1
# 最后写入的result文件就是最终结果
fA = open(fileAName)
resultTempFile = {"result0", "result1"};
tempIndex = 0
fOut = open("repeat", "w+")
repeatCount = 0
for i in range(1, tempFileNo + 1):
print("比较第{0}个临时文件...\n".format(i))
if 0 == tempIndex:
resultTempFile = "result0"
tempIndex = 1
else:
resultTempFile = "result1"
tempIndex = 0
fResult = open(resultTempFile, "w+")

fTemp = open("temp{0}".format(i))
lineSet = fTemp.readlines()
fTemp.close()
lineList = list(lineSet)
lineList.sort()

line = fA.readline()
while line:
if 0 == binarySearch(line, lineList):
fResult.write(line)
else:
fOut.write(line)
repeatCount = repeatCount + 1
line = fA.readline()
fA.close()

fResult.flush()
fResult.close()

fA = open(resultTempFile)

fA.close()
fOut.flush()
fOut.close()

print("比较完成，重复数据{0}条".format(repeatCount))

os.rename(resultTempFile, "result")

#STEP3：结束后把临时文件都删掉
print("删除临时文件...\n")
while tempFileNo > 0:
tempFileName = "temp{0}".format(tempFileNo)
os.remove(tempFileName)
tempFileNo = tempFileNo - 1

print("脚本结束。\n")

来源：https://blog.csdn.net/qyshooter/article/details/53508924

标签：python,筛选,重复行

投稿

python筛选出两个文件中重复行的方法

猜你喜欢

基于python实现删除指定文件类型

嵌入式Web视频点播系统实现方法

学以致用驳“ASP低能论”

在Python文件中指定Python解释器的方法

html风格tooltip效果的实现

重新发现HTML表格

苹果的“创新”

asp如何验证日期输入是否正确？

详解用python计算阶乘的几种方法

在Python中操作文件之seek()方法的使用教程

一个简单的JS显示日期代码

centos yum php 7.x 无需删除升级的方法

详解pandas获取Dataframe元素值的几种方法

Python中删除文件的几种方法实例

简单方法实现网页自动适应任何分辨率任何窗口大小

使用Python实现学生学籍管理系统

Pygame Rect区域位置的使用(图文)

python标准库之time模块的语法与简单使用

Python实现的最近最少使用算法

详解go语言json的使用技巧

python筛选出两个文件中重复行的方法

猜你喜欢

基于python实现删除指定文件类型

嵌入式Web视频点播系统实现方法

学以致用 驳“ASP低能论”

在Python文件中指定Python解释器的方法

html风格tooltip效果的实现

重新发现HTML表格

苹果的“创新”

asp如何验证日期输入是否正确？

详解用python计算阶乘的几种方法

在Python中操作文件之seek()方法的使用教程

一个简单的JS显示日期代码

centos yum php 7.x 无需删除升级的方法

详解pandas获取Dataframe元素值的几种方法

Python中删除文件的几种方法实例

简单方法实现网页自动适应任何分辨率任何窗口大小

使用Python实现 学生学籍管理系统

Pygame Rect区域位置的使用(图文)

python标准库之time模块的语法与简单使用

Python实现的最近最少使用算法

详解go语言json的使用技巧

学以致用驳“ASP低能论”

使用Python实现学生学籍管理系统