详解利用Pandas求解两个DataFrame的差集,交集,并集

作者：尤而小屋时间：2023-10-21 06:14:51　

大家好，我是Peter~

本文讲解的是如何利用Pandas函数求解两个DataFrame的差集、交集、并集。

模拟数据

模拟一份简单的数据：

In [1]:

import pandas as pd

In [2]:

df1 = pd.DataFrame({"col1":[1,2,3,4,5],
                    "col2":[6,7,8,9,10]
                   })

df2 = pd.DataFrame({"col1":[1,3,7],
                    "col2":[6,8,10]
                   })

In [3]:

df1

Out[3]:

	col1	col2
0	1	6
1	2	7
2	3	8
3	4	9
4	5	10

In [4]:

df2

Out[4]:

	col1	col2
0	1	6
1	3	8
2	7	10

两个DataFrame的相同部分：

差集

方法1：concat + drop_duplicates

In [5]:

df3 = pd.concat([df1,df2])
df3

Out[5]:

	col1	col2
0	1	6
1	2	7
2	3	8
3	4	9
4	5	10
0	1	6
1	3	8
2	7	10

In [6]:

# 结果1

df3.drop_duplicates(["col1","col2"],keep=False)

Out[6]:

	col1	col2
1	2	7
3	4	9
4	5	10
2	7	10

方法2：append + drop_duplicates

In [7]:

df4 = df1.append(df2)
df4

Out[7]:

	col1	col2
0	1	6
1	2	7
2	3	8
3	4	9
4	5	10
0	1	6
1	3	8
2	7	10

In [8]:

# 结果2

df4.drop_duplicates(["col1","col2"],keep=False)

Out[8]:

	col1	col2
1	2	7
3	4	9
4	5	10
2	7	10

交集

方法1：merge

In [9]:

# 结果

# 等效：df5 = pd.merge(df1, df2, how="inner")
df5 = pd.merge(df1,df2)

df5

Out[9]:

	col1	col2
0	1	6
1	3	8

方法2：concat + duplicated + loc

In [10]:

df6 = pd.concat([df1,df2])
df6

Out[10]:

	col1	col2
0	1	6
1	2	7
2	3	8
3	4	9
4	5	10
0	1	6
1	3	8
2	7	10

In [11]:

s = df6.duplicated(subset=['col1','col2'], keep='first')
s

Out[11]:

0    False
1    False
2    False
3    False
4    False
0     True
1     True
2    False
dtype: bool

In [12]:

# 结果
df8 = df6.loc[s == True]
df8

Out[12]:

	col1	col2
0	1	6
1	3	8

方法3：concat + groupby + query

In [13]:

# df6 = pd.concat([df1,df2])

df6

Out[13]:

	col1	col2
0	1	6
1	2	7
2	3	8
3	4	9
4	5	10
0	1	6
1	3	8
2	7	10

In [14]:

df9 = df6.groupby(["col1", "col2"]).size().reset_index()
df9.columns = ["col1", "col2", "count"]

df9

Out[14]:

	col1	col2	count
0	1	6	2
1	2	7	1
2	3	8	2
3	4	9	1
4	5	10	1
5	7	10	1

In [15]:

df10 = df9.query("count > 1")[["col1", "col2"]]
df10

Out[15]:

	col1	col2
0	1	6
2	3	8

并集

方法1：concat + drop_duplicates

In [16]:

df11 = pd.concat([df1,df2])
df11

Out[16]:

	col1	col2
0	1	6
1	2	7
2	3	8
3	4	9
4	5	10
0	1	6
1	3	8
2	7	10

In [17]:

# 结果

# df12 = df11.drop_duplicates(subset=["col1","col2"],keep="last")
df12 = df11.drop_duplicates(subset=["col1","col2"],keep="first")
df12

Out[17]:

	col1	col2
0	1	6
1	2	7
2	3	8
3	4	9
4	5	10
2	7	10

方法2：append + drop_duplicates

In [18]:

df13 = df1.append(df2)

# df13.drop_duplicates(subset=["col1","col2"],keep="last")
df13.drop_duplicates(subset=["col1","col2"],keep="first")

Out[18]:

	col1	col2
0	1	6
1	2	7
2	3	8
3	4	9
4	5	10
2	7	10

方法3：merge

In [19]:

pd.merge(df1,df2,how="outer")

Out[19]:

	col1	col2
0	1	6
1	2	7
2	3	8
3	4	9
4	5	10
5	7	10

来源：https://mp.weixin.qq.com/s/kmuVEdt13c8qRFA6w5lYFw

标签：Pandas,DataFrame,差集,交集,并集

投稿

详解利用Pandas求解两个DataFrame的差集,交集,并集

模拟数据

差集

方法1：concat + drop_duplicates

方法2：append + drop_duplicates

交集

方法1：merge

方法2：concat + duplicated + loc

方法3：concat + groupby + query

并集

方法1：concat + drop_duplicates

方法2：append + drop_duplicates

方法3：merge

猜你喜欢

Python 读取图片文件为矩阵和保存矩阵为图片的方法

pandas温差查询案例的实现

Python面向对象中类（class）的简单理解与用法分析

Express框架定制路由实例分析

Python程序员面试题你必须提前准备!

python持久化存储文件操作方法

PyTorch 1.0 正式版已经发布了

mysql如何让左模糊查询也能走索引

Linux安装Python虚拟环境virtualenv的方法

用XML创建可排序、分页的数据显示页面

python爬虫之pyppeteer库简单使用

Python Handler处理器和自定义Opener原理详解

在网页中实现细线边框的两种方法

Python图像处理之使用OpenCV检测对象颜色

表单元素事件 (Form Element Events)

python实现批量提取指定文件夹下同类型文件

Python爬取梨视频的示例

详解Python如何实现Excel数据读取和写入

Python利用AutoGrad实现自动计算函数斜率和梯度

玩转markdown 分享几个需要用到的工具

详解利用Pandas求解两个DataFrame的差集,交集,并集

模拟数据

差集

方法1：concat + drop_duplicates

方法2：append + drop_duplicates

交集

方法1：merge

方法2：concat + duplicated + loc

方法3：concat + groupby + query

并集

方法1：concat + drop_duplicates

方法2：append + drop_duplicates

方法3：merge

猜你喜欢

Python 读取图片文件为矩阵和保存矩阵为图片的方法

pandas温差查询案例的实现

Python面向对象中类（class）的简单理解与用法分析

Express框架定制路由实例分析

Python程序员面试题 你必须提前准备!

python持久化存储文件操作方法

PyTorch 1.0 正式版已经发布了

mysql如何让左模糊查询也能走索引

Linux安装Python虚拟环境virtualenv的方法

用XML创建可排序、分页的数据显示页面

python爬虫之pyppeteer库简单使用

Python Handler处理器和自定义Opener原理详解

在网页中实现细线边框的两种方法

Python图像处理之使用OpenCV检测对象颜色

表单元素事件 (Form Element Events)

python实现批量提取指定文件夹下同类型文件

Python爬取梨视频的示例

详解Python如何实现Excel数据读取和写入

Python利用AutoGrad实现自动计算函数斜率和梯度

玩转markdown 分享几个需要用到的工具

Python程序员面试题你必须提前准备!