浅谈pandas中Dataframe的查询方法([], loc, iloc, at, iat, ix)

作者：ForeseeMark 时间：2023-03-09 19:28:59　

pandas为我们提供了多种切片方法，而要是不太了解这些方法，就会经常容易混淆。下面举例对这些切片方法进行说明。

数据介绍

先随机生成一组数据：

In [5]: rnd_1 = [random.randrange(1,20) for x in xrange(1000)]
...: rnd_2 = [random.randrange(1,20) for x in xrange(1000)]
...: rnd_3 = [random.randrange(1,20) for x in xrange(1000)]
...: fecha = pd.date_range('2012-4-10', '2015-1-4')
...:
...: data = pd.DataFrame({'fecha':fecha, 'rnd_1': rnd_1, 'rnd_2': rnd_2, 'rnd_3': rnd_3})
In [6]: data.describe()
Out[6]:
rnd_1 rnd_2 rnd_3
count 1000.000000 1000.000000 1000.000000
mean 9.946000 9.825000 9.894000
std 5.553911 5.559432 5.423484
min 1.000000 1.000000 1.000000
25％ 5.000000 5.000000 5.000000
50％ 10.000000 10.000000 10.000000
75％ 15.000000 15.000000 14.000000
max 19.000000 19.000000 19.000000

[]切片方法

使用方括号能够对DataFrame进行切片，有点类似于python的列表切片。按照索引能够实现行选择或列选择或区块选择。

# 行选择
In [7]: data[1:5]
Out[7]:
fecha rnd_1 rnd_2 rnd_3
1 2012-04-11 1 16 3
2 2012-04-12 7 6 1
3 2012-04-13 2 16 7
4 2012-04-14 4 17 7
# 列选择
In [10]: data[['rnd_1', 'rnd_3']]
Out[10]:
rnd_1 rnd_3
0 8 12
1 1 3
2 7 1
3 2 7
4 4 7
5 12 8
6 2 12
7 9 8
8 13 17
9 4 7
10 14 14
11 19 16
12 2 12
13 15 18
14 13 18
15 13 11
16 17 7
17 14 10
18 9 6
19 11 15
20 16 13
21 18 9
22 1 18
23 4 3
24 6 11
25 2 13
26 7 17
27 11 8
28 3 12
29 4 2
.. ... ...
970 8 14
971 19 5
972 13 2
973 8 10
974 8 17
975 6 16
976 3 2
977 12 6
978 12 10
979 15 13
980 8 4
981 17 3
982 1 17
983 11 5
984 7 7
985 13 14
986 6 19
987 13 9
988 3 15
989 19 6
990 7 11
991 11 7
992 19 12
993 2 15
994 10 4
995 14 13
996 12 11
997 11 15
998 17 14
999 3 8
[1000 rows x 2 columns]
# 区块选择
In [11]: data[:7][['rnd_1', 'rnd_2']]
Out[11]:
rnd_1 rnd_2
0 8 17
1 1 16
2 7 6
3 2 16
4 4 17
5 12 19
6 2 7

不过对于多列选择，不能像行选择时一样使用1：5这样的方法来选择。

In [12]: data[['rnd_1':'rnd_3']]
File "<ipython-input-13-6291b6a83eb0>", line 1
data[['rnd_1':'rnd_3']]
^
SyntaxError: invalid syntax

loc

loc可以让你按照索引来进行行列选择。

In [13]: data.loc[1:5]
Out[13]:
fecha rnd_1 rnd_2 rnd_3
1 2012-04-11 1 16 3
2 2012-04-12 7 6 1
3 2012-04-13 2 16 7
4 2012-04-14 4 17 7
5 2012-04-15 12 19 8

这里需要注意的是，loc与第一种方法不同之处在于会把第5行也选择进去，而第一种方法只会选择到第4行为止。

data.loc[2:4, ['rnd_2', 'fecha']]
Out[14]:
rnd_2 fecha
2 6 2012-04-12
3 16 2012-04-13
4 17 2012-04-14

loc能够选择在两个特定日期之间的数据，需要注意的是这两个日期必须都要在索引中。

In [15]: data_fecha = data.set_index('fecha')
...: data_fecha.head()
Out[15]:
rnd_1 rnd_2 rnd_3
fecha
2012-04-10 8 17 12
2012-04-11 1 16 3
2012-04-12 7 6 1
2012-04-13 2 16 7
2012-04-14 4 17 7
In [16]: # 生成两个特定日期
...: fecha_1 = dt.datetime(2013, 4, 14)
...: fecha_2 = dt.datetime(2013, 4, 18)
...:
...: # 生成切片数据
...: data_fecha.loc[fecha_1: fecha_2]
Out[16]:
rnd_1 rnd_2 rnd_3
fecha
2013-04-14 17 10 5
2013-04-15 14 4 9
2013-04-16 1 2 18
2013-04-17 9 15 1
2013-04-18 16 7 17

更新：如果没有特殊需求，强烈建议使用loc而尽量少使用[]，因为loc在对DataFrame进行重新赋值操作时会避免chained indexing问题，使用[]时编译器很可能会给出SettingWithCopy的警告。

具体可以参见官方文档：http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy

iloc

如果说loc是按照索引（index）的值来选取的话，那么iloc就是按照索引的位置来进行选取。iloc不关心索引的具体值是多少，只关心位置是多少，所以使用iloc时方括号中只能使用数值。

# 行选择
In [17]: data_fecha[10: 15]
Out[17]:
rnd_1 rnd_2 rnd_3
fecha
2012-04-20 14 6 14
2012-04-21 19 14 16
2012-04-22 2 6 12
2012-04-23 15 8 18
2012-04-24 13 8 18
# 列选择
In [18]: data_fecha.iloc[:,[1,2]].head()
Out[18]:
rnd_2 rnd_3
fecha
2012-04-10 17 12
2012-04-11 16 3
2012-04-12 6 1
2012-04-13 16 7
2012-04-14 17 7
# 切片选择
In [19]: data_fecha.iloc[[1,12,34],[0,2]]
Out[19]:
rnd_1 rnd_3
fecha
2012-04-11 1 3
2012-04-22 2 12
2012-05-14 17 10

at的使用方法与loc类似，但是比loc有更快的访问数据的速度，而且只能访问单个元素，不能访问多个元素。

In [20]: timeit data_fecha.at[fecha_1,'rnd_1']
The slowest run took 3783.11 times longer than the fastest. This could mean that an intermediate result is being cached.
100000 loops, best of 3: 11.3 µs per loop
In [21]: timeit data_fecha.loc[fecha_1,'rnd_1']
The slowest run took 121.24 times longer than the fastest. This could mean that an intermediate result is being cached.
10000 loops, best of 3: 192 µs per loop
In [22]: data_fecha.at[fecha_1,'rnd_1']
Out[22]: 17

iat

iat对于iloc的关系就像at对于loc的关系，是一种更快的基于索引位置的选择方法，同at一样只能访问单个元素。

In [23]: data_fecha.iat[1,0]
Out[23]: 1
In [24]: timeit data_fecha.iat[1,0]
The slowest run took 6.23 times longer than the fastest. This could mean that an intermediate result is being cached.
100000 loops, best of 3: 8.77 µs per loop
In [25]: timeit data_fecha.iloc[1,0]
10000 loops, best of 3: 158 µs per loop

以上说过的几种方法都要求查询的秩在索引中，或者位置不超过长度范围，而ix允许你得到不在DataFrame索引中的数据。

In [28]: date_1 = dt.datetime(2013, 1, 10, 8, 30)
...: date_2 = dt.datetime(2013, 1, 13, 4, 20)
...:
...: # 生成切片数据
...: data_fecha.ix[date_1: date_2]
Out[28]:
rnd_1 rnd_2 rnd_3
fecha
2013-01-11 19 17 19
2013-01-12 10 9 17
2013-01-13 15 3 10

如上面的例子所示，2013年1月10号并没有被选择进去，因为这个时间点被看作为0点0分，比8点30分要早一些。

来源：https://blog.csdn.net/wr339988/article/details/65446138

标签：pandas,dataframe,查询

投稿

浅谈pandas中Dataframe的查询方法([], loc, iloc, at, iat, ix)

猜你喜欢

k8s容器互联flannel vxlan通信原理

js实现div弹出层的方法

Python列表list的详细用法介绍

python神经网络slim常用函数训练保存模型

Python字符串格式化

python 使用元类type创建类

两个元祖T1=('a', 'b'),T2=('c', 'd')使用匿名函数将其转变成[{'a': 'c'},{'b': 'd'}]的几种方法

在Python中将函数作为另一个函数的参数传入并调用的方法

在ASP中改善动态分页的性能

python 爬取腾讯视频评论的实现步骤

javascript创建函数的20种方式汇总

python获取图片颜色信息的方法

Python+Selenium自动化实现分页（pagination）处理

SQL SERVER 中构建执行动态SQL语句的方法

用python生成与调用cntk模型代码演示方法

MySQL 8.0 redo log的深入解析

Python 实现try重新执行

Vue实现步骤条效果

基于python 将列表作为参数传入函数时的测试与理解

matplotlib 范围选区(SpanSelector)的使用