python网络爬虫精解之Beautiful Soup的使用说明

作者：小狐狸梦想去童话镇时间：2021-02-21 15:20:48　

一、Beautiful Soup的介绍

Beautiful Soup是一个强大的解析工具，它借助网页结构和属性等特性来解析网页。

它提供一些函数来处理导航、搜索、修改分析树等功能，Beautiful Soup不需要考虑文档的编码格式。Beautiful Soup在解析时实际上需要依赖解析器，常用的解析器是lxml。

二、Beautiful Soup的使用

test03.html测试实例：

<!DOCTYPE html>
<html>
<head>
<meta content="text/html;charset=utf-8" http-equiv="content-type" />
<meta content="IE=Edge" http-equiv="X-UA-Compatible" />
<meta content="always" name="referrer" />
<link href="https://ss1.bdstatic.com/5eN1bjq8AAUYm2zgoY3K/r/www/cache/bdorz/baidu.min.css" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="stylesheet" type="text/css" />
<title>百度一下，你就知道 </title>
</head>
<body link="#0000cc">
<div id="wrapper">
<div id="head">
<div class="head_wrapper">
<div id="u1">
<a class="mnav" href="http://news.baidu.com" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" name="tj_trnews">新闻 </a>
<a class="mnav" href="https://www.hao123.com" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" name="tj_trhao123">hao123 </a>
<a class="mnav" href="http://map.baidu.com" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" name="tj_trmap">地图 </a>
<a class="mnav" href="http://v.baidu.com" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" name="tj_trvideo">视频 </a>
<a class="mnav" href="http://tieba.baidu.com" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" name="tj_trtieba">贴吧 </a>
<a class="bri" href="//www.baidu.com/more/" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" name="tj_briicon" style="display: block;">更多产品 </a>
</div>
</div>
</div>
</div>
</body>
</html>

1、节点选择器

我们之前了解到，一个网页是由若干个元素节点组成的，通过提取某个节点的具体内容，就可以获取到界面呈现的一些数据。使用节点选择器能够简化我们获取数据的过程，在不使用正则表达式的前提下，精准的获取数据。

from bs4 import BeautifulSoup

file = open("./test03.html",'rb')
html = file.read()
soup = BeautifulSoup(html,'lxml')
print(soup.head)
print(soup.head.title)
print(soup.a)

【运行结果】

<head>
<meta content="text/html;charset=utf-8" http-equiv="content-type"/>
<meta content="IE=Edge" http-equiv="X-UA-Compatible"/>
<meta content="always" name="referrer"/>
<link href="https://ss1.bdstatic.com/5eN1bjq8AAUYm2zgoY3K/r/www/cache/bdorz/baidu.min.css" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="stylesheet" type="text/css"/>
<title>百度一下，你就知道 </title>
</head>
<title>百度一下，你就知道 </title>
<a class="mnav" href="http://news.baidu.com" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" name="tj_trnews">新闻 </a>

分析：

第一条打印数据为获取网页的head节点；

第二条打印内容是获取head节点中title节点，获取该节点使用了一个嵌套选择，因为title节点是嵌套在head节点里面的；

第三条打印内容是获取a节点，在源码中我们看到有许多条a节点，而只匹配到第一个a节点就结束了。当有多个节点时，这种选择方式指只会选择第一个匹配的节点，其他后面节点会忽略。

2、提取信息

一般我们需要的数据位于节点名、属性值、文本值中，以下代码展示了如何获取这三个地方的数据：

from bs4 import BeautifulSoup

file = open("./test03.html",'rb')
html = file.read()
soup = BeautifulSoup(html,'lxml')
print(soup.body.name)
print(soup.body.a.attrs['class'])
print(soup.body.a.attrs['href'])
print(soup.body.a.string)

【运行结果】

body
['mnav']
http://news.baidu.com
新闻

分析：

第一条获取body节点名；

第二条获取a节点class属性值；

第三条获取a节点href属性值；

第四条获取a节点的文本值；

3、关联选择

（1）子节点和子孙节点

子节点可以调用contents属性和children属性，子孙节点可以调用descendants属性，他们返回结果都是生成器类型，通过for循环输出匹配到的信息。

from bs4 import BeautifulSoup

file = open("./test03.html",'rb')
html = file.read()
soup = BeautifulSoup(html,'lxml')
# print(soup.body.contents)
for i,content in enumerate(soup.body.contents):
print(i,content)

【运行结果】

0
1 <div id="wrapper">
<div id="head">
<div class="head_wrapper">
<div id="u1">
<a class="mnav" href="http://news.baidu.com" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" name="tj_trnews">新闻 </a>
<a class="mnav" href="https://www.hao123.com" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" name="tj_trhao123">hao123 </a>
<a class="mnav" href="http://map.baidu.com" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" name="tj_trmap">地图 </a>
<a class="mnav" href="http://v.baidu.com" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" name="tj_trvideo">视频 </a>
<a class="mnav" href="http://tieba.baidu.com" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" name="tj_trtieba">贴吧 </a>
<a class="bri" href="//www.baidu.com/more/" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" name="tj_briicon" style="display: block;">更多产品 </a>
</div>
</div>
</div>
</div>
2

（2）父节点和祖先节点

获取某个节点的父节点可以调用parent属性，例如获取实例中title节点的父节点：

file = open("./test03.html",'rb')
html = file.read()
soup = BeautifulSoup(html,'lxml')
print(soup.title.parent)

【运行结果】

<head>
<meta content="text/html;charset=utf-8" http-equiv="content-type"/>
<meta content="IE=Edge" http-equiv="X-UA-Compatible"/>
<meta content="always" name="referrer"/>
<link href="https://ss1.bdstatic.com/5eN1bjq8AAUYm2zgoY3K/r/www/cache/bdorz/baidu.min.css" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="stylesheet" type="text/css"/>
<title>百度一下，你就知道 </title>
</head>

同理，如果是想要获取节点的祖先节点，则可调用parents属性。

（3）兄弟节点

调用next_sibling获取节点的下一个兄弟元素；

调用previous_sibling获取节点的上一个兄弟元素；

调用next_siblings取节点的下一个兄弟节点；

调用previous_siblings获取节点的上一个兄弟节点；

4、方法选择器

find_all（）

查找所有符合条件的元素，其使用方法如下：

find_all(name,attrs,recursive,text,**kwargs)

（1）name

根据节点名来查询元素，例如查询实例中a标签元素：

from bs4 import BeautifulSoup

file = open("./test03.html",'rb')
html = file.read()
soup = BeautifulSoup(html,'lxml')
print(soup.find_all(name = "a"))
for a in soup.find_all(name = "a"):
print(a)

【运行结果】

[<a class="mnav" href="http://news.baidu.com" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" name="tj_trnews">新闻 </a>, <a class="mnav" href="https://www.hao123.com" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" name="tj_trhao123">hao123 </a>, <a class="mnav" href="http://map.baidu.com" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" name="tj_trmap">地图 </a>, <a class="mnav" href="http://v.baidu.com" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" name="tj_trvideo">视频 </a>, <a class="mnav" href="http://tieba.baidu.com" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" name="tj_trtieba">贴吧 </a>, <a class="bri" href="//www.baidu.com/more/" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" name="tj_briicon" style="display: block;">更多产品 </a>]
<a class="mnav" href="http://news.baidu.com" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" name="tj_trnews">新闻 </a>
<a class="mnav" href="https://www.hao123.com" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" name="tj_trhao123">hao123 </a>
<a class="mnav" href="http://map.baidu.com" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" name="tj_trmap">地图 </a>
<a class="mnav" href="http://v.baidu.com" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" name="tj_trvideo">视频 </a>
<a class="mnav" href="http://tieba.baidu.com" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" name="tj_trtieba">贴吧 </a>
<a class="bri" href="//www.baidu.com/more/" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" name="tj_briicon" style="display: block;">更多产品 </a>

（2）attrs

在查询时我们还可以传入标签的属性，attrs参数的数据类型是字典。

from bs4 import BeautifulSoup

file = open("./test03.html",'rb')
html = file.read()
soup = BeautifulSoup(html,'lxml')
print(soup.find_all(name = "a",attrs = {"class":"bri"}))

【运行结果】

[<a class="bri" href="//www.baidu.com/more/" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" name="tj_briicon" style="display: block;">更多产品 </a>]

可以看到，在加上class=“bri”属性时，查询结果就只剩一条a标签元素。

（3）text

text参数可以用来匹配节点的文本，传入的可以是字符串，也可以是正则表达式对象。

import re
from bs4 import BeautifulSoup

file = open("./test03.html",'rb')
html = file.read()
soup = BeautifulSoup(html,'lxml')
print(soup.find_all(name = "a",text = re.compile('新闻')))

【运行结果】

[<a class="mnav" href="http://news.baidu.com" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" name="tj_trnews">新闻 </a>]

只包含文本内容为“新闻”的a标签。

find（）

find（）的使用与前者相似，唯一不同的是，find进匹配搜索到的第一个元素，然后返回单个元素，find_all（）则是匹配所有符合条件的元素，返回一个列表。

5、CSS选择器

使用CSS选择器时，调用select（）方法，传入相应的CSS选择器；

例如使用CSS选择器获取实例中的a标签

from bs4 import BeautifulSoup

file = open("./test03.html",'rb')
html = file.read()
soup = BeautifulSoup(html,'lxml')
print(soup.select('a'))
for a in soup.select('a'):
print(a)

【运行结果】

[<a class="mnav" href="http://news.baidu.com" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" name="tj_trnews">新闻 </a>, <a class="mnav" href="https://www.hao123.com" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" name="tj_trhao123">hao123 </a>, <a class="mnav" href="http://map.baidu.com" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" name="tj_trmap">地图 </a>, <a class="mnav" href="http://v.baidu.com" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" name="tj_trvideo">视频 </a>, <a class="mnav" href="http://tieba.baidu.com" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" name="tj_trtieba">贴吧 </a>, <a class="bri" href="//www.baidu.com/more/" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" name="tj_briicon" style="display: block;">更多产品 </a>]
<a class="mnav" href="http://news.baidu.com" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" name="tj_trnews">新闻 </a>
<a class="mnav" href="https://www.hao123.com" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" name="tj_trhao123">hao123 </a>
<a class="mnav" href="http://map.baidu.com" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" name="tj_trmap">地图 </a>
<a class="mnav" href="http://v.baidu.com" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" name="tj_trvideo">视频 </a>
<a class="mnav" href="http://tieba.baidu.com" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" name="tj_trtieba">贴吧 </a>
<a class="bri" href="//www.baidu.com/more/" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" name="tj_briicon" style="display: block;">更多产品 </a>

获取属性

获取上述a标签中的href属性

from bs4 import BeautifulSoup

file = open("./test03.html",'rb')
html = file.read()
soup = BeautifulSoup(html,'lxml')
for a in soup.select('a'):
print(a['href'])

【运行结果】

http://news.baidu.com
https://www.hao123.com
http://map.baidu.com
http://v.baidu.com
http://tieba.baidu.com
//www.baidu.com/more/

获取文本

获取上述a标签的文本内容，使用get_text()方法，或者是string获取文本内容

from bs4 import BeautifulSoup

file = open("./test03.html",'rb')
html = file.read()
soup = BeautifulSoup(html,'lxml')
for a in soup.select('a'):
print(a.get_text())
print(a.string)

【运行结果】

新闻
新闻
hao123
hao123
地图
地图
视频
视频
贴吧
贴吧
更多产品
更多产品

来源：https://blog.csdn.net/gets_s/article/details/120372061

标签：python,Beautiful,Soup,网络爬虫

投稿

python网络爬虫精解之Beautiful Soup的使用说明

一、Beautiful Soup的介绍

二、Beautiful Soup的使用

1、节点选择器

2、提取信息

3、关联选择

4、方法选择器

5、CSS选择器

猜你喜欢

地图网站的需求功能与体验

深入浅析python 协程与go协程的区别

Python标准库之typing的用法(类型标注)

vue3无法使用jsx的问题及解决

pjblog3相关日志功能(支持生成静态模式)

主流浏览器性能比较

javascript将数字转换整数金额大写的方法

网页设计详细教程之XML简便省力技巧五则

Python设计模式结构型代理模式

Python Pytorch深度学习之核心小结

合并网页中的多个script引用实现思路及代码

解决Microsoft VBScript 运行时错误 (0x800A0046) 没有权限的解决方案

mysql 5.7.14 下载安装配置方法图文教程

python实现将pvr格式转换成pvr.ccz的方法

GOLang IO接口与工具使用方法讲解

python3安装pip3（install pip3 for python 3.x）

MySQL之Explain详解

MYSQL教程:索引和查询优化程序

[MySQL binlog]mysql如何彻底解析Mixed日志格式的binlog

Go语言Elasticsearch数据清理工具思路详解