python beautiful soup库入门安装教程
作者:Cachel wood 时间:2023-03-04 06:24:27
目录
beautiful soup库的安装
beautiful soup库的理解
beautiful soup库的引用
BeautifulSoup类
回顾demo.html
Tag标签
Tag的name
Tag的attrs(属性)
Tag的NavigableString
HTML基本格式
标签树的下行遍历
标签树的上行遍历
标签的平行遍历
bs库的prettify()方法
bs4库的编码
beautiful soup库的安装
pip install beautifulsoup4
beautiful soup库的理解
beautiful soup库是解析、遍历、维护“标签树”的功能库
beautiful soup库的引用
from bs4 import BeautifulSoup
import bs4
BeautifulSoup类
BeautifulSoup对应一个HTML/XML文档的全部内容
回顾demo.html
import requests
r = requests.get("http://python123.io/ws/demo.html")
demo = r.text
print(demo)
<html><head><title>This is a python demo page</title></head>
<body>
<p class="title"><b>The demo python introduces several python courses.</b></p>
<p class="course">Python is a wonderful general-purpose programming language. You can learn Python from novice to professional by tracking the following courses:
<a href="http://www.icourse163.org/course/BIT-268001" class="py1" id="link1">Basic Python</a> and <a href="http://www.icourse163.org/course/BIT-1001870001" class="py2" id="link2">Advanced Python</a>.</p>
</body></html>
Tag标签
基本元素 | 说明 |
---|---|
Tag | 标签,最基本的信息组织单元,分别用<>和</>标明开头和结尾 |
import requests
from bs4 import BeautifulSoup
r = requests.get("http://python123.io/ws/demo.html")
demo = r.text
soup = BeautifulSoup(demo,"html.parser")
print(soup.title)
tag = soup.a
print(tag)
<title>This is a python demo page</title>
<a href="http://www.icourse163.org/course/BIT-268001" >Basic Python</a>
任何存在于HTML语法中的标签都可以用soup.访问获得。当HTML文档中存在多个相同对应内容时,soup.返回第一个
Tag的name
基本元素 | 说明 |
---|---|
Name | 标签的名字, … 的名字是'p',格式:.name |
import requests
from bs4 import BeautifulSoup
r = requests.get("http://python123.io/ws/demo.html")
demo = r.text
soup = BeautifulSoup(demo,"html.parser")
print(soup.a.name)
print(soup.a.parent.name)
print(soup.a.parent.parent.name)
a
p
body
Tag的attrs(属性)
基本元素 | 说明 |
---|---|
Attributes | 标签的属性,字典形式组织,格式:.attrs |
import requests
from bs4 import BeautifulSoup
r = requests.get("http://python123.io/ws/demo.html")
demo = r.text
soup = BeautifulSoup(demo,"html.parser")
tag = soup.a
print(tag.attrs)
print(tag.attrs['class'])
print(tag.attrs['href'])
print(type(tag.attrs))
print(type(tag))
{'href': 'http://www.icourse163.org/course/BIT-268001', 'class': ['py1'], 'id': 'link1'}
['py1']
http://www.icourse163.org/course/BIT-268001
<class 'dict'>
<class 'bs4.element.Tag'>
Tag的NavigableString
Tag的NavigableString
基本元素 | 说明 |
---|---|
NavigableString | 标签内非属性字符串,<>…</>中字符串,格式:.string |
Tag的Comment
基本元素 | 说明 |
---|---|
Comment | 标签内字符串的注释部分,一种特殊的Comment类型 |
import requests
from bs4 import BeautifulSoup
newsoup = BeautifulSoup("<b><!--This is a comment--></b><p>This is not a comment</p>","html.parser")
print(newsoup.b.string)
print(type(newsoup.b.string))
print(newsoup.p.string)
print(type(newsoup.p.string))
This is a comment
<class 'bs4.element.Comment'>
This is not a comment
<class 'bs4.element.NavigableString'>
HTML基本格式
标签树的下行遍历
属性 | 说明 |
---|---|
.contents | 子节点的列表,将所有儿子结点存入列表 |
.children | 子节点的迭代类型,与.contents类似,用于循环遍历儿子结点 |
.descendents | 子孙节点的迭代类型,包含所有子孙节点,用于循环遍历 |
BeautifulSoup类型是标签树的根节点
import requests
from bs4 import BeautifulSoup
r = requests.get("http://python123.io/ws/demo.html")
demo = r.text
soup = BeautifulSoup(demo,"html.parser")
print(soup.head)
print(soup.head.contents)
print(soup.body.contents)
print(len(soup.body.contents))
print(soup.body.contents[1])
<head><title>This is a python demo page</title></head>
[<title>This is a python demo page</title>]
['\n', <p ><b>The demo python introduces several python courses.</b></p>, '\n', <p >Python
is a wonderful general-purpose programming language. You can learn Python from novice to professional by tracking the
following courses:
<a href="http://www.icourse163.org/course/BIT-268001" >Basic Python</a> and <a href="http://www.icourse163.org/course/BIT-1001870001" >Advanced Python</a>.</p>, '\n']
5
<p ><b>The demo python introduces several python courses.</b></p>
for child in soup.body.children:
print(child) #遍历儿子结点
for child in soup.body.descendants:
print(child) #遍历子孙节点
标签树的上行遍历
属性 | 说明 |
---|---|
.parent | 节点的父亲标签 |
.parents | 节点先辈标签的迭代类型,用于循环遍历先辈节点 |
import requests
from bs4 import BeautifulSoup
r = requests.get("http://python123.io/ws/demo.html")
demo = r.text
soup = BeautifulSoup(demo,"html.parser")
print(soup.title.parent)
print(soup.html.parent)
<head><title>This is a python demo page</title></head>
<html><head><title>This is a python demo page</title></head>
<body>
<p ><b>The demo python introduces several python courses.</b></p>
<p >Python is a wonderful general-purpose programming language. You can learn Python from novice to professional by tracking the following courses:
<a href="http://www.icourse163.org/course/BIT-268001" >Basic Python</a> and <a href="http://www.icourse163.org/course/BIT-1001870001" >Advanced Python</a>.</p>
</body></html>
import requests
from bs4 import BeautifulSoup
r = requests.get("http://python123.io/ws/demo.html")
demo = r.text
soup = BeautifulSoup(demo,"html.parser")
for parent in soup.a.parents:
if parent is None:
print(parent)
else:
print(parent.name)
p
body
html
[document]
标签的平行遍历
属性 | 说明 |
---|---|
.next_sibling | 返回按照HTML文本顺序的下一个平行节点标签 |
.previous.sibling | 返回按照HTML文本顺序的上一个平行节点标签 |
.next_siblings | 迭代类型,返回按照HTML文本顺序的后续所有平行节点标签 |
.previous.siblings | 迭代类型,返回按照HTML文本顺序的前续所有平行节点标签 |
import requests
from bs4 import BeautifulSoup
r = requests.get("http://python123.io/ws/demo.html")
demo = r.text
soup = BeautifulSoup(demo,"html.parser")
print(soup.a.next_sibling)
print(soup.a.next_sibling.next_sibling)
print(soup.a.previous_sibling)
print(soup.a.previous_sibling.previous_sibling)
print(soup.a.parent)
and
<a class="py2" href="http://www.icourse163.org/course/BIT-1001870001" id="link2">Advanced Python</a>
Python is a wonderful general-purpose programming language. You can learn Python from novice to professional by tracking the following courses:
None
<p class="course">Python is a wonderful general-purpose programming language. You can learn Python from novice to professional by tracking the following courses:
<a class="py1" href="http://www.icourse163.org/course/BIT-268001" id="link1">Basic Python</a> and <a class="py2" href="http://www.icourse163.org/course/BIT-1001870001" id="link2">Advanced Python</a>.</p>
for sibling in soup.a.next_sibling:
print(sibling) #遍历后续节点
for sibling in soup.a.previous_sibling:
print(sibling) #遍历前续节点
bs库的prettify()方法
import requests
from bs4 import BeautifulSoup
r = requests.get("http://python123.io/ws/demo.html")
demo = r.text
soup = BeautifulSoup(demo,"html.parser")
print(soup.prettify())
<html>
<head>
<title>
This is a python demo page
</title>
</head>
<body>
<p class="title">
<b>
The demo python introduces several python courses.
</b>
</p>
<p class="course">
Python is a wonderful general-purpose programming language. You can learn Python from novice to professional by tracking the following courses:
Basic Python
</a>
and
<a class="py2" href="http://www.icourse163.org/course/BIT-1001870001" id="link2">
Advanced Python
</a>
.
</p>
</body>
</html>
.prettify()为HTML文本<>及其内容增加更加'\n'
.prettify()可用于标签,方法:.prettify()
bs4库的编码
bs4库将任何HTML输入都变成utf-8编码
python 3.x默认支持编码是utf-8,解析无障碍
import requests
from bs4 import BeautifulSoup
soup = BeautifulSoup("<p>中文</p>","html.parser")
print(soup.p.string)
print(soup.p.prettify())
中文
<p>
中文
</p>
来源:https://blog.csdn.net/weixin_46530492/article/details/119960182
标签:python,beautiful,soup,库


猜你喜欢
pytorch分类模型绘制混淆矩阵以及可视化详解
2023-01-17 17:35:43

python实现批量获取指定文件夹下的所有文件的厂商信息
2021-12-14 20:42:27
mysql: 安装后的目录结构
2011-03-08 09:46:00
Python 读取串口数据,动态绘图的示例
2021-11-15 19:36:24

vue+springboot实现项目的CORS跨域请求
2024-05-09 09:48:23

從無到有實現一個xml數據庫登錄驗証
2008-09-05 17:12:00
python实现中文转换url编码的方法
2021-05-26 08:46:57
Mysql简易索引方案讲解
2024-01-20 15:08:11

Django Admin中增加导出CSV功能过程解析
2021-04-17 06:22:47

JavaScript图片放大镜效果
2009-10-19 22:15:00

Python使用matplotlib绘制正弦和余弦曲线的方法示例
2023-10-03 13:44:57

python自动生成model文件过程详解
2023-09-30 02:54:05
Python中getpass模块无回显输入源码解析
2022-05-02 03:58:28
python实现健康码查验系统
2022-06-27 06:31:34

python中namedtuple函数的用法解析
2023-08-22 11:03:24

Python使用ffmpy将amr格式的音频转化为mp3格式的例子
2021-06-14 13:50:07
windows下安装python的C扩展编译环境(解决Unable to find vcvarsall.bat)
2022-03-22 02:31:42

微信小程序 调用微信授权窗口相关问题解决
2024-04-18 10:08:44

tkinter禁用(只读)下拉列表Combobox问题
2021-01-02 13:05:34

python读取有密码的zip压缩文件实例
2021-03-31 20:58:53
