python利用beautifulSoup实现爬虫

作者：mdxy-dxy 时间：2022-05-17 21:10:34　

以前讲过利用phantomjs做爬虫抓网页 https://www.jb51.net/article/55789.htm 是配合选择器做的

利用 beautifulSoup(文档：http://www.crummy.com/software/BeautifulSoup/bs4/doc/)这个python模块，可以很轻松的抓取网页内容

# coding=utf-8
import urllib
from bs4 import BeautifulSoup

url ='http://www.baidu.com/s'
values ={'wd':'网球'}
encoded_param = urllib.urlencode(values)
full_url = url +'?'+ encoded_param
response = urllib.urlopen(full_url)
soup =BeautifulSoup(response)
alinks = soup.find_all('a')

上面可以抓取百度搜出来结果是网球的记录。

beautifulSoup内置了很多非常有用的方法。

几个比较好用的特性：

构造一个node元素

soup = BeautifulSoup('Extremely bold')
tag = soup.b
type(tag)
# <class 'bs4.element.Tag'>

属性可以使用attr拿到，结果是字典

tag.attrs
# {u'class': u'boldest'}

或者直接tag.class取属性也可。

也可以自由操作属性

tag['class'] = 'verybold'
tag['id'] = 1
tag
# <blockquote class="verybold" id="1">Extremely bold</blockquote>

del tag['class']
del tag['id']
tag
# <blockquote>Extremely bold</blockquote>

tag['class']
# KeyError: 'class'
print(tag.get('class'))
# None

还可以随便操作，查找dom元素，比如下面的例子

1.构建一份文档

html_doc = """
<html><head><title>The Dormouse's story</title></head>

The Dormouse's story

Once upon a time there were three little sisters; and their names were
<a href="http://example.com/elsie" id="link1">Elsie</a>,
<a href="http://example.com/lacie" id="link2">Lacie</a> and
<a href="http://example.com/tillie" id="link3">Tillie</a>;
and they lived at the bottom of a well.

...
"""

from bs4 import BeautifulSoup
soup = BeautifulSoup(html_doc)

2.各种搞

soup.head
# <head><title>The Dormouse's story</title></head>
soup.title
# <title>The Dormouse's story</title>
soup.body.b
# The Dormouse's story
soup.a
# <a class="sister" href="http://example.com/elsie" id="link1">Elsie</a>
soup.find_all('a')
# [<a class="sister" href="http://example.com/elsie" id="link1">Elsie</a>,
# <a class="sister" href="http://example.com/lacie" id="link2">Lacie</a>,
# <a class="sister" href="http://example.com/tillie" id="link3">Tillie</a>]
head_tag = soup.head
head_tag
# <head><title>The Dormouse's story</title></head>

head_tag.contents
[<title>The Dormouse's story</title>]

title_tag = head_tag.contents[0]
title_tag
# <title>The Dormouse's story</title>
title_tag.contents
# [u'The Dormouse's story']
len(soup.contents)
# 1
soup.contents[0].name
# u'html'
text = title_tag.contents[0]
text.contents

for child in title_tag.children:
print(child)
head_tag.contents
# [<title>The Dormouse's story</title>]
for child in head_tag.descendants:
print(child)
# <title>The Dormouse's story</title>
# The Dormouse's story

len(list(soup.children))
# 1
len(list(soup.descendants))
# 25
title_tag.string
# u'The Dormouse's story'

标签：beautifulSoup,爬虫

投稿

python利用beautifulSoup实现爬虫

猜你喜欢

网页效果图设计之色彩索引

这些CSS Selector，你都熟悉吗？

DW实现鼠标滑过切换图片

检测你的SQL Server是否有特洛伊木马

asp连接mysql的问题（端口）

python爬虫基于requests模块发起ajax的get请求实现解析

Python生成器以及应用实例解析

Facebook基础的信息架构图

oracle 触发器学习笔记

WEB前端开发经验总结

超越MYSQL，ACCESS复合承载

django跳转页面传参的实现

谨慎使用PHP的引用原因分析

解读iPhone平台的一些优秀设计思路

简单了解python中的与或非运算

php实现mysql备份恢复分卷处理的方法

ASP函数验证带小数点数字格式

将字符实体引用转换成 Unicode 字符

远古幻想ICON 1套+创作思路

ASP实现语音分时问候

python利用beautifulSoup实现爬虫

猜你喜欢

网页效果图设计之色彩索引

这些CSS Selector，你都熟悉吗？

DW实现鼠标滑过切换图片

检测你的SQL Server是否有特洛伊木马

asp连接mysql的问题（端口）

python爬虫 基于requests模块发起ajax的get请求实现解析

Python生成器以及应用实例解析

Facebook基础的信息架构图

oracle 触发器 学习笔记

WEB前端开发经验总结

超越MYSQL，ACCESS复合承载

django跳转页面传参的实现

谨慎使用PHP的引用原因分析

解读iPhone平台的一些优秀设计思路

简单了解python中的与或非运算

php实现mysql备份恢复分卷处理的方法

ASP函数验证带小数点数字格式

将字符实体引用转换成 Unicode 字符

远古幻想ICON 1套+创作思路

ASP实现语音分时问候

python爬虫基于requests模块发起ajax的get请求实现解析

oracle 触发器学习笔记