Python爬虫必备技巧详细总结

作者：小旺不正经时间：2022-10-02 12:47:44　

自定义函数

import requests
from bs4 import BeautifulSoup
headers={'User-Agent':'Mozilla/5.0 (Windows NT 10.0; Win64; x64; rv:93.0) Gecko/20100101 Firefox/93.0'}
def baidu(company):
url = 'https://www.baidu.com/s?rtt=4&tn=news&word=' + company
print(url)
html = requests.get(url, headers=headers).text
s = BeautifulSoup(html, 'html.parser')
title=s.select('.news-title_1YtI1 a')
for i in title:
print(i.text)
# 批量调用函数
companies = ['腾讯', '阿里巴巴', '百度集团']
for i in companies:
baidu(i)

批量输出多个搜索结果的标题

结果保存为文本文件

import requests
from bs4 import BeautifulSoup
headers={'User-Agent':'Mozilla/5.0 (Windows NT 10.0; Win64; x64; rv:93.0) Gecko/20100101 Firefox/93.0'}
def baidu(company):
url = 'https://www.baidu.com/s?rtt=4&tn=news&word=' + company
print(url)
html = requests.get(url, headers=headers).text
s = BeautifulSoup(html, 'html.parser')
title=s.select('.news-title_1YtI1 a')
fl=open('test.text','a', encoding='utf-8')
for i in title:
fl.write(i.text + '\n')
# 批量调用函数
companies = ['腾讯', '阿里巴巴', '百度集团']
for i in companies:
baidu(i)

写入代码

fl=open('test.text','a', encoding='utf-8')
for i in title:
fl.write(i.text + '\n')

异常处理

for i in companies:
try:
baidu(i)
print('运行成功')
except:
print('运行失败')

写在循环中不会让程序停止运行而会输出运行失败

休眠时间

import time
for i in companies:
try:
baidu(i)
print('运行成功')
except:
print('运行失败')
time.sleep(5)

time.sleep(5)

括号里的单位是秒

放在什么位置则在什么位置休眠（暂停）

爬取多页内容

百度搜索腾讯

切换到第二页

去掉多多余的

https://www.baidu.com/s?wd=腾讯&pn=10

分析出

https://www.baidu.com/s?wd=腾讯&pn=0 为第一页
https://www.baidu.com/s?wd=腾讯&pn=10 为第二页
https://www.baidu.com/s?wd=腾讯&pn=20 为第三页
https://www.baidu.com/s?wd=腾讯&pn=30 为第四页
..........

代码

from bs4 import BeautifulSoup
import time
headers={'User-Agent':'Mozilla/5.0 (Windows NT 10.0; Win64; x64; rv:93.0) Gecko/20100101 Firefox/93.0'}
def baidu(c):
url = 'https://www.baidu.com/s?wd=腾讯&pn=' + str(c)+'0'
print(url)
html = requests.get(url, headers=headers).text
s = BeautifulSoup(html, 'html.parser')
title=s.select('.t a')
for i in title:
print(i.text)

for i in range(10):
baidu(i)
time.sleep(2)

来源：https://blog.csdn.net/weixin_42403632/article/details/120884472

标签：Python,爬虫技巧

投稿

Python爬虫必备技巧详细总结

自定义函数

结果保存为文本文件

异常处理

休眠时间

爬取多页内容

代码

猜你喜欢

解决golang编译提示dial tcp 172.217.160.113:443: connectex: A connection attempt failed(推荐)

webpack动态加载与打包方式

python基于socket函数实现端口扫描

ASP读取XML实例优酷专辑采集程序雷锋版

Python+xlwings制作天气预报表

网页中的平衡、对比、连贯和留白

python两个list[]相加的实现方法

浅谈python多线程和多线程变量共享问题介绍

用vue的双向绑定简单实现一个todo-list的示例代码

python函数递归调用的实现

mysql启动报错MySQL server PID file could not be found

几种MySQL中的联接查询操作方法总结

Python爬取你好李焕英豆瓣短评生成词云的示例代码

python发qq消息轰炸虐狗好友思路详解(完整代码)

asp中Access与Sql Server数据库区别总结

python 基于Apscheduler实现定时任务

MySql数据类型教程示例详解

Mysql复合主键和联合主键的区别解析

Python matplotlib绘图建立画布及坐标系

django 快速启动数据库客户端程序的方法示例

Python爬虫必备技巧详细总结

自定义函数

结果保存为文本文件

异常处理

休眠时间

爬取多页内容

代码

猜你喜欢

解决golang编译提示dial tcp 172.217.160.113:443: connectex: A connection attempt failed(推荐)

webpack动态加载与打包方式

python基于socket函数实现端口扫描

ASP读取XML实例 优酷专辑采集程序 雷锋版

Python+xlwings制作天气预报表

网页中的平衡、对比、连贯和留白

python两个list[]相加的实现方法

浅谈python多线程和多线程变量共享问题介绍

用vue的双向绑定简单实现一个todo-list的示例代码

python函数递归调用的实现

mysql启动报错MySQL server PID file could not be found

几种MySQL中的联接查询操作方法总结

Python爬取你好李焕英豆瓣短评生成词云的示例代码

python发qq消息轰炸虐狗好友思路详解(完整代码)

asp中Access与Sql Server数据库区别总结

python 基于Apscheduler实现定时任务

MySql数据类型教程示例详解

Mysql复合主键和联合主键的区别解析

Python matplotlib绘图建立画布及坐标系

django 快速启动数据库客户端程序的方法示例

ASP读取XML实例优酷专辑采集程序雷锋版