Python异步爬取知乎热榜实例分享
作者:程序员班长 时间:2022-02-26 04:48:47
一、错误代码:摘要和详细的url获取不到
import asyncio
from bs4 import BeautifulSoup
import aiohttp
headers={
'user-agent': 'Mozilla/5.0 (Windows NT 6.1; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/86.0.4240.198 Safari/537.36',
'referer': 'https://www.baidu.com/s?tn=02003390_43_hao_pg&isource=infinity&iname=baidu&itype=web&ie=utf-8&wd=%E7%9F%A5%E4%B9%8E%E7%83%AD%E6%A6%9C'
}
async def getPages(url):
async with aiohttp.ClientSession(headers=headers) as session:
async with session.get(url) as resp:
print(resp.status) # 打印状态码
html=await resp.text()
soup=BeautifulSoup(html,'lxml')
items=soup.select('.HotList-item')
for item in items:
title=item.select('.HotList-itemTitle')[0].text
try:
abstract=item.select('.HotList-itemExcerpt')[0].text
except:
abstract='No Abstract'
hot=item.select('.HotList-itemMetrics')[0].text
try:
img=item.select('.HotList-itemImgContainer img')['src']
except:
img='No Img'
print("{}\n{}\n{}".format(title,abstract,img))
if __name__ == '__main__':
url='https://www.zhihu.com/billboard'
loop=asyncio.get_event_loop()
loop.run_until_complete(getPages(url))
loop.close()
二、查看JS代码
发现详细链接、图片链接、问题摘要等都在JS里面(CSDN的开发者助手插件确实好用)
正则表达式获取上述信息:
接下来就是详细的代码啦
import asyncio
import json
import re
import aiohttp
headers={
'user-agent': 'Mozilla/5.0 (Windows NT 6.1; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/86.0.4240.198 Safari/537.36',
'referer': 'https://www.baidu.com/s?tn=02003390_43_hao_pg&isource=infinity&iname=baidu&itype=web&ie=utf-8&wd=%E7%9F%A5%E4%B9%8E%E7%83%AD%E6%A6%9C'
}
async def getPages(url):
async with aiohttp.ClientSession(headers=headers) as session:
async with session.get(url) as resp:
print(resp.status) # 打印状态码
html=await resp.text()
regex=re.compile('"hotList":(.*?),"guestFeeds":')
text=regex.search(html).group(1)
# print(json.loads(text)) # json换成字典格式
for item in json.loads(text):
title=item['target']['titleArea']['text']
question=item['target']['excerptArea']['text']
hot=item['target']['metricsArea']['text']
link=item['target']['link']['url']
img=item['target']['imageArea']['url']
if not img:
img='No Img'
if not question:
question='No Abstract'
print("Title:{}\nPopular:{}\nQuestion:{}\nLink:{}\nImg:{}".format(title,hot,question,link,img))
if __name__ == '__main__':
url='https://www.zhihu.com/billboard'
loop=asyncio.get_event_loop()
loop.run_until_complete(getPages(url))
loop.close()
来源:https://kantlee.blog.csdn.net/article/details/113665084
标签:Python,异步,爬取,知乎,热榜
0
投稿
猜你喜欢
Python实现翻转数组功能示例
2022-02-28 09:03:09
python3+opencv生成不规则黑白mask实例
2023-10-06 11:01:25
Go语言题解LeetCode724寻找数组的中心下标
2023-07-09 03:26:01
使用Alt提升可访问性
2009-04-04 19:22:00
由黄钻等级图标处理引发的思考
2009-11-16 12:37:00
怎样取得局域网中所有SQL Server的实例
2009-01-08 13:20:00
JavaScript中的this/call/apply/bind的使用及区别
2023-09-15 06:23:19
python Tkinter版学生管理系统
2021-03-11 11:07:48
Python实现常见的4种坐标互相转换
2023-11-21 06:22:38
使用Python3内置文档高效学习以及官方中文文档
2022-06-13 08:14:45
用ASP读取/写入UTF-8编码格式的文件
2007-08-20 09:29:00
SQL Server 自动化管理分区设计方案(图解)
2011-07-21 17:25:04
界面内容优化的层次
2007-11-06 13:07:00
SQL Server数据库备份出错及应对措施
2009-04-20 17:02:00
Python文本特征抽取与向量化算法学习
2023-07-19 09:44:29
SQL Server 数据库备份和还原认识和总结(二)
2012-10-07 10:52:52
如何对PHP程序中的常见漏洞进行攻击
2023-11-14 19:46:25
JS实现动画中的布局转换
2023-10-14 15:58:04
Python实现完全数的示例详解
2021-11-21 20:09:30
asp_数据库操作封装
2010-04-03 21:00:00