Python爬虫实战之使用Scrapy爬取豆瓣图片
作者:濯君 时间:2023-06-08 10:56:20
使用Scrapy爬取豆瓣某影星的所有个人图片
以莫妮卡·贝鲁奇为例
1.首先我们在命令行进入到我们要创建的目录,输入 scrapy startproject banciyuan
创建scrapy项目
创建的项目结构如下
2.为了方便使用pycharm执行scrapy项目,新建main.py
from scrapy import cmdline
cmdline.execute("scrapy crawl banciyuan".split())
再edit configuration
然后进行如下设置,设置后之后就能通过运行main.py运行scrapy项目了
3.分析该HTML页面,创建对应spider
from scrapy import Spider
import scrapy
from banciyuan.items import BanciyuanItem
class BanciyuanSpider(Spider):
name = 'banciyuan'
allowed_domains = ['movie.douban.com']
start_urls = ["https://movie.douban.com/celebrity/1025156/photos/"]
url = "https://movie.douban.com/celebrity/1025156/photos/"
def parse(self, response):
num = response.xpath('//div[@class="paginator"]/a[last()]/text()').extract_first('')
print(num)
for i in range(int(num)):
suffix = '?type=C&start=' + str(i * 30) + '&sortby=like&size=a&subtype=a'
yield scrapy.Request(url=self.url + suffix, callback=self.get_page)
def get_page(self, response):
href_list = response.xpath('//div[@class="article"]//div[@class="cover"]/a/@href').extract()
# print(href_list)
for href in href_list:
yield scrapy.Request(url=href, callback=self.get_info)
def get_info(self, response):
src = response.xpath(
'//div[@class="article"]//div[@class="photo-show"]//div[@class="photo-wp"]/a[1]/img/@src').extract_first('')
title = response.xpath('//div[@id="content"]/h1/text()').extract_first('')
# print(response.body)
item = BanciyuanItem()
item['title'] = title
item['src'] = [src]
yield item
4.items.py
# Define here the models for your scraped items
#
# See documentation in:
# https://docs.scrapy.org/en/latest/topics/items.html
import scrapy
class BanciyuanItem(scrapy.Item):
# define the fields for your item here like:
src = scrapy.Field()
title = scrapy.Field()
pipelines.py
# Define your item pipelines here
#
# Don't forget to add your pipeline to the ITEM_PIPELINES setting
# See: https://docs.scrapy.org/en/latest/topics/item-pipeline.html
# useful for handling different item types with a single interface
from itemadapter import ItemAdapter
from scrapy.pipelines.images import ImagesPipeline
import scrapy
class BanciyuanPipeline(ImagesPipeline):
def get_media_requests(self, item, info):
yield scrapy.Request(url=item['src'][0], meta={'item': item})
def file_path(self, request, response=None, info=None, *, item=None):
item = request.meta['item']
image_name = item['src'][0].split('/')[-1]
# image_name.replace('.webp', '.jpg')
path = '%s/%s' % (item['title'].split(' ')[0], image_name)
return path
settings.py
# Scrapy settings for banciyuan project
#
# For simplicity, this file contains only settings considered important or
# commonly used. You can find more settings consulting the documentation:
#
# https://docs.scrapy.org/en/latest/topics/settings.html
# https://docs.scrapy.org/en/latest/topics/downloader-middleware.html
# https://docs.scrapy.org/en/latest/topics/spider-middleware.html
BOT_NAME = 'banciyuan'
SPIDER_MODULES = ['banciyuan.spiders']
NEWSPIDER_MODULE = 'banciyuan.spiders'
# Crawl responsibly by identifying yourself (and your website) on the user-agent
USER_AGENT = {'User-Agent':'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/71.0.3578.80 Safari/537.36'}
# Obey robots.txt rules
ROBOTSTXT_OBEY = False
# Configure maximum concurrent requests performed by Scrapy (default: 16)
#CONCURRENT_REQUESTS = 32
# Configure a delay for requests for the same website (default: 0)
# See https://docs.scrapy.org/en/latest/topics/settings.html#download-delay
# See also autothrottle settings and docs
#DOWNLOAD_DELAY = 3
# The download delay setting will honor only one of:
#CONCURRENT_REQUESTS_PER_DOMAIN = 16
#CONCURRENT_REQUESTS_PER_IP = 16
# Disable cookies (enabled by default)
#COOKIES_ENABLED = False
# Disable Telnet Console (enabled by default)
#TELNETCONSOLE_ENABLED = False
# Override the default request headers:
#DEFAULT_REQUEST_HEADERS = {
# 'Accept': 'text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8',
# 'Accept-Language': 'en',
#}
# Enable or disable spider middlewares
# See https://docs.scrapy.org/en/latest/topics/spider-middleware.html
#SPIDER_MIDDLEWARES = {
# 'banciyuan.middlewares.BanciyuanSpiderMiddleware': 543,
#}
# Enable or disable downloader middlewares
# See https://docs.scrapy.org/en/latest/topics/downloader-middleware.html
#DOWNLOADER_MIDDLEWARES = {
# 'banciyuan.middlewares.BanciyuanDownloaderMiddleware': 543,
#}
# Enable or disable extensions
# See https://docs.scrapy.org/en/latest/topics/extensions.html
#EXTENSIONS = {
# 'scrapy.extensions.telnet.TelnetConsole': None,
#}
# Configure item pipelines
# See https://docs.scrapy.org/en/latest/topics/item-pipeline.html
ITEM_PIPELINES = {
'banciyuan.pipelines.BanciyuanPipeline': 1,
}
IMAGES_STORE = './images'
# Enable and configure the AutoThrottle extension (disabled by default)
# See https://docs.scrapy.org/en/latest/topics/autothrottle.html
#AUTOTHROTTLE_ENABLED = True
# The initial download delay
#AUTOTHROTTLE_START_DELAY = 5
# The maximum download delay to be set in case of high latencies
#AUTOTHROTTLE_MAX_DELAY = 60
# The average number of requests Scrapy should be sending in parallel to
# each remote server
#AUTOTHROTTLE_TARGET_CONCURRENCY = 1.0
# Enable showing throttling stats for every response received:
#AUTOTHROTTLE_DEBUG = False
# Enable and configure HTTP caching (disabled by default)
# See https://docs.scrapy.org/en/latest/topics/downloader-middleware.html#httpcache-middleware-settings
#HTTPCACHE_ENABLED = True
#HTTPCACHE_EXPIRATION_SECS = 0
#HTTPCACHE_DIR = 'httpcache'
#HTTPCACHE_IGNORE_HTTP_CODES = []
#HTTPCACHE_STORAGE = 'scrapy.extensions.httpcache.FilesystemCacheStorage'
5.爬取结果
reference
源码
来源:https://blog.csdn.net/zzldm/article/details/117425949
标签:Python,Scrapy
0
投稿
猜你喜欢
mysql 8.0 错误The server requested authentication method unknown to the client解决方法
2024-01-13 01:41:25
关于Python中模块的简介、定义与使用
2021-01-25 16:09:29
mysql 列转行,合并字段的方法(必看)
2024-01-28 03:09:04
Python的Socket编程过程中实现UDP端口复用的实例分享
2022-07-03 21:59:56
详解Python操作Excel之openpyxl
2021-10-13 05:13:21
在pycharm中文件取消用 pytest模式打开的操作
2022-06-20 18:16:19
js返回顶部代码
2011-04-25 19:21:00
getAllResponseHeaders获取网页的http头信息代码
2010-03-31 14:31:00
Python基于datetime或time模块分别获取当前时间戳的方法实例
2021-01-19 22:53:18
详谈Python中列表list,元祖tuple和numpy中的array区别
2021-02-04 12:14:28
一文详解Python中生成器的原理与使用
2021-11-29 16:52:55
关于Torch torchvision Python版本对应关系说明
2021-06-17 09:13:52
python基于paramiko库远程执行 SSH 命令,实现 sftp 下载文件
2022-11-09 23:31:31
解决Python正则表达式匹配反斜杠''\\''问题
2022-06-19 09:10:55
Search File Contents PHP 搜索目录文本内容的代码
2023-11-24 08:09:40
Python分析特征数据类别与预处理方法速学
2023-04-29 09:55:52
5个提高你站点可读性的方法
2011-01-31 17:48:00
Golang实现单链表的示例代码
2024-02-11 15:09:27
ASP生成柱型体,折线图,饼图源代码
2007-09-20 12:56:00
Python使用穷举法求两个数的最大公约数问题
2022-01-20 21:26:51