Python爬虫实战之使用Scrapy爬取豆瓣图片

作者:濯君 时间:2023-06-08 10:56:20 

使用Scrapy爬取豆瓣某影星的所有个人图片

以莫妮卡·贝鲁奇为例

Python爬虫实战之使用Scrapy爬取豆瓣图片

1.首先我们在命令行进入到我们要创建的目录,输入 scrapy startproject banciyuan 创建scrapy项目

创建的项目结构如下

Python爬虫实战之使用Scrapy爬取豆瓣图片

2.为了方便使用pycharm执行scrapy项目,新建main.py


from scrapy import cmdline

cmdline.execute("scrapy crawl banciyuan".split())

再edit configuration

Python爬虫实战之使用Scrapy爬取豆瓣图片

然后进行如下设置,设置后之后就能通过运行main.py运行scrapy项目了

Python爬虫实战之使用Scrapy爬取豆瓣图片

3.分析该HTML页面,创建对应spider

Python爬虫实战之使用Scrapy爬取豆瓣图片


from scrapy import Spider
import scrapy

from banciyuan.items import BanciyuanItem

class BanciyuanSpider(Spider):
   name = 'banciyuan'
   allowed_domains = ['movie.douban.com']
   start_urls = ["https://movie.douban.com/celebrity/1025156/photos/"]
   url = "https://movie.douban.com/celebrity/1025156/photos/"

def parse(self, response):
       num = response.xpath('//div[@class="paginator"]/a[last()]/text()').extract_first('')
       print(num)
       for i in range(int(num)):
           suffix = '?type=C&start=' + str(i * 30) + '&sortby=like&size=a&subtype=a'
           yield scrapy.Request(url=self.url + suffix, callback=self.get_page)

def get_page(self, response):
       href_list = response.xpath('//div[@class="article"]//div[@class="cover"]/a/@href').extract()
       # print(href_list)
       for href in href_list:
           yield scrapy.Request(url=href, callback=self.get_info)

def get_info(self, response):
       src = response.xpath(
           '//div[@class="article"]//div[@class="photo-show"]//div[@class="photo-wp"]/a[1]/img/@src').extract_first('')
       title = response.xpath('//div[@id="content"]/h1/text()').extract_first('')
       # print(response.body)
       item = BanciyuanItem()
       item['title'] = title
       item['src'] = [src]
       yield item

4.items.py


# Define here the models for your scraped items
#
# See documentation in:
# https://docs.scrapy.org/en/latest/topics/items.html

import scrapy

class BanciyuanItem(scrapy.Item):
   # define the fields for your item here like:
   src = scrapy.Field()
   title = scrapy.Field()

pipelines.py


# Define your item pipelines here
#
# Don't forget to add your pipeline to the ITEM_PIPELINES setting
# See: https://docs.scrapy.org/en/latest/topics/item-pipeline.html

# useful for handling different item types with a single interface
from itemadapter import ItemAdapter
from scrapy.pipelines.images import ImagesPipeline
import scrapy

class BanciyuanPipeline(ImagesPipeline):
   def get_media_requests(self, item, info):
       yield scrapy.Request(url=item['src'][0], meta={'item': item})

def file_path(self, request, response=None, info=None, *, item=None):
       item = request.meta['item']
       image_name = item['src'][0].split('/')[-1]
       # image_name.replace('.webp', '.jpg')
       path = '%s/%s' % (item['title'].split(' ')[0], image_name)

return path

settings.py


# Scrapy settings for banciyuan project
#
# For simplicity, this file contains only settings considered important or
# commonly used. You can find more settings consulting the documentation:
#
#     https://docs.scrapy.org/en/latest/topics/settings.html
#     https://docs.scrapy.org/en/latest/topics/downloader-middleware.html
#     https://docs.scrapy.org/en/latest/topics/spider-middleware.html

BOT_NAME = 'banciyuan'

SPIDER_MODULES = ['banciyuan.spiders']
NEWSPIDER_MODULE = 'banciyuan.spiders'

# Crawl responsibly by identifying yourself (and your website) on the user-agent
USER_AGENT = {'User-Agent':'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/71.0.3578.80 Safari/537.36'}

# Obey robots.txt rules
ROBOTSTXT_OBEY = False

# Configure maximum concurrent requests performed by Scrapy (default: 16)
#CONCURRENT_REQUESTS = 32

# Configure a delay for requests for the same website (default: 0)
# See https://docs.scrapy.org/en/latest/topics/settings.html#download-delay
# See also autothrottle settings and docs
#DOWNLOAD_DELAY = 3
# The download delay setting will honor only one of:
#CONCURRENT_REQUESTS_PER_DOMAIN = 16
#CONCURRENT_REQUESTS_PER_IP = 16

# Disable cookies (enabled by default)
#COOKIES_ENABLED = False

# Disable Telnet Console (enabled by default)
#TELNETCONSOLE_ENABLED = False

# Override the default request headers:
#DEFAULT_REQUEST_HEADERS = {
#   'Accept': 'text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8',
#   'Accept-Language': 'en',
#}

# Enable or disable spider middlewares
# See https://docs.scrapy.org/en/latest/topics/spider-middleware.html
#SPIDER_MIDDLEWARES = {
#    'banciyuan.middlewares.BanciyuanSpiderMiddleware': 543,
#}

# Enable or disable downloader middlewares
# See https://docs.scrapy.org/en/latest/topics/downloader-middleware.html
#DOWNLOADER_MIDDLEWARES = {
#    'banciyuan.middlewares.BanciyuanDownloaderMiddleware': 543,
#}

# Enable or disable extensions
# See https://docs.scrapy.org/en/latest/topics/extensions.html
#EXTENSIONS = {
#    'scrapy.extensions.telnet.TelnetConsole': None,
#}

# Configure item pipelines
# See https://docs.scrapy.org/en/latest/topics/item-pipeline.html
ITEM_PIPELINES = {
  'banciyuan.pipelines.BanciyuanPipeline': 1,
}
IMAGES_STORE = './images'

# Enable and configure the AutoThrottle extension (disabled by default)
# See https://docs.scrapy.org/en/latest/topics/autothrottle.html
#AUTOTHROTTLE_ENABLED = True
# The initial download delay
#AUTOTHROTTLE_START_DELAY = 5
# The maximum download delay to be set in case of high latencies
#AUTOTHROTTLE_MAX_DELAY = 60
# The average number of requests Scrapy should be sending in parallel to
# each remote server
#AUTOTHROTTLE_TARGET_CONCURRENCY = 1.0
# Enable showing throttling stats for every response received:
#AUTOTHROTTLE_DEBUG = False

# Enable and configure HTTP caching (disabled by default)
# See https://docs.scrapy.org/en/latest/topics/downloader-middleware.html#httpcache-middleware-settings
#HTTPCACHE_ENABLED = True
#HTTPCACHE_EXPIRATION_SECS = 0
#HTTPCACHE_DIR = 'httpcache'
#HTTPCACHE_IGNORE_HTTP_CODES = []
#HTTPCACHE_STORAGE = 'scrapy.extensions.httpcache.FilesystemCacheStorage'

5.爬取结果

Python爬虫实战之使用Scrapy爬取豆瓣图片

reference

源码

来源:https://blog.csdn.net/zzldm/article/details/117425949

标签:Python,Scrapy
0
投稿

猜你喜欢

  • mysql 8.0 错误The server requested authentication method unknown to the client解决方法

    2024-01-13 01:41:25
  • 关于Python中模块的简介、定义与使用

    2021-01-25 16:09:29
  • mysql 列转行,合并字段的方法(必看)

    2024-01-28 03:09:04
  • Python的Socket编程过程中实现UDP端口复用的实例分享

    2022-07-03 21:59:56
  • 详解Python操作Excel之openpyxl

    2021-10-13 05:13:21
  • 在pycharm中文件取消用 pytest模式打开的操作

    2022-06-20 18:16:19
  • js返回顶部代码

    2011-04-25 19:21:00
  • getAllResponseHeaders获取网页的http头信息代码

    2010-03-31 14:31:00
  • Python基于datetime或time模块分别获取当前时间戳的方法实例

    2021-01-19 22:53:18
  • 详谈Python中列表list,元祖tuple和numpy中的array区别

    2021-02-04 12:14:28
  • 一文详解Python中生成器的原理与使用

    2021-11-29 16:52:55
  • 关于Torch torchvision Python版本对应关系说明

    2021-06-17 09:13:52
  • python基于paramiko库远程执行 SSH 命令,实现 sftp 下载文件

    2022-11-09 23:31:31
  • 解决Python正则表达式匹配反斜杠''\\''问题

    2022-06-19 09:10:55
  • Search File Contents PHP 搜索目录文本内容的代码

    2023-11-24 08:09:40
  • Python分析特征数据类别与预处理方法速学

    2023-04-29 09:55:52
  • 5个提高你站点可读性的方法

    2011-01-31 17:48:00
  • Golang实现单链表的示例代码

    2024-02-11 15:09:27
  • ASP生成柱型体,折线图,饼图源代码

    2007-09-20 12:56:00
  • Python使用穷举法求两个数的最大公约数问题

    2022-01-20 21:26:51
  • asp之家 网络编程 m.aspxhome.com