Scrapy爬虫Response子类在应用中的问题解析

作者:ponponon 时间:2023-11-03 01:29:31 

今天用scrapy爬取壁纸的时候(url:http://pic.netbian.com/4kmein...)絮叨了一些问题,记录下来,供后世探讨,以史为鉴。**

因为网站是动态渲染的,所以选择scrapy对接selenium(scrapy抓取网页的方式和requests库相似,都是直接模拟HTTP请求,而Scrapy也不能抓取JavaScript动态渲染的网页。)

所以在Downloader Middlewares中需要得到Request并且返回一个Response,问题出在Response,通过查看官方文档发现class scrapy.http.Response(url[, status=200, headers=None, body=b'', flags=None, request=None]),随即通过from scrapy.http import Response导入Response

Scrapy爬虫Response子类在应用中的问题解析

输入scrapy crawl girl得到如下错误:

*results=response.xpath('//[@id="main"]/div[3]/ul/lia/img')
raise NotSupported("Response content isn't text")
scrapy.exceptions.NotSupported: Response content isn't text**

检查相关代码:

# middlewares.py
from scrapy import signals
from scrapy.http import Response
from scrapy.exceptions import IgnoreRequest
import selenium
from selenium.webdriver.common.by import By
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.support import expected_conditions as EC
class Pic4KgirlDownloaderMiddleware(object):
   # Not all methods need to be defined. If a method is not defined,
   # scrapy acts as if the downloader middleware does not modify the
   # passed objects.
   def process_request(self, request, spider):
       # Called for each request that goes through the downloader
       # middleware.
       # Must either:
       # - return None: continue processing this request
       # - or return a Response object
       # - or return a Request object
       # - or raise IgnoreRequest: process_exception() methods of
       #   installed downloader middleware will be called
       try:
           self.browser=selenium.webdriver.Chrome()
           self.wait=WebDriverWait(self.browser,10)
           self.browser.get(request.url)
           self.wait.until(EC.presence_of_element_located((By.CSS_SELECTOR, '#main > div.page > a:nth-child(10)')))
           return Response(url=request.url,status=200,request=request,body=self.browser.page_source.encode('utf-8'))
       #except:
           #raise IgnoreRequest()
       finally:
           self.browser.close()

推断问题出在:

return Response(url=request.url,status=200,request=request,body=self.browser.page_source.encode('utf-8'))

查看Response类的定义

@property
   def text(self):
       """For subclasses of TextResponse, this will return the body
       as text (unicode object in Python 2 and str in Python 3)
       """
       raise AttributeError("Response content isn't text")
   def css(self, *a, **kw):
       """Shortcut method implemented only by responses whose content
       is text (subclasses of TextResponse).
       """
       raise NotSupported("Response content isn't text")
   def xpath(self, *a, **kw):
       """Shortcut method implemented only by responses whose content
       is text (subclasses of TextResponse).
       """
       raise NotSupported("Response content isn't text")

说明Response类不可以被直接使用,需要被继承重写方法后才能使用

响应子类

**TextResponse对象**
class scrapy.http.TextResponse(url[, encoding[, ...]])
**HtmlResponse对象**
class scrapy.http.HtmlResponse(url[, ...])
**XmlResponse对象**
class scrapy.http.XmlResponse(url [,... ] )

举例观察TextResponse的定义from scrapy.http import TextResponse

导入TextResponse发现

class TextResponse(Response):
   _DEFAULT_ENCODING = 'ascii'
   def __init__(self, *args, **kwargs):
       self._encoding = kwargs.pop('encoding', None)
       self._cached_benc = None
       self._cached_ubody = None
       self._cached_selector = None
       super(TextResponse, self).__init__(*args, **kwargs)

其中xpath方法已经被重写

@property
   def selector(self):
       from scrapy.selector import Selector
       if self._cached_selector is None:
           self._cached_selector = Selector(self)
       return self._cached_selector
   def xpath(self, query, **kwargs):
       return self.selector.xpath(query, **kwargs)
   def css(self, query):
       return self.selector.css(query)

所以用户想要调用Response类,必须选择调用其子类,并且重写部分方法

Scrapy爬虫入门教程十一 Request和Response(请求和响应)

scrapy文档:https://doc.scrapy.org/en/lat...

中文翻译文档:https://www.jb51.net/article/248161.htm

来源:https://segmentfault.com/a/1190000018449717

标签:Scrapy,爬虫,Response,子类
0
投稿

猜你喜欢

  • Python中的int函数使用

    2023-05-13 06:11:59
  • SQL Server错误代码大全及解释(留着备用)

    2024-01-14 07:08:44
  • python使用epoll实现服务端的方法

    2021-05-16 22:52:34
  • asp.net得到本机数据库实例的两种方法代码

    2024-01-27 16:00:42
  • Python队列、进程间通信、线程案例

    2021-10-23 16:43:03
  • pytorch教程resnet.py的实现文件源码分析

    2023-11-07 21:18:47
  • pytorch实现textCNN的具体操作

    2022-08-28 17:40:00
  • 增强网站的魅力 网页制作技巧三则

    2007-10-04 10:06:00
  • Vue集成lodop插件实现打印功能

    2023-07-02 17:01:20
  • python库pydantic的简易入门教程

    2022-06-27 14:05:28
  • python实现求两个字符串的最长公共子串方法

    2021-08-02 21:14:08
  • Zend Framework生成验证码并实现验证码验证功能(附demo源码下载)

    2024-05-03 15:13:30
  • Pandas读取行列数据最全方法

    2022-06-23 09:34:22
  • Python的轻量级ORM框架peewee使用教程

    2021-09-01 06:55:21
  • vue+element-ui+ajax实现一个表格的实例

    2024-04-10 10:34:27
  • bat和python批量重命名文件的实现代码

    2023-10-07 02:11:53
  • 使用xmlhttp为网站增加股市行情查询功能

    2007-10-10 21:09:00
  • python中Requests发送json格式的post请求方法

    2021-05-24 10:09:45
  • Numpy中创建数组的9种方式小结

    2021-03-25 21:39:43
  • 只截取ip前6位的asp代码

    2009-05-29 18:30:00
  • asp之家 网络编程 m.aspxhome.com