Scrapy之迭代爬取网页中失效问题及解决

作者:bladestone 时间:2022-11-02 21:03:32 

引言

在Scrapy中,在很多种情况下,需要一层层地进行爬取网页数据,就是基于url爬取网页,然后在从网页中提取url,继续爬取,循环往复。

本文将讲述一个在迭代爬取中,只能爬取第一层网页的问题。

问题的提出

scrapy crawl enrolldata
Scrapy代码执行结果输出如下:
“`
2018-05-06 17:23:06 [scrapy.utils.log] INFO: Scrapy 1.5.0 started (bot: enrolldata)
2018-05-06 17:23:06 [scrapy.utils.log] INFO: Versions: lxml 4.2.1.0, libxml2 2.9.8, cssselect 1.0.3, parsel 1.4.0, w3lib 1.19.0, Twisted 18.4.0, Python 3.6.1 (default, Apr 24 2017, 23:31:02) - [GCC 6.2.0 20161005], pyOpenSSL 17.5.0 (OpenSSL 1.1.0h 27 Mar 2018), cryptography 2.2.2, Platform Linux-4.15.0-20-generic-x86_64-with-debian-buster-sid
2018-05-06 17:23:06 [scrapy.crawler] INFO: Overridden settings: {‘BOT_NAME’: ‘enrolldata’, ‘CONCURRENT_REQUESTS’: 60, ‘CONCURRENT_REQUESTS_PER_IP’: 60, ‘DEPTH_LIMIT’: 5, ‘NEWSPIDER_MODULE’: ‘enrolldata.spiders’, ‘SPIDER_MODULES’: [‘enrolldata.spiders’]}
2018-05-06 17:23:06 [scrapy.middleware] INFO: Enabled extensions:
[‘scrapy.extensions.corestats.CoreStats’,
‘scrapy.extensions.telnet.TelnetConsole’,
‘scrapy.extensions.memusage.MemoryUsage’,
‘scrapy.extensions.logstats.LogStats’]
2018-05-06 17:23:06 [scrapy.middleware] INFO: Enabled downloader middlewares:
[‘scrapy.downloadermiddlewares.httpauth.HttpAuthMiddleware’,
‘scrapy.downloadermiddlewares.downloadtimeout.DownloadTimeoutMiddleware’,
‘scrapy.downloadermiddlewares.defaultheaders.DefaultHeadersMiddleware’,
‘scrapy.downloadermiddlewares.useragent.UserAgentMiddleware’,
‘scrapy.downloadermiddlewares.retry.RetryMiddleware’,
‘scrapy.downloadermiddlewares.redirect.MetaRefreshMiddleware’,
‘scrapy.downloadermiddlewares.httpcompression.HttpCompressionMiddleware’,
‘scrapy.downloadermiddlewares.redirect.RedirectMiddleware’,
‘scrapy.downloadermiddlewares.cookies.CookiesMiddleware’,
‘scrapy.downloadermiddlewares.httpproxy.HttpProxyMiddleware’,
‘scrapy.downloadermiddlewares.stats.DownloaderStats’]
2018-05-06 17:23:06 [scrapy.middleware] INFO: Enabled spider middlewares:
[‘scrapy.spidermiddlewares.httperror.HttpErrorMiddleware’,
‘scrapy.spidermiddlewares.offsite.OffsiteMiddleware’,
‘scrapy.spidermiddlewares.referer.RefererMiddleware’,
‘scrapy.spidermiddlewares.urllength.UrlLengthMiddleware’,
‘scrapy.spidermiddlewares.depth.DepthMiddleware’]
2018-05-06 17:23:06 [scrapy.middleware] INFO: Enabled item pipelines:
[‘enrolldata.pipelines.EnrolldataPipeline’]
2018-05-06 17:23:06 [scrapy.core.engine] INFO: Spider opened
open spider ………..pipeline
2018-05-06 17:23:06 [scrapy.extensions.logstats] INFO: Crawled 0 pages (at 0 pages/min), scraped 0 items (at 0 items/min)
2018-05-06 17:23:06 [py.warnings] WARNING: /home/bladestone/codebase/python36env/lib/python3.6/site-packages/scrapy/spidermiddlewares/offsite.py:59: URLWarning: allowed_domains accepts only domains, not URLs. Ignoring URL entry http://www.heao.gov.cn/ in allowed_domains.
warnings.warn(“allowed_domains accepts only domains, not URLs. Ignoring URL entry %s in allowed_domains.” % domain, URLWarning)

2018-05-06 17:23:06 [scrapy.extensions.telnet] DEBUG: Telnet console listening on 127.0.0.1:6023
end of start requests
2018-05-06 17:23:14 [scrapy.core.engine] DEBUG: Crawled (200)

-*- coding: utf-8 -*-
import scrapy
from enrolldata.items import EnrolldataItem

from scrapy.http import FormRequest

class SchoolspiderSpider(scrapy.Spider):
   name = 'enrolldata'
   cookies = {}
   allowed_domains = ['http://www.heao.gov.cn/']
   start_urls = ['http://www.heao.gov.cn/JHCX/PZ/enrollplan/SchoolList.aspx']
  .............
  def start_requests(self):
       formdata = {}
       formdata['PagesUpDown$edtPage'] = '1'
       formdata['__EVENTTARGET'] = 'PagesUpDown$lbtnGO'
       formdata['__EVENTARGUMENT'] = ''
       formdata['__VIEWSTATE'] = '/wEPDwUKMjA1MTU4MDA1Ng9kFgICBQ9kFgICAQ8PFggeDGZDdXJyZW50UGFnZQIBHhFmVG90YWxSZWNvcmRDb3VudAK4ER4KZlBhZ2VDb3VudAKVAR4JZlBhZ2VTaXplAg9kZGSI36vb/TsBmDT8pwwx37ajH1x0og=='
       formdata['__VIEWSTATEGENERATOR'] = 'AABB4DD8'
       formdata['__EVENTVALIDATION'] = '/wEWBQLYvvTTCwK2r/yJBQK6r/CJBgLqhPDLCwLQ0r3uCMy0KhJCAT8jebTQL0eNdj7uk4L5'
       for i in range(1, 2):
           formdata['PagesUpDown$edtPage'] = str(i)
           yield FormRequest(url=self.start_urls[0], headers=self.headers, formdata=formdata, callback=self.parse_school)
       print("end of start requests")

def parse(self, response):
       print("parse method is invoked")
       pass

def parse_school(self, response):
       print("parse school data.....")
       urls = response.xpath('//*[@id="SpanSchoolList"]/div/div[2]/ul/li/a/@href').extract();
       print("print out all the matched urls")
       print(urls)

for url in urls:
           request_url = self.base_url + url
           print("request_url in major:" + request_url)
           yield scrapy.Request(request_url, headers=self.request_headers,  cookies=self.cookies, callback=self.parse_major_enroll, meta=self.meta)
......

代码没有报错,只是输出了第一层的Web的爬取结果。但是第二层没有执行爬取。

问题分析

从日志来进行分析,没有发现错误信息;第一层代码爬取正确,但是第二层web爬取,没有被执行,代码的编写应该没有问题的。

那问题是什么呢?会不会代码没有被执行呢?通过添加日志,但是对应的代码并没有执行,日志也被正常输出。是不是被过滤或者拦截了,从而代码没有被执行?

经过代码审查之后,发现allowed_domains设置的问题,由于起设置不正确,导致其余的链接被直接过滤了。

关于allowed_domains需要是一组域名,而非一组urls。

问题的解决

需要将之前的domain name修改一下:

allowed_domains = [‘http://www.heao.gov.cn/‘]

将起修改为:

allowed_domains = [‘heao.gov.cn']

重新执行爬虫,发现多个层次是可以被正确爬取的。

来源:https://blog.csdn.net/blueheart20/article/details/80216440

标签:Scrapy,迭代,爬取网页
0
投稿

猜你喜欢

  • Overflow Auto && Position Relative

    2009-09-03 12:02:00
  • phpStorm2020 注册码

    2024-05-11 10:07:19
  • MSSQL经典语句

    2024-01-22 02:59:12
  • js返回顶部代码

    2011-04-25 19:21:00
  • django连接Mysql中已有数据库的方法详解

    2024-01-23 09:00:59
  • 从pandas一个单元格的字符串中提取字符串方式

    2022-10-14 21:24:35
  • 表格艺术经典回顾

    2013-07-14 11:11:04
  • python重要函数eval多种用法解析

    2023-02-08 20:16:46
  • pytorch 输出中间层特征的实例

    2022-06-08 07:36:39
  • 在Python的Django框架中创建语言文件

    2023-05-06 09:54:21
  • SQL语句单引号与双引号的使用方法

    2024-01-22 11:35:09
  • Python version 2.7 required, which was not found in the registry

    2021-06-02 14:57:13
  • mat矩阵和npy矩阵实现互相转换(python和matlab)

    2023-10-19 17:12:02
  • Django记录操作日志与LogEntry的使用详解

    2022-03-15 11:21:46
  • pytorch 实现在一个优化器中设置多个网络参数的例子

    2021-11-30 17:12:05
  • IE Cookie文件格式说明

    2023-03-13 17:17:22
  • vue3和ts封装axios以及使用mock.js详解

    2024-04-28 09:27:47
  • canvas 2d 环形统计图手写实现示例

    2023-07-13 16:35:23
  • php小经验:解析preg_match与preg_match_all 函数

    2023-10-31 08:55:23
  • PyQt+socket实现远程操作服务器的方法示例

    2022-07-19 01:56:13
  • asp之家 网络编程 m.aspxhome.com