Scrapy框架基本命令与settings.py设置

作者:hankleo 时间:2021-12-03 14:05:47 

本文实例讲述了Scrapy框架基本命令与settings.py设置。分享给大家供大家参考,具体如下:

Scrapy框架基本命令

1.创建爬虫项目


scrapy startproject [项目名称]

2.创建爬虫文件


scrapy genspider +文件名+网址

3.运行(crawl)


scrapy crawl 爬虫名称
# -o output 输出数据到文件
scrapy crawl [爬虫名称] -o zufang.json
scrapy crawl [爬虫名称] -o zufang.csv

4.check检查错误


scrapy check

5.list返回项目所有spider


scrapy list

6.view 存储、打开网页


scrapy view http://www.baidu.com

7.scrapy shell, 进入终端


scrapy shell https://www.baidu.com

8.scrapy runspider


scrapy runspider zufang_spider.py

Scrapy框架: settings.py设置


# -*- coding: utf-8 -*-
# Scrapy settings for maitian project
#
# For simplicity, this file contains only settings considered important or
# commonly used. You can find more settings consulting the documentation:
#
#   https://doc.scrapy.org/en/latest/topics/settings.html
#   https://doc.scrapy.org/en/latest/topics/downloader-middleware.html
#   https://doc.scrapy.org/en/latest/topics/spider-middleware.html
BOT_NAME = 'maitian'
SPIDER_MODULES = ['maitian.spiders']
NEWSPIDER_MODULE = 'maitian.spiders'
#不能批量设置
# Crawl responsibly by identifying yourself (and your website) on the user-agent
USER_AGENT = 'maitian (+http://www.yourdomain.com)'
#默认遵守robots协议
# Obey robots.txt rules
ROBOTSTXT_OBEY = False
#设置日志文件
LOG_FILE="maitian.log"
#日志等级分为5种:1.DEBUG 2.INFO 3.Warning 4.ERROR 5.CRITICAL
#等级越高 输出的日志越少
# LOG_LEVEL="INFO"
#scrapy设置最大并发数 默认16
# Configure maximum concurrent requests performed by Scrapy (default: 16)
#CONCURRENT_REQUESTS = 32
#设置批量延迟请求16 等待3秒再发16 秒
# Configure a delay for requests for the same website (default: 0)
# See https://doc.scrapy.org/en/latest/topics/settings.html#download-delay
# See also autothrottle settings and docs
#DOWNLOAD_DELAY = 3
# The download delay setting will honor only one of:
#CONCURRENT_REQUESTS_PER_DOMAIN = 16
#CONCURRENT_REQUESTS_PER_IP = 16
#cookie 不生效 默认是True
# Disable cookies (enabled by default)
#COOKIES_ENABLED = False
#远程
# Disable Telnet Console (enabled by default)
#TELNETCONSOLE_ENABLED = False
#加载默认的请求头
# Override the default request headers:
#DEFAULT_REQUEST_HEADERS = {
#  'Accept': 'text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8',
#  'Accept-Language': 'en',
#}
#爬虫中间件
# Enable or disable spider middlewares
# See https://doc.scrapy.org/en/latest/topics/spider-middleware.html
#SPIDER_MIDDLEWARES = {
#  'maitian.middlewares.MaitianSpiderMiddleware': 543,
#}
#下载中间件
# Enable or disable downloader middlewares
# See https://doc.scrapy.org/en/latest/topics/downloader-middleware.html
#DOWNLOADER_MIDDLEWARES = {
#  'maitian.middlewares.MaitianDownloaderMiddleware': 543,
#}
# Enable or disable extensions
# See https://doc.scrapy.org/en/latest/topics/extensions.html
#EXTENSIONS = {
#  'scrapy.extensions.telnet.TelnetConsole': None,
#}
#在配置文件 开启管道
#优先级的范围 0--1000;值越小 优先级越高
# Configure item pipelines
# See https://doc.scrapy.org/en/latest/topics/item-pipeline.html
#ITEM_PIPELINES = {
#  'maitian.pipelines.MaitianPipeline': 300,
#}
# Enable and configure the AutoThrottle extension (disabled by default)
# See https://doc.scrapy.org/en/latest/topics/autothrottle.html
#AUTOTHROTTLE_ENABLED = True
# The initial download delay
#AUTOTHROTTLE_START_DELAY = 5
# The maximum download delay to be set in case of high latencies
#AUTOTHROTTLE_MAX_DELAY = 60
# The average number of requests Scrapy should be sending in parallel to
# each remote server
#AUTOTHROTTLE_TARGET_CONCURRENCY = 1.0
# Enable showing throttling stats for every response received:
#AUTOTHROTTLE_DEBUG = False
# Enable and configure HTTP caching (disabled by default)
# See https://doc.scrapy.org/en/latest/topics/downloader-middleware.html#httpcache-middleware-settings
#HTTPCACHE_ENABLED = True
#HTTPCACHE_EXPIRATION_SECS = 0
#HTTPCACHE_DIR = 'httpcache'
#HTTPCACHE_IGNORE_HTTP_CODES = []
#HTTPCACHE_STORAGE = 'scrapy.extensions.httpcache.FilesystemCacheStorage'

更多相关内容可查看本站专题:《Python Socket编程技巧总结》、《Python正则表达式用法总结》、《Python数据结构与算法教程》、《Python函数使用技巧总结》、《Python字符串操作技巧汇总》、《Python入门与进阶经典教程》及《Python文件与目录操作技巧汇总》

希望本文所述对大家基于Scrapy框架的Python程序设计有所帮助。

来源:https://www.cnblogs.com/hankleo/p/11824244.html

标签:Scrapy,基本命令,settings.py设置
0
投稿

猜你喜欢

  • 在Vista IIS 7 中用 vs2005 调试 Web 项目的注意事项

    2011-04-18 10:38:00
  • 不同情况下修复Access数据库的实用方法

    2008-11-28 16:18:00
  • JS前端接口请求参数混淆方案分享

    2024-04-18 10:53:17
  • 超详细汇总21个值得收藏的mysql优化实践

    2024-01-17 21:01:18
  • python opencv实现图像配准与比较

    2023-03-01 15:30:24
  • 基于Python制作AI聊天软件的示例代码

    2023-09-27 14:58:25
  • PHP 简单日历实现代码

    2023-07-01 12:00:01
  • JavaScript大牛:Douglas Crockford

    2009-03-31 12:06:00
  • Ajax缓存和编码问题的最终解决方案

    2010-03-30 13:42:00
  • 十行代码使用Python写一个USB病毒

    2021-08-27 17:05:41
  • python获取list下标及其值的简单方法

    2023-09-18 08:30:56
  • 解决goland新建项目文件名为红色的问题

    2024-04-25 14:58:49
  • Python爬虫实战之用selenium爬取某旅游网站

    2021-03-25 10:28:36
  • 如何设置PyCharm中的Python代码模版(推荐)

    2022-12-14 03:56:29
  • python网络爬虫 Scrapy中selenium用法详解

    2023-02-03 06:08:46
  • 微信小程序实现搜索功能

    2024-05-02 16:17:16
  • Python中socket网络通信是干嘛的

    2023-12-16 02:57:03
  • 利用css的clear属性搞定广告文字环绕效果

    2008-05-24 13:48:00
  • JS、CSS和HTML实现注册页面

    2024-04-17 10:03:20
  • Django的CVB实例详解

    2023-11-04 06:47:26
  • asp之家 网络编程 m.aspxhome.com