讲解Python的Scrapy爬虫框架使用代理进行采集的方法

作者:goldensun 时间:2022-07-28 01:34:49 

1.在Scrapy工程下新建“middlewares.py”


# Importing base64 library because we'll need it ONLY in case if the proxy we are going to use requires authentication
import base64

# Start your middleware class
class ProxyMiddleware(object):
# overwrite process request
def process_request(self, request, spider):
 # Set the location of the proxy
 request.meta['proxy'] = "http://YOUR_PROXY_IP:PORT"

# Use the following lines if your proxy requires authentication
 proxy_user_pass = "USERNAME:PASSWORD"
 # setup basic authentication for the proxy
 encoded_user_pass = base64.encodestring(proxy_user_pass)
 request.headers['Proxy-Authorization'] = 'Basic ' + encoded_user_pass

2.在项目配置文件里(./project_name/settings.py)添加


DOWNLOADER_MIDDLEWARES = {
'scrapy.contrib.downloadermiddleware.httpproxy.HttpProxyMiddleware': 110,
'project_name.middlewares.ProxyMiddleware': 100,
}

只要两步,现在请求就是通过代理的了。测试一下^_^


from scrapy.spider import BaseSpider
from scrapy.contrib.spiders import CrawlSpider, Rule
from scrapy.http import Request

class TestSpider(CrawlSpider):
name = "test"
domain_name = "whatismyip.com"
# The following url is subject to change, you can get the last updated one from here :
# http://www.whatismyip.com/faq/automation.asp
start_urls = ["http://xujian.info"]

def parse(self, response):
 open('test.html', 'wb').write(response.body)

3.使用随机user-agent

默认情况下scrapy采集时只能使用一种user-agent,这样容易被网站屏蔽,下面的代码可以从预先定义的user- agent的列表中随机选择一个来采集不同的页面

在settings.py中添加以下代码


DOWNLOADER_MIDDLEWARES = {
 'scrapy.contrib.downloadermiddleware.useragent.UserAgentMiddleware' : None,
 'Crawler.comm.rotate_useragent.RotateUserAgentMiddleware' :400
}

注意: Crawler; 是你项目的名字 ,通过它是一个目录的名称 下面是蜘蛛的代码


#!/usr/bin/python
#-*-coding:utf-8-*-

import random
from scrapy.contrib.downloadermiddleware.useragent import UserAgentMiddleware

class RotateUserAgentMiddleware(UserAgentMiddleware):
def __init__(self, user_agent=''):
 self.user_agent = user_agent

def process_request(self, request, spider):
 #这句话用于随机选择user-agent
 ua = random.choice(self.user_agent_list)
 if ua:
  request.headers.setdefault('User-Agent', ua)

#the default user_agent_list composes chrome,I E,firefox,Mozilla,opera,netscape
#for more user agent strings,you can find it in http://www.useragentstring.com/pages/useragentstring.php
user_agent_list = [\
 "Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/537.1 (KHTML, like Gecko) Chrome/22.0.1207.1 Safari/537.1"\
 "Mozilla/5.0 (X11; CrOS i686 2268.111.0) AppleWebKit/536.11 (KHTML, like Gecko) Chrome/20.0.1132.57 Safari/536.11",\
 "Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/536.6 (KHTML, like Gecko) Chrome/20.0.1092.0 Safari/536.6",\
 "Mozilla/5.0 (Windows NT 6.2) AppleWebKit/536.6 (KHTML, like Gecko) Chrome/20.0.1090.0 Safari/536.6",\
 "Mozilla/5.0 (Windows NT 6.2; WOW64) AppleWebKit/537.1 (KHTML, like Gecko) Chrome/19.77.34.5 Safari/537.1",\
 "Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/536.5 (KHTML, like Gecko) Chrome/19.0.1084.9 Safari/536.5",\
 "Mozilla/5.0 (Windows NT 6.0) AppleWebKit/536.5 (KHTML, like Gecko) Chrome/19.0.1084.36 Safari/536.5",\
 "Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/536.3 (KHTML, like Gecko) Chrome/19.0.1063.0 Safari/536.3",\
 "Mozilla/5.0 (Windows NT 5.1) AppleWebKit/536.3 (KHTML, like Gecko) Chrome/19.0.1063.0 Safari/536.3",\
 "Mozilla/5.0 (Macintosh; Intel Mac OS X 10_8_0) AppleWebKit/536.3 (KHTML, like Gecko) Chrome/19.0.1063.0 Safari/536.3",\
 "Mozilla/5.0 (Windows NT 6.2) AppleWebKit/536.3 (KHTML, like Gecko) Chrome/19.0.1062.0 Safari/536.3",\
 "Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/536.3 (KHTML, like Gecko) Chrome/19.0.1062.0 Safari/536.3",\
 "Mozilla/5.0 (Windows NT 6.2) AppleWebKit/536.3 (KHTML, like Gecko) Chrome/19.0.1061.1 Safari/536.3",\
 "Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/536.3 (KHTML, like Gecko) Chrome/19.0.1061.1 Safari/536.3",\
 "Mozilla/5.0 (Windows NT 6.1) AppleWebKit/536.3 (KHTML, like Gecko) Chrome/19.0.1061.1 Safari/536.3",\
 "Mozilla/5.0 (Windows NT 6.2) AppleWebKit/536.3 (KHTML, like Gecko) Chrome/19.0.1061.0 Safari/536.3",\
 "Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/535.24 (KHTML, like Gecko) Chrome/19.0.1055.1 Safari/535.24",\
 "Mozilla/5.0 (Windows NT 6.2; WOW64) AppleWebKit/535.24 (KHTML, like Gecko) Chrome/19.0.1055.1 Safari/535.24"
 ]
标签:Python,Scrapy
0
投稿

猜你喜欢

  • ASP 判断是否有中文的代码

    2011-04-15 11:07:00
  • css设计小巧三条

    2008-01-21 13:04:00
  • Python实现内存泄露排查的示例详解

    2021-03-03 21:55:29
  • asp中Access与Sql Server数据库区别总结

    2007-11-18 15:08:00
  • 通过字符串导入 Python 模块的方法详解

    2023-10-15 03:00:56
  • Python 读取位于包中的数据文件

    2023-06-09 00:16:03
  • Python全排列操作实例分析

    2023-08-24 20:17:54
  • Python 调用DLL操作抄表机

    2021-10-15 08:03:39
  • Python实现繁體转为简体的方法示例

    2022-08-10 01:49:58
  • asp如何实现网络打印?

    2010-05-24 18:31:00
  • python实现的web监控系统

    2022-01-28 20:31:57
  • 详解python中的装饰器

    2023-02-20 10:33:36
  • Python多线程:主线程等待所有子线程结束代码

    2021-03-18 05:31:53
  • 深入剖析SQL Server的六种数据移动方法

    2009-01-07 14:09:00
  • 百度在线手写输入法

    2010-02-03 14:27:00
  • 什么是python的列表推导式

    2023-03-11 03:06:02
  • Python提取特定时间段内数据的方法实例

    2023-09-12 05:21:49
  • Python Type Hints 学习之从入门到实践

    2022-10-05 15:53:54
  • mssql中获取指定日期所在月份的第一天的代码

    2011-09-30 11:23:57
  • 在Django的通用视图中处理Context的方法

    2023-02-25 20:50:45
  • asp之家 网络编程 m.aspxhome.com