python 网络爬虫初级实现代码

作者：ACdreamers 时间：2021-10-02 12:24:13　

首先，我们来看一个Python抓取网页的库：urllib或urllib2。

那么urllib与urllib2有什么区别呢？
可以把urllib2当作urllib的扩增，比较明显的优势是urllib2.urlopen()可以接受Request对象作为参数，从而可以控制HTTP Request的header部。
做HTTP Request时应当尽量使用urllib2库，但是urllib.urlretrieve()函数以及urllib.quote等一系列quote和unquote功能没有被加入urllib2中，因此有时也需要urllib的辅助。

urllib.open()这里传入的参数要遵循一些协议，比如http，ftp，file等。例如：

urllib.open('http://www.baidu.com')
urllib.open('file:D\Python\Hello.py')

现在有一个例子，下载一个网站上所有gif格式的图片。那么Python代码如下：

import re
import urllib

def getHtml(url):
page = urllib.urlopen(url)
html = page.read()
return html

def getImg(html):
reg = r'src="(.*?\.gif)"'
imgre = re.compile(reg)
imgList = re.findall(imgre,html)
print imgList
cnt = 1
for imgurl in imgList:
urllib.urlretrieve(imgurl,'％s.jpg' ％cnt)
cnt += 1

if __name__ == '__main__':
html = getHtml('http://www.baidu.com')
getImg(html)

根据上面的方法，我们可以抓取一定的网页，然后提取我们所需要的数据。

实际上，我们利用urllib这个模块来做网络爬虫效率是极其低下的，下面我们来介绍Tornado Web Server。
Tornado web server是使用Python编写出来的一个极轻量级、高可伸缩性和非阻塞IO的Web服务器软件，著名的Friendfeed网站就是使用它搭建的。Tornado跟其他主流的Web服务器框架（主要是Python框架）不同是采用epoll非阻塞IO，响应快速，可处理数千并发连接，特别适用用于实时的Web服务。

用Tornado Web Server来抓取网页效率会比较高。
从Tornado的官网来看，还要安 * ackports.ssl_match_hostname，官网如下：

http://www.tornadoweb.org/en/stable/

import tornado.httpclient

def Fetch(url):
http_header = {'User-Agent' : 'Chrome'}
http_request = tornado.httpclient.HTTPRequest(url=url,method='GET',headers=http_header,connect_timeout=200,request_timeout=600)
print 'Hello'
http_client = tornado.httpclient.HTTPClient()
print 'Hello World'

print 'Start downloading data...'
http_response = http_client.fetch(http_request)
print 'Finish downloading data...'

print http_response.code

all_fields = http_response.headers.get_all()
for field in all_fields:
print field

print http_response.body

if __name__ == '__main__':
Fetch('http://www.baidu.com')

urllib2的常见方法：

（1）info() 获取网页的Header信息

（2）getcode() 获取网页的状态码

（3）geturl() 获取传入的网址

（4）read() 读取文件的内容

标签：python,网络爬虫

投稿

python 网络爬虫初级实现代码

猜你喜欢

基于Python和Scikit-Learn的机器学习探索

python格式的Caffe图片数据均值计算学习

php面向对象程序设计

python opencv将图片转为灰度图的方法示例

Selenium爬取b站主播头像并以昵称命名保存到本地

Jupyter Notebook添加代码自动补全功能的实现

Python 异常的捕获、异常的传递与主动抛出异常操作示例

基于Python搭建人脸识别考勤系统

python pandas dataframe 按列或者按行合并的方法

PHP实现统计代码行数小工具

asp如何更好地保护我的网页？

各种JavaScript开发工具比较

IE下Flash内容刷新后消失问题

Python爬虫后获取重定向url的两种方法

浅谈MySQL存储引擎选择 InnoDB还是MyISAM

Python yield 使用方法浅析

Python常见的几种数据加密方式

专家教你安装 MySQL的与MySQL GUI Tools

对Python实现简单的API接口实例讲解

PHP解析xml格式数据工具类示例