Python下使用Scrapy爬取网页内容的实例

作者:止鱼 时间:2022-05-29 13:43:24 

上周用了一周的时间学习了Python和Scrapy,实现了从0到1完整的网页爬虫实现。研究的时候很痛苦,但是很享受,做技术的嘛。

首先,安装Python,坑太多了,一个个爬。由于我是windows环境,没钱买mac, 在安装的时候遇到各种各样的问题,确实各种各样的依赖。

安装教程不再赘述。如果在安装的过程中遇到 ERROR:需要windows c/c++问题,一般是由于缺少windows开发编译环境,晚上大多数教程是安装一个VisualStudio,太不靠谱了,事实上只要安装一个WindowsSDK就可以了。

下面贴上我的爬虫代码:

爬虫主程序:


# -*- coding: utf-8 -*-
import scrapy
from scrapy.http import Request
from zjf.FsmzItems import FsmzItem
from scrapy.selector import Selector
# 圈圈:情感生活
class MySpider(scrapy.Spider):
#爬虫名
name = "MySpider"
#设定域名
allowed_domains = ["nvsheng.com"]
#爬取地址
start_urls = []
#flag
x = 0
#爬取方法
def parse(self, response):
 item = FsmzItem()
 sel = Selector(response)
 item['title'] = sel.xpath('//h1/text()').extract()
 item['text'] = sel.xpath('//*[@class="content"]/p/text()').extract()
 item['imags'] = sel.xpath('//div[@id="content"]/p/a/img/@src|//div[@id="content"]/p/img/@src').extract()
 if MySpider.x == 0:
  page_list = MySpider.getUrl(self,response)
  for page_single in page_list:
   yield Request(page_single)
 MySpider.x += 1
 yield item
#init: 动态传入参数
#命令行传参写法: scrapy crawl MySpider -a start_url="http://some_url"
def __init__(self,*args,**kwargs):
 super(MySpider,self).__init__(*args,**kwargs)
 self.start_urls = [kwargs.get('start_url')]
def getUrl(self, response):
 url_list = []
 select = Selector(response)
 page_list_tmp = select.xpath('//div[@class="viewnewpages"]/a[not(@class="next")]/@href').extract()
 for page_tmp in page_list_tmp:
  if page_tmp not in url_list:
   url_list.append("http://www.nvsheng.com/emotion/px/" + page_tmp)
 return url_list

PipeLines类


# -*- coding: utf-8 -*-
# Define your item pipelines here
#
# Don't forget to add your pipeline to the ITEM_PIPELINES setting
# See: http://doc.scrapy.org/en/latest/topics/item-pipeline.html
from zjf import settings
import json,os,re,random
import urllib.request
import requests, json
from requests_toolbelt.multipart.encoder import MultipartEncoder
class MyPipeline(object):
flag = 1
post_title = ''
post_text = []
post_text_imageUrl_list = []
cs = []
user_id= ''
def __init__(self):
 MyPipeline.user_id = MyPipeline.getRandomUser('37619,18441390,18441391')
#process the data
def process_item(self, item, spider):
 #获取随机user_id,模拟发帖
 user_id = MyPipeline.user_id
 #获取正文text_str_tmp
 text = item['text']
 text_str_tmp = ""
 for str in text:
  text_str_tmp = text_str_tmp + str
 # print(text_str_tmp)
 #获取标题
 if MyPipeline.flag == 1:
  title = item['title']
  MyPipeline.post_title = MyPipeline.post_title + title[0]
 #保存并上传图片
 text_insert_pic = ''
 text_insert_pic_w = ''
 text_insert_pic_h = ''
 for imag_url in item['imags']:
  img_name = imag_url.replace('/','').replace('.','').replace('|','').replace(':','')
  pic_dir = settings.IMAGES_STORE + '%s.jpg' %(img_name)
  urllib.request.urlretrieve(imag_url,pic_dir)
  #图片上传,返回json
  upload_img_result = MyPipeline.uploadImage(pic_dir,'image/jpeg')
  #获取json中保存图片路径
  text_insert_pic = upload_img_result['result']['image_url']
  text_insert_pic_w = upload_img_result['result']['w']
  text_insert_pic_h = upload_img_result['result']['h']
 #拼接json
 if MyPipeline.flag == 1:
  cs_json = {"c":text_str_tmp,"i":"","w":text_insert_pic_w,"h":text_insert_pic_h}
 else:
  cs_json = {"c":text_str_tmp,"i":text_insert_pic,"w":text_insert_pic_w,"h":text_insert_pic_h}
 MyPipeline.cs.append(cs_json)
 MyPipeline.flag += 1
 return item
#spider开启时被调用
def open_spider(self,spider):
 pass
#sipder 关闭时被调用
def close_spider(self,spider):
 strcs = json.dumps(MyPipeline.cs)
 jsonData = {"apisign":"99ea3eda4b45549162c4a741d58baa60","user_id":MyPipeline.user_id,"gid":30,"t":MyPipeline.post_title,"cs":strcs}
 MyPipeline.uploadPost(jsonData)
#上传图片
def uploadImage(img_path,content_type):
 "uploadImage functions"
 #UPLOAD_IMG_URL = "http://api.qa.douguo.net/robot/uploadpostimage"
 UPLOAD_IMG_URL = "http://api.douguo.net/robot/uploadpostimage"
 # 传图片
 #imgPath = 'D:\pics\http___img_nvsheng_com_uploads_allimg_170119_18-1f1191g440_jpg.jpg'
 m = MultipartEncoder(
  # fields={'user_id': '192323',
  #   'images': ('filename', open(imgPath, 'rb'), 'image/JPEG')}
  fields={'user_id': MyPipeline.user_id,
    'apisign':'99ea3eda4b45549162c4a741d58baa60',
    'image': ('filename', open(img_path , 'rb'),'image/jpeg')}
 )
 r = requests.post(UPLOAD_IMG_URL,data=m,headers={'Content-Type': m.content_type})
 return r.json()
def uploadPost(jsonData):
 CREATE_POST_URL = http://api.douguo.net/robot/uploadimagespost

 reqPost = requests.post(CREATE_POST_URL,data=jsonData)

def getRandomUser(userStr):
 user_list = []
 user_chooesd = ''
 for user_id in str(userStr).split(','):
  user_list.append(user_id)
 userId_idx = random.randint(1,len(user_list))
 user_chooesd = user_list[userId_idx-1]
 return user_chooesd

字段保存Items类


# -*- coding: utf-8 -*-

# Define here the models for your scraped items
#
# See documentation in:
# http://doc.scrapy.org/en/latest/topics/items.html

import scrapy

class FsmzItem(scrapy.Item):
# define the fields for your item here like:
# name = scrapy.Field()
title = scrapy.Field()
#tutor = scrapy.Field()
#strongText = scrapy.Field()
text = scrapy.Field()
imags = scrapy.Field()

在命令行里键入


scrapy crawl MySpider -a start_url=www.aaa.com

这样就可以爬取aaa.com下的内容了

来源:https://blog.csdn.net/qq_31573519/article/details/66975211

标签:Python,Scrapy,爬取,网页
0
投稿

猜你喜欢

  • MYSQL存储过程即常用逻辑知识点总结

    2024-01-21 07:36:36
  • python性能测量工具cProfile使用解析

    2022-01-28 08:22:03
  • elementui源码学习仿写el-collapse示例

    2024-05-09 15:25:03
  • Python实现爬取马云的微博功能示例

    2021-11-14 13:10:53
  • Dreamweaver技巧十二招

    2009-07-05 18:53:00
  • python 工具类之Queue组件详解用法

    2023-08-05 23:59:10
  • Python3多线程爬虫实例讲解代码

    2021-01-10 21:45:28
  • MySQL中把varchar类型转为date类型方法详解

    2024-01-27 03:15:01
  • python代码实现TSNE降维数据可视化教程

    2023-09-08 16:50:37
  • 利用Python小工具实现3秒钟将视频转换为音频

    2023-05-26 21:11:34
  • 事件触发列表与解说

    2013-07-19 11:17:12
  • python中判断数字是否为质数的实例讲解

    2022-02-17 13:19:05
  • Tensorflow训练模型越来越慢的2种解决方案

    2021-06-04 20:55:53
  • Python imgaug库安装与使用教程(图片加模糊光雨雪雾等特效)

    2021-06-23 10:07:16
  • Android App端与PHP Web端的简单数据交互实现示例

    2023-07-02 08:16:16
  • selenium中常见的表单元素操作方法总结

    2021-06-15 23:54:32
  • Pycharm2022最新版无法换源解决方法

    2023-02-09 20:51:55
  • Python利用PyQT5设置闹钟功能

    2023-05-08 15:34:08
  • js中apply和Math.max()函数的问题及区别介绍

    2024-05-09 10:39:14
  • 使用tensorflow进行音乐类型的分类

    2021-02-22 16:58:31
  • asp之家 网络编程 m.aspxhome.com