新浪網分類資訊爬蟲

      從GitHub獲得完整項目(https://github.com/daleyzou/sinainfo.git)

一、簡介

爬取新浪網導航頁全部下全部大類、小類、小類裏的子連接,以及子連接頁面的新聞內容。html

效果演示圖:git

sinaData[7]


二、代碼

items.py

 1 import scrapy
 2 
 3 
 4 class SinainfoItem(scrapy.Item):
 5     # 大類的標題和url
 6     parentTitle = scrapy.Field()
 7     parentUrls = scrapy.Field()
 8 
 9     # 小類的標題和子url
 10     subTitle = scrapy.Field()
 11     subUrls = scrapy.Field()
 12 
 13     # 小類目錄存儲路徑
 14     subFilename = scrapy.Field()
 15 
 16     # 小類下的子連接
 17     sonUrls = scrapy.Field()
 18 
 19     # 大文章標題和內容
 20     head = scrapy.Field()
 21     content = scrapy.Field()

spiders/sina.py(爬蟲)

 1 # -*- coding: utf-8 -*-
 2 import scrapy
 3 import sys
 4 import os
 5 
 6 # noinspection PyUnresolvedReferences
 7 from sinainfo.items import SinainfoItem
 8 
 9 reload(sys)
 10 sys.setdefaultencoding('utf-8')
 11 
 12 
 13 class SinaSpider(scrapy.Spider):
 14     name = 'sina'
 15     allowed_domains = ['sina.com.cn']
 16     start_urls = ['http://news.sina.com.cn/guide/']
 17 
 18     def parse(self, response):
 19         items = []
 20         # 全部大類的標題和url
 21         parentTitle = response.xpath("//div[@id='tab01']/div/h3/a/text()").extract()
 22         parentUrls = response.xpath('//div[@id="tab01"]/div/h3/a/@href').extract()
 23 
 24         # 全部小類的ur 和 標題
 25         subUrls = response.xpath('//div[@id="tab01"]/div/ul/li/a/@href').extract()
 26         subTitle = response.xpath('//div[@id="tab01"]/div/ul/li/a/text()').extract()
 27 
 28         # 爬取全部大類
 29         for i in range(0, len(parentTitle)):
 30             # 指定大類目錄的路徑和目錄名
 31             parentFilename = "./Data/" + parentTitle[i]
 32             # 若是目錄不存在,則建立目錄
 33             if (not os.path.exists(parentFilename)):
 34                 os.makedirs(parentFilename)
 35 
 36             # 爬取全部小類
 37             for j in range(0, len(subUrls)):
 38                 item = SinainfoItem()
 39                 # 保存大類的title和urls
 40                 item['parentTitle'] = parentTitle[i]
 41                 item['parentUrls'] = parentUrls[i]
 42                 # 檢查小類的url是否以同類別大類url開頭,若是是返回Ture
 43                 if_belong = subUrls[j].startswith(item['parentUrls'])
 44                 # 若是屬於本大類,將存儲目錄放在本大類下
 45                 if (if_belong):
 46                     subFilename = parentFilename + '/' + subTitle[j]
 47                     # 若是目錄不存在,則建立目錄
 48                     if (not os.path.exists(subFilename)):
 49                         os.makedirs(subFilename)
 50                     # 存儲 小類url、title、和filename字段數據
 51                     item['subUrls'] = subUrls[j]
 52                     item['subTitle'] = subTitle[j]
 53                     item['subFilename'] = subFilename
 54                     items.append(item)
 55 
 56         # 發送每一個小類url的Request請求,獲得Response連同包含meta數據
 57                     # 一同交給回調函數second_parse()處理
 58         for item in items:
 59             yield scrapy.Request(url = item['subUrls'],\
 60                                  meta={'meta_1':item}, callback=self.second_parse)
 61 
 62     # 對於返回的小類url,在進行遞歸請求
 63     def second_parse(self, response):
 64         # 提取每次Response的meta數據
 65         meta_1 = response.meta['meta_1']
 66         # 取出小類裏全部字連接
 67         sonUrls = response.xpath('//a/@href').extract()
 68 
 69         items = []
 70         for i in range(0, len(sonUrls)):
 71             # 檢查每一個連接是否以大類url開頭、以.shtml結尾,若是是返回True
 72             if_belong = sonUrls[i].endswith('.shtml') and sonUrls[i].startswith(\
 73                 meta_1['parentUrls'])
 74             # 若是屬於本大類,獲取字段值放在同一個item下便於傳輸
 75             if (if_belong):
 76                 item = SinainfoItem()
 77                 item['parentTitle'] = meta_1['parentTitle']
 78                 item['parentUrls'] = meta_1['parentUrls']
 79                 item['subTitle'] = meta_1['subTitle']
 80                 item['subUrls'] = meta_1['subUrls']
 81                 item['subFilename'] = meta_1['subFilename']
 82                 item['sonUrls'] = sonUrls[i]
 83                 items.append(item)
 84 
 85         for item in items:
 86             yield scrapy.Request(url = item['sonUrls'],\
 87                                  meta = {'meta_2':item}, callback=self.detail_parse)
 88 
 89     # 數據解析方法,獲取文章標題和內容
 90     def detail_parse(self, response):
 91         item = response.meta['meta_2']
 92         content = ""
 93         head = response.xpath('//h1[@id="main_title"]/text()')
 94         content_list = response.xpath('//div[@id="artibody"]/p/text()').extract()
 95         # 將p標籤裏的文本內容合併到一塊兒
 96         for content_one in content_list:
 97             content += content_one
 98         item['head'] = head
 99         item['content'] = content
View Code

pipelines.py

 1 class SinainfoPipeline(object):
 2     def process_item(self, item, spider):
 3         sonUrls = item['sonUrls']
 4 
 5         # 文件名爲子連接url中間部分,並將/替換爲_,保存爲.txt
 6         filename = sonUrls[7:-6].replace('/', '_')
 7         filename += ".txt"
 8 
 9         fp = open(item['subFilename']+'/'+filename, 'w')
 10         fp.write(item['content'])
 11         fp.close()
 12         return item

settings.py

 1 
 2 BOT_NAME = 'sinainfo'
 3 
 4 SPIDER_MODULES = ['sinainfo.spiders']
 5 NEWSPIDER_MODULE = 'sinainfo.spiders'
 6 
 7 LOG_LEVEL = 'DEBUG'
 8 # Crawl responsibly by identifying yourself (and your website) on the user-agent
 9 USER_AGENT = 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_11_3) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/48.0.2564.116 Safari/537.36'
 10 DOWNLOAD_DELAY = 3
 11 COOKIES_ENABLED = False
 12 
 13 ITEM_PIPELINES = {
 14    'sinainfo.pipelines.SinainfoPipeline': 300,
 15 }

三、運行

方法一:

(1)在項目根目錄下新建main.py文件,用於調試
from scrapy import cmdline
cmdline.execute('scrapy crawl sina'.split())
(2)執行程序
py2 main.py
 
  
 

方法二:

在命令行下:github

(1)切換到項目/sinainfo/sinainfo/spidersweb

(2)執行 scrapy crawl sinaapp

相關文章
相關標籤/搜索