【python】Scrapy爬蟲框架入門

時間 2019-11-20

標籤 python scrapy 爬蟲框架入門欄目 Python 简体版

原文原文鏈接

說明：php

　　本文主要學習Scrapy框架入門，介紹如何使用Scrapy框架爬取頁面信息。html

　　項目案例：爬取騰訊招聘頁面 https://hr.tencent.com/position.php?&start=json

　　開發環境：win十、Python3.五、Scrapy1.5app

1、安裝框架

　　》pip install scrapydom

　　//若是安裝不成功，能夠參考 https://blog.csdn.net/dapenghehe/article/details/51548079scrapy

　　//或下載安裝twistedide

2、建立項目（scrapy startproject）post

　　一、在開始爬取以前，必須建立一個新的Scrapy項目。進入相應目錄，運行下列命令（tencent爲項目名稱）：學習

　　　　》scrapy startproject tencentSpider

　　二、進入項目目錄（tencentSpider）

　　　　項目目錄結構以下：

　　　　scrapy.cfg：項目的配置文件。

　　　　tencentSpider/：項目的Python模塊，將會從這裏引用代碼。

　　　　tencentSpider/spiders/：存儲爬蟲代碼目錄（爬蟲文件主要在此編輯）。

　　　　tencentSpider/items.py：項目的目標文件。

　　　　tencentSpider/middlewares.py：項目中間件。

　　　　tencentSpider/pipelines.py：項目管道文件。

　　　　tencentSpider/setting：項目的設置文件。

　　到此，項目基本建立完成，接下來就是編寫爬蟲代碼了。

3、明確目標（tencentSpider/items.py）

　　明確須要爬取的網址以及須要的信息，在 items.py 中定義須要爬取的信息字段。

　　本項目主要爬取：https://hr.tencent.com/position.php?&start= 網站裏的職稱、詳情地址、類別、人數、地點和發佈時間。

　　一、打開 tencentSpider 目錄下的 items.py。

　　二、Item 定義結構化數據字段，用來保存爬取到的數據，相似於Python的字典，可是提供一些額外的的保護減小錯誤。

　　三、能夠經過建立一個 scrapy.Item 類，而且定義類型爲 scrapy.Field 的類屬性來定義一個Item（能夠理解成相似於ORM的映射關係）。

　　四、接下來，建立一個 TencentspiderItem 類，和構建item模型（model）。

　　items.py代碼以下：

 1 # -*- coding: utf-8 -*-
 2 
 3 # Define here the models for your scraped items
 4 #
 5 # See documentation in:
 6 # https://doc.scrapy.org/en/latest/topics/items.html
 7 
 8 import scrapy
 9 
10 
11 class TencentspiderItem(scrapy.Item):
12     # define the fields for your item here like:
13     # name = scrapy.Field()
14 
15     # 職稱
16     title = scrapy.Field()
17     # 詳情地址
18     link = scrapy.Field()
19     # 類別
20     cate = scrapy.Field()
21     # 人數
22     num = scrapy.Field()
23     # 地點
24     address = scrapy.Field()
25     # 發佈時間
26     date = scrapy.Field()

4、製做爬蟲（spiders/tencentSpider.py）

　　一、爬取數據

　　　　① 在與 scrapy.cfg 同級目錄下執行以下命令，將會在 tencentSpider/spiders 目錄下建立一個名爲 tencent 的爬蟲，並制定爬取的域範圍（或手動建立文件，基本代碼格式以下所示）：

　　　　　　》scrapy genspider tencent "hr.tencent.com"

　　　　② 打開 tencentSpider/spiders 目錄下的 tencent.py ，默認的代碼以下：

 1 # -*- coding: utf-8 -*-
 2 import scrapy
 3 
 4 
 5 class TencentSpider(scrapy.Spider):
 6     name = 'tencent'
 7     allowed_domains = ['hr.tencent.com']
 8     start_urls = ['http://hr.tencent.com/']
 9 
10     def parse(self, response):
11         pass

　　　　③ 編寫爬蟲文件，基本思路：構造分頁url，解析內容（xpath），管道文件處理：

 1 # -*- coding: utf-8 -*-
 2 import scrapy
 3 from tencentSpider.items import TencentspiderItem
 4 
 5 
 6 class TencentSpider(scrapy.Spider):
 7     # 爬蟲的名字
 8     name = 'tencent'
 9     allowed_domains = ["hr.tencent.com"]
10 
11     # 拼接 URL
12     url = "https://hr.tencent.com/position.php?&start="
13     offset = 0
14 
15     # 首次爬取入口URL
16     start_urls = [url + str(offset)]
17 
18     def parse(self, response):
19         info_ls = response.xpath('//tr[contains(@class, "odd")] | //tr[contains(@class, "even")]')
20 
21         # 原始地址
22         origin_url = "https://hr.tencent.com/"
23 
24         for each in info_ls:
25             # 初始化模型對象
26             item = TencentspiderItem()
27 
28             # 職稱
29             title = each.xpath("./td/a/text()")[0].extract()
30             # 詳情地址
31             link = origin_url + each.xpath("./td/a/@href")[0].extract()
32             # 職位分類
33             cate = each.xpath('./td[2]/text()')[0].extract()
34             # 人數
35             num = each.xpath('./td[3]/text()')[0].extract()
36             # 所在地址
37             address = each.xpath('./td[4]/text()')[0].extract()
38             # 發佈時間
39             date = each.xpath('./td[5]/text()')[0].extract()
40 
41             item['title'] = title
42             item['link'] = link
43             item['cate'] = cate
44             item['num'] = num
45             item['address'] = address
46             item['date'] = date
47 
48             # 交給管道 pipelines 處理
49             yield item
50 
51         # 循環遍歷分頁,這裏只爬取 100 條
52         if self.offset < 100:
53             self.offset += 10
54 
55             # 每次處理完一頁的數據以後， 從新發送下一頁的頁面請求
56             yield scrapy.Request(self.url + str(self.offset), callback=self.parse)
57         else:
58             print("[ALL_END:爬蟲結束]")

　　　　④ 修改配置文件（settings.py），部分:

　　　　　　須要修改的主要有以下三處：　

 1 # 是否遵照 robot 協議，本項目爲False
 2 ROBOTSTXT_OBEY = False
 3 
 4 
 5 # 請求頭
 6 DEFAULT_REQUEST_HEADERS = {
 7     'User-Agent': 'Mozilla/5.0 (compatible; MSIE 9.0; Window NT 6.1; Trident/5.0;)',
 8     'Accept': 'text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8',
 9     # 'Accept-Language': 'en',
10 }
11 
12 # 配置管道文件
13 ITEM_PIPELINES = {
14    'tencentSpider.pipelines.TencentspiderPipeline': 300,
15 }

　　　　⑤ 編寫管道文件 pipelines.py：

　　　　　　這裏的管道文件主要把數據以json格式保存在文件中：

 1 # -*- coding: utf-8 -*-
 2 
 3 # Define your item pipelines here
 4 #
 5 # Don't forget to add your pipeline to the ITEM_PIPELINES setting
 6 # See: https://doc.scrapy.org/en/latest/topics/item-pipeline.html
 7 import json
 8 
 9 class TencentspiderPipeline(object):
10     def __init__(self):
11         self.save_path = open("res_info.json", "w", encoding="utf8")
12         self.save_path.write("[")
13 
14     def process_item(self, item, spider):
15         # 處理每頁的數據，並寫入文件
16         json_text = json.dumps(dict(item), ensure_ascii=False) + ", \n"
17         self.save_path.write(json_text)
18 
19         return item
20 
21     def close_spider(self, spider):
22         self.save_path.write("{}]")
23         self.save_path.close()

　　　　⑥ 運行爬蟲：

　　　　　　》scrapy crawl tencent

　　　　⑦ 查看結果，打開數據文件 res_info.json：