Scrapy——2 Scrapy shell——騰訊招聘信息—Mysql、MongoDB數據保存，相應間傳遞的meta字典

時間 2019-11-13

標籤 scrapy shell 騰訊招聘信息 mysql mongodb 數據保存相應傳遞 meta 字典欄目 Python 简体版

原文原文鏈接

Scrapy——2 Scrapy shellphp

什麼是Scrapy shellcss

Scrapy shell終端是一個交互終端，咱們能夠在未啓動spider的狀況下嘗試調試代碼，也能夠用來測試Xpath或CSS表達式，查看他們的工做方式，方便咱們爬取的網頁中提取數據html

Scrapy內置選擇器：前端

xpah()：傳入xpath表達式，返回該方法所對應的全部節點的selector list列表
extract()：序列化該節點爲Unicode字符串並返回list / extracrt_first()
css()：傳入CSS表達式，返回該表達式所對用的全部的節點的selector list 列表，語法同BeautifulSoup4
re()：根據傳入的正則表達式對數據進行提取，返回Unicode字符串list列表

什麼是sipder
Spider類定義瞭如何爬取某個(或某些)網站。包括了爬取的動做(例如:是否跟進連接)以及如何從網頁的內容中提取結構化數據(爬取item)。換句話說，Spider就是您定義爬取的動做及分析某個網頁(或者是有些網頁)的地方node

實戰python

當咱們爬取騰訊的招聘信息時，假設咱們須要的數據是，崗位名稱，崗位方向，招聘人數，工做地點，發佈時間，需求，並將他們分別Mongo和Mysql數據庫保存mysql

咱們須要匹配多個數據，能夠先用scrapy shell url 先嚐試解析，命令回車，會進入python的交互模式，若是安裝了ipython，react

會優先進入ipython的環境。linux

而後responde已經默認請求完成。直接response.xpath('//**')就能夠解析數據，很是方便webpack

在項目開發中很是實用

Local/Scrapy/tencent/tencent/items.py 設置須要的數據

import scrapy


class TencentItem(scrapy.Item):
    # define the fields for your item here like:
    # name = scrapy.Field()

    # 名字
    work_name = scrapy.Field()
    # 類別
    work_category = scrapy.Field()
    # 人數
    work_number = scrapy.Field()
    # 地點
    work_address = scrapy.Field()
    # 時間
    publish_time = scrapy.Field()
    # 內容
    work_content = scrapy.Field()

Local/Scrapy/tencent/tencent/settings.py 開啓管道

# Configure item pipelines
# See https://doc.scrapy.org/en/latest/topics/item-pipeline.html
ITEM_PIPELINES = {
   'tencent.pipelines.TencentPipeline': 300,
}

Local/Scrapy/tencent/tencent/spiders/ten_spider.py 編寫代碼

# -*- coding: utf-8 -*-
import scrapy
from ..items import TencentItem

class TenSpiderSpider(scrapy.Spider):
    name = 'ten_spider'
    allowed_domains = ['tencent.com']
    start_urls = ['https://hr.tencent.com/position.php']
    base_url = 'https://hr.tencent.com/'


    def parse(self, response):
        tr_list = response.xpath('//table[@class="tablelist"]//tr')[1:-1]

        next_url = response.xpath('//a[@id="next"]/@href').extract()[0]

        for tr in tr_list:
            items = TencentItem()
            detail_url = tr.xpath('./td[1]/a/@href').extract_first()
            items['work_name'] = tr.xpath('./td[1]/a/text()').extract_first()
            items['work_category'] = tr.xpath('./td[2]/text()').extract_first()
            items['work_number'] = tr.xpath('./td[3]/text()').extract_first()
            items['work_address'] = tr.xpath('./td[4]/text()').extract_first()
            items['publish_time'] = tr.xpath('./td[5]/text()').extract_first()
            yield scrapy.Request(url=self.base_url+detail_url,callback=self.detail_parse,meta={'items':items})

        yield scrapy.Request(url=self.base_url + next_url, callback=self.parse)


    def detail_parse(self,response):
        items = response.meta.get('items')
        work_content = '|'.join(response.xpath('//ul[@class="squareli"]/li/text()').extract())
        items['work_content'] = work_content

        yield items

Local/Scrapy/tencent/tencent/pipelines.py 設置保存Mysql須要提早到數據庫中新建好表，MongoDB會本身生成數據表

# -*- coding: utf-8 -*-

# Define your item pipelines here
#
# Don't forget to add your pipeline to the ITEM_PIPELINES setting
# See: https://doc.scrapy.org/en/latest/topics/item-pipeline.html

import pymongo
import pymysql


class TencentPipeline(object):
    def __init__(self):
        self.conn = pymysql.connect(
            host='127.0.0.1',
            port=3306,
            user='pywjh',
            password='pywjh',
            db='bole',
            charset='utf8'
        )
        self.cursor = self.conn.cursor()

    def open_spider(self,spider):
        pass
        # self.conn = pymongo.MongoClient(host='127.0.0.1',port=27017)
        # self.db = self.conn['tencent_db']
        # self.connection = self.db['job']



    def process_item(self, item, spider):
        # self.connection.insert(dict(item))
        sql = "insert into job(name, category, number, address, time, content) values(%s, %s, %s, %s, %s, %s)"
        self.cursor.execute(sql,
                            [item['work_name'],
                             item['work_category'],
                             item['work_number'],
                             item['work_address'],
                             item['publish_time'],
                             item['work_content']
                             ])
        self.conn.commit()


        return item


    def close_spider(self,spider):
        pass

結果：

Mongo

> show dbs
admin       (empty)
local       0.078GB
pywjh       0.078GB
tencent_db  0.078GB
test        0.078GB
> use tencent_db
switched to db tencent_db
> show collections
job
system.indexes
> db.job.find().pretty()
{
        "_id" : ObjectId("5be12ee28a6e9e0b3cad0d25"),
        "work_category" : "技術類",
        "publish_time" : "2018-11-06",
        "work_name" : "22989-高級網絡運維工程師",
        "work_number" : "2",
        "work_address" : "深圳",
        "work_content" : "1.負責騰訊雲機房網絡、VPC、負載均衡平臺的規劃，建設，不斷提高運維效率；|2.負責對網絡問題分可用性；|4.負責分析業務不合理、不高效地方，提出優化改進方案並推動實施。|1.本科及以上學歷；|2.3年以上相關工做經驗，熟精通路由協議（ospf，bgp），有大規模網絡規劃、運維經驗優先；|4.熟悉主流虛擬化技術原理(如kvm,xen,hyper-v,lxc)，有實際的ows操做系統，對系統性能相關問題有較深入理解；|6.擅長shell、python、perl中一種或幾種，熟練應用awk、sed、grep、strace、tcpdump、gdb等經常使用命令。"
}
................
................
................
{
        "_id" : ObjectId("5be12ee38a6e9e0b3cad0d38"),
        "work_category" : "市場類",
        "publish_time" : "2018-11-06",
        "work_name" : "25667-企點行業渠道銷售經理（北京）",
        "work_number" : "1",
        "work_address" : "北京",
        "work_content" : "負責企點產品的行業渠道拓展工做，包括ISV/SI的維護及跟進，達成制定的銷售業績指標；（主要是汽合做，快速實現拳頭優點和標杆效應；|按期拜訪渠道合做夥伴，充分了解客戶需求並積極跟進，制定合理方案，負責方案提示、談判務的市場化，幫助合做夥伴更加健康地發展。|本科及以上學歷，5年以上渠道或業務管理經驗，有saas、互聯網廣告行業工做經驗優造性思惟和營銷推廣能力。"
}

Mysql

mysql> show databases;
+--------------------+
| Database           |
+--------------------+
| information_schema |
| bole               |
| mysql              |
| performance_schema |
| pywjh              |
| sys                |
+--------------------+
6 rows in set (0.08 sec)

mysql> use bole;
Reading table information for completion of table and column names
You can turn off this feature to get a quicker startup with -A

Database changed
mysql> show tables;
+----------------+
| Tables_in_bole |
+----------------+
| blzx           |
| job            |
+----------------+
2 rows in set (0.00 sec)

mysql> select * from job;
+----+------------------------------------------------------+------------------+--------+---------+------------+------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
| id | name                                                 | category         | number | address | time       | content                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                          |
+----+------------------------------------------------------+------------------+--------+---------+------------+------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
|  1 | 22989-高級網絡運維工程師                             | 技術類           |      2 | 深圳    | 2018-11-06 | 1.負載均衡平臺的規劃，建設，不斷提高運維效率；|2.負責對網絡問題分析解決，造成方法論，提高團隊技術能力；|3.負責經過技術手以上學歷；|2.3年以上相關工做經驗，熟悉TCP/IP協議，瞭解SDN相關技術，可以定位linux虛擬化環境下網絡異常；|3.熟悉主流的網技術原理(如kvm,xen,hyper-v,lxc)，有實際的對虛擬化疑難問題trouble shooting經驗；|5.精通linux，windows操做系統，對系統erl中一種或幾種，熟練應用awk、sed、grep、strace、tcpdump、gdb等經常使用命令。                                                                                                                                                                                                                                                                                                                           |
|  2 | 22989-運營產品中心web前端開發                        | 技術類           |      2 | 深圳    | 2018-11-06 | 負責工做；| 負責頁面相關的接入層的開發（nodejs）。| 負責前端框架的搭建，公共組件的開發和維護。|本科以上學歷，計算機相關專avascript，熟悉使用jQuery，react.js等框架及類庫；| 熟悉經常使用WEB開發調試工具；| 有使用grunt、gulp、webpack等工具進行前做態度端正，可以積極主動去工做，高效推進項目完成；| 具備良好的邏輯思惟及語言表達能溝通力，要能高效配合團隊成員，共同                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                       |
.......................
.......................
.......................

| 67 | SA-騰訊社交廣告運維工程師(深圳)                      | 技術類           |      1 | 深圳    | 2018-11-06 | 負責存存儲組件的運維，包括平常擴容，縮容，故障處理，演習，部署容災跟進；|負責db，mysql的運維，包括db優化，故障處理，演習護產品的質量穩定，經過技術手段、流程制度提高組件的健壯性，可用性；|其餘和以上工做相關的專項事務。|3年以上工做經驗，精能相關問題有較深入理解；|精通shell編程，熟練應用awk、sed、grep、strace、tcudump、gdb等經常使用命令；|精通mysql，對mysql相is，memcache，mongodb，leveldb等運維經驗者優先。|熟練使用Linux/unix操做系統，熟悉主流虛擬化技術與開源組件，有devops實體系結構方面的知識；|熟悉集羣高可用性方案，有必定帶寬成本速度優化經驗；|熟悉互聯網產品的運維流程，有海量運營產品架構本科或以上學歷，工做細緻、善於思考，有很強的數據分析和問題解決能力。 |
+----+------------------------------------------------------+------------------+--------+---------+------------+-------------------------------------------------------------------------------------------------------------------------------+
67 rows in set (0.00 sec)