python爬蟲---js加密和混淆,scrapy框架的使用.

時間 2020-05-07

標籤 python 爬蟲加密混淆 scrapy 框架使用欄目 Python 简体版

原文原文鏈接

python爬蟲---js加密和混淆,scrapy框架的使用.

一丶js加密和js混淆

js加密

對js源碼進行加密,從而保護js代碼不被黑客竊取.(通常加密和解密的方法都在前端)前端

http://www.bm8.com.cn/jsConfusion/ # 反解密

js混淆

# 目的: 爲了縮小js體積，加快http傳輸速度 ,混淆的目的是保護代碼
	· 合併多個js文件

    · 去除js代碼裏面的空格和換行

    · 壓縮js裏面的變量名

    · 剔除掉註釋

二丶SCRAPY爬蟲框架

概述scrapy框架特色

- 高性能的網絡請求
    - 高性能的數據解析
    - 高性能的持久化存儲
    - 深度爬取
    - 全棧爬取
    - 分佈式
    - 中間件
    - 請求傳參

下載與安裝

- 環境的安裝：
    - mac/linux：pip install scrapy
    - window:
        - pip install wheel
        - 下載twisted http://www.lfd.uci.edu/~gohlke/pythonlibs/#twisted
        - 進入下載目錄，執行 pip install Twisted‑17.1.0‑cp35‑cp35m‑win_amd64.whl
        - pip install pywin32
        - pip install scrapy

基本使用

建立項目python

- 新建一個工程：scrapy startproject ProName
        - 目錄結構：
            - spiders（包）：空包
            - settings：配置文件
                - 不聽從robots
                - UA假裝
                - 日誌等級的指定

    - cd ProName：進入到工程目錄中
    - 在spiders（爬蟲文件夾）中建立一個爬蟲文件
        - scrapy genspider spiderName www.xxx.com
    - 編寫代碼：主要的代碼會編寫在爬蟲文件中
    - 執行工程：scrapy crawl spiderName

scrapy目錄結構mysql

- 項目名
	- 同名項目文件夾
		- spiders 文件夾
		- init.py
		- items.py
		- middlewares.py
		- pipelines.py
		- settings.py 
	- scrapy.cfg

scrapy數據解析linux

# scrapy 能夠使用 xpath進行解析
	# extract_first() 獲取 讀取文本並得到索引爲0的字符串
    # extract() 獲取文本 
  content = div.xpath('.//div[@class="link-detail"]/a/text()').extract_first()

scrapy數據存儲sql

# 基於終端進行持久化存儲
	- 只能夠將parse方法的返回值存儲到本地的磁盤文件（指定形式後綴）中
	- scrapy crawl spiderName -o filePath


    
    
# 基於管道持久化存儲 (**)
	- 在items註冊存儲的字段 (Filed萬能字段,包含大部分數據類型)
	- 在piplelines文件 編寫管道類 ,並在settings配置文件進行註冊'ITEM_PIPELINES'
    
    
	- 編碼流程
	    - 1.在爬蟲文件中進行數據解析
	    - 2.在item類中定義相關的屬性
	    - 3.將解析到的數據存儲到一個item類型的對象中
	    - 4.將item類型的對象提交給管道 (yiled item)
	    - 5.管道類的process_item方法負責接受item，接受到後能夠對item實現任意形式的持久化存儲操做
            - 6.在配置文件中開啓管道

	- 一個管道類對應一種平臺的持久化存儲
    
	## 兩種方式
    	# 基於 本地的管道存儲
class ChoutiproPipeline(object):
    # 重寫父類的方法, 只執行一次
    fp = None

    def open_spider(self, spider):
        print('開始爬蟲~~~~')
        self.fp = open('./本地持久化存儲文件.txt', 'w', encoding='utf-8')

    def process_item(self, item, spider):
        author = item['author']
        content = item['content']

        self.fp.write(author + ':' + content + '\n')

        return item

    def close_spider(self, spider):
        print('爬蟲結束~~~')
        self.fp.close()
        
        
        
        # 基於 mysql的管道存儲
class MySqlChoutiproPipeline(object):
    conn = None
    cursor = None

    def open_spider(self, spider):
        print('建立數據庫鏈接~~')
        # 創建數據庫鏈接
        self.conn = pymysql.Connection(host='127.0.0.1', port=3306, db='scrapy_db1', user='root', password='123',charset='utf8')
                    # pymysql.Connection(host='127.0.0.1', port=3306, user='root', password='123', db='spider', charset='utf8')
    def process_item(self, item, spider):
        authro = item['author']
        content = item['content']

        sql = 'insert into chouti values ("%s","%s")' %(authro ,content)
        self.cursor = self.conn.cursor()

        try:
            self.cursor.execute(sql)
            self.conn.commit() # 提交

        except Exception as e:
            print(e)
            self.conn.rollback() # 回滾

        return item

    def close_spider(self,spider):
        self.cursor.close()
        self.conn.close()