爬蟲day6 js加密和混淆的解析

時間 2019-11-16

標籤爬蟲 day6 day 加密混淆解析欄目網絡爬蟲简体版

原文原文鏈接

爬蟲day6javascript

關於<https://www.aqistudy.cn/html/city_detail.html 中國空氣質量在線監測分析平臺php

js加密和混淆的解析html

剛開始是的日期沒有,加載,是由於已經加載完了,換一個日期.java

能夠看到數據是動態加載的node

可是數據加密了mysql

post請求,並且請求的數據也加密了redis

多是谷歌瀏覽器的js沒有監聽,因此換比較好用的火狐瀏覽器sql

事件監聽數據庫

getData請求數據api

找到element(谷歌瀏覽器)(定義處) 能夠看到執行了getAQI和getweather

function getWeatherData()
       {

         var method = 'GETCITYWEATHER';
         var param = {};
         param.city = city;
         param.type = type;
         param.startTime = startTime;
         param.endTime = endTime;
         getServerData(method, param, function(obj) {
            data = obj.data;
            if(data.total>0)
            {
              dataTemp.splice(0, dataTemp.length);
              dataHumi.splice(0, dataHumi.length);
              dataWind.splice(0, dataWind.length);
              for(i=0;i<data.rows.length;i++)
              {
                 dataTemp.push({
                    x: converTimeFormat(data.rows[i].time).getTime(),
                    y: parseInt(data.rows[i].temp)
                  });
                  dataHumi.push({
                    x: converTimeFormat(data.rows[i].time).getTime(),
                    y: parseInt(data.rows[i].humi)
                  });
                  dataWind.push({
                    x: converTimeFormat(data.rows[i].time).getTime(),
                    y: parseInt(data.rows[i].wse),
                    d: data.rows[i].wd,
                    w: data.rows[i].tq,
                    marker:{symbol: getWindDirectionUrl(data.rows[i].wd)}
                  });
              }
              state ++ ;
            if(state>=2)
            {
                showCurrentTab();
              }
            }
         }, 0.5);
       }

以及在搜索裏找到getWeather ,有4個參數method, param, function(obj) ,0.5

中間的是function函數參數以及getServerData 後面的0.5);說明是在調用函數

去找getServerData函數定義處

又是亂碼...加密了,不對,是混淆了,

到解密平臺去解密一下: <http://www.bm8.com.cn/jsConfusion/

object = city type starttime endtime 等進行md5加密

url 對php那個文件發送請求,得到動態的數據

本身寫一個假的函數去調用得到 param請求參數

function getPostParamCode(method, city, type, startTime, endTime){
    var param = {};
    param.city = city;
    param.type = type;
    param.startTime = startTime;
    param.endTime = endTime;
    return getParam(method, param);
}

而後經過這個參數發送post 爬取內容

代碼以下: 將字符串轉換成js代碼

import requests
# 補充
import execjs

node = execjs.get()

# Params
method = 'GETDETAIL'
city = '北京'
type = 'HOUR'
start_time = '2018-01-25 00:00:00'
end_time = '2018-01-25 23:00:00'

# Compile javascript
file = 'jsCode.js'
ctx = node.compile(open(file, encoding='utf-8').read())

# Get params
js = 'getPostParamCode("{0}","{1}","{2}","{3}","{4}")'.format(method, city, type, start_time, end_time)
params = ctx.eval(js)

url = 'https://www.aqistudy.cn/apinew/aqistudyapi.php'
data = {
    'd': params
}
page_text = requests.post(url, data=data).text

js = 'decodeData("{0}")'.format(page_text)
decrypted_data = ctx.eval(js)
# print(js)

print(decrypted_data)

而後在經過decodeData 解密獲取最終數據

params

tdgHOYxwKdDSgYXe+RLPzYCgLvrddahasI5XXklB4gVLYqab+XRPpMD/oSqnJ/aEmFwzVEUhLnPzRy03+X1BIzLvxQKwu4A3YsqR3OemYgNnHqPdBwvJlbxia99YeK+xhYnh+pXoudhbw1bJHi/H1n7o0PGXMb60NrW7f/Yd0Y+H4hNSDHVYyZnBsxJh6kkarSTzqNharSCvztTU3b95na/jKrVddatUdH5CVexOuKjxjdT0C1swsJBH7bdn3Sga7wXZ20GcktH39BwkMaScAudbM3yYSgDrJkCmV4i6ZZlU54+aR4MY7r7J9IpW1TSy93gC24xTvjiaa5Apo2c77/b7gcIiTvSc14c2AnLDI5oOfgIl4J2hRMFfqr4g4Lfuq1cRlOQg5c5uZrQjyIsIFicregIDGNu4fluOdSLC+Pg+OQDMIlqLzHtwgZ2MW0HuoL8o/copcJu1ClHTCk0y+g==

decrypted_data

{"success":true,"errcode":0,"errmsg":"success","result":{"success":true,"data":{"total":24,"rows":[{"time":"2018-01-25 00:00:00","aqi":"43","pm2_5":"29","pm10":"43","co":"0.7","no2":"56","o3":"23","so2":"6","rank":null},{"time":"2018-01-25 01:00:00","aqi":"25","pm2_5":"15","pm10":"25","co":"0.6","no2":"45","o3":"34","so2":"6","rank":null},{"time":"2018-01-25 02:00:00","aqi":"25","pm2_5":"9","pm10":"25","co":"0.5","no2":"38","o3":"39","so2":"5","rank":null},{"time":"2018-01-25 03:00:00","aqi":"22","pm2_5":"9","pm10":"22","co":"0.5","no2":"40","o3":"37","so2":"5","rank":null},{"time":"2018-01-25 04:00:00","aqi":"16","pm2_5":"8","pm10":"15","co":"0.5","no2":"32","o3":"45","so2":"4","rank":null},{"time":"2018-01-25 05:00:00","aqi":"16","pm2_5":"8","pm10":"14","co":"0.4","no2":"25","o3":"51","so2":"4","rank":null},{"time":"2018-01-25 06:00:00","aqi":"17","pm2_5":"7","pm10":"15","co":"0.4","no2":"24","o3":"53","so2":"4","rank":null},{"time":"2018-01-25 07:00:00","aqi":"18","pm2_5":"5","pm10":"18","co":"0.4","no2":"26","o3":"52","so2":"3","rank":null},{"time":"2018-01-25 08:00:00","aqi":"19","pm2_5":"5","pm10":"19","co":"0.4","no2":"27","o3":"51","so2":"4","rank":null},{"time":"2018-01-25 09:00:00","aqi":"20","pm2_5":"6","pm10":"20","co":"0.5","no2":"28","o3":"50","so2":"4","rank":null},{"time":"2018-01-25 10:00:00","aqi":"21","pm2_5":"6","pm10":"21","co":"0.4","no2":"22","o3":"58","so2":"4","rank":null},{"time":"2018-01-25 11:00:00","aqi":"27","pm2_5":"9","pm10":"27","co":"0.4","no2":"17","o3":"63","so2":"5","rank":null},{"time":"2018-01-25 12:00:00","aqi":"25","pm2_5":"9","pm10":"25","co":"0.4","no2":"15","o3":"66","so2":"4","rank":null},{"time":"2018-01-25 13:00:00","aqi":"22","pm2_5":"9","pm10":"21","co":"0.4","no2":"14","o3":"68","so2":"4","rank":null},{"time":"2018-01-25 14:00:00","aqi":"23","pm2_5":"6","pm10":"18","co":"0.3","no2":"13","o3":"71","so2":"4","rank":null},{"time":"2018-01-25 15:00:00","aqi":"23","pm2_5":"7","pm10":"17","co":"0.3","no2":"13","o3":"71","so2":"4","rank":null},{"time":"2018-01-25 16:00:00","aqi":"23","pm2_5":"7","pm10":"19","co":"0.4","no2":"14","o3":"71","so2":"4","rank":null},{"time":"2018-01-25 17:00:00","aqi":"22","pm2_5":"8","pm10":"19","co":"0.3","no2":"17","o3":"68","so2":"4","rank":null},{"time":"2018-01-25 18:00:00","aqi":"20","pm2_5":"6","pm10":"20","co":"0.4","no2":"23","o3":"62","so2":"3","rank":null},{"time":"2018-01-25 19:00:00","aqi":"24","pm2_5":"7","pm10":"24","co":"0.4","no2":"29","o3":"54","so2":"4","rank":null},{"time":"2018-01-25 20:00:00","aqi":"23","pm2_5":"8","pm10":"23","co":"0.4","no2":"31","o3":"48","so2":"4","rank":null},{"time":"2018-01-25 21:00:00","aqi":"25","pm2_5":"11","pm10":"25","co":"0.5","no2":"39","o3":"39","so2":"4","rank":null},{"time":"2018-01-25 22:00:00","aqi":"26","pm2_5":"12","pm10":"26","co":"0.5","no2":"45","o3":"32","so2":"5","rank":null},{"time":"2018-01-25 23:00:00","aqi":"31","pm2_5":"13","pm10":"31","co":"0.6","no2":"50","o3":"25","so2":"5","rank":null}]}}}

day6

scrapy五大核心

引擎根據數據流判斷哪一個執行了 ! 觸發對應的事務!

twisted 支持異步 ,下載器下載

調度器, 過濾器(其實是一個類,set等)去除重複的URL

管道進行存儲到數據庫或者是文本

spdier 爬取

數據庫

趁熱吃,涼了就很差吃了.-家長名言

作簡單的

基於scrapy的爬取與存儲到文本

item (寫字段,傳輸的item)

first (寫爬蟲)

setting.py (寫配置)

pipeline(寫存儲) #

爬取豆瓣存到文件中

出錯了許多:

1 open_spider() close_spider()  def 先後執行而不能是其餘 , 配置在setting裏應該
2 在process_item裏寫 fp.write()   
3 first 中 , for 循環 而後加入值, 要 item = 實例化 item裏的, 不能再for 循環外面實例化,可能會出錯, 一個個實例化,效率就低了,可是每個寫給了文章裏  ,在 pipe管道里一樣的方法取值, [] 也挺方便
4 item裏的類裏寫上屬性值, 才能夠傳,能夠往屬性里加 不然會報錯
5

setting.py

USER_AGENT = 'Mozilla/5.0 (Windows NT 6.1; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/76.0.3809.100 Safari/537.36'

# Obey robots.txt rules
ROBOTSTXT_OBEY = False

LOG_LEVEL = 'ERROR'
ITEM_PIPELINES = {
   'BoosPro.pipelines.BoosproPipeline': 300,
}

item.py

import scrapy


class BoosproItem(scrapy.Item):
    title_text = scrapy.Field()
    socre_text = scrapy.Field()

爬蟲對應的py

# -*- coding: utf-8 -*-
import scrapy

from BoosPro.items import BoosproItem

    # 能夠寫boos直聘被封了. 這個和下面的豆瓣的是同樣的

class BoosSpider(scrapy.Spider):
    name = 'douban'     # boos被封了  改爲豆瓣了
 
    start_urls = ['https://movie.douban.com/top250']


    url_next = 'https://movie.douban.com/top250?start=25&filter='

    def parse(self, response):
 
        title_text = response.xpath('//*[@id="content"]/div/div[1]/ol/li/div/div[2]/div[1]/a/span[1]/text()').extract()  # //*[@id="main"]/div/div[3]/ul/li[18]/div/div[1]/h3/a/span
        socre_text = response.xpath('//*[@id="content"]/div/div[1]/ol/li/div/div[2]/div[2]/div/span[2]/text()').extract()
        print(title_text, socre_text)


        item = BoosproItem()

        item['title_text'] = title_text  # 能夠寫進列表 ?
        item['socre_text'] = socre_text


        yield item

有個錯誤:

        # 第一個錯誤,沒寫字段, NightprItem no job_text 字段, 第二個錯誤,這種寫法能夠可是會覆蓋的  {'job_text': 'Python高級工程師/專家', 'salary_text': '30-50K'}
        # for index, job in enumerate(job_text):
        #     item['job_text'] = job
        #     item['salary_text'] = salary_text[index]

pipe 存儲到文件夾裏

import pymysql
from redis import Redis

class BoosproPipeline(object):
 
    def open_spider(self, spider):
        self.fp = open('book.text', 'w', encoding='utf-8')

    def process_item(self, item, spider):
        print('*****************************************************', item['title_text'])
        for title in item['title_text']:
            title = title.replace( '\n', '')
            print(title)

        title_text = item['title_text']
        socre_text = item['socre_text']

        self.title = {}
        for index, title in enumerate(title_text):
            self.title[title] = ': ' + socre_text[index]

        return item

    def close_spider(self, spider):
        self.fp.write(str(self.title))

深層爬取全站爬取(手動)

items

title_text = scrapy.Field()
socre_text = scrapy.Field()
title_desc = scrapy.Field()

爬蟲.py

# -*- coding: utf-8 -*-
import scrapy
# 處理了寫入文件的  問題 以及 寫入兩個頁面的問題 手動寫入  深度的評價內容
from doubanPro.items import DoubanproItem

class DianyingSpider(scrapy.Spider):
    name = 'dianying'
    # allowed_domains = ['www.xxx.com']
    start_urls = ['https://movie.douban.com/top250']

    url_next = 'https://movie.douban.com/top250?start=25&filter='
    page = 1
    def parse(self, response):
        # print(response.text)      #
        title_text = response.xpath('//*[@id="content"]/div/div[1]/ol/li/div/div[2]/div[1]/a/span[1]/text()').extract()  # //*[@id="main"]/div/div[3]/ul/li[18]/div/div[1]/h3/a/span
        socre_text = response.xpath('//*[@id="content"]/div/div[1]/ol/li/div/div[2]/div[2]/div/span[2]/text()').extract()
        detail_urls = response.xpath('//*[@id="content"]/div/div[1]/ol/li/div/div[2]/div[1]/a/@href').extract()
 
        for index, text in enumerate(title_text):
            item = DoubanproItem()      # 每次都執行 ? 放在外面能夠不  不能放到外面
            item['title_text'] = text  # 能夠寫進列表 ?

            item['socre_text'] = socre_text[index]
            detail_url = detail_urls[index]
 
            print('數據', item['title_text'], item['socre_text'])
            yield scrapy.Request(url=detail_url, callback=self.detail_parse, meta={'item': item})    # 執行第二次

  
        if self.page < 2:
            self.page += 2
            yield scrapy.Request(url=self.url_next, callback=self.parse)    # 執行第二次

    #   # 進行深度爬取  //*
    def detail_parse(self,response):
        item = response.meta['item']
        # print('response',response)      # //*[@id="link-report"]/span[1]/span
        title_desc = response.xpath('//*[@id="link-report"]//text()').extract()
        # 拿到的確定是列表
        # print('title_desc', title_desc[2])
        item['title_desc'] = title_desc[2]
        yield item

步驟

先start_url 裏的地址走 parse(self,response) 而後 xpath 解析一下給了item添值,返回yield item ,本質上走的是 scrapy.Requests(url,callback = parse) # 不用寫header

或者

在 parse 裏, 循環幾個URL, if 判斷, 由於是遞歸,因此好像是 for循環同樣

yield scrapy.Request(url = deurl , callback= parse) # 遞歸,

深度爬取,

parse item添加和 if 在手動獲取url請求中件添加深度爬取標籤點開頁裏面的內容

def parse():
    for( response.xpath().extract())
        item[''] =
        detialurl =  解析[index] 或者 是 別的標籤  或者須要用到管道符
        yeild scrapy.Request(url=url,callback=深層的函數,meta[item]:item)
        # 返回給解析的函數, 由於callback 會執行, 而後, callback(新解析的屬性+已經解析的屬性  拼接)在yield item給pipeline, 
  if pag< 5 :
    pag+=1 
    yeild scrapy.Request(url= pag, callback = parse) # 遞歸執行  而後中間又去執行深度解析
    
def 深度解析():
    xpath = ...
    yeild item

出的錯誤

# 解決了只有一個鍵值對文件中 的問題嗎 上面的東西解決了!       
       # print(title_text, socre_text)
        # item = DoubanproItem()  # 每次都執行 ? 放在外面能夠不  不能夠!

pipeline 存儲

class DoubanproPipeline(object):
    datas = []
    def open_spider(self, spider):
        print('開始爬蟲')
        self.fp = open('book.text', 'w', encoding='utf-8')

    def process_item(self, item, spider):       # spider 的做用, 和 item交互  那邊 傳值
        print('***********3333333333333333****', item)

        data = item['title_text'] +'  '+ item['socre_text']+':'+ item['title_desc']+'\n'
        self.fp.write(data)

        return item


# 這裏是關閉的  不是寫入的 , 記混了 , 並且 只能寫入一次,不能寫入屢次,上面的process_item  能夠處理屢次
    def close_spider(self, spider):
        self.fp.close()
        print('結束爬蟲！')

存儲到數據庫的mysql redis 修改的

setting.py

ITEM_PIPELINES = {
   'DatabasePro.pipelines.DatabaseproPipeline': 300,
   'DatabasePro.pipelines.RedisPipeline': 301,
}

pipelines.py

# -*- coding: utf-8 -*-

# Define your item pipelines here
#
# Don't forget to add your pipeline to the ITEM_PIPELINES setting
# See: https://docs.scrapy.org/en/latest/topics/item-pipeline.html

import pymysql
from redis import Redis
class DatabaseproPipeline(object):

    conn = None
    cursor = None

 

    def open_spider(self, spider):   # 3306 不是字符串
        self.conn = pymysql.Connect(host='127.0.0.1', port=3306, user='root', password='123', db='spider', charset='utf8')
        print('conn', self.conn)


    def process_item(self, item, spider):  # spider 的做用, 和 item交互  那邊 傳值

        self.cursor = self.conn.cursor()
        sql = 'insert into douban values ("%s","%s")'%(item['title_text'], item['socre_text'])
        try :
            self.cursor.execute(sql)
            self.conn.commit()
        except Exception as e:
            print(e)
            self.conn.rollback()
        return item

    # 這裏是關閉的  不是寫入的 , 記混了 , 並且 只能寫入一次,不能寫入屢次,上面的process_item  能夠處理屢次
    def close_spider(self, spider):
        self.cursor.close()
        self.conn.close()
        print('結束爬蟲！')


class RedisPipeline(object):
    conn = None
    def open_spider(self, spider):                              # 3306 不是字符串
        self.conn = Redis(host='127.0.0.1', port=6379)
        print(spider.start_urls)    # 數據交互 spider

    def process_item(self, item, spider):  # spider 的做用, 和 item交互  那邊 傳值
        print('item',item['title_text'],item['socre_text'],'(((((((((((((((((((((((((((')
        dic = {
            'title': item['title_text'],
            'socre': item['socre_text']
        }
        self.conn.lpush('movie', dic)

        return item

犯得錯誤:

1 port 3306 和 6379 27017
2 沒在設置裏開第二個管道

from redis import Redis
self.conn = Redis(host='127.0.0.1',port=6379)
self.conn.lpush('movie',dic)

sql語句

import pymysql
self.conn = pymysql.Connect(host='127.0.0.1', port=3306, user='root', password='123', db='spider', charset='utf8')
self.cursor = self.conn.cursor()
sql = 'insert into douban values ("%s","%s")'%(item['title_text'], item['socre_text'])
try :
    self.cursor.execute(sql)
    self.conn.commit()
except Exception as e:
    print(e)
    self.conn.rollback()