(反反爬蟲)X車之家車型配置頁的字體反爬

  唉,說句實在話,最近些爬蟲也寫的比較多了,常常爬一些沒有反爬措施,或者反爬只停留在驗證cookies、UA、referer的網站真的沒太多樂趣。前端時間在知乎上看見了一個專欄,反反爬蟲系列,因而乎也就入了坑,目前除了第二個以外所有都跟着做者的思路復現了代碼,收穫真的挺多的。話說python爬蟲在知乎上的活躍度真的挺高的,常常有一些前輩、大牛在上面分享經驗、教程。在知乎上查看、學習、討論、復現他們的代碼,很方便並且收穫挺多!javascript

  好了,廢話也很少說了,開始今天的主題把。汽車之家汽車參數配置的爬蟲。html

  其實剛開始看完這篇文章我是懵逼的,懵逼的地方不在於解析JS獲取真實數據,而是在於這句話,前端

   按照index替換就行?我將經過JS拿到的數據認真的同網頁進行了比對,怎麼也找不到這返回的數據跟網頁有什麼類似的地方,無奈最後網上各類找,終於找到了一個博主寫的文章了,運行了以後成功的拿到了數據,心中非常欣喜,而後也終於明白了按照index替換就行這句話的含義。java

  下面我講我寫的代碼貼出來,固然了,這段代碼只是爲了跑通並拿到數據,並無對其進行結構化封裝。python

  1 import re
  2 import os
  3 import json
  4 
  5 import requests
  6 import xlwt
  7 from selenium import webdriver
  8 
  9 
 10 url = "https://car.autohome.com.cn/config/series/2357.html#pvareaid=3454437"
 11 headers = {
 12     "User-agent": "Mozilla/5.0 (Windows NT 6.3; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) "
 13                   "Chrome/71.0.3578.98 Safari/537.36"
 14 }
 15 # 運行JS的DOM
 16 DOM = ("var rules = '2';"
 17        "var document = {};"
 18        "function getRules(){return rules}"
 19        "document.createElement = function() {"
 20        "      return {"
 21        "              sheet: {"
 22        "                      insertRule: function(rule, i) {"
 23        "                              if (rules.length == 0) {"
 24        "                                      rules = rule;"
 25        "                              } else {"
 26        "                                      rules = rules + '#' + rule;"
 27        "                              }"
 28        "                      }"
 29        "              }"
 30        "      }"
 31        "};"
 32        "document.querySelectorAll = function() {"
 33        "      return {};"
 34        "};"
 35        "document.head = {};"
 36        "document.head.appendChild = function() {};"
 37 
 38        "var window = {};"
 39        "window.decodeURIComponent = decodeURIComponent;")
 40 
 41 response = requests.get(url=url, headers=headers, timeout=10)
 42 # print(response.content.decode("utf-8"))
 43 html = response.content.decode("utf-8")
 44 
 45 # 匹配汽車的參數
 46 car_info = ""
 47 config = re.search("var config = (.*?)};", html, re.S)       # 車的參數
 48 option = re.search("var option = (.*?)};", html, re.S)   # 主被動安全裝備
 49 bag = re.search("var bag = (.*?)};", html, re.S)             # 選裝包
 50 js_list = re.findall('(\(function\([a-zA-Z]{2}.*?_\).*?\(document\);)', html)
 51 # 拼接車型的全部參數car_info
 52 if config and option and bag:
 53     car_info = car_info + config.group(0) + option.group(0) + bag.group(0)
 54 # print(car_info)
 55 
 56 # 封裝JS成本地文件經過selenium執行,獲得true_text
 57 for item in js_list:
 58     DOM = DOM + item
 59 html_type = "<html><meta http-equiv='Content-Type' content='text/html; charset=utf-8' /><head></head><body>    <script type='text/javascript'>"
 60 js = html_type + DOM + " document.write(rules)</script></body></html>"    # 待執行的JS字符串
 61 os.makedirs("D:\\test11")
 62 with open("D:\\test11\\asd.html", "w", encoding="utf-8") as f:
 63     f.write(js)
 64 browser = webdriver.Chrome(executable_path="D:\chromedrive\chromedriver.exe")
 65 browser.get("file://D:/test11/asd.html")
 66 true_text = browser.find_element_by_tag_name('body').text   
 67 # print(true_text)
 68 
 69 span_list = re.findall("<span(.*?)></span>", car_info)    # 匹配車輛參數中全部的span標籤
 70 
 71 # 按照span標籤與true_text中的關鍵字進行替換
 72 for span in span_list:
 73     info = re.search("'(.*?)'", span)
 74     if info:
 75         class_info = str(info.group(1)) + "::before { content:(.*?)}"                   #
 76         content = re.search(class_info, true_text).group(1)                             # 匹配到的字體
 77         car_info = car_info.replace(str("<span class='" + info.group(1) + "'></span>"),
 78                                     re.search("\"(.*?)\"", content).group(1))
 79 # print(car_info)
 80 
 81 # 持久化
 82 car_item = {}
 83 config = re.search("var config = (.*?);", car_info).group(1)
 84 option = re.search("var option = (.*?);var", car_info).group(1)
 85 bag = re.search("var bag = (.*?);", car_info).group(1)
 86 
 87 config_re = json.loads(config)
 88 option_re = json.loads(option)
 89 bag_re = json.loads(bag)
 90 
 91 config_item = config_re['result']['paramtypeitems'][0]['paramitems']
 92 option_item = option_re['result']['configtypeitems'][0]['configitems']
 93 bag_item = bag_re['result']['bagtypeitems'][0]['bagitems']
 94 
 95 for car in config_item:
 96     car_item[car['name']] = []
 97     for value in car['valueitems']:
 98         car_item[car['name']].append(value['value'])
 99 
100 for car in option_item:
101     car_item[car['name']] = []
102     for value in car['valueitems']:
103         car_item[car['name']].append(value['value'])
104 
105 for car in bag_item[0]['valueitems']:
106     car_item[car['name']] = []
107     car_item[car['name']].append(car['bagid'])
108     car_item[car['name']].append(car['pricedesc'])
109     car_item[car['name']].append(car['description'])
110 
111 # 生成表格
112 workbook = xlwt.Workbook(encoding='ascii')  # 建立一個文件
113 worksheet = workbook.add_sheet('汽車之家')  # 建立一個表
114 cols = 0
115 start_row = 0
116 
117 for co in car_item:
118     cols = cols + 1
119     worksheet.write(start_row, cols, co)  # 在第0(一)行寫入車的配置信息
120 
121 end_row_num = start_row + len(car_item['車型名稱'])  # 車輛款式記錄數
122 for row in range(start_row, end_row_num):
123     col_num = 0  # 列數
124     row += 1
125     for col in car_item:
126         col_num = col_num + 1
127         worksheet.write(row, col_num, str(car_item[col][row - 1]))
128 
129 workbook.save('d:\\test.xls')
View Code

   ok,在強調一遍,這段代碼只是爲了跑通並拿到數據,並無對其進行結構化封裝。git

  最後感謝這位博主:https://www.cnblogs.com/kangz/p/10011348.htmlweb

經過這段時間跟着知乎的專欄的學習,明顯看見了本身的不足!必定要沉着冷靜的去分析、思考,網站對API進行了加密,必定可以解密!chrome

相關文章
相關標籤/搜索