唉,說句實在話,最近些爬蟲也寫的比較多了,常常爬一些沒有反爬措施,或者反爬只停留在驗證cookies、UA、referer的網站真的沒太多樂趣。前端時間在知乎上看見了一個專欄,反反爬蟲系列,因而乎也就入了坑,目前除了第二個以外所有都跟着做者的思路復現了代碼,收穫真的挺多的。話說python爬蟲在知乎上的活躍度真的挺高的,常常有一些前輩、大牛在上面分享經驗、教程。在知乎上查看、學習、討論、復現他們的代碼,很方便並且收穫挺多!javascript
好了,廢話也很少說了,開始今天的主題把。汽車之家汽車參數配置的爬蟲。html
其實剛開始看完這篇文章我是懵逼的,懵逼的地方不在於解析JS獲取真實數據,而是在於這句話,前端
按照index替換就行?我將經過JS拿到的數據認真的同網頁進行了比對,怎麼也找不到這返回的數據跟網頁有什麼類似的地方,無奈最後網上各類找,終於找到了一個博主寫的文章了,運行了以後成功的拿到了數據,心中非常欣喜,而後也終於明白了按照index替換就行這句話的含義。java
下面我講我寫的代碼貼出來,固然了,這段代碼只是爲了跑通並拿到數據,並無對其進行結構化封裝。python
1 import re 2 import os 3 import json 4 5 import requests 6 import xlwt 7 from selenium import webdriver 8 9 10 url = "https://car.autohome.com.cn/config/series/2357.html#pvareaid=3454437" 11 headers = { 12 "User-agent": "Mozilla/5.0 (Windows NT 6.3; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) " 13 "Chrome/71.0.3578.98 Safari/537.36" 14 } 15 # 運行JS的DOM 16 DOM = ("var rules = '2';" 17 "var document = {};" 18 "function getRules(){return rules}" 19 "document.createElement = function() {" 20 " return {" 21 " sheet: {" 22 " insertRule: function(rule, i) {" 23 " if (rules.length == 0) {" 24 " rules = rule;" 25 " } else {" 26 " rules = rules + '#' + rule;" 27 " }" 28 " }" 29 " }" 30 " }" 31 "};" 32 "document.querySelectorAll = function() {" 33 " return {};" 34 "};" 35 "document.head = {};" 36 "document.head.appendChild = function() {};" 37 38 "var window = {};" 39 "window.decodeURIComponent = decodeURIComponent;") 40 41 response = requests.get(url=url, headers=headers, timeout=10) 42 # print(response.content.decode("utf-8")) 43 html = response.content.decode("utf-8") 44 45 # 匹配汽車的參數 46 car_info = "" 47 config = re.search("var config = (.*?)};", html, re.S) # 車的參數 48 option = re.search("var option = (.*?)};", html, re.S) # 主被動安全裝備 49 bag = re.search("var bag = (.*?)};", html, re.S) # 選裝包 50 js_list = re.findall('(\(function\([a-zA-Z]{2}.*?_\).*?\(document\);)', html) 51 # 拼接車型的全部參數car_info 52 if config and option and bag: 53 car_info = car_info + config.group(0) + option.group(0) + bag.group(0) 54 # print(car_info) 55 56 # 封裝JS成本地文件經過selenium執行,獲得true_text 57 for item in js_list: 58 DOM = DOM + item 59 html_type = "<html><meta http-equiv='Content-Type' content='text/html; charset=utf-8' /><head></head><body> <script type='text/javascript'>" 60 js = html_type + DOM + " document.write(rules)</script></body></html>" # 待執行的JS字符串 61 os.makedirs("D:\\test11") 62 with open("D:\\test11\\asd.html", "w", encoding="utf-8") as f: 63 f.write(js) 64 browser = webdriver.Chrome(executable_path="D:\chromedrive\chromedriver.exe") 65 browser.get("file://D:/test11/asd.html") 66 true_text = browser.find_element_by_tag_name('body').text 67 # print(true_text) 68 69 span_list = re.findall("<span(.*?)></span>", car_info) # 匹配車輛參數中全部的span標籤 70 71 # 按照span標籤與true_text中的關鍵字進行替換 72 for span in span_list: 73 info = re.search("'(.*?)'", span) 74 if info: 75 class_info = str(info.group(1)) + "::before { content:(.*?)}" # 76 content = re.search(class_info, true_text).group(1) # 匹配到的字體 77 car_info = car_info.replace(str("<span class='" + info.group(1) + "'></span>"), 78 re.search("\"(.*?)\"", content).group(1)) 79 # print(car_info) 80 81 # 持久化 82 car_item = {} 83 config = re.search("var config = (.*?);", car_info).group(1) 84 option = re.search("var option = (.*?);var", car_info).group(1) 85 bag = re.search("var bag = (.*?);", car_info).group(1) 86 87 config_re = json.loads(config) 88 option_re = json.loads(option) 89 bag_re = json.loads(bag) 90 91 config_item = config_re['result']['paramtypeitems'][0]['paramitems'] 92 option_item = option_re['result']['configtypeitems'][0]['configitems'] 93 bag_item = bag_re['result']['bagtypeitems'][0]['bagitems'] 94 95 for car in config_item: 96 car_item[car['name']] = [] 97 for value in car['valueitems']: 98 car_item[car['name']].append(value['value']) 99 100 for car in option_item: 101 car_item[car['name']] = [] 102 for value in car['valueitems']: 103 car_item[car['name']].append(value['value']) 104 105 for car in bag_item[0]['valueitems']: 106 car_item[car['name']] = [] 107 car_item[car['name']].append(car['bagid']) 108 car_item[car['name']].append(car['pricedesc']) 109 car_item[car['name']].append(car['description']) 110 111 # 生成表格 112 workbook = xlwt.Workbook(encoding='ascii') # 建立一個文件 113 worksheet = workbook.add_sheet('汽車之家') # 建立一個表 114 cols = 0 115 start_row = 0 116 117 for co in car_item: 118 cols = cols + 1 119 worksheet.write(start_row, cols, co) # 在第0(一)行寫入車的配置信息 120 121 end_row_num = start_row + len(car_item['車型名稱']) # 車輛款式記錄數 122 for row in range(start_row, end_row_num): 123 col_num = 0 # 列數 124 row += 1 125 for col in car_item: 126 col_num = col_num + 1 127 worksheet.write(row, col_num, str(car_item[col][row - 1])) 128 129 workbook.save('d:\\test.xls')
ok,在強調一遍,這段代碼只是爲了跑通並拿到數據,並無對其進行結構化封裝。git
最後感謝這位博主:https://www.cnblogs.com/kangz/p/10011348.htmlweb
經過這段時間跟着知乎的專欄的學習,明顯看見了本身的不足!必定要沉着冷靜的去分析、思考,網站對API進行了加密,必定可以解密!chrome