京東對於爬蟲來講太友好了,不向天貓跟淘寶那樣的喪心病狂,本次爬蟲來爬取下京東,研究下京東的數據是如何獲取的。php
1 # 目標網址: jd.com 2 # 關鍵字: 手機(任意關鍵字,本文以手機入手)
獲得url以下:html
1 https://search.jd.com/Search?keyword=%E6%89%8B%E6%9C%BA&enc=utf-8&wq=%E6%89%8B%E6%9C%BA&pvid=c53afe790a6f440f9adf7edcaabd8703
往下拖拽的時候就會發現很明顯部分數據是經過Ajax動態獲取的。那既然設計到動態數據沒啥好說的抓下包。不過在抓包以前不妨先翻幾頁看看url有沒有什麼變化。python
點擊下一頁數據庫
https://search.jd.com/Search?keyword=手機BA&enc=utf-8&qrst=1&rt=1&stop=1&vt=2&wq=手機BA&cid2=653&cid3=655&page=3&s=60&click=0 # 關鍵信息page出現了
在點回第一頁app
https://search.jd.com/Search?keyword=手機BA&enc=utf-8&qrst=1&rt=1&stop=1&vt=2&wq=手機BA&cid2=653&cid3=655&page=1&s=1&click=0 # 這個時候其實規律就出來了
把page改爲2試一下,結果出來的數據跟第一頁的同樣,page=4跟page=3出來的數據也是同樣,那其實很好說了,每次打開新的頁面的時候只須要確保page+2便可。dom
抓下包,獲取下動態的數據:scrapy
拿到url訪問下這個頁面,結果卻跳回了首頁,很明顯參數不夠。看了幾篇博客才知道原來是要攜帶referer信息的。ide
referer地址也很明顯就是本頁面的url。再來看看這些動態數據的url該怎麼構造,多訪問幾個頁面看看規律。函數
1 第一頁: https://search.jd.com/s_new.php?keyword=%E6%89%8B%E6%9C%BA&enc=utf-8&qrst=1&rt=1&stop=1&vt=2&wq=%E6%89%8B%E6%9C%BA&cid2=653&cid3=655&page=2&s=30&scrolling=y&log_id=1547824670.57168&tpl=3_M&show_items=7643003,5089235,100000822981,5089273,5821455,7437788,5089225,100001172674,8894451,7081550,100000651175,6946605,8895275,7437564,100000349372,100002293114,8735304,100000820311,6949475,100000773875,7357933,100000971366,8638898,7694047,8790521,7479912,7651927,7686683,100001464948,100000650837 2 3 第二頁: https://search.jd.com/s_new.php?keyword=%E6%89%8B%E6%9C%BA&enc=utf-8&qrst=1&rt=1&stop=1&vt=2&wq=%E6%89%8B%E6%9C%BA&cid2=653&cid3=655&page=4&s=86&scrolling=y&log_id=1547824734.86451&tpl=3_M&show_items=5283387,7428766,6305258,7049459,8024543,6994622,5826236,3133841,6577511,100000993102,5295423,5963066,8717360,100000400014,7425622,7621213,100000993265,100002727566,28331229415,2321948,6737464,7029523,34250730122,3133811,36121534193,11794447957,5159244,28751842981,100001815307,35175013603
4
5 第三頁: https://search.jd.com/s_new.php?keyword=%E6%89%8B%E6%9C%BA&enc=utf-8&qrst=1&rt=1&stop=1&vt=2&wq=%E6%89%8B%E6%9C%BA&cid2=653&cid3=655&page=6&s=140&scrolling=y&log_id=1547824799.50167&tpl=3_M&show_items=3889169,4934609,5242942,4270017,32399556682,7293054,28209134950,100000993265,32796441851,5980401,6176077,27424489997,27493450925,5424574,100000015166,6840907,30938386315,12494304703,7225861,34594345130,29044581673,28502299808,4577217,8348845,31426728970,6425153,31430342752,15501730722,100000322417,5283377
仔細觀察關鍵字page,第一頁page=2,第二頁page=4,第三頁page=6,後面的 show_items= 這裏的參數一一直在變化,這是些什麼鬼?url
查看博客才知,原來啊京東每一頁由60條數據,前30條直接顯示出來,後30條數據是動態加載的,show_items=後面的這些數字實際上是前30條數據的每一條pid在html源碼中能夠直接獲取到。
OK,總結一下,訪問首頁前30條數據的url是這個。
https://search.jd.com/Search?keyword=%E6%89%8B%E6%9C%BA&enc=utf-8&qrst=1&rt=1&stop=1&vt=2&wq=%E6%89%8B%E6%9C%BA&cid2=653&cid3=655&page=1&s=1&click=0
後30條動態的數據是這個
https://search.jd.com/s_new.php?keyword=%E6%89%8B%E6%9C%BA&enc=utf-8&qrst=1&rt=1&stop=1&vt=2&wq=%E6%89%8B%E6%9C%BA&cid2=653&cid3=655&page=2&s=30&scrolling=y&log_id=1547825445.77300&tpl=3_M&show_items=7643003,5089235,100000822981,5089273,5821455,7437788,5089225,100001172674,8894451,7081550,100000651175,6946605,8895275,7437564,100000349372,100002293114,8735304,100000820311,6949475,100000773875,7357933,100000971366,8638898,8790521,7479912,7651927,7686683,100001464948,100000650837,1861091
且訪問後30條的時候要帶上referer以及pid,在獲取下一頁的時候只須要page+2便可。就能夠動手整了。
目錄結構:
jdspider.py
1 import scrapy 2 from ..items import JdItem 3 4 5 class JdSpider(scrapy.Spider): 6 name = 'jd' 7 allowed_domains = ['jd.com'] # 有的時候寫個www.jd.com會致使search.jd.com沒法爬取 8 keyword = "手機" 9 page = 1 10 url = 'https://search.jd.com/Search?keyword=%s&enc=utf-8&qrst=1&rt=1&stop=1&vt=2&wq=%s&cid2=653&cid3=655&page=%d&click=0' 11 next_url = 'https://search.jd.com/s_new.php?keyword=%s&enc=utf-8&qrst=1&rt=1&stop=1&vt=2&wq=%s&cid2=653&cid3=655&page=%d&scrolling=y&show_items=%s' 12 13 def start_requests(self): 14 yield scrapy.Request(self.url % (self.keyword, self.keyword, self.page), callback=self.parse) 15 16 def parse(self, response): 17 """ 18 爬取每頁的前三十個商品,數據直接展現在原網頁中 19 :param response: 20 :return: 21 """ 22 ids = [] 23 for li in response.xpath('//*[@id="J_goodsList"]/ul/li'): 24 item = JdItem() 25 title = li.xpath('div/div/a/em/text()').extract_first("") # 標題 26 price = li.xpath('div/div/strong/i/text()').extract_first("") # 價格 27 p_id = li.xpath('@data-pid').extract_first("") # id 28 ids.append(p_id) 29 url = li.xpath('div/div[@class="p-name p-name-type-2"]/a/@href').extract_first("") # 須要跟進的連接 30 31 item['title'] = title 32 item['price'] = price 33 item['url'] = url 34 # 給url加上https: 35 if item['url'].startswith('//'): 36 item['url'] = 'https:' + item['url'] # 粗心的同窗請注意必定要加上冒號: 37 elif not item['url'].startswith('https:'): 38 item['info'] = None 39 yield item 40 continue 41 42 yield scrapy.Request(item['url'], callback=self.info_parse, meta={"item": item}) 43 44 headers = {'referer': response.url} 45 # 後三十頁的連接訪問會檢查referer,referer是就是本頁的實際連接 46 # referer錯誤會跳轉到:https://www.jd.com/?se=deny 47 self.page += 1 48 yield scrapy.Request(self.next_url % (self.keyword, self.keyword, self.page, ','.join(ids)), 49 callback=self.next_parse, headers=headers) 50 51 def next_parse(self, response): 52 """ 53 爬取每頁的後三十個商品,數據展現在一個特殊連接中:url+id(這個id是前三十個商品的id) 54 :param response: 55 :return: 56 """ 57 for li in response.xpath('//li[@class="gl-item"]'): 58 item = JdItem() 59 title = li.xpath('div/div/a/em/text()').extract_first("") # 標題 60 price = li.xpath('div/div/strong/i/text()').extract_first("") # 價格 61 url = li.xpath('div/div[@class="p-name p-name-type-2"]/a/@href').extract_first("") # 須要跟進的連接 62 item['title'] = title 63 item['price'] = price 64 item['url'] = url 65 66 if item['url'].startswith('//'): 67 item['url'] = 'https:' + item['url'] # 粗心的同窗請注意必定要加上冒號: 68 elif not item['url'].startswith('https:'): 69 item['info'] = None 70 yield item 71 continue 72 73 yield scrapy.Request(item['url'], callback=self.info_parse, meta={"item": item}) 74 75 if self.page < 200: 76 self.page += 1 77 yield scrapy.Request(self.url % (self.keyword, self.keyword, self.page), callback=self.parse) 78 79 def info_parse(self, response): 80 """ 81 連接跟進,爬取每件商品的詳細信息,全部的信息都保存在item的一個子字段info中 82 :param response: 83 :return: 84 """ 85 item = response.meta['item'] 86 item['info'] = {} 87 name = response.xpath('//div[@class="inner border"]/div[@class="head"]/a/text()').extract_first("") 88 type = response.xpath('//div[@class="item ellipsis"]/text()').extract_first("") 89 item['info']['name'] = name 90 item['info']['type'] = type 91 92 for div in response.xpath('//div[@class="Ptable"]/div[@class="Ptable-item"]'): 93 h3 = div.xpath('h3/text()').extract_first() 94 if h3 == '': 95 h3 = "未知" 96 dt = div.xpath('dl/dl/dt/text()').extract() # 以列表的形式傳參給zip()函數 97 dd = div.xpath('dl/dl/dd[not(@class)]/text()').extract() 98 item['info'][h3] = {} 99 for t, d in zip(dt, dd): 100 item['info'][h3][t] = d 101 yield item
items.py
1 import scrapy 2 3 4 class JdItem(scrapy.Item): 5 title = scrapy.Field() # 標題 6 7 price = scrapy.Field() # 價格 8 9 url = scrapy.Field() # 商品連接 10 11 info = scrapy.Field() # 詳細信息
piplines.py
1 from scrapy.conf import settings 2 from pymongo import MongoClient 3 4 5 class JdphonePipeline(object): 6 def __init__(self): 7 # 獲取setting中主機名,端口號和集合名 8 host = settings['MONGODB_HOST'] 9 port = settings['MONGODB_PORT'] 10 dbname = settings['MONGODB_DBNAME'] 11 col = settings['MONGODB_COL'] 12 13 # 建立一個mongo實例 14 client = MongoClient(host=host, port=port) 15 16 # 訪問數據庫 17 db = client[dbname] 18 19 # 訪問集合 20 self.col = db[col] 21 22 def process_item(self, item, spider): 23 data = dict(item) 24 self.col.insert(data) 25 return item
settings.py
1 USER_AGENT = 'Mozilla/5.0 (Windows NT 6.1; WOW64; rv:23.0) Gecko/20100101 Firefox/23.0' 2 3 ITEM_PIPELINES = { 4 'jd.pipelines.JdphonePipeline': 300, 5 } 6 7 # 主機環回地址 8 MONGODB_HOST = '127.0.0.1' 9 # 端口號,默認27017 10 MONGODB_POST = 27017 11 # 設置數據庫名稱 12 MONGODB_DBNAME = 'JingDong' 13 # 設置集合名稱 14 MONGODB_COL = 'JingDongPhone' 15 SQL_DATETIME_FORMAT = "%Y-%m-%d %H:%M:%S" 16 SQL_DATE_FORMAT = "%Y-%m-%d"
代碼基本上copy了這位博主的代碼,只是作了些許的修改。http://www.javashuo.com/article/p-paxztipn-bz.html
好吧此次京東的爬蟲就到這裏,其實關於京東的爬蟲網上還有另一個版本,下次在研究一下。京東是真的對爬蟲友好。