scrapy爬取京東

京東對於爬蟲來講太友好了,不向天貓跟淘寶那樣的喪心病狂,本次爬蟲來爬取下京東,研究下京東的數據是如何獲取的。php

1 # 目標網址: jd.com
2 # 關鍵字: 手機(任意關鍵字,本文以手機入手)

獲得url以下:html

1 https://search.jd.com/Search?keyword=%E6%89%8B%E6%9C%BA&enc=utf-8&wq=%E6%89%8B%E6%9C%BA&pvid=c53afe790a6f440f9adf7edcaabd8703

往下拖拽的時候就會發現很明顯部分數據是經過Ajax動態獲取的。那既然設計到動態數據沒啥好說的抓下包。不過在抓包以前不妨先翻幾頁看看url有沒有什麼變化。python

 點擊下一頁數據庫

https://search.jd.com/Search?keyword=手機BA&enc=utf-8&qrst=1&rt=1&stop=1&vt=2&wq=手機BA&cid2=653&cid3=655&page=3&s=60&click=0  # 關鍵信息page出現了

 在點回第一頁app

https://search.jd.com/Search?keyword=手機BA&enc=utf-8&qrst=1&rt=1&stop=1&vt=2&wq=手機BA&cid2=653&cid3=655&page=1&s=1&click=0   # 這個時候其實規律就出來了

 把page改爲2試一下,結果出來的數據跟第一頁的同樣,page=4跟page=3出來的數據也是同樣,那其實很好說了,每次打開新的頁面的時候只須要確保page+2便可。dom

抓下包,獲取下動態的數據:scrapy

拿到url訪問下這個頁面,結果卻跳回了首頁,很明顯參數不夠。看了幾篇博客才知道原來是要攜帶referer信息的。ide

referer地址也很明顯就是本頁面的url。再來看看這些動態數據的url該怎麼構造,多訪問幾個頁面看看規律。函數

1 第一頁:  https://search.jd.com/s_new.php?keyword=%E6%89%8B%E6%9C%BA&enc=utf-8&qrst=1&rt=1&stop=1&vt=2&wq=%E6%89%8B%E6%9C%BA&cid2=653&cid3=655&page=2&s=30&scrolling=y&log_id=1547824670.57168&tpl=3_M&show_items=7643003,5089235,100000822981,5089273,5821455,7437788,5089225,100001172674,8894451,7081550,100000651175,6946605,8895275,7437564,100000349372,100002293114,8735304,100000820311,6949475,100000773875,7357933,100000971366,8638898,7694047,8790521,7479912,7651927,7686683,100001464948,100000650837
2 
3 第二頁:  https://search.jd.com/s_new.php?keyword=%E6%89%8B%E6%9C%BA&enc=utf-8&qrst=1&rt=1&stop=1&vt=2&wq=%E6%89%8B%E6%9C%BA&cid2=653&cid3=655&page=4&s=86&scrolling=y&log_id=1547824734.86451&tpl=3_M&show_items=5283387,7428766,6305258,7049459,8024543,6994622,5826236,3133841,6577511,100000993102,5295423,5963066,8717360,100000400014,7425622,7621213,100000993265,100002727566,28331229415,2321948,6737464,7029523,34250730122,3133811,36121534193,11794447957,5159244,28751842981,100001815307,35175013603
4
5 第三頁: https://search.jd.com/s_new.php?keyword=%E6%89%8B%E6%9C%BA&enc=utf-8&qrst=1&rt=1&stop=1&vt=2&wq=%E6%89%8B%E6%9C%BA&cid2=653&cid3=655&page=6&s=140&scrolling=y&log_id=1547824799.50167&tpl=3_M&show_items=3889169,4934609,5242942,4270017,32399556682,7293054,28209134950,100000993265,32796441851,5980401,6176077,27424489997,27493450925,5424574,100000015166,6840907,30938386315,12494304703,7225861,34594345130,29044581673,28502299808,4577217,8348845,31426728970,6425153,31430342752,15501730722,100000322417,5283377

 仔細觀察關鍵字page,第一頁page=2,第二頁page=4,第三頁page=6,後面的 show_items= 這裏的參數一一直在變化,這是些什麼鬼?url

查看博客才知,原來啊京東每一頁由60條數據,前30條直接顯示出來,後30條數據是動態加載的,show_items=後面的這些數字實際上是前30條數據的每一條pid在html源碼中能夠直接獲取到。

 

OK,總結一下,訪問首頁前30條數據的url是這個。

https://search.jd.com/Search?keyword=%E6%89%8B%E6%9C%BA&enc=utf-8&qrst=1&rt=1&stop=1&vt=2&wq=%E6%89%8B%E6%9C%BA&cid2=653&cid3=655&page=1&s=1&click=0

 

 後30條動態的數據是這個

https://search.jd.com/s_new.php?keyword=%E6%89%8B%E6%9C%BA&enc=utf-8&qrst=1&rt=1&stop=1&vt=2&wq=%E6%89%8B%E6%9C%BA&cid2=653&cid3=655&page=2&s=30&scrolling=y&log_id=1547825445.77300&tpl=3_M&show_items=7643003,5089235,100000822981,5089273,5821455,7437788,5089225,100001172674,8894451,7081550,100000651175,6946605,8895275,7437564,100000349372,100002293114,8735304,100000820311,6949475,100000773875,7357933,100000971366,8638898,8790521,7479912,7651927,7686683,100001464948,100000650837,1861091

 

 且訪問後30條的時候要帶上referer以及pid,在獲取下一頁的時候只須要page+2便可。就能夠動手整了。


 

目錄結構:

jdspider.py

  1 import scrapy
  2 from ..items import JdItem
  3 
  4 
  5 class JdSpider(scrapy.Spider):
  6     name = 'jd'
  7     allowed_domains = ['jd.com']  # 有的時候寫個www.jd.com會致使search.jd.com沒法爬取
  8     keyword = "手機"
  9     page = 1
 10     url = 'https://search.jd.com/Search?keyword=%s&enc=utf-8&qrst=1&rt=1&stop=1&vt=2&wq=%s&cid2=653&cid3=655&page=%d&click=0'
 11     next_url = 'https://search.jd.com/s_new.php?keyword=%s&enc=utf-8&qrst=1&rt=1&stop=1&vt=2&wq=%s&cid2=653&cid3=655&page=%d&scrolling=y&show_items=%s'
 12 
 13     def start_requests(self):
 14         yield scrapy.Request(self.url % (self.keyword, self.keyword, self.page), callback=self.parse)
 15 
 16     def parse(self, response):
 17         """
 18         爬取每頁的前三十個商品,數據直接展現在原網頁中
 19         :param response:
 20         :return:
 21         """
 22         ids = []
 23         for li in response.xpath('//*[@id="J_goodsList"]/ul/li'):
 24             item = JdItem()
 25             title = li.xpath('div/div/a/em/text()').extract_first("")  # 標題
 26             price = li.xpath('div/div/strong/i/text()').extract_first("")  # 價格
 27             p_id = li.xpath('@data-pid').extract_first("")  # id
 28             ids.append(p_id)
 29             url = li.xpath('div/div[@class="p-name p-name-type-2"]/a/@href').extract_first("")  # 須要跟進的連接
 30 
 31             item['title'] = title
 32             item['price'] = price
 33             item['url'] = url
 34             # 給url加上https:
 35             if item['url'].startswith('//'):
 36                 item['url'] = 'https:' + item['url']  # 粗心的同窗請注意必定要加上冒號:
 37             elif not item['url'].startswith('https:'):
 38                 item['info'] = None
 39                 yield item
 40                 continue
 41 
 42             yield scrapy.Request(item['url'], callback=self.info_parse, meta={"item": item})
 43 
 44         headers = {'referer': response.url}
 45         # 後三十頁的連接訪問會檢查referer,referer是就是本頁的實際連接
 46         # referer錯誤會跳轉到:https://www.jd.com/?se=deny
 47         self.page += 1
 48         yield scrapy.Request(self.next_url % (self.keyword, self.keyword, self.page, ','.join(ids)),
 49                              callback=self.next_parse, headers=headers)
 50 
 51     def next_parse(self, response):
 52         """
 53         爬取每頁的後三十個商品,數據展現在一個特殊連接中:url+id(這個id是前三十個商品的id)
 54         :param response:
 55         :return:
 56         """
 57         for li in response.xpath('//li[@class="gl-item"]'):
 58             item = JdItem()
 59             title = li.xpath('div/div/a/em/text()').extract_first("")  # 標題
 60             price = li.xpath('div/div/strong/i/text()').extract_first("")  # 價格
 61             url = li.xpath('div/div[@class="p-name p-name-type-2"]/a/@href').extract_first("")  # 須要跟進的連接
 62             item['title'] = title
 63             item['price'] = price
 64             item['url'] = url
 65 
 66             if item['url'].startswith('//'):
 67                 item['url'] = 'https:' + item['url']  # 粗心的同窗請注意必定要加上冒號:
 68             elif not item['url'].startswith('https:'):
 69                 item['info'] = None
 70                 yield item
 71                 continue
 72 
 73             yield scrapy.Request(item['url'], callback=self.info_parse, meta={"item": item})
 74 
 75         if self.page < 200:
 76             self.page += 1
 77             yield scrapy.Request(self.url % (self.keyword, self.keyword, self.page), callback=self.parse)
 78 
 79     def info_parse(self, response):
 80         """
 81         連接跟進,爬取每件商品的詳細信息,全部的信息都保存在item的一個子字段info中
 82         :param response:
 83         :return:
 84         """
 85         item = response.meta['item']
 86         item['info'] = {}
 87         name = response.xpath('//div[@class="inner border"]/div[@class="head"]/a/text()').extract_first("")
 88         type = response.xpath('//div[@class="item ellipsis"]/text()').extract_first("")
 89         item['info']['name'] = name
 90         item['info']['type'] = type
 91 
 92         for div in response.xpath('//div[@class="Ptable"]/div[@class="Ptable-item"]'):
 93             h3 = div.xpath('h3/text()').extract_first()
 94             if h3 == '':
 95                 h3 = "未知"
 96             dt = div.xpath('dl/dl/dt/text()').extract()  # 以列表的形式傳參給zip()函數
 97             dd = div.xpath('dl/dl/dd[not(@class)]/text()').extract()
 98             item['info'][h3] = {}
 99             for t, d in zip(dt, dd):
100                 item['info'][h3][t] = d
101         yield item

 

 items.py

 1 import scrapy
 2 
 3 
 4 class JdItem(scrapy.Item):
 5     title = scrapy.Field()  # 標題
 6 
 7     price = scrapy.Field()  # 價格
 8 
 9     url = scrapy.Field()  # 商品連接
10 
11     info = scrapy.Field()  # 詳細信息

 

 piplines.py

 1 from scrapy.conf import settings
 2 from pymongo import MongoClient
 3 
 4 
 5 class JdphonePipeline(object):
 6     def __init__(self):
 7         # 獲取setting中主機名,端口號和集合名
 8         host = settings['MONGODB_HOST']
 9         port = settings['MONGODB_PORT']
10         dbname = settings['MONGODB_DBNAME']
11         col = settings['MONGODB_COL']
12 
13         # 建立一個mongo實例
14         client = MongoClient(host=host, port=port)
15 
16         # 訪問數據庫
17         db = client[dbname]
18 
19         # 訪問集合
20         self.col = db[col]
21 
22     def process_item(self, item, spider):
23         data = dict(item)
24         self.col.insert(data)
25         return item

 

 settings.py

 1 USER_AGENT = 'Mozilla/5.0 (Windows NT 6.1; WOW64; rv:23.0) Gecko/20100101 Firefox/23.0'
 2 
 3 ITEM_PIPELINES = {
 4    'jd.pipelines.JdphonePipeline': 300,
 5 }
 6 
 7 # 主機環回地址
 8 MONGODB_HOST = '127.0.0.1'
 9 # 端口號,默認27017
10 MONGODB_POST = 27017
11 # 設置數據庫名稱
12 MONGODB_DBNAME = 'JingDong'
13 # 設置集合名稱
14 MONGODB_COL = 'JingDongPhone'
15 SQL_DATETIME_FORMAT = "%Y-%m-%d %H:%M:%S"
16 SQL_DATE_FORMAT = "%Y-%m-%d"

 

 代碼基本上copy了這位博主的代碼,只是作了些許的修改。http://www.javashuo.com/article/p-paxztipn-bz.html

好吧此次京東的爬蟲就到這裏,其實關於京東的爬蟲網上還有另一個版本,下次在研究一下。京東是真的對爬蟲友好。

相關文章
相關標籤/搜索