你們好,但願各位能懷着正直、嚴謹、專業的心態觀看這篇文章。ヾ(๑╹◡╹)ノ"html
接下來咱們嘗試用 Python 抓取天貓內衣銷售數據,並分析獲得中國女性廣泛的罩杯數據、最受歡迎的內衣顏色是什麼、評論的關鍵字。java
但願看完以後你能替你女友買上一件心怡的內衣。python
咱們先看看分析獲得的成果是怎樣的?(講的很詳細,推薦跟着敲一遍)mysql
(買個內衣這麼開心)git
圖片看不清楚的話,能夠把圖片單獨拉到另外一個窗口。github
這裏是分析了一萬條數據得出的結論,可能會有偏差,可是仍是但願單身的大家能找到 0.06% 那一批妹紙。web
下面我會詳細介紹怎麼抓取天貓內衣銷售數據,存儲、分析、展現。sql
研究天貓網站數據庫
咱們隨意進入一個商品的購買界面(能看到評論的那個界面),F12 開發者模式 -- Network 欄 -- 刷新下界面 -- 在如圖的位置搜索 list_ 會看到一個 list_detail_rate.htm?itemId= ....json
以下圖:【單擊】這個url 能看到返回的是一個 Json 數據 ,檢查一下你會發現這串 Json 就是商品的評論數據 ['rateDetail']['rateList']
【雙擊】這個url 你會獲得一個新界面,如圖
看一下這個信息
這裏的路徑 就是獲取評論數據的 url了。這個 URL 有不少參數你能夠分析一下每一個值都是幹嗎的。
itemId 對應的是商品id, sellerId 對應的是店鋪id,currentPage 是當前頁。這裏 sellerId 能夠填任意值,不影響數據的獲取。
抓取天貓評論數據
寫一個抓取天貓評論數據的方法。getCommentDetail
# 獲取商品評論數據 def getCommentDetail(itemId,currentPage): url = 'https://rate.tmall.com/list_detail_rate.htm?itemId=' + str( itemId) + '&sellerId=2451699564&order=3¤tPage=' + str(currentPage) + '&append=0callback=jsonp336' # itemId 產品id ; sellerId 店鋪id 字段必須有值,但隨意值就行 html = common.getUrlContent(url) # 獲取網頁信息 # 刪掉返回的多餘信息 html = html.replace('jsonp128(','') #須要肯定是否是 jsonp128 html = html.replace(')','') html = html.replace('false','"false"') html = html.replace('true','"true"') # 將string 轉換爲字典對象 tmalljson = json.loads(html) return tmalljson
這裏須要注意的是 jsonp128 這個值須要你本身看一下,你那邊跟我這個應該是不一樣的。
還有幾十 common 這我本身封裝的一個工具類,主要就是上一篇博客裏寫的一些功能,想 requests 和 pymysql 模塊的功能。在文章最後我會貼出來。
在上面的方法裏有兩個變量,itemId 和 currentPage 這兩個值咱們動態來控制,因此咱們須要得到 一批 商品id號 和 評論的最大頁數 用來遍歷。
寫個獲取商品評論最大頁數的方法 getLastPage
# 獲取商品評論最大頁數 def getLastPage(itemId): tmalljson = getCommentDetail(itemId,1) return tmalljson['rateDetail']['paginator']['lastPage'] #最大頁數
那如今怎麼獲取 產品的id 列表呢? 咱們能夠在天貓中搜索商品關鍵字 用開發者模式觀察
這裏觀察一下這個頁面的元素分佈,很容易就發現了 商品的id 信息,固然你能夠想辦法確認一下。
如今就寫個 獲取商品id 的方法 getProductIdList
# 獲取商品id def getProductIdList(): url = 'https://list.tmall.com/search_product.htm?q=內衣' # q參數 是查詢的關鍵字 html = common.getUrlContent(url) # 獲取網頁信息 soup = BeautifulSoup(html,'html.parser') idList = [] # 用Beautiful Soup提取商品頁面中全部的商品ID productList = soup.find_all('div', {'class': 'product'}) for product in productList: idList.append(product['data-id']) return idList
如今全部的基本要求都有了,是時候把他們組合起來。
在 main 方法中寫剩下的組裝部分
if __name__ == '__main__': productIdList = getProductIdList() #獲取商品id initial = 0 while initial < len(productIdList) - 30: # 總共有60個商品,我只取了前30個 try: itemId = productIdList[initial] print('----------', itemId, '------------') maxPage = getLastPage(itemId) #獲取商品評論最大頁數 num = 1 while num <= maxPage and num < 20: #每一個商品的評論我最多取20 頁,每頁有20條評論,也就是每一個商品最多隻取 400 個評論 try: # 抓取某個商品的某頁評論數據 tmalljson = getCommentDetail(itemId, num) rateList = tmalljson['rateDetail']['rateList'] commentList = [] n = 0 while (n < len(rateList)): comment = [] # 商品描述 colorSize = rateList[n]['auctionSku'] m = re.split('[:;]', colorSize) rateContent = rateList[n]['rateContent'] dtime = rateList[n]['rateDate'] comment.append(m[1]) comment.append(m[3]) comment.append('天貓') comment.append(rateContent) comment.append(dtime) commentList.append(comment) n += 1 print(num) sql = "insert into bras(bra_id, bra_color, bra_size, resource, comment, comment_time) value(null, %s, %s, %s, %s, %s)" common.patchInsertData(sql, commentList) # mysql操做的批量插入 num += 1 except Exception as e: num += 1 print(e) continue initial += 1 except Exception as e: print(e)
全部的代碼就這樣完成了,我如今把 common.py 的代碼,還有 tmallbra.py 的代碼都貼出來
# -*- coding:utf-8 -*- # Author: zww import requests import time import random import socket import http.client import pymysql import csv # 封裝requests class Common(object): def getUrlContent(self, url, data=None): header = { 'Accept': 'text/html,application/xhtml+xml,application/xml;q=0.9,image/webp,image/apng,*/*;q=0.8', 'Accept-Encoding': 'gzip, deflate, br', 'Accept-Language': 'zh-CN,zh;q=0.9,en;q=0.8', 'user-agent': "User-Agent:Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/62.0.3202.94 Safari/537.36", 'cache-control': 'max-age=0' } # request 的請求頭 timeout = random.choice(range(80, 180)) while True: try: rep = requests.get(url, headers=header, timeout=timeout) # 請求url地址,得到返回 response 信息 # rep.encoding = 'utf-8' break except socket.timeout as e: # 如下都是異常處理 print('3:', e) time.sleep(random.choice(range(8, 15))) except socket.error as e: print('4:', e) time.sleep(random.choice(range(20, 60))) except http.client.BadStatusLine as e: print('5:', e) time.sleep(random.choice(range(30, 80))) except http.client.IncompleteRead as e: print('6:', e) time.sleep(random.choice(range(5, 15))) print('request success') return rep.text # 返回的 Html 全文 def writeData(self, data, url): with open(url, 'a', errors='ignore', newline='') as f: f_csv = csv.writer(f) f_csv.writerows(data) print('write_csv success') def queryData(self, sql): db = pymysql.connect("localhost", "zww", "960128", "test") cursor = db.cursor() results = [] try: cursor.execute(sql) #執行查詢語句 results = cursor.fetchall() except Exception as e: print('查詢時發生異常' + e) # 若是發生錯誤則回滾 db.rollback() # 關閉數據庫鏈接 db.close() return results print('insert data success') def insertData(self, sql): # 打開數據庫鏈接 db = pymysql.connect("localhost", "zww", "000000", "zwwdb") # 使用 cursor() 方法建立一個遊標對象 cursor cursor = db.cursor() try: # sql = "INSERT INTO WEATHER(w_id, w_date, w_detail, w_temperature) VALUES (null, '%s','%s','%s')" % (data[0], data[1], data[2]) cursor.execute(sql) #單條數據寫入 # 提交到數據庫執行 db.commit() except Exception as e: print('插入時發生異常' + e) # 若是發生錯誤則回滾 db.rollback() # 關閉數據庫鏈接 db.close() print('insert data success') def patchInsertData(self, sql, datas): # 打開數據庫鏈接 db = pymysql.connect("localhost", "zww", "960128", "test") # 使用 cursor() 方法建立一個遊標對象 cursor cursor = db.cursor() try: # 批量插入數據 # cursor.executemany('insert into WEATHER(w_id, w_date, w_detail, w_temperature_low, w_temperature_high) value(null, %s,%s,%s,%s)',datas) cursor.executemany(sql, datas) # 提交到數據庫執行 db.commit() except Exception as e: print('插入時發生異常' + e) # 若是發生錯誤則回滾 db.rollback() # 關閉數據庫鏈接 db.close() print('insert data success')
上面須要注意,數據庫的配置。
# -*- coding:utf-8 -*- # Author: zww from Include.commons.common import Common from bs4 import BeautifulSoup import json import re import pymysql common = Common() # 獲取商品id def getProductIdList(): url = 'https://list.tmall.com/search_product.htm?q=內衣' # q參數 是查詢的關鍵字,這要改變一下查詢值,就能夠抓取任意你想知道的數據 html = common.getUrlContent(url) # 獲取網頁信息 soup = BeautifulSoup(html,'html.parser') idList = [] # 用Beautiful Soup提取商品頁面中全部的商品ID productList = soup.find_all('div', {'class': 'product'}) for product in productList: idList.append(product['data-id']) return idList # 獲取商品評論數據 def getCommentDetail(itemId,currentPage): url = 'https://rate.tmall.com/list_detail_rate.htm?itemId=' + str( itemId) + '&sellerId=2451699564&order=3¤tPage=' + str(currentPage) + '&append=0callback=jsonp336' # itemId 產品id ; sellerId 店鋪id 字段必須有值,但隨意值就行 html = common.getUrlContent(url) # 獲取網頁信息 # 刪掉返回的多餘信息 html = html.replace('jsonp128(','') #須要肯定是否是 jsonp128 html = html.replace(')','') html = html.replace('false','"false"') html = html.replace('true','"true"') # 將string 轉換爲字典對象 tmalljson = json.loads(html) return tmalljson # 獲取商品評論最大頁數 def getLastPage(itemId): tmalljson = getCommentDetail(itemId,1) return tmalljson['rateDetail']['paginator']['lastPage'] #最大頁數 if __name__ == '__main__': productIdList = getProductIdList() #獲取商品id initial = 0 while initial < len(productIdList) - 30: # 總共有60個商品,我只取了前30個 try: itemId = productIdList[initial] print('----------', itemId, '------------') maxPage = getLastPage(itemId) #獲取商品評論最大頁數 num = 1 while num <= maxPage and num < 20: #每一個商品的評論我最多取20 頁,每頁有20條評論,也就是每一個商品最多隻取 400 個評論 try: # 抓取某個商品的某頁評論數據 tmalljson = getCommentDetail(itemId, num) rateList = tmalljson['rateDetail']['rateList'] commentList = [] n = 0 while (n < len(rateList)): comment = [] # 商品描述 colorSize = rateList[n]['auctionSku'] m = re.split('[:;]', colorSize) rateContent = rateList[n]['rateContent'] dtime = rateList[n]['rateDate'] comment.append(m[1]) comment.append(m[3]) comment.append('天貓') comment.append(rateContent) comment.append(dtime) commentList.append(comment) n += 1 print(num) sql = "insert into bras(bra_id, bra_color, bra_size, resource, comment, comment_time) value(null, %s, %s, %s, %s, %s)" common.patchInsertData(sql, commentList) # mysql操做的批量插入 num += 1 except Exception as e: num += 1 print(e) continue initial += 1 except Exception as e: print(e)
存儲、分析數據
全部的代碼都有了,就差數據庫的創建了。我這裏用的是 MySql 數據庫。
CREATE TABLE `bra` ( `bra_id` int(11) NOT NULL AUTO_INCREMENT COMMENT 'id' , `bra_color` varchar(25) NULL COMMENT '顏色' , `bra_size` varchar(25) NULL COMMENT '罩杯' , `resource` varchar(25) NULL COMMENT '數據來源' , `comment` varchar(500) CHARACTER SET utf8mb4 DEFAULT NULL COMMENT '評論' , `comment_time` datetime NULL COMMENT '評論時間' , PRIMARY KEY (`bra_id`) ) character set utf8 ;
這裏有兩個地方須要注意, comment 評論字段須要設置編碼格式爲 utf8mb4 ,由於可能有表情文字。還有表須要設置爲 utf8 編碼,否則存不了中文。
建好了表,就能夠完整執行代碼了。(這裏的執行可能須要點時間,能夠作成多線程的方式)。看一下執行完以後,數據庫有沒有數據。
數據是有了,可是有些咱們多餘的文字描述,咱們能夠稍微整理一下。
update bra set bra_color = REPLACE(bra_color,'2B6521-無鋼圈4-',''); update bra set bra_color = REPLACE(bra_color,'-1',''); update bra set bra_color = REPLACE(bra_color,'5',''); update bra set bra_size = substr(bra_size,1,3);
這裏須要根據本身實際狀況來修改。若是數據整理的差很少了,咱們能夠分析一下數據庫的信息。
select 'A罩杯' as 罩杯, CONCAT(ROUND(COUNT(*)/(select count(*) from bra) * 100, 2) , "%") as 比例, COUNT(*) as 銷量 from bra where bra_size like '%A' union all select 'B罩杯' as 罩杯, CONCAT(ROUND(COUNT(*)/(select count(*) from bra) * 100, 2) , "%") as 比例, COUNT(*) as 銷量 from bra where bra_size like '%B' union all select 'C罩杯' as 罩杯, CONCAT(ROUND(COUNT(*)/(select count(*) from bra) * 100, 2) , "%") as 比例, COUNT(*) as 銷量 from bra where bra_size like '%C' union all select 'D罩杯' as 罩杯, CONCAT(ROUND(COUNT(*)/(select count(*) from bra) * 100, 2) , "%") as 比例, COUNT(*) as 銷量 from bra where bra_size like '%D' union all select 'E罩杯' as 罩杯, CONCAT(ROUND(COUNT(*)/(select count(*) from bra) * 100, 2) , "%") as 比例, COUNT(*) as 銷量 from bra where bra_size like '%E' union all select 'F罩杯' as 罩杯, CONCAT(ROUND(COUNT(*)/(select count(*) from bra) * 100, 2) , "%") as 比例, COUNT(*) as 銷量 from bra where bra_size like '%F' union all select 'G罩杯' as 罩杯, CONCAT(ROUND(COUNT(*)/(select count(*) from bra) * 100, 2) , "%") as 比例, COUNT(*) as 銷量 from bra where bra_size like '%G' union all select 'H罩杯' as 罩杯, CONCAT(ROUND(COUNT(*)/(select count(*) from bra) * 100, 2) , "%") as 比例, COUNT(*) as 銷量 from bra where bra_size like '%H' order by 銷量 desc;
(想知道是哪6位小姐姐買的 G (~ ̄▽ ̄)~ )
數據可視化
數據的展現,我用了是 mycharts 模塊,若是不瞭解的能夠去 學習一下 http://pyecharts.org/#/zh-cn/prepare
這裏我就不細說了,直接貼代碼看
# encoding: utf-8 # author zww from pyecharts import Pie from Include.commons.common import Common if __name__ == '__main__': common = Common() results = common.queryData("""select count(*) from bra where bra_size like '%A' union all select count(*) from bra where bra_size like '%B' union all select count(*) from bra where bra_size like '%C' union all select count(*) from bra where bra_size like '%D' union all select count(*) from bra where bra_size like '%E' union all select count(*) from bra where bra_size like '%F' union all select count(*) from bra where bra_size like '%G'""") # 獲取每一個罩杯數量 attr = ["A罩杯", 'G罩杯', "B罩杯", "C罩杯", "D罩杯", "E罩杯", "F罩杯"] v1 = [results[0][0], results[6][0], results[1][0], results[2][0], results[3][0], results[4][0], results[5][0]] pie = Pie("內衣罩杯", width=1300, height=620) pie.add("", attr, v1, is_label_show=True) pie.render('size.html') print('success') results = common.queryData("""select count(*) from bra where bra_color like '%膚%' union all select count(*) from bra where bra_color like '%灰%' union all select count(*) from bra where bra_color like '%黑%' union all select count(*) from bra where bra_color like '%藍%' union all select count(*) from bra where bra_color like '%粉%' union all select count(*) from bra where bra_color like '%紅%' union all select count(*) from bra where bra_color like '%紫%' union all select count(*) from bra where bra_color like '%綠%' union all select count(*) from bra where bra_color like '%白%' union all select count(*) from bra where bra_color like '%褐%' union all select count(*) from bra where bra_color like '%黃%' """) # 獲取每一個罩杯數量 attr = ["膚色", '灰色', "黑色", "藍色", "粉色", "紅色", "紫色", '綠色', "白色", "褐色", "黃色"] v1 = [results[0][0], results[1][0], results[2][0], results[3][0], results[4][0], results[5][0], results[6][0], results[7][0], results[8][0], results[9][0], results[10][0]] pieColor = Pie("內衣顏色", width=1300, height=620) pieColor.add("", attr, v1, is_label_show=True) pieColor.render('color.html') print('success')
這一章就到這裏了,該知道的你也知道了,不應知道的你也知道了。
代碼所有存放在 GitHub 上 https://github.com/zwwjava/python_capture