咳咳~不要懷疑,這是一個正經的可視化項目,並且附帶一點科普🐶python
數據來自爬蟲獲取,淘寶約50個文胸商品的20W條評論數據~git
數據源來自chenjiandongx/cup-sizegithub
對於不少只知道A/B/C的紳士們,咱們在看數據以前可能先得了解點知識~正則表達式
首先咱們得先了解兩個概念——上胸圍 & 下胸圍,具體看示意圖:app
經過上胸圍與下胸圍的差值,咱們就能夠肯定罩杯的大小了,具體的對應關係可參考下圖:echarts
有了下胸圍 & 罩杯就能肯定文胸對應的尺碼了~
固然這又有分爲英式尺碼和國際尺碼,具體參考下圖:3d
好了,接下倆就能夠開始咱們的可視化了~code
from pyecharts.charts import * from pyecharts import options as opts from pyecharts.commons.utils import JsCode from collections import Counter import re import pandas as pd import jieba import jieba.posseg as psg from stylecloud import gen_stylecloud from IPython.display import Image
原始數據是txt格式,爲了方便處理,這邊轉爲Dataframe~orm
尺碼部分經過正則表達式提取出對應的下胸圍和罩杯,具體代碼以下:blog
patterns = re.compile(r'(?P<datetime>.*),顏色分類:(?P<color>.*?);尺碼:(?P<size>.*?),(?P<comment>.*)') with open('/home/kesci/input/cup6439/cup_all.txt', 'r') as f: data = f.readlines() obj_list = [] for item in data: obj = patterns.search(item) obj_list.append(obj.groupdict()) data = pd.DataFrame(obj_list) data = pd.concat([data, data['size'].str.extract('(?P<circumference>[7-9]{1}[0|5]{1}).*(?P<cup>[a-zA-Z])', expand=True)], axis=1) data.head()
咱們經過jieba
分詞來看看商品分類中最常出現的是哪些關鍵詞~
w_all = [] for item in data.color: w_l = psg.cut(item) w_l = [w for w, f in w_l if f in ('n', 'nr') and len(w)>1] w_all.extend(w_l) c = Counter(w_all) counter = c.most_common(50) bar = (Bar(init_opts=opts.InitOpts(theme='purple-passion', width='1000px', height='800px')) .add_xaxis([x for x, y in counter[::-1]]) .add_yaxis('出現次數', [y for x, y in counter[::-1]], category_gap='30%') .set_global_opts(title_opts=opts.TitleOpts(title="出現最多的關鍵詞", pos_left="center", title_textstyle_opts=opts.TextStyleOpts(font_size=20)), datazoom_opts=opts.DataZoomOpts(range_start=70, range_end=100, orient='vertical'), visualmap_opts=opts.VisualMapOpts(is_show=False, max_=6e4, min_=3000, dimension=0, range_color=['#f5d69f', '#f5898b', '#ef5055']), legend_opts=opts.LegendOpts(is_show=False), xaxis_opts=opts.AxisOpts(is_show=False,), yaxis_opts=opts.AxisOpts(axistick_opts=opts.AxisTickOpts(is_show=False), axisline_opts=opts.AxisLineOpts(is_show=False))) .set_series_opts(label_opts=opts.LabelOpts(is_show=True, position='right', font_style='italic'), itemstyle_opts={"normal": { "barBorderRadius": [30, 30, 30, 30], 'shadowBlur': 10, 'shadowColor': 'rgba(120, 36, 50, 0.5)', 'shadowOffsetY': 5, } } ).reversal_axis()) bar.render_notebook()
- 顏色:膚色 > 黑色 > 粉色 > 白色;
- 薄款 > 厚款;
- 鋼圈彷佛是個比較重要的賣點;
t_data = data.groupby(['circumference', 'cup'])['datetime'].count().reset_index() t_data.columns = ['circumference', 'cup', 'num'] #t_data.num = round(t_data.num.div(t_data.num.sum(axis=0), axis=0) * 100, 1) data_pair = [ {"name": 'A', "label":{"show": True}, "children": []}, {"name": 'B', "label":{"show": True}, "children": []}, {"name": 'C', "label":{"show": True}, 'shadowBlur': 10, 'shadowColor': 'rgba(120, 36, 50, 0.5)', 'shadowOffsetY': 5, "children": []}, {"name": 'D', "label":{"show": False}, "children": []}, {"name": 'E', "label":{"show": False}, "children": []} ] for idx, row in t_data.iterrows(): t_dict = {"name": row.cup, "label":{"show": True}, "children": []} if row.num > 3000: child_data = {"name": '{}-{}'.format(row.circumference, row.cup), "value":row.num, "label":{"show": True}} else: child_data = {"name": '{}-{}'.format(row.circumference, row.cup), "value":row.num, "label":{"show": False}} if row.cup == "A": data_pair[0]['children'].append(child_data) elif row.cup == "B": data_pair[1]['children'].append(child_data) elif row.cup == "C": data_pair[2]['children'].append(child_data) elif row.cup == "D": data_pair[3]['children'].append(child_data) elif row.cup == "E": data_pair[4]['children'].append(child_data) c = (Sunburst( init_opts=opts.InitOpts( theme='purple-passion', width="1000px", height="1000px")) .add( "", data_pair=data_pair, highlight_policy="ancestor", radius=[0, "100%"], sort_='null', levels=[ {}, { "r0": "20%", "r": "48%", "itemStyle": {"borderColor": 'rgb(220,220,220)', "borderWidth": 2} }, {"r0": "50%", "r": "80%", "label": {"align": "right"}, "itemStyle": {"borderColor": 'rgb(220,220,220)', "borderWidth": 1}} ], ) .set_global_opts( visualmap_opts=opts.VisualMapOpts(is_show=False, max_=90000, min_=3000, range_color=['#f5d69f', '#f5898b', '#ef5055']), title_opts=opts.TitleOpts(title="文 胸\n\n尺 碼 分 布", pos_left="center", pos_top="center", title_textstyle_opts=opts.TextStyleOpts(font_style='oblique', font_size=30),)) .set_series_opts(label_opts=opts.LabelOpts(font_size=18, formatter="{b}: {c}")) ) c.render_notebook()
- 單看罩杯的話:B > A > C
- 細分到具體尺碼:75B > 80B > 75A > 70A
咱們經過不一樣的胸圍來看看罩杯的比例:
grid = Grid(init_opts=opts.InitOpts(theme='purple-passion', width='1000px', height='1000px')) for idx, c in enumerate(['70', '75', '80', '85', '90', '95']): if idx % 2 == 0: x = 30 y = int(idx/2) * 30 + 20 else: x = 70 y = int(idx/2) * 30 + 20 pos_x = str(x)+'%' pos_y = str(y)+'%' pie = Pie(init_opts=opts.InitOpts()) pie.add( c, [[row.cup, row.num]for i, row in t_data[t_data.circumference==c].iterrows()], center=[pos_x, pos_y], radius=[70, 100], label_opts=opts.LabelOpts(formatter='{b}:{d}%'), ) pie.set_global_opts( title_opts=opts.TitleOpts(title="下胸圍={}".format(c), pos_top=str(y-1)+'%', pos_left=str(x-4)+'%', title_textstyle_opts=opts.TextStyleOpts(font_size=15)), legend_opts=opts.LegendOpts(is_show=True)) grid.add(pie,grid_opts=opts.GridOpts(pos_left='20%')) grid.render_notebook()
- 下胸圍=70:A > B > C
- 下胸圍=75:B > A > C
- 下胸圍=80:B > A > C
- 下胸圍=85:B > C > A
- 下胸圍=90:C > B > A
- 下胸圍=95:C > B > D
最後咱們來看看評論中常常說到的是什麼詞語吧~
w_all = [] for item in data.comment: w_l = jieba.lcut(item) w_all.extend(w_l) c = Counter(w_all) gen_stylecloud(' '.join(w_all), size=1000, #max_words=1000, font_path='/home/kesci/work/font/simhei.ttf', #palette='palettable.tableau.TableauMedium_10', icon_name='fas fa-heartbeat', output_name='comment.png', custom_stopwords=['沒有','用戶','填寫','評論'] ) Image(filename='comment.png')
文章內只能上傳沒有交互效果的圖片,更好的閱讀體驗歡迎訪問的我KLab ——【Pyecharts】20W條淘寶文胸商品評論數據可視化~
🌈 歡迎點贊支持~