幾個月前空閒時候爬了下外賣的壽司數據(纔不會認可是那段時間靠外賣維持生存),得閒寫寫分享下。本文適合圍觀羣衆和有一丁點基礎的人。html
tips:本爬蟲爲了提升爬取速度,使用了異步協程,有須要且數據量小的噴油並不建議這麼使用,會被封掉,能夠修改成常規同步代碼。mysql
根據數據分析的ETL流程,該小爬蟲講解以下:git
import pandas as pd
import requests
import aiohttp
import asyncio
from multiprocessing.pool import Pool
from datetime import date
import pymysql
from sqlalchemy import create_engine
import collections
複製代碼
async def gethtml(url):
header = {
'Accept': 'application/json, text/plain, */*',
'Accept-Encoding': 'gzip, deflate, br',
'Accept-Language': 'zh-CN,zh;q=0.9,en;q=0.8',
'Cache-Control': 'max-age=0',
'Connection': 'keep-alive',
'Host': 'www.ele.me',
'Referer': 'https://www.ele.me/place/wsbrgts6d1ry?latitude=28.111704&longitude=113.011304',
'x-shard': 'loc=113.011304,28.111704',
'User-Agent': 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_13_3) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/64.0.3282.186 Safari/537.36',
}
try:
async with aiohttp.ClientSession() as session:
async with session.get(url=url, headers=header) as r:
# time.sleep(0.5)
if not r.raise_for_status():
data = await r.json()
# print(data)
# data = ujson.loads(data)
return data
except Exception as e:
print(e)
pass
複製代碼
後續的數據請求都是經過這個函數,由於使用的是異步協程,因此使用async定義。github
2.2 接下來是數據提取函數:web
def getshopid(html):
shop_id = {i['restaurant']['id'] for i in html['restaurant_with_foods']}
return shop_id
def geturl(ids):
restaurant_url = {'https://www.ele.me/restapi/shopping/restaurant/%s?latitude=28.09515&longitude=113.012001&terminal=web' %
shop_id for shop_id in ids}
foodurl = {'https://www.ele.me/restapi/shopping/v2/menu?restaurant_id=%s&terminal=web' %
shop_id for shop_id in ids}
return restaurant_url, foodurl
複製代碼
函數分別是獲取店鋪id,獲取店鋪詳情,這裏面須要注意的是提取數據要注意去重,這裏使用了簡單暴力的集合數據結構去重。sql
2.3 數據提取完畢,接下來使用pandas從新載入數據作最後的分析,以下:數據庫
def food_table(foodlists):
foods = {(y['specfoods'][0]['restaurant_id'], y['name'], y['specfoods'][0]['price'],y['month_sales'], date.today().strftime('%Y-%m-%d'), date.today().strftime('%A')) for foodlist in foodlists for x in foodlist for y in x['foods']}
return foods
def shop_table(shoplist):
shop_detail = {(shop['id'], shop['name'], shop['distance'], shop['float_delivery_fee'],shop['float_minimum_order_amount'], shop['rating'], shop['rating_count']) for shop in shoplist}
return shop_detail
複製代碼
函數分別是生成食物詳情表,店鋪詳情表。json
2.4 最後一步就是作分析,使用pandas處理,這裏以簡單的每一個店鋪月銷售總額作爲指標:小程序
def join_table(shoptable, foodtable):
shoptable = pd.DataFrame(list(shoptable), columns=[ 'id', 'name', 'distance', 'delivery_fee', 'minimum_order_amount', 'rating', 'rating_count'])
foodtable = pd.DataFrame(list(foodtable), columns=['id', 'fname', 'price', 'msale', 'date', 'weekday'])
# print(foodtable.values)
new = pd.merge(shoptable, foodtable, on='id')
new['total'] = new['msale'] * new['price']
group = new.groupby(['name', 'id'])
return new, group.sum()
複製代碼
這一步是用pandas替代了SQL作處理,也能夠存入MySQL中再處理,代碼以下:api
connect = create_engine( 'mysql+pymysql://root:12345678@localhost:3306/waimai?charset=utf8')
pd.io.sql.to_sql(frame=detail, name=k, con=connect, if_exists='append')
複製代碼
async def main(name):
pool = Pool(8)
# html = await gethtml(yangqi)
htasks = [asyncio.ensure_future(gethtml(url))for url in name]
htmls = await asyncio.gather(*htasks)
# ids = getshopid(html)
# print(htmls)
ids = [getshopid(html) for html in htmls]
# print(ids)
restaurant_url, food_url = geturl(ids[0])
print('async crawl...')
shoptasks = [asyncio.ensure_future(
gethtml(url)) for url in restaurant_url]
foodtasks = [asyncio.ensure_future(
gethtml(url)) for url in food_url]
fdone, fpending = await asyncio.wait(foodtasks)
sdone, spending = await asyncio.wait(shoptasks)
shoplist = [task.result() for task in sdone]
foodlist = [task.result() for task in fdone]
print('distribute pasrse....')
sparse_jobs = [pool.apply_async(shop_table, args=(shoplist,))]
fparse_jobs = [pool.apply_async(food_table, args=(foodlist,))]
shoptable = [x.get() for x in sparse_jobs][0]
foodtable = [x.get() for x in fparse_jobs][0]
new, result = join_table(shoptable, foodtable)
return new, result
複製代碼
while len(lists)>0:
for k,v in list(lists.items()):
try:
loop = asyncio.get_event_loop()
tasks = asyncio.ensure_future(main(v))
loop.run_until_complete(tasks)
detail, totals = tasks.result()
lists.pop(k)
print('done:{}'.format(k))
except KeyError:
print('fail:{}'.format(k))
pass
else:
connect = create_engine( 'mysql+pymysql://root:12345678@localhost:3306/waimai?charset=utf8')
pd.io.sql.to_sql(frame=detail, name=k, con=connect, if_exists='append')
複製代碼
由於是異步,須要在事件循環中執行。裏面的lists就是本身想要搜索的區域中的外賣店列表,下面提供幾個列表示例:
wuyisquare=['https://www.ele.me/restapi/shopping/restaurants/search?extras%5B%5D=activity&keyword=%E5%92%96%E5%95%A1&latitude=28.19652&limit=100&longitude=112.977361&offset={0}&terminal=web'.format(x) for x in range(0, 120, 24)]
sushi = ['https://www.ele.me/restapi/shopping/restaurants/search?extras%5B%5D=activity&keyword=%E5%AF%BF%E5%8F%B8&latitude=28.111704&limit=100&longitude=113.011304&offset={0}&terminal=web'.format(x) for x in range(0, 120, 24)]
yangqi = ['https://www.ele.me/restapi/shopping/restaurants/search?extras%5B%5D=activity&keyword=%E8%8C%B6&latitude=28.23188&limit=100&longitude=112.871522&offset={0}&terminal=web'.format(x) for x in range(0, 120, 24)]
tea = ['https://www.ele.me/restapi/shopping/restaurants/search?extras%5B%5D=activity&keyword=%E5%92%96%E5%95%A1&latitude=28.09515&limit=100&longitude=113.012001&offset={0}&terminal=web'.format(x) for x in range(0, 120, 24)]
fen = ['https://www.ele.me/restapi/shopping/restaurants/search?extras%5B%5D=activity&keyword=%E7%AD%92%E5%AD%90%E9%AA%A8%E7%B2%89&latitude=28.111704&limit=100&longitude=113.011304&offset={0}&terminal=web'.format(x) for x in range(0, 120, 24)]
gaosheng = ['https://www.ele.me/restapi/shopping/restaurants/search?extras%5B%5D=activity&keyword=%E7%B2%89&latitude=28.09515&limit=100&longitude=113.012001&offset={0}&terminal=web'.format(x) for x in range(0, 120, 24)]
fangcun = ['https://www.ele.me/restapi/shopping/restaurants/search?extras%5B%5D=activity&keyword=%E6%96%B9%E5%AF%B8%E5%AF%BF%E5%8F%B8&latitude=28.23188&limit=100&longitude=112.871522&offset={0}&terminal=web'.format(x) for x in range(0, 120, 24)]
luoyide = ['https://www.ele.me/restapi/shopping/restaurants/search?extras%5B%5D=activity&keyword=%E7%BD%97%E4%B9%89%E5%BE%B7&latitude=28.23188&limit=100&longitude=112.871522&offset={0}&terminal=web'.format(x) for x in range(0, 120, 24)]
lists={'sushi':sushi,'tea':tea,'fen':fen,'gaosheng':gaosheng,'luoyide':luoyide,'fangcun':fangcun}
複製代碼
URL只要替換keyword和latitude,longitude就能夠搜索本身想要區域,經度緯度能夠經過各種地圖API獲取,這裏就不打廣告了
這個爬蟲使用了異步請求,集合去重,pandas的數據庫同步寫入等基礎知識,適合練手,至於數據的價值本身慢慢挖掘,有點意思。 好比月售與各類維度的關係,好比散點圖,柱狀圖,日曆熱點圖:
下一波玩一玩微信和QQ機器人,敬請期待~~~~~
寫的這些文章是給剛入門的噴油作些參考,歡迎點星狂贊,順便打個廣告,顏值計算器小程序,源碼看這裏