Python3——爬取淘寶評論

1、分析目標json

  • 爬取淘寶商品評論詳情

2、分析網頁加載流程app

  • 目標數據是否在網頁源代碼中(即右鍵——查看源代碼)
    • 不在
  • 目標數據在審查元素中(即右鍵——審查元素或f12)
    • f12——>Network——>f5 從新加載並記錄網頁活動——>點擊網頁評論——>Name——>Preview找評論信息(script類型(Type))——>Header從中找記錄評論信息的網址

f12f5加載

3、分析目標數據的請求。分析參數、本身構造urldom

1. 找到網址:函數

https://rate.tmall.com/list_detail_rate.htmitemId=539137284584&spuId=701871908&sellerId=929347050&order=3&currentPage=1&append=0&content=1&tagId=&posi=&picture=&ua=098%23E1hvLpvWvRQvUvCkvvvvvjiPPFSptjlbPLsy6jYHPmP96jrWn2s9ljiEPFMyQjrURphvCvvvvvmCvpvW7D%2BnMq5w7Di4OzbNdphvHmQhsUE8o9v9BmeS8kH2mOcEmfwGiQhvCvvv9UUPvpvhvv2MMQhCvvOvUvvvphmivpvUvvmv%2BJZCZ94EvpvVmvvC9jxvKphv8vvvvvCvpvvvvvmmH6CvvHIvvUUdphvWvvvv9krvpv3Fvvmm86CvmVWEvpCWCh%2BMvvaw1WCl%2Bb8rwZHlYhzBRfpKofkXAf00Io3EAp0YyfUZEcqh1j7yHdUfbcc6D76fde%2BRfwLvaB46NZ59QnkQRqwiLO2vqU0QKLyCvvpvvvvv3QhvCvvhvvv%3D&isg=BBwcrmBIqyRNj10slC4flSrd7ToOPcHVm6szQvYdFofqQb3LHqQ2T4ezpam5SfgX&needFold=0&_ksTS=1527496615091_664&callback=jsonp665jsonp

2.分析網站

    • currentPage當前頁數,動態更新(經過requests(url)輸出總共的頁數,這裏是共99頁)
    • _ksTS:時間戳,動態更新——>須要人爲設置,動態傳入參數(time.time()函數)
    • callback回調函數,據觀察,callback=jsonp665(665=664+1——1527496615091_664)

3.構造url,requests.get()的參數 pagram 編碼

    • get到數據
      • response = requests.get(url, params=pagram)
      • data = response.text

  4.寫入庫url

    • 解析數據,經過正則匹配項,找到data中須要的部分
    • 將獲取的json數據類型轉爲字典類型
    • 找data中須要的部分
    • 將須要的部分寫入庫

  5.存儲好的csv文件能夠用excel打開spa

代碼粘貼以下:excel

 1 # -*- coding:utf-8 -*-
 2 
 3 # _ksTS=1526545121518_1881時間戳滯後了 ∴要動態的傳參數——(導入time模塊)
 4 # callback=jsonp1882
 5 import requests
 6 import time 7 import random 8 import re 9 import json 10 11 url = 
'https://rate.tmall.com/list_detail_rate.htm?itemId=539137284584&spuId=701871908&sellerId=929347050&order=3&append=0&content=1&tagId=&posi=&picture=&ua=098%23E1hvsvvLvZIvUpCkvvvvvjiPPFdZ6jtPPLqOzjivPmPh1jDRRFchAjYbPsMh6jYWR46Cvvyv2vZjwchvJCurvpvEvvkUCgR2vV2LdphvmpvhOQb3gpCU4UhCvCLwMCHJGaMwznAY8xS50YAizRl4k46CvvyvCWgmYNZvECojvpvhvvpvvvGCvvpvvPMMuphvmvvv9bhrvjKCkphvC99vvOClpbyCvm9vvvvvphvvvvvv9F1vpvkjvvmmZhCv2CUvvUEpphvWwpvv9DCvpv11mphvLvp%2F6QvjWz7%2BkU97%2B3%2BraNBraB4AVAElYWmQrEt1pwLU%2BnezrmphQRAn3feAOHcIAXcBKFyK2ixrQj7Jymx%2F1j7QiXTAVArlMR29VEQCvpvVvvpvvhCvRphvCvvvvvm5vpvhvvmv9u6CvvyvCV4mRLyvVbervpvEvvBxvkgKv2kqRphvCvvvvvmCvpvZz2sm4VdNznswvCDfY0IwXaAv7Ihtvpvhvvvvvv%3D%3D&isg=BBgYp5ys9ga0jdox7XxaDMe26UbZGXLdB_e3zlII19NS7bvX-hKvGsuvISVdfTRj&needFold=0' 12 13 # 發送 http://請求 14 # t = time.time() 時間戳time()函數 15 # csv文件 excle 能夠打開 16 #csv文件,編碼只能‘gbk’ 17 f = open('votes.csv','w',encoding='gbk') 18 f.write('評價內容,店家回覆,暱稱\n') 19 for i in range(99): 20 t = str(time.time()).split('.') 21 22 # 構造url的過程,get請求的參數 23 pagram = { 24 'currentPage': i+1, 25 '_ksTS': '%s_%s' % (t[0], t[1]), 26 'callback': 'jsonp%s' %(int(t[1])+1) 27 } 28 29 # 隨機休眠,行爲分析,防止訪問過快,避免被網站檢測到有問題 30 time.sleep(random.random()) 31 32 response = requests.get(url, params=pagram) 33 # 數據持久化——入庫、文件 34 # csv文件:經過','區分 35 data = response.text 36 37 # 解析數據 38 data = re.findall(r'{.*}', data)[0] 39 # json模塊能夠將 Json數據<——>爲字典 互相轉換 40 41 # Json數據——>轉爲字典 42 data = json.loads(data) 43 data = data['rateDetail']['rateList'] 44 print(data) 45 for item in data: 46 f.write('%s,%s,%s'% ( 47 item['rateContent'].replace(',', ','), 48 item['reply'].replace(',', ','), 49 item['displayUserNick']))
相關文章
相關標籤/搜索