對「深圳移動」微博用戶爬取全部微博及其評論。python
語言:python 2.7
使用的庫:import requests
微博帳號:網上購買若干
IP代理:網上租動態IP的代理服務器
User-agent:網上搜索若干json
2. 手機微博看不到翻頁,是一直往下加載的(一共1671頁),可是其json格式的數據仍然以翻頁的形式呈現。
https://m.weibo.cn/api/container/getIndex?type=uid&value=1922826034&containerid=1076031922826034&page=2
主要就是修改page後面的值來獲取手機微博每一個頁面的json數據。api
3. 從上面的json數據頁面獲取字段idstr,即微博id。
從https://m.weibo.cn/status/4177994643916324地址能夠獲取一條微博的手機頁面。
格式:https://m.weibo.cn/status/【id】
服務器
4. 從https://m.weibo.cn/api/comments/show?id=4131150395559419&page=1
地址能夠獲取一條微博的評論的json格式數據,id爲一條微博的id,page爲評論翻頁。
格式:https://m.weibo.cn/api/comments/show?id=【id】&page=【page_num】
首行若ok=1說明該條微博有評論;若ok=0說明該條微博沒有評論。cookie
1.設置user-agent、cookies、headers。
app
從網上獲取大量user-agent,在TAOBAO購買若干微博帳號,獲取其cookie。
Random.choice()函數從列表中每次隨機獲取一個值,避免短期內用同一個cookie或者同一個user-agent訪問微博頁面致使cookie或user-agent被封。dom
2.獲取微博每一頁json數據,提取其中的idstr字段獲得每條微博的id。
Time.sleep(random.randint(1,4)) 休眠時間是隨機數而非固定值。
ide
3.一樣的道理從評論的json頁面獲取評論的json數據。函數
1.時間久了以後會出現NO JSON COULD BE DECODED的錯誤,debug後發現是獲取不到頁面源碼返回response 404的錯誤,緣由是user-agent使用次數過多被禁,主要是由於使用了單一IP地址,在這裏我用的是動態IP地址的服務器,所以不須要在爬蟲中設置代理IP,設置代理IP的方法和random.choice( )設置user-agent的方法雷同。此外,儘管使用了動態IP,user-agent仍有被禁的可能。
關於反爬蟲如何禁止user-agent抓取網站的辦法:
來源:《Nginx反爬蟲攻略:禁止某些User Agent抓取網站》工具
2.爬取的數據過多時,須要有代碼能夠自動更新微博帳號的cookie。
對本次數據爬取有重要貢獻的參考文章:《pyhton微博爬蟲(3)——獲取微博評論數據》
http://blog.csdn.net/FlySky1991/article/details/76924443
1 #!/usr/bin/env python 2 # -*- coding: utf-8 -*- 3 import sys 4 5 import requests 6 7 reload(sys) 8 sys.setdefaultencoding('utf8') 9 import time 10 import random 11 import crawler.user_agents as ua 12 from crawler import cookies as ck 13 14 15 def writeintxt(list,filename): 16 output = open(filename, 'a') 17 for i in list: 18 output.write(str(i[0])+','+str(i[1])+'\n') 19 output.close() 20 21 cookies = random.choice(ck.cookies) 22 user_agent = random.choice(ua.agents) 23 headers = { 24 'User-agent' : user_agent, 25 'Host' : 'm.weibo.cn', 26 'Accept' : 'application/json, text/plain, */*', 27 'Accept-Language' : 'zh-CN,zh;q=0.8', 28 'Accept-Encoding' : 'gzip, deflate, sdch, br', 29 'Referer' : 'https://m.weibo.cn/u/1922826034', 30 'Cookie' : cookies, 31 'Connection' : 'keep-alive', 32 } 33 34 id_list = [] 35 base_url = 'https://m.weibo.cn/api/container/getIndex?type=uid&value=1922826034&containerid=1076031922826034&page=' 36 for i in range(0, 1672): 37 try: 38 url = base_url+i.__str__() 39 resp = requests.get(url, headers=headers,timeout = 5) 40 jsondata = resp.json() 41 42 data = jsondata.get('cards') 43 for d in data: 44 id = d.get("mblog").get('idstr') 45 # print id 46 id_list.append([i,id]) 47 time.sleep(random.randint(1,4)) 48 except: 49 print i 50 print('*'*100) 51 pass 52 print "ok" 53 54 55 writeintxt(id_list,'weibo_id')
1 #!/usr/bin/env python 2 # -*- coding: utf-8 -*- 3 import sys 4 5 import requests 6 7 reload(sys) 8 sys.setdefaultencoding('utf8') 9 import time 10 import random 11 import crawler.user_agents as ua 12 from crawler import cookies as ck 13 14 15 def readfromtxt(filename): 16 file = open(u'D:/MattDoc/實習/1124爬取深圳移動新浪微博/網頁/'+filename, "r") 17 text = file.read() 18 file.close() 19 return text 20 21 def writeintxt(dict,filename): 22 output = open(u"D:/MattDoc/實習/1124爬取深圳移動新浪微博/網頁/"+filename, 'a+') 23 for d, list in dict.items(): 24 comment_str = "" 25 for l in list: 26 comment_str = comment_str + l.__str__() + "####" 27 output.write(d.split(',')[1]+"####"+comment_str+'\n') 28 output.close() 29 30 31 32 user_agent = random.choice(ua.agents) 33 cookies = random.choice(ck.cookies) 34 headers = { 35 'User-agent' : user_agent, 36 'Host' : 'm.weibo.cn', 37 'Accept' : 'application/json, text/plain, */*', 38 'Accept-Language' : 'zh-CN,zh;q=0.8', 39 'Accept-Encoding' : 'gzip, deflate, sdch, br', 40 'Referer' : 'https://m.weibo.cn/u/1922826034', 41 'Cookie' : cookies, 42 'Connection' : 'keep-alive', 43 } 44 45 46 base_url = 'https://m.weibo.cn/api/comments/show?id=' 47 weibo_id_list = readfromtxt('weibo_id1.txt').split('\n') 48 result_dict = {} 49 for weibo_id in weibo_id_list: 50 try: 51 record_list = [] 52 i=1 53 SIGN = 1 54 while(SIGN): 55 # url = base_url + weibo_id.split(',')[1] + '&page=' + str(i) 56 url = base_url + str(weibo_id) + '&page=' + str(i) 57 resp = requests.get(url, headers=headers, timeout=100) 58 jsondata = resp.json() 59 if jsondata.get('ok') == 1: 60 SIGN = 1 61 i = i + 1 62 data = jsondata.get('data') 63 for d in data: 64 comment = d.get('text').replace('$$','') 65 like_count = d.get('like_counts') 66 user_id = d.get("user").get('id') 67 user_name = d.get("user").get('screen_name').replace('$$','') 68 one_record = user_id.__str__()+'$$'+like_count.__str__()+'$$'+user_name.__str__()+'$$'+ comment.__str__() 69 record_list.append(one_record) 70 else: 71 SIGN = 0 72 73 result_dict[weibo_id]=record_list 74 time.sleep(random.randint(2,3)) 75 except: 76 # print traceback.print_exc() 77 print weibo_id 78 print('*'*100) 79 pass 80 print "ok" 81 82 writeintxt(result_dict,'comment1.txt')
1 # encoding=utf-8 2 """ User-Agents """ 3 agents = [ 4 "Mozilla/4.0 (compatible; MSIE 6.0; Windows NT 5.1; SV1; AcooBrowser; .NET CLR 1.1.4322; .NET CLR 2.0.50727)", 5 "Mozilla/4.0 (compatible; MSIE 7.0; Windows NT 6.0; Acoo Browser; SLCC1; .NET CLR 2.0.50727; Media Center PC 5.0; .NET CLR 3.0.04506)", 6 "Mozilla/4.0 (compatible; MSIE 7.0; AOL 9.5; AOLBuild 4337.35; Windows NT 5.1; .NET CLR 1.1.4322; .NET CLR 2.0.50727)", 7 "Mozilla/5.0 (Windows; U; MSIE 9.0; Windows NT 9.0; en-US)", 8 "Mozilla/5.0 (compatible; MSIE 9.0; Windows NT 6.1; Win64; x64; Trident/5.0; .NET CLR 3.5.30729; .NET CLR 3.0.30729; .NET CLR 2.0.50727; Media Center PC 6.0)", 9 "Mozilla/5.0 (compatible; MSIE 8.0; Windows NT 6.0; Trident/4.0; WOW64; Trident/4.0; SLCC2; .NET CLR 2.0.50727; .NET CLR 3.5.30729; .NET CLR 3.0.30729; .NET CLR 1.0.3705; .NET CLR 1.1.4322)", 10 "Mozilla/4.0 (compatible; MSIE 7.0b; Windows NT 5.2; .NET CLR 1.1.4322; .NET CLR 2.0.50727; InfoPath.2; .NET CLR 3.0.04506.30)", 11 "Mozilla/5.0 (Windows; U; Windows NT 5.1; zh-CN) AppleWebKit/523.15 (KHTML, like Gecko, Safari/419.3) Arora/0.3 (Change: 287 c9dfb30)", 12 "Mozilla/5.0 (X11; U; Linux; en-US) AppleWebKit/527+ (KHTML, like Gecko, Safari/419.3) Arora/0.6", 13 "Mozilla/5.0 (Windows; U; Windows NT 5.1; en-US; rv:1.8.1.2pre) Gecko/20070215 K-Ninja/2.1.1", 14 "Mozilla/5.0 (Windows; U; Windows NT 5.1; zh-CN; rv:1.9) Gecko/20080705 Firefox/3.0 Kapiko/3.0", 15 "Mozilla/5.0 (X11; Linux i686; U;) Gecko/20070322 Kazehakase/0.4.5", 16 "Mozilla/5.0 (X11; U; Linux i686; en-US; rv:1.9.0.8) Gecko Fedora/1.9.0.8-1.fc10 Kazehakase/0.5.6", 17 "Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/535.11 (KHTML, like Gecko) Chrome/17.0.963.56 Safari/535.11", 18 "Mozilla/5.0 (Macintosh; Intel Mac OS X 10_7_3) AppleWebKit/535.20 (KHTML, like Gecko) Chrome/19.0.1036.7 Safari/535.20", 19 "Opera/9.80 (Macintosh; Intel Mac OS X 10.6.8; U; fr) Presto/2.9.168 Version/11.52", 20 "Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/536.11 (KHTML, like Gecko) Chrome/20.0.1132.11 TaoBrowser/2.0 Safari/536.11", 21 "Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/537.1 (KHTML, like Gecko) Chrome/21.0.1180.71 Safari/537.1 LBBROWSER", 22 "Mozilla/5.0 (compatible; MSIE 9.0; Windows NT 6.1; WOW64; Trident/5.0; SLCC2; .NET CLR 2.0.50727; .NET CLR 3.5.30729; .NET CLR 3.0.30729; Media Center PC 6.0; .NET4.0C; .NET4.0E; LBBROWSER)", 23 "Mozilla/4.0 (compatible; MSIE 6.0; Windows NT 5.1; SV1; QQDownload 732; .NET4.0C; .NET4.0E; LBBROWSER)", 24 "Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/535.11 (KHTML, like Gecko) Chrome/17.0.963.84 Safari/535.11 LBBROWSER", 25 "Mozilla/4.0 (compatible; MSIE 7.0; Windows NT 6.1; WOW64; Trident/5.0; SLCC2; .NET CLR 2.0.50727; .NET CLR 3.5.30729; .NET CLR 3.0.30729; Media Center PC 6.0; .NET4.0C; .NET4.0E)", 26 "Mozilla/5.0 (compatible; MSIE 9.0; Windows NT 6.1; WOW64; Trident/5.0; SLCC2; .NET CLR 2.0.50727; .NET CLR 3.5.30729; .NET CLR 3.0.30729; Media Center PC 6.0; .NET4.0C; .NET4.0E; QQBrowser/7.0.3698.400)", 27 "Mozilla/4.0 (compatible; MSIE 6.0; Windows NT 5.1; SV1; QQDownload 732; .NET4.0C; .NET4.0E)", 28 "Mozilla/4.0 (compatible; MSIE 7.0; Windows NT 5.1; Trident/4.0; SV1; QQDownload 732; .NET4.0C; .NET4.0E; 360SE)", 29 "Mozilla/4.0 (compatible; MSIE 6.0; Windows NT 5.1; SV1; QQDownload 732; .NET4.0C; .NET4.0E)", 30 "Mozilla/4.0 (compatible; MSIE 7.0; Windows NT 6.1; WOW64; Trident/5.0; SLCC2; .NET CLR 2.0.50727; .NET CLR 3.5.30729; .NET CLR 3.0.30729; Media Center PC 6.0; .NET4.0C; .NET4.0E)", 31 "Mozilla/5.0 (Windows NT 5.1) AppleWebKit/537.1 (KHTML, like Gecko) Chrome/21.0.1180.89 Safari/537.1", 32 "Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/537.1 (KHTML, like Gecko) Chrome/21.0.1180.89 Safari/537.1", 33 "Mozilla/5.0 (iPad; U; CPU OS 4_2_1 like Mac OS X; zh-cn) AppleWebKit/533.17.9 (KHTML, like Gecko) Version/5.0.2 Mobile/8C148 Safari/6533.18.5", 34 "Mozilla/5.0 (Windows NT 6.1; Win64; x64; rv:2.0b13pre) Gecko/20110307 Firefox/4.0b13pre", 35 "Mozilla/5.0 (X11; Ubuntu; Linux x86_64; rv:16.0) Gecko/20100101 Firefox/16.0", 36 "Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/537.11 (KHTML, like Gecko) Chrome/23.0.1271.64 Safari/537.11", 37 "Mozilla/5.0 (X11; U; Linux x86_64; zh-CN; rv:1.9.2.10) Gecko/20100922 Ubuntu/10.10 (maverick) Firefox/3.6.10", 38 "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/58.0.3029.110 Safari/537.36", 39 ]
# encoding=utf-8 """ cookies """ cookies = [ "SINAGLOBAL=6061592354656.324.1489207743838; un=18240343109; TC-V5-G0=52dad2141fc02c292fc30606953e43ef; wb_cusLike_2140170130=N; _s_tentry=login.sina.com.cn; Apache=5393750164131.485.1511882292296; ULV=1511882292314:55:14:7:5393750164131.485.1511882292296:1511789163477; TC-Page-G0=1e758cd0025b6b0d876f76c087f85f2c; TC-Ugrow-G0=e66b2e50a7e7f417f6cc12eec600f517; login_sid_t=7cbd20d7f5c121ef83f50e3b28a77ed7; cross_origin_proto=SSL; WBStorage=82ca67f06fa80da0|undefined; UOR=,,login.sina.com.cn; WBtopGlobal_register_version=573631b425a602e8; crossidccode=CODE-tc-1EjHEO-2SNIe8-y00Hd0Yq79mGw3l1975ae; SSOLoginState=1511882345; SCF=AvFiX3-W7ubLmZwXrMhoZgCv_3ZXikK7fhjlPKRLjog0OIIQzSqq7xsdv-_GhEe8XWdkHikzsFJyqtvqej6OkaM.; SUB=_2A253GQ45DeThGeRP71IQ9y7NyDyIHXVUb3jxrDV8PUNbmtAKLWrSkW9NTjfYoWTfrO0PkXSICRzowbfjExbQidve; SUBP=0033WrSXqPxfM725Ws9jqgMF55529P9D9WFaVAdSwLmvOo1VRiSlRa3q5JpX5KzhUgL.FozpSh5pS05pe052dJLoIfMLxKBLBonL122LxKnLB.qL1-z_i--fiKyFi-2Xi--fi-2fiKyFTCH8SFHF1C-4eFH81FHWSE-RebH8SE-4BC-RSFH8SFHFBbHWeEH8SEHWeF-RegUDMJ7t; SUHB=04W-u1HCo6armH; ALF=1543418344; wvr=6", "SINAGLOBAL=6061592354656.324.1489207743838; TC-V5-G0=52dad2141fc02c292fc30606953e43ef; wb_cusLike_2140170130=N; _s_tentry=login.sina.com.cn; Apache=5393750164131.485.1511882292296; ULV=1511882292314:55:14:7:5393750164131.485.1511882292296:1511789163477; TC-Page-G0=1e758cd0025b6b0d876f76c087f85f2c; TC-Ugrow-G0=e66b2e50a7e7f417f6cc12eec600f517; login_sid_t=7cbd20d7f5c121ef83f50e3b28a77ed7; WBStorage=82ca67f06fa80da0|undefined; WBtopGlobal_register_version=573631b425a602e8; crossidccode=CODE-tc-1EjHEO-2SNIe8-y00Hd0Yq79mGw3l1975ae; cross_origin_proto=SSL; UOR=,,login.sina.com.cn; SSOLoginState=1511882443; SCF=AvFiX3-W7ubLmZwXrMhoZgCv_3ZXikK7fhjlPKRLjog0-14gBQox9IhSK8vZVaZYWsLxUaOWNkudAR9iT6NFJkg.; SUB=_2A253GQ6bDeRhGeNH6FsZ8CjLzj2IHXVUb2dTrDV8PUNbmtAKLWTjkW9NSqHIBUvGapKd6-MQhJTejk3w_ivUUNXZ; SUBP=0033WrSXqPxfM725Ws9jqgMF55529P9D9W5gYdHWIHRmedh9Nyrij6XN5JpX5K2hUgL.Fo-4e0.RehqNSK22dJLoI0.LxK-L122LB.qLxK-LB.BLBKqLxKMLB.2LBKzLxKnL12-L122LxK.LBK2L12qLxKqLBKqL1KHiqc-t; SUHB=0auwlDzUYulNGs; ALF=1543418442; un=13728408992; wvr=6", "SINAGLOBAL=6061592354656.324.1489207743838; TC-V5-G0=52dad2141fc02c292fc30606953e43ef; wb_cusLike_2140170130=N; _s_tentry=login.sina.com.cn; Apache=5393750164131.485.1511882292296; ULV=1511882292314:55:14:7:5393750164131.485.1511882292296:1511789163477; TC-Page-G0=1e758cd0025b6b0d876f76c087f85f2c; TC-Ugrow-G0=e66b2e50a7e7f417f6cc12eec600f517; login_sid_t=7cbd20d7f5c121ef83f50e3b28a77ed7; WBStorage=82ca67f06fa80da0|undefined; WBtopGlobal_register_version=573631b425a602e8; crossidccode=CODE-tc-1EjHEO-2SNIe8-y00Hd0Yq79mGw3l1975ae; wb_cusLike_5939806751=N; cross_origin_proto=SSL; UOR=,,login.sina.com.cn; SSOLoginState=1511882512; SCF=AvFiX3-W7ubLmZwXrMhoZgCv_3ZXikK7fhjlPKRLjog089iFKjxeT1Oc6cbJkkqgWrnQAuMVukRrJy3898cKIb8.; SUB=_2A253GQ9ADeRhGeNH6FsZ8ynJzz6IHXVUb2eIrDV8PUNbmtAKLVWhkW9NSqG4DzNeLkyPCmJIKq6bXfKXpSRCPLqO; SUBP=0033WrSXqPxfM725Ws9jqgMF55529P9D9W50J-rDh2D6-QEqNOZ2NddF5JpX5K2hUgL.Fo-4e0.Re0MfShz2dJLoIEeLxK-LB--L1KeLxK-L1hqLBoMLxKnL1K5LBo8IC281xEfIg5tt; SUHB=0gHiPrbPWNJvao; ALF=1543418511; un=15614187608; wvr=6", "SINAGLOBAL=6061592354656.324.1489207743838; TC-V5-G0=52dad2141fc02c292fc30606953e43ef; wb_cusLike_2140170130=N; _s_tentry=login.sina.com.cn; Apache=5393750164131.485.1511882292296; ULV=1511882292314:55:14:7:5393750164131.485.1511882292296:1511789163477; TC-Page-G0=1e758cd0025b6b0d876f76c087f85f2c; TC-Ugrow-G0=e66b2e50a7e7f417f6cc12eec600f517; login_sid_t=7cbd20d7f5c121ef83f50e3b28a77ed7; WBStorage=82ca67f06fa80da0|undefined; WBtopGlobal_register_version=573631b425a602e8; crossidccode=CODE-tc-1EjHEO-2SNIe8-y00Hd0Yq79mGw3l1975ae; wb_cusLike_5939806751=N; wb_cusLike_5939837542=N; cross_origin_proto=SSL; UOR=,,login.sina.com.cn; SSOLoginState=1511882567; SCF=AvFiX3-W7ubLmZwXrMhoZgCv_3ZXikK7fhjlPKRLjog02c5hBW41ia6vpj1cAqbFzE2KCcsXvDxToS_KOeUnwRc.; SUB=_2A253GQ8XDeRhGeNH6FsZ9CjKyjuIHXVUb2ffrDV8PUNbmtAKLU7wkW9NSqGOexL53l1CujvuLpAFNeOEsl05T_5E; SUBP=0033WrSXqPxfM725Ws9jqgMF55529P9D9WWuISqBnuGqpyxGiWdJ4bOv5JpX5K2hUgL.Fo-4e0.RShqceKM2dJLoI0YLxK-L1K5L1K2LxK.L1KnLBoeLxK-L1K5L1K2LxKqL1-2L1KqLxK.L1KMLBo-LxKMLB.zLB.qLxK-L1hML1-Bt; SUHB=0LcSwyK5XYMzbr; ALF=1543418566; un=13242833134; wvr=6" ]