接上篇以後。此次來爬取須要登陸才能訪問的微博。
爬蟲目標是獲取用戶的微博數、關注數、粉絲數。爲創建用戶關係圖(還沒有實現)作數據儲備html
安裝第三方庫requests
和pymongo
python
安裝MongoDBreact
建立一個weibo爬蟲項目git
如何建立Scrapy項目以前文章都已經提到了,直接進入主題。github
Item數據這部分我只須要我的信息,微博數,關注數、分數數這些基本信息就行。json
class ProfileItem(Item): """ 帳號的微博數、關注數、粉絲數及詳情 """ _id = Field() nick_name = Field() profile_pic = Field() tweet_stats = Field() following_stats = Field() follower_stats = Field() sex = Field() location = Field() birthday = Field() bio = Field() class FollowingItem(Item): """ 關注的微博帳號 """ _id = Field() relationship = Field() class FollowedItem(Item): """ 粉絲的微博帳號 """ _id = Field() relationship = Field()
爲了方便爬蟲,咱們選擇登錄的入口是手機版的微博http://weibo.cn/
。
其中微博的uid能夠經過訪問用戶資料頁或者從關注的href
屬性中獲取segmentfault
class WeiboSpiderSpider(scrapy.Spider): name = "weibo_spider" allowed_domains = ["weibo.cn"] url = "http://weibo.cn/" start_urls = ['2634877355',...] # 爬取入口微博ID task_set = set(start_urls) # 待爬取集合 tasked_set = set() # 已爬取集合 ... def start_requests(self): while len(self.task_set) > 0 : _id = self.task_set.pop() if _id in self.tasked_set: raise CloseSpider(reason="已存在該數據 %s "% (_id) ) self.tasked_set.add(_id) info_url = self.url + _id info_item = ProfileItem() following_url = info_url + "/follow" following_item = FollowingItem() following_item["_id"] = _id following_item["relationship"] = [] follower_url = info_url + "/fans" follower_item = FollowedItem() follower_item["_id"] = _id follower_item["relationship"] = [] yield scrapy.Request(info_url, meta={"item":info_item}, callback=self.account_parse) yield scrapy.Request(following_url, meta={"item":following_item}, callback=self.relationship_parse) yield scrapy.Request(follower_url, meta={"item":follower_item}, callback=self.relationship_parse) def account_parse(self, response): item = response.meta["item"] sel = scrapy.Selector(response) profile_url = sel.xpath("//div[@class='ut']/a/@href").extract()[1] counts = sel.xpath("//div[@class='u']/div[@class='tip2']").extract_first() item['_id'] = re.findall(u'^/(\d+)/info',profile_url)[0] item['tweet_stats'] = re.findall(u'微博\[(\d+)\]', counts)[0] item['following_stats'] = re.findall(u'關注\[(\d+)\]', counts)[0] item['follower_stats'] = re.findall(u'粉絲\[(\d+)\]', counts)[0] if int(item['tweet_stats']) < 4500 and int(item['following_stats']) > 1000 and int(item['follower_stats']) < 500: raise CloseSpider("殭屍粉") yield scrapy.Request("http://weibo.cn"+profile_url, meta={"item": item},callback=self.profile_parse) def profile_parse(self,response): item = response.meta['item'] sel = scrapy.Selector(response) info = sel.xpath("//div[@class='tip']/following-sibling::div[@class='c']").extract_first() item["profile_pic"] = sel.xpath("//div[@class='c']/img/@src").extract_first() item["nick_name"] = re.findall(u'暱稱:(.*?)<br>',info)[0] item["sex"] = re.findall(u'性別:(.*?)<br>',info) and re.findall(u'性別:(.*?)<br>',info)[0] or '' item["location"] = re.findall(u'地區:(.*?)<br>',info) and re.findall(u'地區:(.*?)<br>',info)[0] or '' item["birthday"] = re.findall(u'生日:(.*?)<br>',info) and re.findall(u'生日:(.*?)<br>',info)[0] or '' item["bio"] = re.findall(u'簡介:(.*?)<br>',info) and re.findall(u'簡介:(.*?)<br>',info)[0] or '' yield item def relationship_parse(self, response): item = response.meta["item"] sel = scrapy.Selector(response) uids = sel.xpath("//table/tr/td[last()]/a[last()]/@href").extract() new_uids = [] for uid in uids: if "uid" in uid: new_uids.append(re.findall('uid=(\d+)&',uid)[0]) else: try: new_uids.append(re.findall('/(\d+)', uid)[0]) except: print('--------',uid) pass item["relationship"].extend(new_uids) for i in new_uids: if i not in self.tasked_set: self.task_set.add(i) next_page = sel.xpath("//*[@id='pagelist']/form/div/a[text()='下頁']/@href").extract_first() if next_page: yield scrapy.Request("http://weibo.cn"+next_page, meta={"item": item},callback=self.relationship_parse) else: yield item
代碼中值得注意的地方有幾個。cookie
這裏咱們填寫的是微博的uid,有的用戶有自定義域名(如上圖),要訪問後才能獲得真正的uidstart_url
填寫的初始種子數要在10個以上。這是爲了確保後面咱們爬取到的新的種子可以加入到待爬取的隊列中。10個以上的規定是從Scrapy文檔中查得的session
REACTOR_THREADPOOL_MAXSIZE
Default:
10
線程數是Twisted
線程池的默認大小(The maximum limit for Twisted Reactor thread pool size.)
當遇到不須要的繼續爬取的鏈接時(如已經爬取過的連接,定義的殭屍粉連接等等),就能夠用CloseSpider關閉當前爬蟲線程
class CookiesMiddleware(object): """ 換Cookie """ def process_request(self, request, spider): cookie = random.choice(cookies) request.cookies = cookie
這裏我本來是想用手機版的微博去模擬登錄的,奈何驗證碼是在是太難搞了。因此我直接用網上有人編寫好的登錄網頁版微博的代碼SinaSpider 這位寫的很好,有興趣的能夠去看看。其中還有另外一位寫了模擬登錄(帶驗證碼) 經測試可用。只不過我還沒想好怎麼嵌入到個人項目中。
# encoding=utf-8 import json import base64 import requests myWeiBo = [ {'no': 'xx@sina.com', 'psw': 'xx'}, {'no': 'xx@qq.com', 'psw': 'xx'}, ] def getCookies(weibo): """ 獲取Cookies """ cookies = [] loginURL = r'https://login.sina.com.cn/sso/login.php?client=ssologin.js(v1.4.15)' for elem in weibo: account = elem['no'] password = elem['psw'] username = base64.b64encode(account.encode('utf-8')).decode('utf-8') postData = { "entry": "sso", "gateway": "1", "from": "null", "savestate": "30", "useticket": "0", "pagerefer": "", "vsnf": "1", "su": username, "service": "sso", "sp": password, "sr": "1440*900", "encoding": "UTF-8", "cdult": "3", "domain": "sina.com.cn", "prelt": "0", "returntype": "TEXT", } session = requests.Session() r = session.post(loginURL, data=postData) jsonStr = r.content.decode('gbk') info = json.loads(jsonStr) if info["retcode"] == "0": print("Get Cookie Success!( Account:%s )" % account) cookie = session.cookies.get_dict() cookies.append(cookie) else: print("Failed!( Reason:%s )" % info["reason"].encode("utf-8")) return cookies cookies = getCookies(myWeiBo)
登錄-反爬蟲的這部分應該是整個項目中最難的地方了。好多地方我都還不太懂。之後有空在研究
這邊只須要主要什麼類型的Item存到那張表裏就好了
class MongoDBPipeline(object): def __init__(self): connection = MongoClient( host=settings['MONGODB_SERVER'], port=settings['MONGODB_PORT'] ) db = connection[settings['MONGODB_DB']] self.info = db[settings['INFO']] self.following = db[settings['FOLLOWING']] self.followed = db[settings['FOLLOWED']] def process_item(self, item, spider): if isinstance(item, ProfileItem): self.info.insert(dict(item)) elif isinstance(item, FollowingItem): self.following.insert(dict(item)) elif isinstance(item, FollowedItem): self.followed.insert(dict(item)) log.msg("Weibo added to MongoDB database!", level=log.DEBUG, spider=spider) return item
運行一下程序,就能看到MongoDB中有了咱們要的數據了
settings
中的DOWNLOAD_DELAY
設置5才能防止被微博BAN掉
嘗試在利用cookies登錄失敗時使用模擬登錄,可是效果很不理想
嘗試用代理IP池反爬蟲,可是嘗試失敗主要是不太會
將來將利用D3.js將爬到的數據繪製出來(或許吧)
項目地址:weibo_spider