Scrapy學習(四) 爬取微博數據

前言

Scrapy學習(三) 爬取豆瓣圖書信息php

接上篇以後。此次來爬取須要登陸才能訪問的微博。
爬蟲目標是獲取用戶的微博數、關注數、粉絲數。爲創建用戶關係圖(還沒有實現)作數據儲備html

準備

  • 安裝第三方庫requestspymongopython

  • 安裝MongoDBreact

  • 建立一個weibo爬蟲項目git

如何建立Scrapy項目以前文章都已經提到了,直接進入主題。github

建立Items

Item數據這部分我只須要我的信息,微博數,關注數、分數數這些基本信息就行。json

class ProfileItem(Item):
    """
    帳號的微博數、關注數、粉絲數及詳情
    """
    _id = Field()
    nick_name = Field()
    profile_pic = Field()
    tweet_stats = Field()
    following_stats = Field()
    follower_stats = Field()
    sex = Field()
    location = Field()
    birthday = Field()
    bio = Field()
    
class FollowingItem(Item):
    """
    關注的微博帳號
    """
    _id = Field()
    relationship = Field()

class FollowedItem(Item):
    """
    粉絲的微博帳號
    """
    _id = Field()
    relationship = Field()

編寫Spider

爲了方便爬蟲,咱們選擇登錄的入口是手機版的微博http://weibo.cn/

其中微博的uid能夠經過訪問用戶資料頁或者從關注的href屬性中獲取segmentfault

class WeiboSpiderSpider(scrapy.Spider):
    name = "weibo_spider"
    allowed_domains = ["weibo.cn"]
    url = "http://weibo.cn/"
    start_urls = ['2634877355',...] # 爬取入口微博ID
    task_set = set(start_urls) # 待爬取集合
    tasked_set = set() # 已爬取集合
    ...   
    
    def start_requests(self):
        while len(self.task_set) > 0 :
            _id = self.task_set.pop()
            if _id in self.tasked_set:
                raise CloseSpider(reason="已存在該數據 %s "% (_id) )
            self.tasked_set.add(_id)
            info_url = self.url + _id
            info_item = ProfileItem()
            following_url = info_url + "/follow"
            following_item = FollowingItem()
            following_item["_id"] = _id
            following_item["relationship"] = []
            follower_url = info_url + "/fans"
            follower_item = FollowedItem()
            follower_item["_id"] = _id
            follower_item["relationship"] = []
            yield scrapy.Request(info_url, meta={"item":info_item}, callback=self.account_parse)
            yield scrapy.Request(following_url, meta={"item":following_item}, callback=self.relationship_parse)
            yield scrapy.Request(follower_url, meta={"item":follower_item}, callback=self.relationship_parse)

    def account_parse(self, response):
        item = response.meta["item"]
        sel = scrapy.Selector(response)
        profile_url = sel.xpath("//div[@class='ut']/a/@href").extract()[1]
        counts = sel.xpath("//div[@class='u']/div[@class='tip2']").extract_first()
        item['_id'] = re.findall(u'^/(\d+)/info',profile_url)[0]
        item['tweet_stats'] = re.findall(u'微博\[(\d+)\]', counts)[0]
        item['following_stats'] = re.findall(u'關注\[(\d+)\]', counts)[0]
        item['follower_stats'] = re.findall(u'粉絲\[(\d+)\]', counts)[0]
        if int(item['tweet_stats']) < 4500 and int(item['following_stats']) > 1000 and int(item['follower_stats']) < 500:
            raise CloseSpider("殭屍粉")
        yield scrapy.Request("http://weibo.cn"+profile_url, meta={"item": item},callback=self.profile_parse)

    def profile_parse(self,response):
        item = response.meta['item']
        sel = scrapy.Selector(response)
        info = sel.xpath("//div[@class='tip']/following-sibling::div[@class='c']").extract_first()
        item["profile_pic"] = sel.xpath("//div[@class='c']/img/@src").extract_first()
        item["nick_name"] = re.findall(u'暱稱:(.*?)<br>',info)[0]
        item["sex"] = re.findall(u'性別:(.*?)<br>',info) and re.findall(u'性別:(.*?)<br>',info)[0] or ''
        item["location"] = re.findall(u'地區:(.*?)<br>',info) and re.findall(u'地區:(.*?)<br>',info)[0] or ''
        item["birthday"] = re.findall(u'生日:(.*?)<br>',info) and re.findall(u'生日:(.*?)<br>',info)[0] or ''
        item["bio"] = re.findall(u'簡介:(.*?)<br>',info) and re.findall(u'簡介:(.*?)<br>',info)[0] or ''
        yield item

    def relationship_parse(self, response):
        item = response.meta["item"]
        sel = scrapy.Selector(response)
        uids = sel.xpath("//table/tr/td[last()]/a[last()]/@href").extract()
        new_uids = []
        for uid in uids:
            if "uid" in uid:
                new_uids.append(re.findall('uid=(\d+)&',uid)[0])
            else:
                try:
                    new_uids.append(re.findall('/(\d+)', uid)[0])
                except:
                    print('--------',uid)
                    pass
        item["relationship"].extend(new_uids)
        for i in new_uids:
            if i not in self.tasked_set:
                self.task_set.add(i)
        next_page = sel.xpath("//*[@id='pagelist']/form/div/a[text()='下頁']/@href").extract_first()
        if next_page:
            yield scrapy.Request("http://weibo.cn"+next_page, meta={"item": item},callback=self.relationship_parse)
        else:
            yield item

代碼中值得注意的地方有幾個。cookie

start_url

這裏咱們填寫的是微博的uid,有的用戶有自定義域名(如上圖),要訪問後才能獲得真正的uid
start_url 填寫的初始種子數要在10個以上。這是爲了確保後面咱們爬取到的新的種子可以加入到待爬取的隊列中。10個以上的規定是從Scrapy文檔中查得的session

REACTOR_THREADPOOL_MAXSIZE

Default: 10
線程數是Twisted線程池的默認大小(The maximum limit for Twisted Reactor thread pool size.)

CloseSpider

當遇到不須要的繼續爬取的鏈接時(如已經爬取過的連接,定義的殭屍粉連接等等),就能夠用CloseSpider關閉當前爬蟲線程

編寫middlewares

class CookiesMiddleware(object):
    """ 換Cookie """

    def process_request(self, request, spider):
        cookie = random.choice(cookies)
        request.cookies = cookie

編寫cookie的獲取方法

這裏我本來是想用手機版的微博去模擬登錄的,奈何驗證碼是在是太難搞了。因此我直接用網上有人編寫好的登錄網頁版微博的代碼SinaSpider 這位寫的很好,有興趣的能夠去看看。其中還有另外一位寫了模擬登錄(帶驗證碼) 經測試可用。只不過我還沒想好怎麼嵌入到個人項目中。

# encoding=utf-8
import json
import base64
import requests

myWeiBo = [
    {'no': 'xx@sina.com', 'psw': 'xx'},
    {'no': 'xx@qq.com', 'psw': 'xx'},
]


def getCookies(weibo):
    """ 獲取Cookies """
    cookies = []
    loginURL = r'https://login.sina.com.cn/sso/login.php?client=ssologin.js(v1.4.15)'
    for elem in weibo:
        account = elem['no']
        password = elem['psw']
        username = base64.b64encode(account.encode('utf-8')).decode('utf-8')
        postData = {
            "entry": "sso",
            "gateway": "1",
            "from": "null",
            "savestate": "30",
            "useticket": "0",
            "pagerefer": "",
            "vsnf": "1",
            "su": username,
            "service": "sso",
            "sp": password,
            "sr": "1440*900",
            "encoding": "UTF-8",
            "cdult": "3",
            "domain": "sina.com.cn",
            "prelt": "0",
            "returntype": "TEXT",
        }
        session = requests.Session()
        r = session.post(loginURL, data=postData)
        jsonStr = r.content.decode('gbk')
        info = json.loads(jsonStr)
        if info["retcode"] == "0":
            print("Get Cookie Success!( Account:%s )" % account)
            cookie = session.cookies.get_dict()
            cookies.append(cookie)
        else:
            print("Failed!( Reason:%s )" % info["reason"].encode("utf-8"))
    return cookies

cookies = getCookies(myWeiBo)

登錄-反爬蟲的這部分應該是整個項目中最難的地方了。好多地方我都還不太懂。之後有空在研究

編寫pipelines

這邊只須要主要什麼類型的Item存到那張表裏就好了

class MongoDBPipeline(object):
    def __init__(self):
        connection = MongoClient(
            host=settings['MONGODB_SERVER'],
            port=settings['MONGODB_PORT']
        )
        db = connection[settings['MONGODB_DB']]
        self.info = db[settings['INFO']]
        self.following = db[settings['FOLLOWING']]
        self.followed = db[settings['FOLLOWED']]

    def process_item(self, item, spider):

        if isinstance(item, ProfileItem):
            self.info.insert(dict(item))
        elif isinstance(item, FollowingItem):
            self.following.insert(dict(item))
        elif isinstance(item, FollowedItem):
            self.followed.insert(dict(item))
        log.msg("Weibo  added to MongoDB database!",
                level=log.DEBUG, spider=spider)
        return item

運行一下程序,就能看到MongoDB中有了咱們要的數據了

總結

  • settings中的DOWNLOAD_DELAY設置5才能防止被微博BAN掉

  • 嘗試在利用cookies登錄失敗時使用模擬登錄,可是效果很不理想

  • 嘗試用代理IP池反爬蟲,可是嘗試失敗主要是不太會

  • 將來將利用D3.js將爬到的數據繪製出來(或許吧)

項目地址:weibo_spider

相關文章
相關標籤/搜索