原本打算作一個關於微博粉絲列表的爬蟲,能夠統計一下某個微博帳號的粉絲裏面,殭屍粉(水軍)的數量,大V數量。html
結果寫完爬蟲才發現,如今微博只給人看粉絲列表的前5頁.......哈哈,好吧。挺無奈的,淘寶那邊也是隻展現前100頁的評論。web
直接上爬蟲代碼瀏覽器
import requests import re tmpt_url = 'https://weibo.com/p/1005051678105910/follow?page=%d#Pl_Official_HisRelation__59' def get_data(tmpt_url): urllist = [tmpt_url%i for i in range(1,6)] user_id = [] #粉絲ID user_name = [] #粉絲名稱 user_follow = [] #粉絲的關注 user_fans = [] #粉絲的粉絲量 user_address = [] #粉絲的地址 headers = {'Accept':'text/html,application/xhtml+xml,application/xml;q=0.9,image/webp,*/*;q=0.8', 'Accept-Encoding':'gzip, deflate, br', 'Accept-Language':'zh-CN,zh;q=0.8,zh-TW;q=0.7,zh-HK;q=0.5,en-US;q=0.3,en;q=0.2', 'Connection':'keep-alive', 'Cookie':'請在本身的瀏覽器中查看,因涉及我的隱私不公開', 'Host':'weibo.com', 'Upgrade-Insecure-Requests':'1', 'User-Agent':'Mozilla/5.0 (Windows NT 6.1; Win64; x64; rv:65.0) Gecko/20100101 Firefox/65.0'} for url in urllist: html = requests.get(url,headers=headers).text user_id.extend(re.findall(r'<a class=\\"S_txt1\\" target=\\"_blank\\" usercard=\\"id=(\d+)&refer_flag=\d+_\\" href=\\"\\/\S+\?refer_flag=\d+_\\" >\S+<\\/a>',html)) user_name.extend(re.findall(r'<a class=\\"S_txt1\\" target=\\"_blank\\" usercard=\\"id=\d+&refer_flag=\d+_\\" href=\\"\\/\S+\?refer_flag=\d+_\\" >(\S+)<\\/a>',html)) user_follow.extend(re.findall(r'關注 <em class=\\"count\\"><a target=\\"_blank\\" href=\\"\\/\d+\\/follow\\" >(\d+)<\\/a>',html)) user_fans.extend(re.findall(r'粉絲<em class=\\"count\\"><a target=\\"_blank\\" href=\\"\\/\d+\\/fans\?current=fans\\" >(\d+)<\\/a>',html)) user_address.extend(re.findall(r'<em class=\\"tit S_txt2\\">地址<\\/em><span>(\S+\s?\S+?)<\\/span>\\r\\n\\t\\t\\t\\t\\t<\\/div>',html)) print('user_id',user_id) print('user_name',user_name) print('user_follow',user_follow) print('user_fans',user_fans) print('user_address',user_address)
這個url是孫儷的微博帳號app
下面是她粉絲列表前5頁爬到的信息,包括:粉絲ID,粉絲名稱,粉絲的關注,粉絲的粉絲量,粉絲的地址url