PS:在下一篇文章中分析排行榜的動態變化趨勢,並繪製成動態條形圖和折線圖。php
1、網站原始信息html
咱們先來看下原始的網站頁面python
3、如何獲取123粉絲網的爬蟲信息web
step1:瀏覽器(通常用火狐和Google我用的360)中打開123粉絲網正則表達式
step2:按鍵盤F12 -> ctrl+r瀏覽器
-
step3: 點擊results.php -> 到Headers中找到代碼所需的參數
1 用Python中的Requests庫獲取網頁信息bash
#爬取當前頁信息,並用BeautifulSoup解析成標準格式import requests #導入requests模塊import bs4
url = "https://123fans.cn/lastresults.php?c=1"headers = {'User-Agent':'Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/78.0.3904.108 Safari/537.36', 'Request Method':'Get'}req = requests.get(url, timeout=30, headers=headers)soup = bs4.BeautifulSoup(req.text, "html.parser")
2 把爬取到的數據整合到一個數據框中微信
import re import numpy as np import pandas as pd
period_data = pd.DataFrame(np.zeros((400,5))) period_data.columns = ['name', 'popularity_value', 'period_num', 'end_time','rank'] i = 0 name = soup.findAll("td", {"class":"name"})for each in name: period_data['name'][i]=each.a.text i += 1j = 0popularity = soup.findAll("td", {"class":"ballot"})for each in popularity: period_data['popularity_value'][j]=float(each.text.replace(",",'')) j += 1period_num = int(re.findall('[0-9]+', str(soup.h2.text))[0])period_data['period_num'] = period_numend_time_0 = str(re.findall('結束日期.+[0-9]+', str(soup.findAll("div", {"class":"results"})))).split('.')end_time = ''for str_1 in end_time_0: end_time = end_time + re.findall('[0-9]+',str_1)[0]period_data['end_time'] = end_timeperiod_data_1 = period_data.sort_values(by='popularity_value',ascending=False)period_data_1['rank'] = range(period_data_1.shape[0])
popularity:用findAll函數取出全部的人氣值信息。app
period_data_1['rank']:在最後一列加入有序數,方便數據截取使用。dom
接下來展現批量爬蟲代碼
5、批量爬蟲代碼解析
1 定義爬蟲函數
import requests #導入requests模塊import bs4import re #正則表達式庫import numpy as np import pandas as pdimport warningsimport timeimport random
warnings.filterwarnings('ignore') #忽視ignore#headers的內容在Headers裏面均可以找到headers = {'User-Agent':'Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/78.0.3904.108 Safari/537.36', 'Request Method':'Get'}def crawler(url): req = requests.get(url, timeout=30, headers=headers) # 獲取網頁信息 soup = bs4.BeautifulSoup(req.text, "html.parser") #用soup庫解析 period_data = pd.DataFrame(np.zeros((400,5))) #構造400行5列的全0矩陣備用 period_data.columns = ['name', 'popularity_value', 'period_num', 'end_time','rank'] #給0矩陣列命名 #把當期的數據填入表格中 #姓名信息 i = 0 name = soup.findAll("td", {"class":"name"}) for each in name: period_data['name'][i]=each.a.text #依次加入姓名 i += 1 #人氣信息 j = 0 popularity = soup.findAll("td", {"class":"ballot"}) for each in popularity: period_data['popularity_value'][j]=float(each.text.replace(",",'')) #依次加入人氣值 j += 1 #期數信息 period_num = int(re.findall('[0-9]+', str(soup.h2.text))[0]) period_data['period_num'] = period_num #截止日期 end_time_0 = str(re.findall('結束日期.+[0-9]+', str(soup.findAll("div", {"class":"results"})))).split('.') end_time = '' for str_1 in end_time_0: end_time = end_time + re.findall('[0-9]+',str_1)[0] period_data['end_time'] = end_time #有序數,方便截取前多少位 period_data_1 = period_data.sort_values(by='popularity_value',ascending=False) period_data_1['rank'] = range(period_data_1.shape[0]) return period_data_1
本段代碼是把分段爬蟲代碼整合到一個函數中,方便反覆調用。
2 反覆調用函數實現批量爬蟲
period_data_final = pd.DataFrame(np.zeros((1,5))) #構造400行5列的全0矩陣備用period_data_final.columns = ['name', 'popularity_value', 'period_num', 'end_time','rank'] #給0矩陣列命名for qi in range(538,499,-1): print("目前爬到了第",qi,'期') if qi == 538: url="https://123fans.cn/lastresults.php?c=1" else: url="https://123fans.cn/results.php?qi={}&c=1".format(qi) time.sleep(random.uniform(1, 2)) date = crawler(url) period_data_final = period_data_final.append(date)period_data_final_1 = period_data_fina.loc[1:,:] #去掉第一行無用數據
本段代碼是反覆調用爬蟲函數獲取頁面數據,並用append整合到一個數據框中。
注3:time.sleep中用了隨機數,是人爲休眠一段不定長時間爲了防止反爬蟲。
你可能感興趣:
長按識別上方二維碼學習更多Python知識
喜歡這篇文章
就請點個 「 在看」吧
本文分享自微信公衆號 - 阿黎逸陽的代碼(gh_f3910c467dfe)。
若有侵權,請聯繫 support@oschina.cn 刪除。
本文參與「OSC源創計劃」,歡迎正在閱讀的你也加入,一塊兒分享。