本項目寫於2017年七月初,主要使用Python爬取網貸之家以及人人貸的數據進行分析。
網貸之家是國內最大的P2P數據平臺,人人貸國內排名前二十的P2P平臺。
源碼地址javascript
抓包工具主要使用chrome的開發者工具 網絡一欄,網貸之家的數據所有是ajax返回json數據,而人人貸既有ajax返回數據也有html頁面直接生成數據。html
從數據中能夠看到請求數據的方式(GET或者POST),請求頭以及請求參數。
從請求數據中能夠看到返回數據的格式(此例中爲json)、數據結構以及具體數據。
注:這是如今網貸之家的API請求後臺的接口,爬蟲編寫的時候與數據接口與現在的請求接口不同,因此網貸之家的數據爬蟲部分已無效。java
根據抓包分析獲得的結果,構造請求。在本項目中,使用Python的 requests庫模擬http請求
具體代碼:python
import requests class SessionUtil(): def __init__(self,headers=None,cookie=None): self.session=requests.Session() if headers is None: headersStr={"Accept":"application/json, text/javascript, */*; q=0.01", "X-Requested-With":"XMLHttpRequest", "User-Agent":"Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/57.0.2987.133 Safari/537.36", "Accept-Encoding":"gzip, deflate, sdch, br", "Accept-Language":"zh-CN,zh;q=0.8" } self.headers=headersStr else: self.headers=headers self.cookie=cookie //發送get請求 def getReq(self,url): return self.session.get(url,headers=self.headers).text def addCookie(self,cookie): self.headers['cookie']=cookie //發送post請求 def postReq(self,url,param): return self.session.post(url, param).text
在設置請求頭的時候,關鍵字段只設置了"User-Agent",網貸之家和人人貸的沒有反爬措施,甚至不用設置"Referer"字段來防止跨域錯誤。mysql
如下是一個爬蟲實例git
import json import time from databaseUtil import DatabaseUtil from sessionUtil import SessionUtil from dictUtil import DictUtil from logUtil import LogUtil import traceback def handleData(returnStr): jsonData=json.loads(returnStr) platData=jsonData.get('data').get('platOuterVo') return platData def storeData(jsonOne,conn,cur,platId): actualCapital=jsonOne.get('actualCapital') aliasName=jsonOne.get('aliasName') association=jsonOne.get('association') associationDetail=jsonOne.get('associationDetail') autoBid=jsonOne.get('autoBid') autoBidCode=jsonOne.get('autoBidCode') bankCapital=jsonOne.get('bankCapital') bankFunds=jsonOne.get('bankFunds') bidSecurity=jsonOne.get('bidSecurity') bindingFlag=jsonOne.get('bindingFlag') businessType=jsonOne.get('businessType') companyName=jsonOne.get('companyName') credit=jsonOne.get('credit') creditLevel=jsonOne.get('creditLevel') delayScore=jsonOne.get('delayScore') delayScoreDetail=jsonOne.get('delayScoreDetail') displayFlg=jsonOne.get('displayFlg') drawScore=jsonOne.get('drawScore') drawScoreDetail=jsonOne.get('drawScoreDetail') equityVoList=jsonOne.get('equityVoList') experienceScore=jsonOne.get('experienceScore') experienceScoreDetail=jsonOne.get('experienceScoreDetail') fundCapital=jsonOne.get('fundCapital') gjlhhFlag=jsonOne.get('gjlhhFlag') gjlhhTime=jsonOne.get('gjlhhTime') gruarantee=jsonOne.get('gruarantee') inspection=jsonOne.get('inspection') juridicalPerson=jsonOne.get('juridicalPerson') locationArea=jsonOne.get('locationArea') locationAreaName=jsonOne.get('locationAreaName') locationCity=jsonOne.get('locationCity') locationCityName=jsonOne.get('locationCityName') manageExpense=jsonOne.get('manageExpense') manageExpenseDetail=jsonOne.get('manageExpenseDetail') newTrustCreditor=jsonOne.get('newTrustCreditor') newTrustCreditorCode=jsonOne.get('newTrustCreditorCode') officeAddress=jsonOne.get('officeAddress') onlineDate=jsonOne.get('onlineDate') payment=jsonOne.get('payment') paymode=jsonOne.get('paymode') platBackground=jsonOne.get('platBackground') platBackgroundDetail=jsonOne.get('platBackgroundDetail') platBackgroundDetailExpand=jsonOne.get('platBackgroundDetailExpand') platBackgroundExpand=jsonOne.get('platBackgroundExpand') platEarnings=jsonOne.get('platEarnings') platEarningsCode=jsonOne.get('platEarningsCode') platName=jsonOne.get('platName') platStatus=jsonOne.get('platStatus') platUrl=jsonOne.get('platUrl') problem=jsonOne.get('problem') problemTime=jsonOne.get('problemTime') recordId=jsonOne.get('recordId') recordLicId=jsonOne.get('recordLicId') registeredCapital=jsonOne.get('registeredCapital') riskCapital=jsonOne.get('riskCapital') riskFunds=jsonOne.get('riskFunds') riskReserve=jsonOne.get('riskReserve') riskcontrol=jsonOne.get('riskcontrol') securityModel=jsonOne.get('securityModel') securityModelCode=jsonOne.get('securityModelCode') securityModelOther=jsonOne.get('securityModelOther') serviceScore=jsonOne.get('serviceScore') serviceScoreDetail=jsonOne.get('serviceScoreDetail') startInvestmentAmout=jsonOne.get('startInvestmentAmout') term=jsonOne.get('term') termCodes=jsonOne.get('termCodes') termWeight=jsonOne.get('termWeight') transferExpense=jsonOne.get('transferExpense') transferExpenseDetail=jsonOne.get('transferExpenseDetail') trustCapital=jsonOne.get('trustCapital') trustCreditor=jsonOne.get('trustCreditor') trustCreditorMonth=jsonOne.get('trustCreditorMonth') trustFunds=jsonOne.get('trustFunds') tzjPj=jsonOne.get('tzjPj') vipExpense=jsonOne.get('vipExpense') withTzj=jsonOne.get('withTzj') withdrawExpense=jsonOne.get('withdrawExpense') sql='insert into problemPlatDetail (actualCapital,aliasName,association,associationDetail,autoBid,autoBidCode,bankCapital,bankFunds,bidSecurity,bindingFlag,businessType,companyName,credit,creditLevel,delayScore,delayScoreDetail,displayFlg,drawScore,drawScoreDetail,equityVoList,experienceScore,experienceScoreDetail,fundCapital,gjlhhFlag,gjlhhTime,gruarantee,inspection,juridicalPerson,locationArea,locationAreaName,locationCity,locationCityName,manageExpense,manageExpenseDetail,newTrustCreditor,newTrustCreditorCode,officeAddress,onlineDate,payment,paymode,platBackground,platBackgroundDetail,platBackgroundDetailExpand,platBackgroundExpand,platEarnings,platEarningsCode,platName,platStatus,platUrl,problem,problemTime,recordId,recordLicId,registeredCapital,riskCapital,riskFunds,riskReserve,riskcontrol,securityModel,securityModelCode,securityModelOther,serviceScore,serviceScoreDetail,startInvestmentAmout,term,termCodes,termWeight,transferExpense,transferExpenseDetail,trustCapital,trustCreditor,trustCreditorMonth,trustFunds,tzjPj,vipExpense,withTzj,withdrawExpense,platId) values ("'+actualCapital+'","'+aliasName+'","'+association+'","'+associationDetail+'","'+autoBid+'","'+autoBidCode+'","'+bankCapital+'","'+bankFunds+'","'+bidSecurity+'","'+bindingFlag+'","'+businessType+'","'+companyName+'","'+credit+'","'+creditLevel+'","'+delayScore+'","'+delayScoreDetail+'","'+displayFlg+'","'+drawScore+'","'+drawScoreDetail+'","'+equityVoList+'","'+experienceScore+'","'+experienceScoreDetail+'","'+fundCapital+'","'+gjlhhFlag+'","'+gjlhhTime+'","'+gruarantee+'","'+inspection+'","'+juridicalPerson+'","'+locationArea+'","'+locationAreaName+'","'+locationCity+'","'+locationCityName+'","'+manageExpense+'","'+manageExpenseDetail+'","'+newTrustCreditor+'","'+newTrustCreditorCode+'","'+officeAddress+'","'+onlineDate+'","'+payment+'","'+paymode+'","'+platBackground+'","'+platBackgroundDetail+'","'+platBackgroundDetailExpand+'","'+platBackgroundExpand+'","'+platEarnings+'","'+platEarningsCode+'","'+platName+'","'+platStatus+'","'+platUrl+'","'+problem+'","'+problemTime+'","'+recordId+'","'+recordLicId+'","'+registeredCapital+'","'+riskCapital+'","'+riskFunds+'","'+riskReserve+'","'+riskcontrol+'","'+securityModel+'","'+securityModelCode+'","'+securityModelOther+'","'+serviceScore+'","'+serviceScoreDetail+'","'+startInvestmentAmout+'","'+term+'","'+termCodes+'","'+termWeight+'","'+transferExpense+'","'+transferExpenseDetail+'","'+trustCapital+'","'+trustCreditor+'","'+trustCreditorMonth+'","'+trustFunds+'","'+tzjPj+'","'+vipExpense+'","'+withTzj+'","'+withdrawExpense+'","'+platId+'")' cur.execute(sql) conn.commit() conn,cur=DatabaseUtil().getConn() session=SessionUtil() logUtil=LogUtil("problemPlatDetail.log") cur.execute('select platId from problemPlat') data=cur.fetchall() print(data) mylist=list() print(data) for i in range(0,len(data)): platId=str(data[i].get('platId')) mylist.append(platId) print mylist for i in mylist: url=''+i try: data=session.getReq(url) platData=handleData(data) dictObject=DictUtil(platData) storeData(dictObject,conn,cur,i) except Exception,e: traceback.print_exc() cur.close() conn.close
整個過程當中 咱們 構造請求,而後把解析每一個請求的響應,其中json返回值使用json庫進行解析,html頁面使用BeautifulSoup庫進行解析(結構複雜的html的頁面推薦使用lxml庫進行解析),解析到的結果存儲到mysql數據庫中。github
爬蟲代碼地址(注:爬蟲使用代碼Python2與python3均可運行,本人把爬蟲代碼部署在阿里雲服務器上,使用Python2 運行)web
數據分析主要使用Python的numpy、pandas、matplotlib進行數據分析,同時輔以海致BDP。ajax
通常採起把數據讀取pandas的DataFrame中進行分析。
如下就是讀取問題平臺的數據的例子sql
problemPlat=pd.read_csv('problemPlat.csv',parse_dates=True)#問題平臺
數據結構
eg 問題平臺數量隨時間變化
problemPlat['id']['2012':'2017'].resample('M',how='count').plot(title='P2P發生問題')#發生問題P2P平臺數量 隨時間變化趨勢
圖形化展現
使用海致BDP完成(Python繪製地圖分佈輪子比較複雜,當時還未學習)
eg 全國六月平臺成交額分佈
代碼
juneData['amount'].hist(normed=True) juneData['amount'].plot(kind='kde',style='k--')#六月份交易量機率分佈
核密度圖形展現
成交額取對數核密度分佈
np.log10(juneData['amount']).hist(normed=True) np.log10(juneData['amount']).plot(kind='kde',style='k--')#取 10 對數的 機率分佈
圖形化展現
可看出取10的對數後分布更符合正常的金字塔形。
lujinData=platVolume[platVolume['wdzjPlatId']==59] corr=pd.rolling_corr(lujinData['amount'],allPlatDayData['amount'],50,min_periods=50).plot(title='陸金所交易額與全部平臺交易額的相關係數變化趨勢')
圖形化展現
車貸平臺與全平臺成交額數據對比
carFinanceDayData=carFinanceData.resample('D').sum()['amount'] fig,axes=plt.subplots(nrows=1,ncols=2,sharey=True,figsize=(14,7)) carFinanceDayData.plot(ax=axes[0],title='車貸平臺交易額') allPlatDayData['amount'].plot(ax=axes[1],title='全部p2p平臺交易額')
lujinAmount=platVolume[platVolume['wdzjPlatId']==59] lujinAmount['y']=lujinAmount['amount'] lujinAmount['ds']=lujinAmount['date'] m=Prophet(yearly_seasonality=True) m.fit(lujinAmount) future=m.make_future_dataframe(periods=365) forecast=m.predict(future) m.plot(forecast)
趨勢預測圖形化展現
數據分析代碼地址(注:數據分析代碼智能運行在Python3 環境下)
代碼運行後樣例(無需安裝Python環境 也可查看具體代碼解圖形化展現)
這是本人從 Java web轉向數據方向後本身寫的第一項目,也是本身的第一個Python項目,在整個過程當中,也沒遇到多少坑,總體來講,爬蟲和數據分析以及Python這門語言門檻都是很是低的。
若是想入門Python爬蟲,推薦《Python網絡數據採集》
若是想入門Python數據分析,推薦 《利用Python進行數據分析》