PageRank是TextRank的前身。顧名思義,TextRank用於文本重要性計算(語句排名)和文本摘要等NLP應用,而Page最初是因搜索引擎須要對網頁的重要性計算和排名而誕生。本着追本溯源、知其然要知其因此然的目的,而進行實踐層面的研究和實現。
網上博客不少,但真正把一件事情講懂,講清楚的,一直不多。我來試試,把原理和編程實現一併說個明白。html
+ 做者:Johnny Zen + 單位:西華大學 計算機學院 + 博文地址:https://www.cnblogs.com/johnnyzen/p/10888248.html + CSDN警告:版權全部,侵權必究。
附一張手記以做記念,哈哈~python
原理的部分概述、三個例子和公式推導,來源於博文:PageRank算法原理與實現
大部分配圖與公式爲本文博主經過markdown編輯。算法
PageRank,又稱網頁排名、谷歌左側排名,是一種由搜索引擎根據網頁之間相互的超連接計算的技術,而做爲網頁排名的要素之一,以Google公司創辦人拉里·佩奇(Larry Page)之姓來命名。Google用它來體現網頁的相關性和重要性,在搜索引擎優化操做中是常常被用來評估網頁優化的成效因素之一。shell
TextRank論文中對PageRank的一點建議:將最初PageRank基於無權邊的圖模型改進爲有權圖。經過有權邊來表示兩節點間的「強度」,進而可能提升模型效果。編程
In the context of Web surfing, it is unusual for a page to include multiple or partial links to another page, and hence the original PageRank definition for graph-based ranking is assuming unweighted graphs.However, in our model the graphs are build from natural language texts, and may include multiple or partial links between the units (vertices) that are extracted from text. It may be therefore useful to indicate and incorporate into the model the 「strength」 of the connection between two vertices Vi and Vj as a weight W(i,j) added to the corresponding edge that connects the two vertices.數組
①假設一個由4個網頁組成的羣體:A,B,C和D。若是全部頁面都只連接至A,那麼A的PR(PageRank)值將是B,C及D的Pagerank總和。
瀏覽器
②假設一個由4個網頁組成的羣體:A,B,C和D。從新假設B連接到A和C,C只連接到A,而且D連接到所有其餘的3個頁面。
markdown
換成咱們容易理解的公式。 即app
\[ PA(A) = (1-d) + d*\begin{bmatrix} \frac{ PR(T_{_{1}}) }{ |Out(T_{_{1}})| } +\frac{ PR(T_{_{2}}) }{ |Out(T_{_{2}})| } + ... + \frac{ PR(T_{_{m}}) }{ |Out(T_{_{m}})| } \end{bmatrix}\quad \]函數
\[ TR(Ti) = (1-d) + d*\begin{bmatrix} \frac{ TR(T_{_{1}})*Similarity(Ti,T1) }{ |Out(T_{_{1}})| } +\frac{ PR(T_{_{2}})*Similarity(Ti,T2) }{ |Out(T_{_{2}})| } + ... + \frac{ PR(T_{_{m}})*Similarity(Ti,Tm) }{ |Out(T_{_{m}})| } \end{bmatrix}\quad \]
\[ PA(A) = \frac{(1-d)}{d} + d*\begin{bmatrix} \frac{ PR(T_{_{1}}) }{ |Out(T_{_{1}})| } +\frac{ PR(T_{_{2}}) }{ |Out(T_{_{2}})| } + ... + \frac{ PR(T_{_{m}}) }{ |Out(T_{_{m}})| } \end{bmatrix}\quad \]
其中,N爲頁面節點總數。
PR(C)
\[ PR(C) = 0.5 + 0.5*(PR(A)/2 + PR(B)/1) = 0.5+0.5*(1/2+0.75)= 1.125 \]
不斷迭(xun)代(huan)計算
python 3.6.2
import numpy as np; class PageRank: def __init__(self,pages=['A','B'],links=[(1,0),(0,1)],d=0.85): self.pages = pages; self.links = links; # 根據初始數據初始化其它參數 self.dtype = np.float64; #精度更高 self.d = d; # 阻尼係數 self.pageRanks = np.array([ 1/(len(self.pages)) ]*len(self.pages)); # np.ones(len(self.pages), dtype = self.dtype); # 初始化每一個頁面PR值:1 or 1 / N or other 【利用numpy數組化,方便進行算術運算,原生python列表不支持此類運算】通過幾組數據測試,此初始值確實不會影響最終的PR收斂值 ## 記錄每一個節點的入度節點列表 (不定長二維數組) self.inputLinksList = [[]]*len(self.pages); # 建立不定長二維數組 [[], [], [], [],...,[]] for i in range(len(self.inputLinksList)): self.inputLinksList[i]=[]; pass; # print("in:\n",self.inputLinksList) for item in self.links: # (n,m):n指向m # print(i) self.inputLinksList[item[1]].append(item[0]); pass; ## 記錄每一個節點的出度節點列表 (不定長二維數組) self.outputLinksList = [[]]*len(pages); # 建立不定長二維數組 [[], [], [], [],...,[]] for item in range(len(self.outputLinksList)): self.outputLinksList[item]=[]; pass; # print("out:\n",self.outputLinksList) for item in self.links: # (n,m):n指向m # print(i) self.outputLinksList[item[0]].append(item[1]); pass; def getInputLinksList(self): # print("in:\n",self.inputLinksList) return self.inputLinksList; def getOutputLinksList(self): # print("out:\n",self.getOutputLinksList) return self.outputLinksList; def getCurrentPageRanks(): return self.pageRanks; def getLinks(): return self.links; def iterOnce(pageRanks,d,links,inputLinksList,outputLinksList): #迭代運算一次 [本函數能夠抽離出欄單獨使用,相似於靜態方法] pageRanks = np.copy(pageRanks); #必須拷貝,不然後續影響傳入地址pageRanks的值 # print("input pageRanks:",pageRanks); # print("input d:",d); for i in range(0,len(pageRanks)): result = 0.0; for j in inputLinksList[i]: # inputLinksList[i]:第i個節點的(入度)節點[下標]列表 # print (inputLinksList[i]); result += pageRanks[j]/len(outputLinksList[j]); # print("[",j,"] pageRanks[j]:",pageRanks[j]," len(outputLinksList[j]:",len(outputLinksList[j])); pass; # print("[",i,"] result:",result); pageRanks[i] = (1 - d) + d*result; # print("[",i,"] pr:",pageRanks[i]); pass; return pageRanks; def maxAbs(array): # 從數組中找絕對值最大者 [靜態方法] max = 0; # 初始化 默認第一個爲絕對值最小值的下標 for i in range(0,len(array)): if abs(array[max]) < abs(array[i]): max = i; pass; pass; return max; # 返回下標 def train(self,maxIterationSize=100,threshold=0.0000001): # 訓練 threshold:閾值 print("[PageRank.train]",0," self.pageRanks:",self.pageRanks); iteration=1; lastPageRanks = self.pageRanks; # pageRanks:上一批次 self.pageRanks:當前批次 difPageRanks = np.array([100000.0]*len(self.pageRanks),dtype=self.dtype); # 初始化 當前批次各節點PR值與上一批次PR值的大小 [1000000000,1000000000, ...,1000000000] while iteration <= maxIterationSize: if ( abs(difPageRanks[PageRank.maxAbs(difPageRanks)]) < threshold ): break; self.pageRanks = PageRank.iterOnce(lastPageRanks,self.d,self.links,self.inputLinksList,self.outputLinksList); #【利用numpy數組化,方便進行加減算術運算,原生python列表不支持此類運算】 difPageRanks = lastPageRanks - self.pageRanks ; # self.pageRanks在初始化__init__中已經過numpy向量化 # print("[PageRank.train]",iteration," lastPageRanks:",lastPageRanks); print("[PageRank.train]",iteration," self.pageRanks:",self.pageRanks); # print("[PageRank.train]",iteration," difPageRanks:",difPageRanks); lastPageRanks = np.array(self.pageRanks); iteration += 1; pass; print("[PageRank.train] iteration:",iteration-1);#test print("[PageRank.train] difPageRanks:",difPageRanks) # test return self.pageRanks; pass; # end class
# 用戶初始化頁面連接數據 #pages = ["A","B","C","D"]; #links = [(1,0),(1,2),(2,0),(3,0),(3,1),(3,2)]; # (n,m):n指向m pages = ["A","B","C"]; links = [(0,1),(0,2),(1,2),(2,0)]; # (n,m):n指向m d = 0.5; # 阻尼係數 pageRank = PageRank(pages,links,d); pageRanks = pageRank.train(12,0.000000000001); # pageRanks:各節點的PR值 print("pageRanks:",pageRanks); print("sum(pageRanks) :",np.sum(pageRanks)); print("getInputLinksList:",pageRank.getInputLinksList()); print("getOutputLinksList:",pageRank.getOutputLinksList());
// output :PR初始值爲1時 [PageRank.train] 0 self.pageRanks: [ 1. 1. 1.] [PageRank.train] 1 self.pageRanks: [ 1. 0.75 1.125] [PageRank.train] 2 self.pageRanks: [ 1.0625 0.765625 1.1484375] [PageRank.train] 3 self.pageRanks: [ 1.07421875 0.76855469 1.15283203] [PageRank.train] 4 self.pageRanks: [ 1.07641602 0.769104 1.15365601] [PageRank.train] 5 self.pageRanks: [ 1.076828 0.769207 1.1538105] [PageRank.train] 6 self.pageRanks: [ 1.07690525 0.76922631 1.15383947] [PageRank.train] 7 self.pageRanks: [ 1.07691973 0.76922993 1.1538449 ] [PageRank.train] 8 self.pageRanks: [ 1.07692245 0.76923061 1.15384592] [PageRank.train] 9 self.pageRanks: [ 1.07692296 0.76923074 1.15384611] [PageRank.train] 10 self.pageRanks: [ 1.07692305 0.76923076 1.15384615] [PageRank.train] 11 self.pageRanks: [ 1.07692307 0.76923077 1.15384615] [PageRank.train] 12 self.pageRanks: [ 1.07692308 0.76923077 1.15384615] [PageRank.train] iteration: 12 [PageRank.train] difPageRanks: [ -3.35654704e-09 -8.39136760e-10 -1.25870514e-09] pageRanks: [ 1.07692308 0.76923077 1.15384615] sum(pageRanks) : 2.99999999874 getInputLinksList: [[2], [0], [0, 1]] getOutputLinksList: [[1, 2], [2], [0]] //output :PR初始值爲1/N時 [PageRank.train] 0 self.pageRanks: [ 0.33333333 0.33333333 0.33333333] [PageRank.train] 1 self.pageRanks: [ 0.66666667 0.66666667 1. ] [PageRank.train] 2 self.pageRanks: [ 1. 0.75 1.125] [PageRank.train] 3 self.pageRanks: [ 1.0625 0.765625 1.1484375] [PageRank.train] 4 self.pageRanks: [ 1.07421875 0.76855469 1.15283203] [PageRank.train] 5 self.pageRanks: [ 1.07641602 0.769104 1.15365601] [PageRank.train] 6 self.pageRanks: [ 1.076828 0.769207 1.1538105] [PageRank.train] 7 self.pageRanks: [ 1.07690525 0.76922631 1.15383947] [PageRank.train] 8 self.pageRanks: [ 1.07691973 0.76922993 1.1538449 ] [PageRank.train] 9 self.pageRanks: [ 1.07692245 0.76923061 1.15384592] [PageRank.train] 10 self.pageRanks: [ 1.07692296 0.76923074 1.15384611] [PageRank.train] 11 self.pageRanks: [ 1.07692305 0.76923076 1.15384615] [PageRank.train] 12 self.pageRanks: [ 1.07692307 0.76923077 1.15384615] [PageRank.train] iteration: 12 [PageRank.train] difPageRanks: [ -1.79015842e-08 -4.47539605e-09 -6.71309408e-09] pageRanks: [ 1.07692307 0.76923077 1.15384615] sum(pageRanks) : 2.99999999329 getInputLinksList: [[2], [0], [0, 1]] getOutputLinksList: [[1, 2], [2], [0]]
留個小問題,如何證實/保證PageRank通過n次迭代之後,必然收斂?即 收斂必定會成立嗎?試證實之。