PPT原稿參見Stanford;http://www.stanford.edu/class/cs124/lec/med.pdfpython
Tips:因爲本人水平有限,對MED的我的理解可能有紕漏之處,請勿盡信。算法
Edit:我的理解指編輯之意,也即對於兩個字符串,對其中的一個進行各類編輯操做(插入、刪除、替換)使其變爲另外一個字符串。要解決的問題是求出最小的編輯操做次數是多少。編程
基因系列比對數組
X,Y是大小分別爲n,m的字符串。app
定義D(i,j)表示X[1..i],Y[1..j]兩子字符串的距離。則:框架
X與Y的距離爲D(n,m)函數
對於D(i,j)的計算結果作成表格(矩陣),D(i,j)的運算結果能夠創建在以前的結果之上。post
對本算法的一點我的理解:測試
設 i:輸出子字符串長度this
j:輸入子字符串長度
D(0,0)=0
D(i,0):輸入0個字符轉換到i個字符的輸出,也即要插入i個字符,代價爲insertCost*i
D(0,j):目標長度爲0,輸入長度爲j,因此代價爲:deletionCost*j
D(i,j)的上一步能夠分爲三種狀況:
一、上一步輸入長度爲j,輸出長度爲i-1,那麼如今這一步確定至少還要插入一個字符才能達到i長度輸出:
D(i-1,j)+inseartCost*1
二、上一步輸入長度爲j-1,輸出長度爲i-1,那麼如今這一步第j個輸入只須要作替換處理(若是第j個輸入與第i個輸出不相等)或者保持不變(若是第j個輸入與第i個輸出相等):
D(i-1,j-1)+substituteCost*(source[j]==target[i] ? 0 : 1)
三、上一步輸入長度爲j-1,輸出長度爲i,那麼如今這一步因爲又多了一個字符,因此要把多的這個字符刪除:
D(i,j-1)+deletionCost*1
因爲咱們要求的是最小Edit Distance,天然,就是上述三種狀況中最小值作爲D(i,j)的值。具體算法以下:
最小編輯距離動態算法(Levenshtein):
D(n,m)即爲最小距離。
上圖中從(0,0)到(n,m)的每個非遞減路徑即爲兩字符串的對齊關係。
爲何要加權?
由於有些字符被寫錯的機率要大些(如搜索引擎中常常能自動搜索到相近的詞句)
#=============================================================================== # Using dynamic programming to realize the MED algorithm(Levenshtein) # MED: short for Minimum Edit Distance #=============================================================================== import types import numpy as np class MED: def __init__(self): self.insCost = 1 #insertion cost self.delCost = 1 #deletion cost self.subsCost = 1 #substitution cost self.D = 0 def mDistance(self, word1, word2): assert type(word1) is types.StringType assert type(word2) is types.StringType size1 = len(word1) size2 = len(word2) if size1==0 or size2 ==0: return size1*self.delCost+size2*self.insCost D_mat = np.zeros((size1+1,size2+1)) D_mat[:,0] = range(size1+1) D_mat[0,:] = range(size2+1) for i in range(1,size1+1): for j in range(1,size2+1): insert_cost = D_mat[i-1, j] + self.insCost*1 delete_cost = D_mat[i, j-1] + self.delCost*1 if word1[i-1]==word2[j-1]: substitue_cost = D_mat[i-1, j-1] else: substitue_cost = D_mat[i-1, j-1] + self.subsCost*1 D_mat[i,j] = min(insert_cost, delete_cost, substitue_cost) self.D = D_mat return D_mat[size1, size2] if __name__ == "__main__": word1 = "Function" word2 = "fanctional" med = MED() print "MED distance is :" ,med.mDistance(word1, word2)
輸出結果:
MED distance is : 4.0
對原ED計算函數作點更改(每次獲得MED時記錄該值的來源,而後從D(n,m)開始利用記錄的來源回溯至起始位置,獲得該MED的完整路徑):
def computingAlignment(self, word1, word2): assert type(word1) is types.StringType assert type(word2) is types.StringType size1 = len(word1) size2 = len(word2) if size1==0 or size2 ==0: return size1*self.delCost+size2*self.insCost D_mat = np.zeros((size1+1,size2+1)) D_rec = np.zeros((size1+1, size2+1)) D_mat[:,0] = range(size1+1) D_mat[0,:] = range(size2+1) for i in range(1,size1+1): for j in range(1,size2+1): insert_cost = D_mat[i-1, j] + self.insCost*1 delete_cost = D_mat[i, j-1] + self.delCost*1 if word1[i-1]==word2[j-1]: substitue_cost = D_mat[i-1, j-1] else: substitue_cost = D_mat[i-1, j-1] + self.subsCost*1 D_mat[i,j] = min(insert_cost, delete_cost, substitue_cost) if D_mat[i,j] == insert_cost:#Record Where the min val comes from D_rec[i,j] = 1 elif D_mat[i,j]== delete_cost: D_rec[i,j] = 2 else: D_rec[i,j] = 3 self.D = D_mat self.D_rec = D_rec # BackTrace alignRevPath=[]#Get the reverse path j = size2 i = size1 while i!=0 or j!=0:#Be carefull of this row alignRevPath.append([i,j,D_rec[i,j]]) if D_rec[i,j]==1: i -=1 elif D_rec[i,j]==2: j -=1 elif D_rec[i,j]==3: i -=1 j-=1 elif D_rec[i,j]==0: if i>0: i -= 1 if j>0: j -= 1 alignStr1 =[] alignStr2 =[] if alignRevPath[-1][0]!=0:#process the first postion of the path alignStr1.append(word1[alignRevPath[-1][0]-1]) else: alignStr1.append('*') if alignRevPath[-1][1]!=0: alignStr2.append(word2[alignRevPath[-1][1]-1]) else: alignStr2.append('*') for i in range(len(alignRevPath)-1, 0, -1): #process the rest of the path k = np.subtract(alignRevPath[i-1], alignRevPath[i]) bK = k>0 if bK[0]!=0: alignStr1.append(word1[alignRevPath[i-1][0]-1]) else: alignStr1.append('*') if bK[1]!=0: alignStr2.append(word2[alignRevPath[i-1][1]-1]) else: alignStr2.append('*') return (alignStr1, alignStr2)
上面的代碼中alignRevPath用來保存路徑上的每個位置,其每一個元素都爲3元列表,前兩維爲路上的座標,第3維取0、一、二、3四種值,0表示路徑到達邊界了,1表示當前的ED結果由前一個insert操做獲得,2表示當前ED結果由前一個Delete獲得,3表示當前ED結果由前一個substitue獲得。
在主程序中加入以下測試代碼:
word1 = "efnction" word2 = "faunctional" med = MED() w1, w2 = med.computingAlignment(word1, word2) print ' '.join(w1) print '| '*len(w1) print ' '.join(w2)
獲得的輸出:
原理:
圖示:
從上圖能夠看出,匹配的並不必定是連續的子串.這是由於咱們的懲罰項設置爲:
也即迭代過程中的s(xi,yj)取-1(不匹配)或1(匹配)。當子串中有不匹配的字符出現時,將會對以前的匹配計數減1。
若是想匹配最長連續子串,能夠令懲罰項F(i,j)爲0(不匹配)或F(i-1,j-1)+1(匹配)
Python 實現(對computingAlignment作少量更改):
def longestSubstr(self, word1, word2): assert type(word1) is types.StringType assert type(word2) is types.StringType size1 = len(word1) size2 = len(word2) if size1==0 or size2 ==0: return size1*self.delCost+size2*self.insCost D_mat = np.zeros((size1+1,size2+1)) D_rec = np.zeros((size1+1, size2+1)) D_mat[:,0] = 0 D_mat[0,:] = 0 for i in range(1,size1+1): for j in range(1,size2+1): insert_cost = D_mat[i-1, j] - self.insCost*1 delete_cost = D_mat[i, j-1] - self.delCost*1 if word1[i-1]==word2[j-1]: substitue_cost = D_mat[i-1, j-1] +1 else: substitue_cost = D_mat[i-1, j-1] - self.subsCost*1 # substitue_cost = 0 D_mat[i,j] = max(0,insert_cost, delete_cost, substitue_cost) if D_mat[i,j] == insert_cost:#Record Where the min val comes from D_rec[i,j] = 1 elif D_mat[i,j]== delete_cost: D_rec[i,j] = 2 elif D_mat[i,j]==substitue_cost: D_rec[i,j] = 3 self.D = D_mat self.D_rec = D_rec maxVal = np.max(D_mat) maxBool = D_mat == maxVal numMax = np.sum(maxBool) alignRevPaths=[] for i in range(numMax): alignRevPaths.append([]) pathStarts=[] for i in range(size1+1): for j in range(size2+1): if maxBool[i,j]: pathStarts.append([i,j]) for m in range(numMax): i = pathStarts[m][0] j = pathStarts[m][1] while i!=0 and j!=0: alignRevPaths[m].append([i,j,D_rec[i,j]]) if D_rec[i,j]==1: i -=1 elif D_rec[i,j]==2: j -=1 elif D_rec[i,j]==3: i -=1 j-=1 elif D_rec[i,j]==0: break if D_mat[i,j]==0: break str1=[] str2=[] for m in range(numMax): str1.append([]) str2.append([]) p = alignRevPaths[m] for i in range(len(p)-1, -1,-1): str1[m].append(word1[p[i][0]-1]) str2[m].append(word2[p[i][1]-1]) return ([''.join(i) for i in str1],[''.join(i) for i in str2])
代碼測試:
word1 = "ATCAT" word2 = "ATTATC" med = MED() str1,str2= med.longestSubstr(word1, word2) print "Longest match substr:" print "match substr in word1:", str1 print "match substr in word2:", str2
輸出結果:
Longest match substr:
match substr in word1: ['ATC', 'ATCAT']
match substr in word2: ['ATC', 'ATTAT']
若是把longestSubstr中的
substitue_cost = D_mat[i-1, j-1] - self.subsCost*1
改成:
substitue_cost = 0
便可用來求最大連續子字符串,一樣運行上述測試代碼,可得:
Longest match substr:
match substr in word1: ['ATC']
match substr in word2: ['ATC']
改用另一組字符串進行實驗進行進一步驗證:
word1 = "fefnction" word2 = "faunctional" med = MED() str1,str2= med.longestSubstr(word1, word2) print "Longest match substr:" print "match substr in word1:", str1 print "match substr in word2:", str2
當求解的是最長子字符串(非連續)時,輸出:
Longest match substr:
match substr in word1: ['nction']
match substr in word2: ['nction']
當求解的是最長連續子字符串時,輸出:
Longest match substr:
match substr in word1: ['nction']
match substr in word2: ['nction']
在以上的算法中,MED的一個最大的特色就是利用了矩陣保存以前處理的結果數據,以作爲下一次的輸入。對於最長子字符串的查找,一樣套用了MED的框架,但仔細一想咱們會發現,其實最長子串的查找並不必定須要記錄路徑alignRevPaths,有了alignRevPaths這個路徑,咱們編程時是根據這個路徑處理字符的。路徑的設置主要是爲了在computingAlignment中對字符進行對齊,由於字符對齊的狀況下,字符不必定是連續,好比會有以下對齊的形式中的「*」號:
因此咱們纔想到要用路徑作記錄。
但在最長子串中,子串確定是連續的,天然路徑也就不須要了。
下面對最長子串程序作簡化:
def longestSubstr2(self,word1, word2): assert type(word1) is types.StringType assert type(word2) is types.StringType size1 = len(word1) size2 = len(word2) if size1==0 or size2 ==0: return size1*self.delCost+size2*self.insCost D_mat = np.zeros((size1+1,size2+1)) D_mat[:,0] = 0 D_mat[0,:] = 0 for i in range(1,size1+1): for j in range(1,size2+1): insert_cost = D_mat[i-1, j] - self.insCost*1 delete_cost = D_mat[i, j-1] - self.delCost*1 if word1[i-1]==word2[j-1]: substitue_cost = D_mat[i-1, j-1] +1 else: substitue_cost = D_mat[i-1, j-1] - self.subsCost*1 # substitue_cost = 0 D_mat[i,j] = max(0,insert_cost, delete_cost, substitue_cost) self.D = D_mat maxVal = np.max(D_mat) maxBool = D_mat == maxVal numMax = np.sum(maxBool) alignRevPaths=[] for i in range(numMax): alignRevPaths.append([]) pathStarts=[] for i in range(size1+1): for j in range(size2+1): if maxBool[i,j]: pathStarts.append([i,j]) str1=[] str2=[] for m in range(numMax): str1.append([]) str2.append([]) i = pathStarts[m][0] j = pathStarts[m][1] s1Tmp = [] s2Tmp = [] while D_mat[i,j]!=0: s1Tmp.append(word1[i-1]) s2Tmp.append(word2[j-1]) i -= 1 j -= 1 str1[m]=[s1Tmp[len(s1Tmp)-i-1] for i in range(len(s1Tmp))] str2[m]=[s2Tmp[len(s2Tmp)-i-1] for i in range(len(s2Tmp))] return ([''.join(i) for i in str1],[''.join(i) for i in str2])
使用如下字符作實驗:
word1 = "ATCAT"
word2 = "ATTATC"
輸出結果一樣爲:
Longest match substr:
match substr in word1: ['ATC', 'ATCAT']
match substr in word2: ['ATC', 'ATTAT']
一點我的感悟:
雖然這裏的動態規劃算法看上去有點難以想象,但聯想起線性代數中的矩陣運算也就不難理解了。
就拿最長子串的程序來講,其實際過程仍可看作:對每word1中的每個字符在word2中進行查找,當匹配第一個字符後繼續匹配第二個字符,而後第3、第四……個,直到有字符不匹配時,記錄該匹配成功的串長度及字串始末位置。
而這裏的Smith-waterman算法將這個匹配記錄在一個2維矩陣(數組)中。咱們都知道,線性代數中的矩陣乘法一樣是能夠展開爲各元素的乘與累加操做的。而這裏,我認爲,發明者正是利用了這一思想。