類似度度量(Similarity),即計算個體間的類似程度,類似度度量的值越小,說明個體間類似度越小,類似度的值越大說明個體差別越大。python
對於多個不一樣的文本或者短文本對話消息要來計算他們之間的類似度如何,一個好的作法就是將這些文本中詞語,映射到向量空間,造成文本中文字和向量數據的映射關係,經過計算幾個或者多個不一樣的向量的差別的大小,來計算文本的類似度。下面介紹一個詳細成熟的向量空間餘弦類似度方法計算類似度算法
向量空間餘弦類似度(Cosine Similarity)app
餘弦類似度用向量空間中兩個向量夾角的餘弦值做爲衡量兩個個體間差別的大小。餘弦值越接近1,就代表夾角越接近0度,也就是兩個向量越類似,這就叫"餘弦類似性"。ide
上圖兩個向量a,b的夾角很小能夠說a向量和b向量有很高的的類似性,極端狀況下,a和b向量徹底重合。以下圖:函數
如上圖二:能夠認爲a和b向量是相等的,也即a,b向量表明的文本是徹底類似的,或者說是相等的。若是a和b向量夾角較大,或者反方向。以下圖ui
如上圖三: 兩個向量a,b的夾角很大能夠說a向量和b向量有很底的的類似性,或者說a和b向量表明的文本基本不類似。那麼是否能夠用兩個向量的夾角大小的函數值來計算個體的類似度呢?this
向量空間餘弦類似度理論就是基於上述來計算個體類似度的一種方法。下面作詳細的推理過程分析。spa
想到餘弦公式,最基本計算方法就是初中的最簡單的計算公式,計算夾角ssr
圖(4)
的餘弦定值公式爲:3d
可是這個是隻適用於直角三角形的,而在非直角三角形中,餘弦定理的公式是
圖(5)
三角形中邊a和b的夾角 的餘弦計算公式爲:
在向量表示的三角形中,假設a向量是(x1, y1),b向量是(x2, y2),那麼能夠將餘弦定理改寫成下面的形式:
向量a和向量b的夾角 的餘弦計算以下
擴展,若是向量a和b不是二維而是n維,上述餘弦的計算法仍然正確。假定a和b是兩個n維向量,a是 ,b是 ,則a與b的夾角 的餘弦等於:
餘弦值越接近1,就代表夾角越接近0度,也就是兩個向量越類似,夾角等於0,即兩個向量相等,這就叫"餘弦類似性"。
【下面舉一個例子,來講明餘弦計算文本類似度】
舉一個例子來講明,用上述理論計算文本的類似性。爲了簡單起見,先從句子着手。
句子A:這隻皮靴號碼大了。那隻號碼合適
句子B:這隻皮靴號碼不小,那隻更合適
怎樣計算上面兩句話的類似程度?
基本思路是:若是這兩句話的用詞越類似,它們的內容就應該越類似。所以,能夠從詞頻入手,計算它們的類似程度。
第一步,分詞。
句子A:這隻/皮靴/號碼/大了。那隻/號碼/合適。
句子B:這隻/皮靴/號碼/不/小,那隻/更/合適。
第二步,列出全部的詞。
這隻,皮靴,號碼,大了。那隻,合適,不,小,很
第三步,計算詞頻。
句子A:這隻1,皮靴1,號碼2,大了1。那隻1,合適1,不0,小0,更0
句子B:這隻1,皮靴1,號碼1,大了0。那隻1,合適1,不1,小1,更1
第四步,寫出詞頻向量。
句子A:(1,1,2,1,1,1,0,0,0)
句子B:(1,1,1,0,1,1,1,1,1)
到這裏,問題就變成了如何計算這兩個向量的類似程度。咱們能夠把它們想象成空間中的兩條線段,都是從原點([0, 0, ...])出發,指向不一樣的方向。兩條線段之間造成一個夾角,若是夾角爲0度,意味着方向相同、線段重合,這是表示兩個向量表明的文本徹底相等;若是夾角爲90度,意味着造成直角,方向徹底不類似;若是夾角爲180度,意味着方向正好相反。所以,咱們能夠經過夾角的大小,來判斷向量的類似程度。夾角越小,就表明越類似。
使用上面的公式(4)
計算兩個句子向量
句子A:(1,1,2,1,1,1,0,0,0)
和句子B:(1,1,1,0,1,1,1,1,1)的向量餘弦值來肯定兩個句子的類似度。
計算過程以下:
計算結果中夾角的餘弦值爲0.81很是接近於1,因此,上面的句子A和句子B是基本類似的
由此,咱們就獲得了文本類似度計算的處理流程是:
找出兩篇文章的關鍵詞;
每篇文章各取出若干個關鍵詞,合併成一個集合,計算每篇文章對於這個集合中的詞的詞頻
生成兩篇文章各自的詞頻向量;
計算兩個向量的餘弦類似度,值越大就表示越類似。
python實現
def cosin_distance(vector1, vector2): """ K(X, Y) = <X, Y> / (||X||*||Y||) :param vector1: :param vector2: :return: """ dot_product = 0.0 normA = 0.0 normB = 0.0 for a, b in zip(vector1, vector2): dot_product += a * b normA += a ** 2 normB += b ** 2 if normA == 0.0 or normB == 0.0: return None else: return dot_product / ((normA * normB) ** 0.5)
import numpy as np from sklearn.metrics.pairwise import cosine_similarity user_tag_matric = np.matrix(np.array([vect1, vect2])) user_similarity = cosine_similarity(user_tag_matric) print(user_similarity)
import functools import math import re import time text1 = "This game is one of the very best. games ive played. the ;pictures? " \ "cant descripe the real graphics in the game." text2 = "this game have/ is3 one of the very best. games ive played. the ;pictures? " \ "cant descriPe now the real graphics in the game." text3 = "So in the picture i saw a nice size detailed metal puzzle. Eager to try since I enjoy 3d wood puzzles, i ordered it. Well to my disappointment I got in the mail a small square about 4 inches around. And to add more disappointment when I built it it was smaller than the palm of my hand. For the price it should of been much much larger. Don't be fooled. It's only worth $5.00.Update 4/15/2013I have bought and completed 13 of these MODELS from A.C. Moore for $5.99 a piece, so i stand by my comment that thiss one is overpriced. It was still fun to build just like all the others from the maker of this brand.Just be warned, They are small." text4 = "I love it when an author can bring you into their made up world and make you feel like a friend, confidant, or family. Having a special child of my own I could relate to the teacher and her madcap class. I've also spent time in similar classrooms and enjoyed the uniqueness of each and every child. Her story drew me into their world and had me laughing so hard my family thought I had lost my mind, so I shared the passage so they could laugh with me. Read this book if you enjoy a book with strong women, you won't regret it." def timeit(func): @functools.wraps(func) def wrap(*args, **kwargs): start = time.time() res = func(*args, **kwargs) print('運行時間爲: {0:.4f}' .format(time.time() - start)) return res return wrap def preprocess(text): """ 文本預處理,可根據具體狀況書寫邏輯 :param text: :return: """ return text.split() @timeit def compute_cosine(words1, words2): """ 計算兩段文本的餘弦類似度 :param text_a: :param text_b: :return: """ # 1. 統計詞頻 words1_dict = {} words2_dict = {} for word in words1: # word = word.strip(",.?!;") word = re.sub('[^a-zA-Z]', '', word) word = word.lower() # print(word) if word != '' and word in words1_dict: num = words1_dict[word] words1_dict[word] = num + 1 elif word != '': words1_dict[word] = 1 else: continue for word in words2: # word = word.strip(",.?!;") word = re.sub('[^a-zA-Z]', '', word) word = word.lower() if word != '' and word in words2_dict: num = words2_dict[word] words2_dict[word] = num + 1 elif word != '': words2_dict[word] = 1 else: continue # nltk統計詞頻 # 方式一 # words1_dict2 = FreqDist(words1) # 方式二 # from collections import Counter # Counter(text1.split()) # print(words1_dict) # print(words2_dict) # 2. 按照頻率排序 dic1 = sorted(words1_dict.items(), key=lambda x: x[1], reverse=True) dic2 = sorted(words2_dict.items(), key=lambda x: x[1], reverse=True) # print(dic1) # print(dic2) # 3. 獲得詞向量 words_key = [] list(map(lambda x: words_key.append(x[0]), dic1)) list(map(lambda x: words_key.append(x[0]), filter(lambda x: x[0] not in words_key, dic2))) # print(words_key) vect1 = [] vect2 = [] for word in words_key: if word in words1_dict: vect1.append(words1_dict[word]) else: vect1.append(0) if word in words2_dict: vect2.append(words2_dict[word]) else: vect2.append(0) # print(vect1) # print(vect2) # 4. 計算餘弦類似度 sum = 0 sq1 = 0 sq2 = 0 for i in range(len(vect1)): sum += vect1[i] * vect2[i] sq1 += pow(vect1[i], 2) sq2 += pow(vect2[i], 2) try: result = round(float(sum) / (math.sqrt(sq1) * math.sqrt(sq2)), 2) except ZeroDivisionError: result = 0.0 # skleran實現 import numpy as np # from sklearn.metrics.pairwise import cosine_similarity # user_tag_matric = np.matrix(np.array([vect1, vect2])) # user_similarity = cosine_similarity(user_tag_matric) # print(user_similarity) return result def cosin_distance(vector1, vector2): """ K(X, Y) = <X, Y> / (||X||*||Y||) :param vector1: :param vector2: :return: """ dot_product = 0.0 normA = 0.0 normB = 0.0 for a, b in zip(vector1, vector2): dot_product += a * b normA += a ** 2 normB += b ** 2 if normA == 0.0 or normB == 0.0: return None else: return dot_product / ((normA * normB) ** 0.5) if __name__ == '__main__': text1 = preprocess(text1) text2 = preprocess(text2) print(compute_cosine(text1, text2))