因爲下定決心開始攻克機器學習。展轉反側,又是折騰線性代數,又是折騰機率論。而後又看了大學時候的高等數學。弄了大半天。不過今天還好有了收穫,把思路進行羅列出來,與你們分享。python
數學知識:app
因爲無法表示數學符號,我都如今這個進行羅列機器學習
向量A學習
①直線利用向量表示:{t*向量A | t 屬性 R}code
在二維平面中 當向量A和向量B不垂直時,此時。這個表達式就能夠表示任意一條直線。orm
//由此進行推廣,更高維度的直線咱們該怎麼去表示ip
②向量的點積 向量A 內積 向量B = 向量A的摸 * 向量B的摸 *cos //沒找到數學符號,先將就的這看utf-8
就這兩個數學概念就能夠最簡單的解決:文本類似度rem
----------------------------------------------------------------------------------------------------------------------------get
程序思路:
1.讀取文本
2.文本內容轉碼
3.文本分詞
4. 剔除 文本分詞後中 包含停用詞的詞組 以後統計剩餘分詞在 對比文本中分詞出現的詞頻--》待分類詞頻
5.將待分類詞頻比標準分類詞頻 利用餘弦定理計算夾角,夾角的大小就是類似的大小
下面我來解釋下:
第四步 做用,實質就是利用字典統計,來統計詞組出現的頻率,而後把詞組看作成一個多維空間的直線《----》直線的矩陣表示
第五步做用 把直線利用向量進行的表示,而後利用向量的內積,就能夠算出他們的夾角。這個是否是很簡單。這是我首次發現數學的做用
下面我把代碼進行展現(Python3.4)
import numpy as np import jieba import copy import codecs,sys ftest1fn = "D:\Tempory\mobile2.txt" ftest2fn = "D:\Tempory\war2.txt" sampfn = "D:\Tempory\war1.txt" def get_cossimi(x,y): myx = np.array(x) myy = np.array(y) cos1 = np.sum(myx * myy) cos21 = np.sqrt(sum(myx * myx)) cos22 = np.sqrt(sum(myy * myy)) return cos1 / (cos21 * cos22) if __name__ == '__main__': print("loading...") print("working...") f1 = codecs.open(sampfn,"r","utf-8") try: f1_text = f1.read() finally: f1.close() f1_seg_list = jieba.cut(f1_text) #first test ftest1 = codecs.open(ftest1fn,"r", "utf-8") try: ftest1_text = ftest1.read() finally: ftest1.close() ftest1_seg_list = jieba.cut(ftest1_text) #second test ftest2 = codecs.open(ftest2fn, "r", "utf-8") try: ftest2_text = ftest2.read() finally: ftest2.close() ftest2_seg_list = jieba.cut(ftest2_text) #read sample text #remove stop word and constructor dict f_stop = codecs.open("D:\Tempory\stopwords.txt","r","utf-8") try: f_stop_text = f_stop.read() finally: f_stop.close() f_stop_seg_list = f_stop_text.split("\n") test_words = {} all_words = {} for myword in f1_seg_list: print(".") if not(myword.strip()) in f_stop_seg_list: test_words.setdefault(myword, 0) all_words.setdefault(myword, 0) all_words[myword] += 1 #read to be tested word mytest1_words = copy.deepcopy(test_words) for myword in ftest1_seg_list: print(".") if not(myword.strip()) in f_stop_seg_list: if myword in mytest1_words: mytest1_words[myword] += 1 mytest2_words = copy.deepcopy(test_words) for myword in ftest2_seg_list: print(".") if not(myword.strip()) in f_stop_seg_list: if myword in mytest2_words: mytest2_words[myword] += 1 #calculate sample with to be tested text sample sampdate = [] test1data = [] test2data = [] for key in all_words.keys(): sampdate.append(all_words[key]) test1data.append(mytest1_words[key]) test2data.append(mytest2_words[key]) test1simi = get_cossimi(sampdate,test1data) test2simi = get_cossimi(sampdate,test2data) print("{0}樣本{1}的餘弦類似度{2}".format(ftest1fn,sampdate,test1simi)) print("{0}樣本{1}的餘弦類似度{2}".format(ftest2fn,sampdate,test2simi))