3大數據挖掘系列之文本類似度匹配

時間 2019-11-12

標籤數據挖掘系列文本類似匹配简体版

原文原文鏈接

preface

這一篇咱們作文本類似度計算主要採用jieba,Gensim模塊來作。文本類似度有什麼用呢？它可以計算出文本內容類似的文章，能夠把類似的文章推送給讀者，也能夠去計算幾篇文章是否存在抄襲的嫌疑。好那麼下面就開始開車，請坐穩扶好。php

windows下大型文本讀取如何處理字符編碼問題：

咱們首先看下代碼,採用最基本的Open方法：html

f=open('F:\Learnning\daomubiji.txt','r+')

此時報錯了：python

Traceback (most recent call last):
  File "H:/PythonProject/untitled/DataParse/Day2/002.py", line 32, in <module>
    f=open('F:\Learnning\daomubiji.txt','r',).read()
UnicodeDecodeError: 'gbk' codec can't decode byte 0xac in position 512518: illegal multibyte sequence

對於這樣的方法，咱們能夠採用http的方法來讀取，以下代碼：算法

import 
hf = urllib.request.urlopen('http://127.0.0.1/txt.html').read().decode('utf8','ignore')

jieba模塊介紹

jieba模塊主要對中文作分詞，分析句子和文章都是以詞語爲文章，因此咱們對此處理的時候須要進行分詞，與英文不同的，英文分詞靠空格就能夠解決。下面就看看jieba的使用：
安裝jiebawindows

pip install jieba

cut方法之全模式：

setence = '溫柔易淡在北京東方之音工做'
w1 = jieba.cut(setence,cut_all=True)   # cut是分詞方法，cut_all = True表示全模式
print(w1)
for w in w1:
    print(w)

打印結果以下：app

溫柔
易
淡
在
北京
北京東方
京東
京東方
東方
之
音
工做

如上所示，全模式是說把全部詞都弄出來。包含重複的。dom

cut方法之精準模式：

只須要把上面的代碼小改下就行，cut_all=False。默認是精準模式工具

w1 = jieba.cut(setence,cut_all=False)   # cut是分詞方法，cut_all = False精準模式

輸出結果以下：學習

溫柔
易淡
在
北京東方
之音
工做

很明顯，沒有任何重複的詞了，by the way，詞語是具備權重性的，因此這裏的分詞依照詞語權重性來分詞大數據

cut_for_search方法：

這個方法是倒排索引的方法，代碼以下：

w2 = jieba.cut_for_search(setence)
for i in w2:
    print(i)

打印結果以下：

溫柔
易淡
在
北京
京東
東方
京東方
北京東方
之音
工做

jieba的字典使用

jieba默認有本身的字典，有時候咱們須要使用本身自定義的字典，那麼咱們能夠在jieba模塊的目錄下編寫本身的字典，以下所示：
我在個人電腦搜索python，而後在lib下面找到了jieba的模塊：

C:\Users\Leo\AppData\Local\Programs\Python\Python35\Lib\site-packages\jieba

在這下面有個dict.txt的文件，這是自帶的，咱們看看dict.txt的內容格式：

詞語	權重	詞性
AT&T	3	nz

第一列爲詞語，第二列爲詞語的權重，第三列爲詞性。
此時咱們瞭解jieba字典格式後，咱們就能夠自定義字典了,我這裏使用window系統自帶的記事本打開編輯的，下面請看：
我本身自定義本身的字典

詞語	權重	詞性
ljf	3	n
Leo	3	n
溫柔易淡	3	n

而後咱們在導入本身定義的字典：

import jieba.posseg
# 加載字典
jieba.load_userdict("C:\dict1.txt")
sentence3='ljfLeo溫柔易淡都是同一我的'
w4 = jieba.posseg.cut(sentence3)
for i in w4:
    print(i.word,'---',i.flag)

打印結果以下：

ljf --- n     # 使用了自定義字典的，可以把ljf，Leo給拆分開來。
Leo --- n
溫柔易淡 --- n
都 --- d
是 --- v
同一個 --- b
人 --- n

咱們在回頭看看沒有使用自定義字典的打印結果：

ljfLeo --- eng    # 沒有使用自定義字典的，把ljfLeo都視爲英文單詞了。
溫柔 --- a
易 --- a
淡 --- a
都 --- d
是 --- v
同一個 --- b
人 --- n

詞性說明

a形容詞
b連詞
d 副詞
e 嘆詞
f 方位詞
i 成語
m 數詞
n 名詞
nr 人名
ns 地名
nt 機構團體
nz 其餘專有名詞
p 介詞
r 代詞
t 時間
u 助詞
v 動詞
vn 動名詞
w 標點符號
un 未知詞語

利用posseg來判斷詞性：

from jieba import posseg
w5 = posseg.cut(sentence)
for i in w5:
    print(i.word,'=',i.flag)

輸出結果以下：

溫柔 = a
易 = a
淡 = a
在 = p
北京東方 = ns
之 = u
音 = n
工做 = vn

analyse關鍵詞提取

經過analyse.extract_tags來作，代碼以下

import jieba.analyse
tag = jieba.analyse.extract_tags(sentence2,3)
print(tag)

打印結果以下

['北京東方', '易淡', '之音']

tokenize 獲取詞語下標

和元組，列表同樣，默認下標都是從0開始的。代碼以下：

w9 = jieba.tokenize(sentence2)
for i in w9:
    print(i)

打印結果以下

('溫柔', 0, 2)
('易淡', 2, 4)
('在', 4, 5)
('北京東方', 5, 9)
('之音', 9, 11)
('工做', 11, 13)

tokenize 獲取詞語下標之search模式

search模式等同於cut下的全模式，把全部的單詞都給找出來。

w9 = jieba.tokenize(sentence2,mode='search')
for i in w9:
    print(i)

打印結果以下

('溫柔', 0, 2)
('易淡', 2, 4)
('在', 4, 5)
('北京', 5, 7)
('京東', 6, 8)
('東方', 7, 9)
('京東方', 6, 9)
('北京東方', 5, 9)
('之音', 9, 11)
('工做', 11, 13)

提取盜墓筆記的關鍵字

提取關鍵字的方法是經過analyse.extract_tags，第二個參數3表示提取三個關鍵字。
文本內容從網上下載的，而後用記事本刪除幾個句子（象徵性刪除）再保存下，這樣在open 的時候可以最大程度避免編碼問題致使讀取不了文件。萬惡的字符編碼

f=open('C:\Program Files\phpstudy\WWW\daomubiji.txt','rb').read()
tag = jieba.analyse.extract_tags(f,3)
print(tag)
# 打印結果以下：
['三叔', '咱們', '老癢']

Gensim模塊簡介

Gensim是一款開源的第三方Python工具包，用於從原始的非結構化的文本中，無監督地學習到文本隱層的主題向量表達。它支持包括TF-IDF，LSA，LDA，和word2vec在內的多種主題模型算法，支持流式訓練，並提供了諸如類似度計算，信息檢索等一些經常使用任務的API接口。
下面聊聊Gensim四個基本概念：

語料（Corpus）：一組原始文本的集合，用於無監督地訓練文本主題的隱層結構。語料中不須要人工標註的附加信息。在Gensim中，Corpus一般是一個可迭代的對象（好比列表）。每一次迭代返回一個可用於表達文本對象的稀疏向量。
向量（Vector）：由一組文本特徵構成的列表。是一段文本在Gensim中的內部表達。
稀疏向量（Sparse Vector）：咱們能夠略去向量中多餘的0元素或空元素。此時，向量中的每個元素是一個(key, value)的tuple。
模型（Model）：是一個抽象的術語。定義了兩個向量空間的變換（即從文本的一種向量表達變換爲另外一種向量表達）。

簡單的使用

#!/usr/bin/env python
# __author__:Leo
from gensim import corpora
from collections import defaultdict
from pprint import pprint
# 創建簡單的文檔內容
documents = ["Human machine interface for lab abc computer applications",
             "A survey of user opinion of computer system response time",
             "The EPS user interface management system",
             "System and human system engineering testing of EPS",
             "Relation of user perceived response time to error measurement",
             "The generation of random binary unordered trees",
             "The intersection graph of paths in trees",
             "Graph minors IV Widths of trees and well quasi ordering",
             "Graph minors A survey"]

 # remove common words and tokenize
stoplist = set('for a of the and to in'.split())   # 設置一個經常使用的的單詞表，凡是在這個表的單詞都應該刪除
texts = [[word for word in document.lower().split() if word not in stoplist]
         for document in documents]   # 像這樣的2層嵌套循環，遵循從右往左看，if word not in stoplist 剔除無用的單詞。

# remove words that appear only once
frequency = defaultdict(int)   # 默認字典使用int類型
for text in texts:    # 下面2行代碼是統計詞頻
    for token in text:
         frequency[token] += 1
texts = [[token for token in text if frequency[token] > 1]
          for text in texts]    # 提取出來大於等於1的單詞。

dictionary = corpora.Dictionary(texts)   # 把剛纔生成的文本作成字典
dictionary.save('D:\deerwester.dict')  # store the dictionary, for future reference
print(dictionary.token2id)    # 打印每一個單詞的ID

#To actually convert tokenized documents to vectors:
new_doc = "Human computer interaction"

# doc2bow 方法的用途：The function doc2bow() simply counts the number of occurrences of each distinct word,converts the word to its integer
# word id and returns the result as a sparse vector

new_vec = dictionary.doc2bow(new_doc.lower().split())
print(new_vec)  # the word "interaction" does not appear in the dictionary and is ignored

入門使用能夠參考官網地址：http://radimrehurek.com/gensim/tut1.html

具體應用和詳細的資料能夠查看Gensim官網：http://radimrehurek.com/gensim/tutorial.html

文本類似度匹配

在通過上面的簡單知識學習後，咱們知道了jieba這個模塊的分詞，提取關鍵字，使用自定義字典等等基本方法。下面就聊聊經過TFIDF算法與jieba作文本類似度匹配。
tfidf（term frequency–inverse document frequency）是一種用於信息檢索與數據挖掘的經常使用加權技術，說白了就是一種統計方法，，用以評估一字詞對於一個文件集或一個語料庫中的其中一份文件的重要程度，但同時會隨着它在語料庫中出現的頻率成反比降低。
tfidf具體內容與應用能夠查看百度百科

操做步驟：

讀取文檔
對要計算的多篇文檔進行分詞
對文檔進行整理成指定格式，方便後續進行計算
計算出詞語的頻率
[可選]-對頻率低的詞語進行過濾
經過語料庫創建詞典
加載要對比的文檔
將要對比的文檔經過doc2bow轉化爲稀疏向量
對稀疏向量進行進一步處理，獲得新語料庫
將新語料庫經過tfidfmodel進行處理，獲得tfidf
經過token2id獲得特徵數
稀疏矩陣類似度，從而創建索引
獲得最終類似度結果

代碼操做：

在簡單的瞭解上面文本類似度匹配步驟之後，下面就看看怎麼用代碼去實現：

shezhao.txt 是盜墓筆記4 蛇沼鬼城.txt
daomubiji.txt 是隨便從一個網站上下載下來的全集（我也不知道是否是真的全集）
mihaiguichao.txt 是盜墓筆記5 謎海歸巢.txt，和盜墓筆記4 蛇沼鬼城.txt 是一套的。

盜墓筆記本身上網上找吧，我這裏就不提供下載地址了。

老規矩，若是open文件遇到字符集編碼的文本，我是用記事本打開對應的文件，刪除一些文字後保存，或者刪除程序報錯時提示哪一個位置的內容後保存便可。

#!/usr/bin/env python
# __author__:Leo
from gensim import corpora,models,similarities
import jieba
from collections import defaultdict
f1=open('F:\Learnning\shezhao.txt','r+').read()
f2=open('F:\Learnning\daomubiji.txt','r+').read()
data1 = jieba.cut(f1)
data2 = jieba.cut(f2)
data1_words = ''
data2_words = ''
for word in data1:
    data1_words += word+" "
for word in data2:
    data2_words += word+" "
documents = [data1_words,data2_words]
texts = [[word for word in document.split()]
         for document in documents] # 遇到這樣的代碼，咱們閱讀的規則是遵循從右往左，先看最外面的for循環，z再看裏面的for循環

#print('texts',texts)
frequency = defaultdict(int)   # 使用默認字典
for text in texts:      # 下面2行代碼是計算每一個詞的頻數。方便下面的代碼去除頻數少的單詞
    for token in text:
        frequency[token] = +1

dictionary = corpora.Dictionary(texts)
dictionary.save('F:\Learnning\dictionary.txt')

# texts = [[word for word in text if frequency[token] > 1]
#          for text in texts]   # 去除頻數少於1的詞語，  內循環是先走if判斷，再走循環
texts = [[word for word in text ]
         for text in texts]
f3=open('F:\Learnning\mihaiguichao.txt','r+').read()
data3=jieba.cut(f3)
data3_words =''
for item in data3:
    data3_words += item+' '
new_vec = dictionary.doc2bow(data3_words.split())    # 創建向量
corpus = [dictionary.doc2bow(text) for text in texts]   # 創建新的語料庫
corpora.MmCorpus.serialize("F:\Learnning\CSDN-python大數據\Day2\XinYU.mm",corpus)  # 存新的語料庫
tfidf = models.TfidfModel(corpus)   # 創建tfidf模型
featureNum = len(dictionary.token2id.keys())   # 經過token2id獲得特徵數
index=similarities.SparseMatrixSimilarity(tfidf[corpus],num_features=featureNum)   #稀疏矩陣類似度，從而創建索引
sim = index[tfidf[new_vec]]    # 計算最終類似度結果
print(sim)from gensim import corpora,models,similarities
import jieba
from collections import defaultdict
f1=open('F:\Learnning\CSDN-python大數據\Day2\shezhao.txt','r+').read()
f2=open('F:\Learnning\CSDN-python大數據\Day2\daomubiji.txt','r+').read()
data1 = jieba.cut(f1)
data2 = jieba.cut(f2)
data1_words = ''
data2_words = ''
for word in data1:
    data1_words += word+" "
for word in data2:
    data2_words += word+" "
documents = [data1_words,data2_words]
texts = [[word for word in document.split()]
         for document in documents] # 遇到這樣的代碼，咱們閱讀的規則是遵循從右往左，先看最外面的for循環，z再看裏面的for循環

#print('texts',texts)
frequency = defaultdict(int)   # 使用默認字典
for text in texts:      # 下面2行代碼是計算每一個詞的頻數。方便下面的代碼去除頻數少的單詞
    for token in text:
        frequency[token] = +1

dictionary = corpora.Dictionary(texts)
dictionary.save('F:\Learnning\CSDN-python大數據\Day2\dictionary.txt')

# texts = [[word for word in text if frequency[token] > 1]
#          for text in texts]   # 去除頻數少於1的詞語，  內循環是先走if判斷，再走循環
texts = [[word for word in text ]
         for text in texts]
f3=open('F:\Learnning\CSDN-python大數據\Day2\mihaiguichao.txt','r+').read()
data3=jieba.cut(f3)
data3_words =''
for item in data3:
    data3_words += item+' '
new_vec = dictionary.doc2bow(data3_words.split())    # 創建向量
corpus = [dictionary.doc2bow(text) for text in texts]   # 創建新的語料庫
corpora.MmCorpus.serialize("F:\Learnning\CSDN-python大數據\Day2\XinYU.mm",corpus)  # 存新的語料庫
tfidf = models.TfidfModel(corpus)   # 創建tfidf模型
featureNum = len(dictionary.token2id.keys())   # 經過token2id獲得特徵數
index=similarities.SparseMatrixSimilarity(tfidf[corpus],num_features=featureNum)   #稀疏矩陣類似度，從而創建索引
sim = index[tfidf[new_vec]]    # 計算最終類似度結果
print(sim)

打印出來的類似度都在0-1之間，越接近1表示類似度速度越高。若是數據量特別大，能夠考慮採用多進程或者集羣來作。