文本類似性算法實現（二）-分組及分句熱度統計

時間 2020-03-04

標籤文本類似算法實現分組分句熱度統計简体版

原文原文鏈接

1. 場景描述

軟件老王在上一節介紹到類似性熱度統計的4個需求（文本類似性熱度統計(python版)），本次介紹分組及分組分句熱度統計（需求1和需求2）。html

2. 解決方案

分組熱度統計首先根據某列進行分組，而後再對這些句進行熱度統計，主要是分組處理，分句僅僅是按照標點符號作了下拆分，在代碼說明中能夠替換下就能夠了。python

2.1 完整代碼

完整代碼，有須要的朋友能夠直接拿走，不想看代碼介紹的，能夠直接拿走執行就行。算法

import jieba.posseg as pseg
import jieba.analyse
import xlwt  # 寫入Excel表的庫
import pandas as pd
from gensim import corpora, models, similarities
import re
#停詞函數
def StopWordsList(filepath):
    wlst = [w.strip() for w in open(filepath, 'r', encoding='utf8').readlines()]
    return wlst
def str_to_hex(s):
    return ''.join([hex(ord(c)).replace('0x', '') for c in s])
# jieba分詞
def seg_sentence(sentence, stop_words):
    stop_flag = ['x', 'c', 'u', 'd', 'p', 't', 'uj', 'f', 'r']
    sentence_seged = pseg.cut(sentence)
    outstr = []
    for word, flag in sentence_seged:
        if word not in stop_words and flag not in stop_flag:
            outstr.append(word)
    return outstr
if __name__ == '__main__':
    # 1 這些是jieba分詞的自定義詞典，軟件老王這裏添加的格式行業術語，格式就是文檔，一列一個詞一行就好了，
    # 這個幾個詞典軟件老王就不上傳了，可註釋掉。
    jieba.load_userdict("g1.txt")
    jieba.load_userdict("g2.txt")
    jieba.load_userdict("g3.txt")

    # 2 停用詞，簡單理解就是此次詞不分割，這個軟件老王找的網上通用的。
    spPath = 'stop.txt'
    stop_words = StopWordsList(spPath)

    # 3 excel處理
    wbk = xlwt.Workbook(encoding='ascii')
    sheet = wbk.add_sheet("軟件老王sheet")  # sheet名稱
    sheet.write(0, 0, '軟件老王1-類別')
    sheet.write(0, 1, '軟件老王2-緣由')
    sheet.write(0, 2, '軟件老王3-統計數量')
    sheet.write(0, 3, '導航-連接到明細sheet表')

    inputfile = '軟件老王-source2.xlsx'
    data = pd.read_excel(inputfile)  # 讀取數據
    grp1 = data.groupby('類別')
    rcount = 1
    for name, group in grp1:
        print(grp1)
        texts = []
        orig_txt = []
        key_list = []
        name_list = []
        sheet_list = []
        name = name.replace('\n', '').replace('/', '')
        for i in range(len(group)):
            row = group.iloc[i].values
            cell = row[1]
            if cell is None:
                continue
            if not isinstance(cell, str):
                continue
            item = cell.strip('\n\r').split('\t')
            string = item[0]
            if string is None or len(string) == 0:
                continue
            else:
                textstr = seg_sentence(string, stop_words)
                texts.append(textstr)
                orig_txt.append(string)
        # 4 類似性處理
        dictionary = corpora.Dictionary(texts)
        feature_cnt = len(dictionary.token2id.keys())
        corpus = [dictionary.doc2bow(text) for text in texts]
        tfidf = models.LsiModel(corpus)
        index = similarities.SparseMatrixSimilarity(tfidf[corpus], num_features=feature_cnt)
        result_lt = []
        word_dict = {}
        count =0
        for keyword in orig_txt:
            count = count+1
            print('開始執行，第'+ str(count)+'行')
            if keyword in result_lt or keyword is None or len(keyword) == 0:
                continue
            kw_vector = dictionary.doc2bow(seg_sentence(keyword, stop_words))
            sim = index[tfidf[kw_vector]]
            result_list = []
            for i in range(len(sim)):
                if sim[i] > 0.5:
                    if orig_txt[i] in result_lt and orig_txt[i] not in result_list:
                        continue
                    result_list.append(orig_txt[i])
                    result_lt.append(orig_txt[i])
            if len(result_list) >0:
                word_dict[keyword] = len(result_list)
            if len(result_list) >= 1:
                name = name.strip('\n\r').replace('\n', '').replace('/', '').replace('，', '').replace('。', '').replace(
                    '*', '')
                name = re.sub(u"([^\u4e00-\u9fa5\u0030-\u0039\u0041-\u005a\u0061-\u007a])", "", name)
                sname = name[0:10] + '_' + re.sub(u"([^\u4e00-\u9fa5\u0030-\u0039\u0041-\u005a\u0061-\u007a])", "", keyword[0:10])+ '_'\
                        + str(len(result_list)+ len(str_to_hex(keyword))) + str_to_hex(keyword)[-5:]
                sheet_t = wbk.add_sheet(sname)  # Excel單元格名字
                for i in range(len(result_list)):
                    sheet_t.write(i, 0, label=result_list[i])
        # 5 按照熱度排序 -軟件老王
        with open("rjlw.txt", 'w', encoding='utf-8') as wf2:  # 打開文件
            orderList = list(word_dict.values())
            orderList.sort(reverse=True)
            count = len(orderList)
            for i in range(count):
                for key in word_dict:
                    if word_dict[key] == orderList[i]:
                        key_list.append(key)
                        name_list.append(name)
                        word_dict[key] = 0
            wf2.truncate()
        # 6 寫入目標excel
        for i in range(len(key_list)):
            sheet.write(i+rcount, 0, label=name_list[i])
            sheet.write(i+rcount, 1, label=key_list[i])
            sheet.write(i+rcount, 2, label=orderList[i])
            if orderList[i] >= 1:
                shname = name_list[i][0:10] + '_' + re.sub(u"([^\u4e00-\u9fa5\u0030-\u0039\u0041-\u005a\u0061-\u007a])", "", key_list[i][0:10]) \
                         + '_'+ str(orderList[i]+ len(str_to_hex(key_list[i])))+ str_to_hex(key_list[i])[-5:]
                link = 'HYPERLINK("#%s!A1";"%s")' % (shname, shname)
                sheet.write(i+rcount, 3, xlwt.Formula(link))
        rcount = rcount + len(key_list)
        key_list = []
        name_list = []
        orderList = []
        texts = []
        orig_txt = []
        sheet_list =[]
    wbk.save('軟件老王-target2.xls')

2.2 代碼說明

以上的代碼中有很明確的註釋就再也不一一介紹了，重點說幾個。app

（1）分組處理跟文本類似性熱度統計算法實現（一）-整句熱度統計類似，不一樣的是首先按照某一列作了分組處理，而後進行類似性統計，類似性這塊同樣，其實不一樣的主要是excel處理這塊的內容。函數

（2）excle分組用的是pandas包，python中excel數據分組處理。excel

（3）關於需求2，分組分句，代碼以下：code

for i in range(len(group)):
            row = group.iloc[i].values 
            cell = row[1]
            if cell is None:
                continue
            if not isinstance(cell, str):
                continue
            item = cell.strip('\n\r').split('\t') 
            string = item[0]
            #軟件老王，這裏按照標點符號對緣由進行拆分，而後再進行處理。
            lt = re.split('，|。|！|？', string)
            for t in lt:
                if t is None or t.strip() == '' or len(t.strip()) == 0:
                    continue
                else:
                    textstr = seg_sentence(t, stop_words)
                    texts.append(textstr)
                    orig_txt.append(t)

2.3 效果圖orm

（1）軟件老王-source2.xlsxhtm

類別	緣由
軟件老王1	主機不能加電
軟件老王1	有時不能加電
軟件老王1	開機加電
軟件老王2	自檢報錯或死機
軟件老王2	機器噪音大
軟件老王3	噪音問題
軟件老王1	噪音太大
軟件老王1	噪音噪聲
軟件老王1	聲音太大
軟件老王2	聲音太大
軟件老王3	聲音太大

（2）軟件老王-target2.xlsblog

軟件老王1-類別	軟件老王2-緣由	軟件老王3-統計數量	導航-連接到明細sheet表
軟件老王1	主機不能加電	3	軟件老王1_主機不能加電_2707535
軟件老王1	噪音太大	2	軟件老王1_噪音太大_18a5927
軟件老王1	聲音太大	1	軟件老王1_聲音太大_17a5927
軟件老王2	自檢報錯或死機	1	軟件老王2_自檢報錯或死機_29b673a
軟件老王2	機器噪音大	1	軟件老王2_機器噪音大_2135927
軟件老王2	聲音太大	1	軟件老王2_聲音太大_17a5927
軟件老王3	噪音問題	1	軟件老王3_噪音問題_17e9898
軟件老王3	聲音太大	1	軟件老王3_聲音太大_17a5927