基於gensim的Wiki百科中文word2vec訓練

時間 2019-11-11

標籤基於 gensim wiki 百科中文 word2vec word vec 訓練欄目 Microsoft Office 简体版

原文原文鏈接

Word2Vec簡介

Word2Vec是詞（Word）的一種表示方式。不一樣於one-hot vector，word2vec能夠經過計算各個詞之間的距離，來表示詞與詞之間的類似度。word2vec提取了更多的特徵，它使得具備相同上下文語義的詞儘量離得近一些，而不太相關的詞儘量離得較遠一些。例如，【騰訊】和【網易】兩個詞向量將會離得很近，同理【寶馬】和【保時捷】兩個詞向量將會離得很近。而【騰訊】和【寶馬】/【保時捷】，【網易】和【寶馬】/【保時捷】將會離得較遠一些。由於【騰訊】和【網易】都同屬於互聯網類目，而【寶馬】和【保時捷】都同屬於汽車類目。人以類聚，物以羣分嘛！互聯網圈子中談的畢竟都是互聯網相關的話題，而汽車圈子中談論的都是和汽車相關的話題。python

咱們怎麼獲得一個詞的word2vec呢？下面咱們將介紹如何使用python gensim獲得咱們想要的詞向量。總的來講，包括如下幾個步驟：linux

wiki中文數據預處理json
文本數據分詞segmentfault
gensim word2vec訓練微信

wiki中文數據預處理

首先，下載wiki中文數據：zhwiki-latest-pages-articles.xml.bz2。由於zhwiki數據中包含不少繁體字，因此咱們想得到簡體語料庫，接下來須要作如下兩件事：多線程

使用gensim模塊中的WikiCorpus從bz2中獲取原始文本數據線程
使用OpenCC將繁體字轉換爲簡體字code

WikiCorpus獲取原始文本數據

數據處理的python代碼以下：orm

from __future__ import print_function
from gensim.corpora import WikiCorpus
import jieba
import codecs
import os
import six
from gensim.models import Word2Vec
from gensim.models.word2vec import LineSentence
import multiprocessing

 
class Config:
    data_path = 'xxx/zhwiki'
    zhwiki_bz2 = 'zhwiki-latest-pages-articles.xml.bz2'
    zhwiki_raw = 'zhwiki_raw.txt'
    zhwiki_raw_t2s = 'zhwiki_raw_t2s.txt'
    zhwiki_seg_t2s = 'zhwiki_seg.txt'
    embedded_model_t2s = 'embedding_model_t2s/zhwiki_embedding_t2s.model'
    embedded_vector_t2s = 'embedding_model_t2s/vector_t2s'
 
 
def dataprocess(_config):
    i = 0
    if six.PY3:
        output = open(os.path.join(_config.data_path, _config.zhwiki_raw), 'w')
    output = codecs.open(os.path.join(_config.data_path, _config.zhwiki_raw), 'w')
    wiki = WikiCorpus(os.path.join(_config.data_path, _config.zhwiki_bz2), lemmatize=False, dictionary={})
    for text in wiki.get_texts():
        if six.PY3:
            output.write(b' '.join(text).decode('utf-8', 'ignore') + '\n')
        else:
            output.write(' '.join(text) + '\n')
        i += 1
        if i % 10000 == 0:
            print('Saved ' + str(i) + ' articles')
    output.close()
    print('Finished Saved ' + str(i) + ' articles')

config = Config()
dataprocess(config)

使用OpenCC將繁體字轉換爲簡體字

這裏，須要預先安裝OpenCC，關於OpenCC在linux環境中的安裝方法，請參考這篇文章。僅僅須要兩行linux命令就能夠完成繁體字轉換爲簡體字的任務，並且速度很快。xml

$ cd /xxx/zhwiki/
$ opencc -i zhwiki_raw.txt -o zhwiki_t2s.txt -c t2s.json

文本數據分詞

對於分詞這個任務，咱們直接使用了python的jieba分詞模塊。你也可使用哈工大的ltp或者斯坦福的nltk python接口進行分詞，準確率及權威度挺高的。不過這兩個安裝的時候會花費很長時間，尤爲是斯坦福的。關於jieba的分詞處理代碼，參考以下：

def is_alpha(tok):
    try:
        return tok.encode('ascii').isalpha()
    except UnicodeEncodeError:
        return False


def zhwiki_segment(_config, remove_alpha=True):
    i = 0
    if six.PY3:
        output = open(os.path.join(_config.data_path, _config.zhwiki_seg_t2s), 'w', encoding='utf-8')
    output = codecs.open(os.path.join(_config.data_path, _config.zhwiki_seg_t2s), 'w', encoding='utf-8')
    print('Start...')
    with codecs.open(os.path.join(_config.data_path, _config.zhwiki_raw_t2s), 'r', encoding='utf-8') as raw_input:
        for line in raw_input.readlines():
            line = line.strip()
            i += 1
            print('line ' + str(i))
            text = line.split()
            if True:
                text = [w for w in text if not is_alpha(w)]
            word_cut_seed = [jieba.cut(t) for t in text]
            tmp = ''
            for sent in word_cut_seed:
                for tok in sent:
                    tmp += tok + ' '
            tmp = tmp.strip()
            if tmp:
                output.write(tmp + '\n')
        output.close()

zhwiki_segment(config)

gensim word2vec訓練

python的gensim模塊提供了word2vec訓練，爲咱們模型的訓練提供了很大的方便。關於gensim的使用方法，能夠參考基於Gensim的Word2Vec實踐。
本次訓練的詞向量大小size爲50，訓練窗口爲5，最小詞頻爲5，並使用了多線程，具體代碼以下：

def word2vec(_config, saved=False):
    print('Start...')
    model = Word2Vec(LineSentence(os.path.join(_config.data_path, _config.zhwiki_seg_t2s)),
                     size=50, window=5, min_count=5, workers=multiprocessing.cpu_count())
    if saved:
        model.save(os.path.join(_config.data_path, _config.embedded_model_t2s))
        model.save_word2vec_format(os.path.join(_config.data_path, _config.embedded_vector_t2s), binary=False)
    print("Finished!")
    return model
 
 
def wordsimilarity(word, model):
    semi = ''
    try:
        semi = model.most_similar(word, topn=10)
    except KeyError:
        print('The word not in vocabulary!')
    for term in semi:
        print('%s,%s' % (term[0],term[1]))

model = word2vec(config, saved=True)

word2vec訓練已經完成，咱們獲得了想要的模型以及詞向量，並保存到本地。下面咱們分別查看同【寶馬】和【騰訊】最相近的前10個詞語。能夠發現：和【寶馬】相近的詞大都屬於汽車行業，並且是汽車品牌；和【騰訊】相近的詞大都屬於互聯網行業。

>>> wordsimilarity(word=u'寶馬', model=model)
保時捷,0.92567974329
固特異,0.888278841972
勞斯萊斯,0.884045600891
奧迪,0.881808757782
馬自達,0.881799697876
亞菲特,0.880708634853
歐寶,0.877104878426
雪鐵龍,0.876984715462
瑪莎拉蒂,0.868475496769
桑塔納,0.865387916565

>>> wordsimilarity(word=u'騰訊', model=model)
網易,0.880213916302
優酷,0.873666107655
騰訊網,0.87026232481
廣州日報,0.859486758709
微信,0.835543811321
天涯社區,0.834927380085
李彥宏,0.832848489285
土豆網,0.831390202045
團購,0.829696238041
搜狐網,0.825544642448

相關標籤/搜索

每日一句

每一个你不满意的现在，都有一个你没有努力的曾经。