中英文維基百科語料上的Word2Vec實驗

時間 2019-11-14

標籤英文維基百科語料 word2vec word vec 實驗欄目 Microsoft Office 简体版

原文原文鏈接

最近試了一下Word2Vec, GloVe 以及對應的python版本 gensim word2vec 和 python-glove，就有心在一個更大規模的語料上測試一下，天然而然維基百科的語料進入了視線。維基百科官方提供了一個很好的維基百科數據源：https://dumps.wikimedia.org，能夠方便的下載多種語言多種格式的維基百科數據。此前經過gensim的玩過英文的維基百科語料並訓練LSI，LDA模型來計算兩個文檔的類似度，因此想看看gensim有沒有提供一種簡便的方式來處理維基百科數據，訓練word2vec模型，用於計算詞語之間的語義類似度。感謝Google，在gensim的google group下，找到了一個很長的討論帖：training word2vec on full Wikipedia ，這個帖子基本上把如何使用gensim在維基百科語料上訓練word2vec模型的問題說清楚了，甚至參與討論的gensim的做者Radim Řehůřek博士還在新的gensim版本里加了一點修正，而對於我來講，所作的工做就是作一下驗證而已。雖然github上有一個wiki2vec的項目也是作得這個事，不過我更喜歡用python gensim的方式解決問題。html

關於word2vec，這方面不管中英文的參考資料至關的多，英文方面既能夠看官方推薦的論文，也能夠看gensim做者Radim Řehůřek博士寫得一些文章。而中文方面，推薦 @licstar的《Deep Learning in NLP （一）詞向量和語言模型》，有道技術沙龍的《Deep Learning實戰之word2vec》，@飛林沙的《word2vec的學習思路》, falao_beiliu 的《深度學習word2vec筆記之基礎篇》和《深度學習word2vec筆記之算法篇》等。
node

1、英文維基百科的Word2Vec測試python

首先測試了英文維基百科的數據，下載的是xml壓縮後的最新數據（下載日期是2015年3月1號），大概11G，下載地址：git

https://dumps.wikimedia.org/enwiki/latest/enwiki-latest-pages-articles.xml.bz2github

處理包括兩個階段，首先將xml的wiki數據轉換爲text格式，經過下面這個腳本(process_wiki.py)實現：算法

#!/usr/bin/env python
# -*- coding: utf-8 -*-
 
import logging
import os.path
import sys
 
from gensim.corpora import WikiCorpus
 
if __name__ == '__main__':
    program = os.path.basename(sys.argv[0])
    logger = logging.getLogger(program)
 
    logging.basicConfig(format='%(asctime)s: %(levelname)s: %(message)s')
    logging.root.setLevel(level=logging.INFO)
    logger.info("running %s" % ' '.join(sys.argv))
 
    # check and process input arguments
    if len(sys.argv) < 3:
        print globals()['__doc__'] % locals()
        sys.exit(1)
    inp, outp = sys.argv[1:3]
    space = " "
    i = 0
 
    output = open(outp, 'w')
    wiki = WikiCorpus(inp, lemmatize=False, dictionary={})
    for text in wiki.get_texts():
        output.write(space.join(text) + "\n")
        i = i + 1
        if (i % 10000 == 0):
            logger.info("Saved " + str(i) + " articles")
 
    output.close()
    logger.info("Finished Saved " + str(i) + " articles")

這裏利用了gensim裏的維基百科處理類WikiCorpus，經過get_texts將維基裏的每篇文章轉換位1行text文本，而且去掉了標點符號等內容，注意這裏「wiki = WikiCorpus(inp, lemmatize=False, dictionary={})」將lemmatize設置爲False的主要目的是不使用pattern模塊來進行英文單詞的詞幹化處理，不管你的電腦是否已經安裝了pattern，由於使用pattern會嚴重影響這個處理過程，變得很慢。less

執行」python process_wiki.py enwiki-latest-pages-articles.xml.bz2 wiki.en.text」:ide

2015-03-07 15:08:39,181: INFO: running process_enwiki.py enwiki-latest-pages-articles.xml.bz2 wiki.en.text
2015-03-07 15:11:12,860: INFO: Saved 10000 articles
2015-03-07 15:13:25,369: INFO: Saved 20000 articles
2015-03-07 15:15:19,771: INFO: Saved 30000 articles
2015-03-07 15:16:58,424: INFO: Saved 40000 articles
2015-03-07 15:18:12,374: INFO: Saved 50000 articles
2015-03-07 15:19:03,213: INFO: Saved 60000 articles
2015-03-07 15:19:47,656: INFO: Saved 70000 articles
2015-03-07 15:20:29,135: INFO: Saved 80000 articles
2015-03-07 15:22:02,365: INFO: Saved 90000 articles
2015-03-07 15:23:40,141: INFO: Saved 100000 articles
.....
2015-03-07 19:33:16,549: INFO: Saved 3700000 articles
2015-03-07 19:33:49,493: INFO: Saved 3710000 articles
2015-03-07 19:34:23,442: INFO: Saved 3720000 articles
2015-03-07 19:34:57,984: INFO: Saved 3730000 articles
2015-03-07 19:35:31,976: INFO: Saved 3740000 articles
2015-03-07 19:36:05,790: INFO: Saved 3750000 articles
2015-03-07 19:36:32,392: INFO: finished iterating over Wikipedia corpus of 3758076 documents with 2018886604 positions (total 15271374 articles, 2075130438 positions before pruning articles shorter than 50 words)
2015-03-07 19:36:32,394: INFO: Finished Saved 3758076 articles

在個人macpro（4核16G機器）大約跑了4個半小時，處理了375萬的文章後，咱們獲得了一個12G的text格式的英文維基百科數據wiki.en.text，格式相似這樣的：性能

anarchism is collection of movements and ideologies that hold the state to be undesirable unnecessary or harmful these movements advocate some form of stateless society instead often based on self governed voluntary institutions or non hierarchical free associations although anti statism is central to anarchism as political philosophy anarchism also entails rejection of and often hierarchical organisation in general as an anti dogmatic philosophy anarchism draws on many currents of thought and strategy anarchism does not offer fixed body of doctrine from single particular world view instead fluxing and flowing as philosophy there are many types and traditions of anarchism not all of which are mutually exclusive anarchist schools of thought can differ fundamentally supporting anything from extreme individualism to complete collectivism strains of anarchism have often been divided into the categories of social and individualist anarchism or similar dual classifications anarchism is usually considered radical left wing ideology and much of anarchist economics and anarchist legal philosophy reflect anti authoritarian interpretations of communism collectivism syndicalism mutualism or participatory economics etymology and terminology the term anarchism is compound word composed from the word anarchy and the suffix ism themselves derived respectively from the greek anarchy from anarchos meaning one without rulers from the privative prefix ἀν an without and archos leader ruler cf archon or arkhē authority sovereignty realm magistracy and the suffix or ismos isma from the verbal infinitive suffix…學習

有了這個數據後，不管用原始的word2vec binary版本仍是gensim中的python word2vec版本，均可以用來訓練word2vec模型，不過咱們試了一下前者，發現很慢，因此仍是採用google group 討論帖中的gensim word2vec方式的訓練腳本，不過作了一點修改，保留了vector text格式的輸出，方便debug, 腳本train_word2vec_model.py以下：

#!/usr/bin/env python
# -*- coding: utf-8 -*-
 
import logging
import os.path
import sys
import multiprocessing
 
from gensim.corpora import WikiCorpus
from gensim.models import Word2Vec
from gensim.models.word2vec import LineSentence
 
if __name__ == '__main__':
    program = os.path.basename(sys.argv[0])
    logger = logging.getLogger(program)
 
    logging.basicConfig(format='%(asctime)s: %(levelname)s: %(message)s')
    logging.root.setLevel(level=logging.INFO)
    logger.info("running %s" % ' '.join(sys.argv))
 
    # check and process input arguments
    if len(sys.argv) < 4:
        print globals()['__doc__'] % locals()
        sys.exit(1)
    inp, outp1, outp2 = sys.argv[1:4]
 
    model = Word2Vec(LineSentence(inp), size=400, window=5, min_count=5,
            workers=multiprocessing.cpu_count())
 
    # trim unneeded model memory = use(much) less RAM
    #model.init_sims(replace=True)
    model.save(outp1)
    model.save_word2vec_format(outp2, binary=False)

執行「python train_word2vec_model.py wiki.en.text wiki.en.text.model wiki.en.text.vector」:

2015-03-09 22:48:29,588: INFO: running train_word2vec_model.py wiki.en.text wiki.en.text.model wiki.en.text.vector
2015-03-09 22:48:29,593: INFO: collecting all words and their counts
2015-03-09 22:48:29,607: INFO: PROGRESS: at sentence #0, processed 0 words and 0 word types
2015-03-09 22:48:50,686: INFO: PROGRESS: at sentence #10000, processed 29353579 words and 430650 word types
2015-03-09 22:49:08,476: INFO: PROGRESS: at sentence #20000, processed 54695775 words and 610833 word types
2015-03-09 22:49:22,985: INFO: PROGRESS: at sentence #30000, processed 75344844 words and 742274 word types
2015-03-09 22:49:35,607: INFO: PROGRESS: at sentence #40000, processed 93430415 words and 859131 word types
2015-03-09 22:49:44,125: INFO: PROGRESS: at sentence #50000, processed 106057188 words and 935606 word types
2015-03-09 22:49:49,185: INFO: PROGRESS: at sentence #60000, processed 114319016 words and 952771 word types
2015-03-09 22:49:53,316: INFO: PROGRESS: at sentence #70000, processed 121263134 words and 969526 word types
2015-03-09 22:49:57,268: INFO: PROGRESS: at sentence #80000, processed 127773799 words and 984130 word types
2015-03-09 22:50:07,593: INFO: PROGRESS: at sentence #90000, processed 142688762 words and 1062932 word types
2015-03-09 22:50:19,162: INFO: PROGRESS: at sentence #100000, processed 159550824 words and 1157644 word 
types
......
2015-03-09 23:11:52,977: INFO: PROGRESS: at sentence #3700000, processed 1999452503 words and 7990138 word types
2015-03-09 23:11:55,367: INFO: PROGRESS: at sentence #3710000, processed 2002777270 words and 8002903 word types
2015-03-09 23:11:57,842: INFO: PROGRESS: at sentence #3720000, processed 2006213923 words and 8019620 word types
2015-03-09 23:12:00,439: INFO: PROGRESS: at sentence #3730000, processed 2009762733 words and 8035408 word types
2015-03-09 23:12:02,793: INFO: PROGRESS: at sentence #3740000, processed 2013066196 words and 8045218 word types
2015-03-09 23:12:05,178: INFO: PROGRESS: at sentence #3750000, processed 2016363087 words and 8057784 word types
2015-03-09 23:12:07,013: INFO: collected 8069236 word types from a corpus of 2018886604 words and 3758076 sentences
2015-03-09 23:12:12,230: INFO: total 1969354 word types after removing those with count<5
2015-03-09 23:12:12,230: INFO: constructing a huffman tree from 1969354 words
2015-03-09 23:14:07,415: INFO: built huffman tree with maximum node depth 29
2015-03-09 23:14:09,790: INFO: resetting layer weights
2015-03-09 23:15:04,506: INFO: training model with 4 workers on 1969354 vocabulary and 400 features, using 'skipgram'=1 'hierarchical softmax'=1 'subsample'=0 and 'negative sampling'=0
2015-03-09 23:15:19,112: INFO: PROGRESS: at 0.01% words, alpha 0.02500, 19098 words/s
2015-03-09 23:15:20,224: INFO: PROGRESS: at 0.03% words, alpha 0.02500, 37671 words/s
2015-03-09 23:15:22,305: INFO: PROGRESS: at 0.07% words, alpha 0.02500, 75393 words/s
2015-03-09 23:15:27,712: INFO: PROGRESS: at 0.08% words, alpha 0.02499, 65618 words/s
2015-03-09 23:15:29,452: INFO: PROGRESS: at 0.09% words, alpha 0.02500, 70966 words/s
2015-03-09 23:15:34,032: INFO: PROGRESS: at 0.11% words, alpha 0.02498, 77369 words/s
2015-03-09 23:15:37,249: INFO: PROGRESS: at 0.12% words, alpha 0.02498, 74935 words/s
2015-03-09 23:15:40,618: INFO: PROGRESS: at 0.14% words, alpha 0.02498, 75399 words/s
2015-03-09 23:15:42,301: INFO: PROGRESS: at 0.16% words, alpha 0.02497, 86029 words/s
2015-03-09 23:15:46,283: INFO: PROGRESS: at 0.17% words, alpha 0.02497, 83033 words/s
2015-03-09 23:15:48,374: INFO: PROGRESS: at 0.18% words, alpha 0.02497, 83370 words/s
2015-03-09 23:15:51,398: INFO: PROGRESS: at 0.19% words, alpha 0.02496, 82794 words/s
2015-03-09 23:15:55,069: INFO: PROGRESS: at 0.21% words, alpha 0.02496, 83753 words/s
2015-03-09 23:15:57,718: INFO: PROGRESS: at 0.23% words, alpha 0.02496, 85031 words/s
2015-03-09 23:16:00,106: INFO: PROGRESS: at 0.24% words, alpha 0.02495, 86567 words/s
2015-03-09 23:16:05,523: INFO: PROGRESS: at 0.26% words, alpha 0.02495, 84850 words/s
2015-03-09 23:16:06,596: INFO: PROGRESS: at 0.27% words, alpha 0.02495, 87926 words/s
2015-03-09 23:16:09,500: INFO: PROGRESS: at 0.29% words, alpha 0.02494, 88618 words/s
2015-03-09 23:16:10,714: INFO: PROGRESS: at 0.30% words, alpha 0.02494, 91023 words/s
2015-03-09 23:16:18,467: INFO: PROGRESS: at 0.32% words, alpha 0.02494, 85960 words/s
2015-03-09 23:16:19,547: INFO: PROGRESS: at 0.33% words, alpha 0.02493, 89140 words/s
2015-03-09 23:16:23,500: INFO: PROGRESS: at 0.36% words, alpha 0.02493, 92026 words/s
2015-03-09 23:16:29,738: INFO: PROGRESS: at 0.37% words, alpha 0.02491, 88180 words/s
2015-03-09 23:16:32,000: INFO: PROGRESS: at 0.40% words, alpha 0.02492, 92734 words/s
2015-03-09 23:16:34,392: INFO: PROGRESS: at 0.42% words, alpha 0.02491, 93300 words/s
2015-03-09 23:16:41,018: INFO: PROGRESS: at 0.43% words, alpha 0.02490, 89727 words/s
.......
2015-03-10 05:03:31,849: INFO: PROGRESS: at 99.20% words, alpha 0.00020, 95350 words/s
2015-03-10 05:03:32,901: INFO: PROGRESS: at 99.21% words, alpha 0.00020, 95350 words/s
2015-03-10 05:03:34,296: INFO: PROGRESS: at 99.21% words, alpha 0.00020, 95350 words/s
2015-03-10 05:03:35,635: INFO: PROGRESS: at 99.22% words, alpha 0.00020, 95349 words/s
2015-03-10 05:03:36,730: INFO: PROGRESS: at 99.22% words, alpha 0.00020, 95350 words/s
2015-03-10 05:03:37,489: INFO: reached the end of input; waiting to finish 8 outstanding jobs
2015-03-10 05:03:37,908: INFO: PROGRESS: at 99.23% words, alpha 0.00019, 95350 words/s
2015-03-10 05:03:39,028: INFO: PROGRESS: at 99.23% words, alpha 0.00019, 95350 words/s
2015-03-10 05:03:40,127: INFO: PROGRESS: at 99.24% words, alpha 0.00019, 95350 words/s
2015-03-10 05:03:40,910: INFO: training on 1994415728 words took 20916.4s, 95352 words/s
2015-03-10 05:03:41,058: INFO: saving Word2Vec object under wiki.en.text.model, separately None
2015-03-10 05:03:41,209: INFO: not storing attribute syn0norm
2015-03-10 05:03:41,209: INFO: storing numpy array 'syn0' to wiki.en.text.model.syn0.npy
2015-03-10 05:04:35,199: INFO: storing numpy array 'syn1' to wiki.en.text.model.syn1.npy
2015-03-10 05:11:25,400: INFO: storing 1969354x400 projection weights into wiki.en.text.vector

大約跑了7個小時，咱們獲得了一個gensim中默認格式的word2vec model和一個原始c版本word2vec的vector格式的模型: wiki.en.text.vector，格式以下：

1969354 400
the 0.129255 0.015725 0.049174 -0.016438 -0.018912 0.032752 0.079885 0.033669 -0.077722 -0.025709 0.012775 0.044153 0.134307 0.070499 -0.002243 0.105198 -0.016832 -0.028631 -0.124312 -0.123064 -0.116838 0.051181 -0.096058 -0.049734 0.017380 -0.101221 0.058945 0.013669 -0.012755 0.061053 0.061813 0.083655 -0.069382 -0.069868 0.066529 -0.037156 -0.072935 -0.009470 0.037412 -0.004406 0.047011 0.005033 -0.066270 -0.031815 0.023160 -0.080117 0.172918 0.065486 -0.072161 0.062875 0.019939 -0.048380 0.198152 -0.098525 0.023434 0.079439 0.045150 -0.079479 -0.051441 -0.021556 -0.024981 -0.045291 0.040284 -0.082500 0.014618 -0.071998 0.031887 0.043916 0.115783 -0.174898 0.086603 -0.023124 0.007293 -0.066576 -0.164817 -0.081223 0.058412 0.000132 0.064160 0.055848 0.029776 -0.103420 -0.007541 -0.031742 0.082533 -0.061760 -0.038961 0.001754 -0.023977 0.069616 0.095920 0.017136 0.067126 -0.111310 0.053632 0.017633 -0.003875 -0.005236 0.063151 0.039729 -0.039158 0.001415 0.021754 -0.012540 0.015070 -0.062636 -0.013605 -0.031770 0.005296 -0.078119 -0.069303 -0.080634 -0.058377 0.024398 -0.028173 0.026353 0.088662 0.018755 -0.113538 0.055538 -0.086012 -0.027708 -0.028788 0.017759 0.029293 0.047674 -0.106734 -0.134380 0.048605 -0.089583 0.029426 0.030552 0.141916 -0.022653 0.017204 -0.036059 0.061045 -0.000077 -0.076579 0.066747 0.060884 -0.072817…
…

在ipython中，咱們經過gensim來加載和測試這個模型，由於這個模型大約有7G，因此加載的時間也稍長一些：

In [2]: import gensim
 
In [3]: model = gensim.models.Word2Vec.load_word2vec_format("wiki.en.text.vector", binary=False)
 
In [4]: model.most_similar("queen")
Out[4]: 
[(u'princess', 0.5760838389396667),
 (u'hyoui', 0.5671186447143555),
 (u'janggyung', 0.5598698854446411),
 (u'king', 0.5556215047836304),
 (u'dollallolla', 0.5540223121643066),
 (u'loranella', 0.5522741079330444),
 (u'ramphaiphanni', 0.5310937166213989),
 (u'jeheon', 0.5298476219177246),
 (u'soheon', 0.5243583917617798),
 (u'coronation', 0.5217245221138)]
 
In [5]: model.most_similar("man")
Out[5]: 
[(u'woman', 0.7120707035064697),
 (u'girl', 0.58659827709198),
 (u'handsome', 0.5637181997299194),
 (u'boy', 0.5425317287445068),
 (u'villager', 0.5084836483001709),
 (u'mustachioed', 0.49287813901901245),
 (u'mcgucket', 0.48355430364608765),
 (u'spider', 0.4804879426956177),
 (u'policeman', 0.4780033826828003),
 (u'stranger', 0.4750771224498749)]
 
In [6]: model.most_similar("woman")
Out[6]: 
[(u'man', 0.7120705842971802),
 (u'girl', 0.6736541986465454),
 (u'prostitute', 0.5765659809112549),
 (u'divorcee', 0.5429972410202026),
 (u'person', 0.5276163816452026),
 (u'schoolgirl', 0.5102938413619995),
 (u'housewife', 0.48748138546943665),
 (u'lover', 0.4858251214027405),
 (u'handsome', 0.4773051142692566),
 (u'boy', 0.47445783019065857)]
 
In [8]: model.similarity("woman", "man")
Out[8]: 0.71207063453821218
 
In [10]: model.doesnt_match("breakfast cereal dinner lunch".split())
Out[10]: 'cereal'
 
In [11]: model.similarity("woman", "girl")
Out[11]: 0.67365416785207421
 
In [13]: model.most_similar("frog")
Out[13]: 
[(u'toad', 0.6868536472320557),
 (u'barycragus', 0.6607867479324341),
 (u'grylio', 0.626731276512146),
 (u'heckscheri', 0.6208407878875732),
 (u'clamitans', 0.6150864362716675),
 (u'coplandi', 0.612680196762085),
 (u'pseudacris', 0.6108512878417969),
 (u'litoria', 0.6084023714065552),
 (u'raniformis', 0.6044802665710449),
 (u'watjulumensis', 0.6043726205825806)]

一切ok，可是當加載gensim默認的基於numpy格式的模型時，卻遇到了問題：

In [1]: import gensim 
 
In [2]: model = gensim.models.Word2Vec.load("wiki.en.text.model")
 
In [3]: model.most_similar("man")
... RuntimeWarning: invalid value encountered in divide
  self.syn0norm = (self.syn0 / sqrt((self.syn0 ** 2).sum(-1))[..., newaxis]).astype(REAL)
 
Out[3]: 
[(u'ahsns', nan),
 (u'ny\xedl', nan),
 (u'indradeo', nan),
 (u'jaimovich', nan),
 (u'addlepate', nan),
 (u'jagello', nan),
 (u'festenburg', nan),
 (u'picatic', nan),
 (u'tolosanum', nan),
 (u'mithoo', nan)]

這也是我修改前面這個腳本的緣由所在，這個腳本在訓練小一些的數據，譬如前10萬條text的時候沒任何問題，不管原始格式仍是gensim格式，可是當跑完這個英文維基百科的時候，卻存在這個問題，試了一些方法解決，尚未成功，若是你們有好的建議或解決方案，歡迎提出。

2、中文維基百科的Word2Vec測試

測試完英文維基百科以後，天然想試試中文的維基百科數據，與英文處理過程類似，也分兩個步驟，不過這裏須要對中文維基百科數據特殊處理一下，包括繁簡轉換，中文分詞，去除非utf-8字符等。中文數據的下載地址是：https://dumps.wikimedia.org/zhwiki/latest/zhwiki-latest-pages-articles.xml.bz2。

中文維基百科的數據比較小，整個xml的壓縮文件大約才1G，相對英文數據小了不少。首先用 process_wiki.py處理這個XML壓縮文件，執行：python process_wiki.py zhwiki-latest-pages-articles.xml.bz2 wiki.zh.text

2015-03-11 17:39:22,739: INFO: running process_wiki.py zhwiki-latest-pages-articles.xml.bz2 wiki.zh.text
2015-03-11 17:40:08,329: INFO: Saved 10000 articles
2015-03-11 17:40:45,501: INFO: Saved 20000 articles
2015-03-11 17:41:23,659: INFO: Saved 30000 articles
2015-03-11 17:42:01,748: INFO: Saved 40000 articles
2015-03-11 17:42:33,779: INFO: Saved 50000 articles
......
2015-03-11 17:55:23,094: INFO: Saved 200000 articles
2015-03-11 17:56:14,692: INFO: Saved 210000 articles
2015-03-11 17:57:04,614: INFO: Saved 220000 articles
2015-03-11 17:57:57,979: INFO: Saved 230000 articles
2015-03-11 17:58:16,621: INFO: finished iterating over Wikipedia corpus of 232894 documents with 51603419 positions (total 2581444 articles, 62177405 positions before pruning articles shorter than 50 words)
2015-03-11 17:58:16,622: INFO: Finished Saved 232894 articles

獲得了大約23萬多篇中文語料的text格式的語料:wiki.zh.text，大概750多M。不過查看以後發現，除了加雜一些英文詞彙外，還有不少繁體字混跡其中，這裏仍是參考了 @licstar 《維基百科簡體中文語料的獲取》中的方法，安裝opencc，而後將wiki.zh.text中的繁體字轉化位簡體字：

opencc -i wiki.zh.text -o wiki.zh.text.jian -c zht2zhs.ini

而後就是分詞處理了，此次我用基於MeCab訓練的一套中文分詞系統來進行中文分詞，目前雖尚未達到實用的狀態，可是性能和分詞結果基本能達到此次的使用要求：

mecab -d ../data/ -O wakati wiki.zh.text.jian -o wiki.zh.text.jian.seg -b 10000000

注意這裏data目錄下是給mecab訓練好的分詞模型和詞典文件等，詳細可參考《用MeCab打造一套實用的中文分詞系統》。

有了中文維基百科的分詞數據，還覺得就能夠執行word2vec模型訓練了：

python train_word2vec_model.py wiki.zh.text.jian.seg wiki.zh.text.model wiki.zh.text.vector

不過仍然遇到了問題，提示的錯誤是：

UnicodeDecodeError: ‘utf8’ codec can’t decode bytes in position 5394-5395: invalid continuation byte

google了一下，大體是文件中包含非utf-8字符，又用iconv處理了一下這個問題：

iconv -c -t UTF-8 < wiki.zh.text.jian.seg > wiki.zh.text.jian.seg.utf-8

這樣基本上就沒問題了，執行：

python train_word2vec_model.py wiki.zh.text.jian.seg.utf-8 wiki.zh.text.model wiki.zh.text.vector

2015-03-11 18:50:02,586: INFO: running train_word2vec_model.py wiki.zh.text.jian.seg.utf-8 wiki.zh.text.model wiki.zh.text.vector
2015-03-11 18:50:02,592: INFO: collecting all words and their counts
2015-03-11 18:50:02,592: INFO: PROGRESS: at sentence #0, processed 0 words and 0 word types
2015-03-11 18:50:12,476: INFO: PROGRESS: at sentence #10000, processed 12914562 words and 254662 word types
2015-03-11 18:50:20,215: INFO: PROGRESS: at sentence #20000, processed 22308801 words and 373573 word types
2015-03-11 18:50:28,448: INFO: PROGRESS: at sentence #30000, processed 30724902 words and 460837 word types
...
2015-03-11 18:52:03,498: INFO: PROGRESS: at sentence #210000, processed 143804601 words and 1483608 word types
2015-03-11 18:52:07,772: INFO: PROGRESS: at sentence #220000, processed 149352283 words and 1521199 word types
2015-03-11 18:52:11,639: INFO: PROGRESS: at sentence #230000, processed 154741839 words and 1563584 word types
2015-03-11 18:52:12,746: INFO: collected 1575172 word types from a corpus of 156430908 words and 232894 sentences
2015-03-11 18:52:13,672: INFO: total 278291 word types after removing those with count<5
2015-03-11 18:52:13,673: INFO: constructing a huffman tree from 278291 words
2015-03-11 18:52:29,323: INFO: built huffman tree with maximum node depth 25
2015-03-11 18:52:29,683: INFO: resetting layer weights
2015-03-11 18:52:38,805: INFO: training model with 4 workers on 278291 vocabulary and 400 features, using 'skipgram'=1 'hierarchical softmax'=1 'subsample'=0 and 'negative sampling'=0
2015-03-11 18:52:49,504: INFO: PROGRESS: at 0.10% words, alpha 0.02500, 15008 words/s
2015-03-11 18:52:51,935: INFO: PROGRESS: at 0.38% words, alpha 0.02500, 44434 words/s
2015-03-11 18:52:54,779: INFO: PROGRESS: at 0.56% words, alpha 0.02500, 53965 words/s
2015-03-11 18:52:57,240: INFO: PROGRESS: at 0.62% words, alpha 0.02491, 52116 words/s
2015-03-11 18:52:58,823: INFO: PROGRESS: at 0.72% words, alpha 0.02494, 55804 words/s
2015-03-11 18:53:03,649: INFO: PROGRESS: at 0.94% words, alpha 0.02486, 58277 words/s
2015-03-11 18:53:07,357: INFO: PROGRESS: at 1.03% words, alpha 0.02479, 56036 words/s
......
2015-03-11 19:22:09,002: INFO: PROGRESS: at 98.38% words, alpha 0.00044, 85936 words/s
2015-03-11 19:22:10,321: INFO: PROGRESS: at 98.50% words, alpha 0.00044, 85971 words/s
2015-03-11 19:22:11,934: INFO: PROGRESS: at 98.55% words, alpha 0.00039, 85940 words/s
2015-03-11 19:22:13,384: INFO: PROGRESS: at 98.65% words, alpha 0.00036, 85960 words/s
2015-03-11 19:22:13,883: INFO: training on 152625573 words took 1775.1s, 85982 words/s
2015-03-11 19:22:13,883: INFO: saving Word2Vec object under wiki.zh.text.model, separately None
2015-03-11 19:22:13,884: INFO: not storing attribute syn0norm
2015-03-11 19:22:13,884: INFO: storing numpy array 'syn0' to wiki.zh.text.model.syn0.npy
2015-03-11 19:22:20,797: INFO: storing numpy array 'syn1' to wiki.zh.text.model.syn1.npy
2015-03-11 19:22:40,667: INFO: storing 278291x400 projection weights into wiki.zh.text.vector

讓咱們看一下訓練好的中文維基百科word2vec模型「wiki.zh.text.vector」的效果：

In [1]: import gensim
 
In [2]: model = gensim.models.Word2Vec.load("wiki.zh.text.model")
 
In [3]: model.most_similar(u"足球")
Out[3]: 
[(u'\u8054\u8d5b', 0.6553816199302673),
 (u'\u7532\u7ea7', 0.6530429720878601),
 (u'\u7bee\u7403', 0.5967546701431274),
 (u'\u4ff1\u4e50\u90e8', 0.5872289538383484),
 (u'\u4e59\u7ea7', 0.5840631723403931),
 (u'\u8db3\u7403\u961f', 0.5560152530670166),
 (u'\u4e9a\u8db3\u8054', 0.5308005809783936),
 (u'allsvenskan', 0.5249762535095215),
 (u'\u4ee3\u8868\u961f', 0.5214947462081909),
 (u'\u7532\u7ec4', 0.5177896022796631)]
 
In [4]: result = model.most_similar(u"足球")
 
In [5]: for e in result:
    print e[0], e[1]
   ....:     
聯賽 0.65538161993
甲級 0.653042972088
籃球 0.596754670143
俱樂部 0.587228953838
乙級 0.58406317234
足球隊 0.556015253067
亞足聯 0.530800580978
allsvenskan 0.52497625351
表明隊 0.521494746208
甲組 0.51778960228
 
In [6]: result = model.most_similar(u"男人")
 
In [7]: for e in result:
    print e[0], e[1]
   ....:     
女人 0.77537125349
傢伙 0.617369174957
媽媽 0.567102909088
漂亮 0.560832381248
잘했어 0.540875017643
謊話 0.538448691368
爸爸 0.53660941124
傻瓜 0.535608053207
예쁘다 0.535151124001
mc劉 0.529670000076
 
In [8]: result = model.most_similar(u"女人")
 
In [9]: for e in result:
    print e[0], e[1]
   ....:     
男人 0.77537125349
個人某 0.589010596275
媽媽 0.576344847679
잘했어 0.562340974808
美麗 0.555426716805
爸爸 0.543958246708
新娘 0.543640494347
謊話 0.540272831917
妞兒 0.531066179276
老婆 0.528521537781
 
In [10]: result = model.most_similar(u"青蛙")
 
In [11]: for e in result:
    print e[0], e[1]
   ....:     
老鼠 0.559612870216
烏龜 0.489831030369
蜥蜴 0.478990525007
貓 0.46728849411
鱷魚 0.461885392666
蟾蜍 0.448014199734
猴子 0.436584025621
白雪公主 0.434905380011
蚯蚓 0.433413207531
螃蟹 0.4314712286
 
In [12]: result = model.most_similar(u"姨夫")
 
In [13]: for e in result:
    print e[0], e[1]
   ....:     
堂伯 0.583935439587
祖父 0.574735701084
妃所生 0.569327116013
內弟 0.562012672424
早卒 0.558042645454
曕 0.553856015205
胤禎 0.553288519382
陳潛 0.550716996193
愔之 0.550510883331
叔父 0.550032019615
 
In [14]: result = model.most_similar(u"衣服")
 
In [15]: for e in result:
    print e[0], e[1]
   ....:     
鞋子 0.686688780785
穿着 0.672499775887
衣物 0.67173999548
大衣 0.667605519295
褲子 0.662670075893
內褲 0.662210345268
裙子 0.659705817699
西裝 0.648508131504
洋裝 0.647238850594
圍裙 0.642895817757
 
In [16]: result = model.most_similar(u"公安局")
 
In [17]: for e in result:
    print e[0], e[1]
   ....:     
司法局 0.730189085007
公安廳 0.634275555611
公安 0.612798035145
房管局 0.597343325615
商業局 0.597183346748
軍管會 0.59476184845
體育局 0.59283208847
財政局 0.588721752167
戒毒所 0.575558543205
新聞辦 0.573395550251
 
In [18]: result = model.most_similar(u"鐵道部")
 
In [19]: for e in result:
    print e[0], e[1]
   ....:     
盛光祖 0.565509021282
交通部 0.548688530922
批覆 0.546967327595
劉志軍 0.541010737419
立項 0.517836689949
報送 0.510296344757
計委 0.508456230164
水利部 0.503531932831
國務院 0.503227233887
經貿委 0.50156635046
 
In [20]: result = model.most_similar(u"清華大學")
 
In [21]: for e in result:
    print e[0], e[1]
   ....:     
北京大學 0.763922810555
化學系 0.724210739136
物理系 0.694550514221
數學系 0.684280991554
中山大學 0.677202701569
復旦 0.657914161682
師範大學 0.656435549259
哲學系 0.654701948166
生物系 0.654403865337
中文系 0.653147578239
 
In [22]: result = model.most_similar(u"衛視")
 
In [23]: for e in result:
    print e[0], e[1]
   ....:     
湖南 0.676812887192
中文臺 0.626506924629
収蔵 0.621356606483
黃金檔 0.582251906395
cctv 0.536769032478
安徽 0.536752820015
非同凡響 0.534517168999
唱響 0.533438682556
最強音 0.532605051994
金鷹 0.531676828861
 

 
In [26]: result = model.most_similar(u"林丹")
 
In [27]: for e in result:
    print e[0], e[1]
   ....:     
黃綜翰 0.538035452366
蔣燕皎 0.52646958828
劉鑫 0.522252976894
韓晶娜 0.516120731831
王曉理 0.512289524078
王適 0.508560419083
楊影 0.508159279823
陳躍 0.507353425026
龔智超 0.503159761429
李敬元 0.50262516737
 
In [28]: result = model.most_similar(u"語言學")
 
In [29]: for e in result:
    print e[0], e[1]
   ....:     
社會學 0.632598280907
人類學 0.623406708241
歷史學 0.618442356586
比較文學 0.604823827744
心理學 0.600066184998
人文科學 0.577783346176
社會心理學 0.575571238995
政治學 0.574541330338
地理學 0.573896467686
哲學 0.573873817921
 
In [30]: result = model.most_similar(u"計算機")
 
In [31]: for e in result:
    print e[0], e[1]
   ....:     
自動化 0.674171924591
應用 0.614087462425
自動化系 0.611132860184
材料科學 0.607891201973
集成電路 0.600370049477
技術 0.597518980503
電子學 0.591316461563
建模 0.577238917351
工程學 0.572855889797
微電子 0.570086717606
 
In [32]: model.similarity(u"計算機", u"自動化")
Out[32]: 0.67417196002404789
 
In [33]: model.similarity(u"女人", u"男人")
Out[33]: 0.77537125129824813
 
In [34]: model.doesnt_match(u"早餐 晚餐 午飯 中心".split())
Out[34]: u'\u4e2d\u5fc3'
 
In [35]: print model.doesnt_match(u"早餐 晚餐 午飯 中心".split())
中心

有好的也有壞的case，甚至bad case可能會更多一些，這和語料庫的規模有關，還和分詞器的效果有關等等，不過這個實驗暫且就到這裏了。至於word2vec有什麼用，目前除了用來來計算詞語類似度外，業界更關注的是word2vec在具體的應用任務中的效果，這個纔是更有意思的東東，也歡迎你們一塊兒探討。

出處「我愛天然語言處理」：www.52nlp.cn

本文連接地址：http://www.52nlp.cn/中英文維基百科語料上的word2v