最近試了一下Word2Vec, GloVe 以及對應的python版本 gensim word2vec 和 python-glove,就有心在一個更大規模的語料上測試一下,天然而然維基百科的語料進入了視線。維基百科官方提供了一個很好的維基百科數據源:https://dumps.wikimedia.org,能夠方便的下載多種語言多種格式的維基百科數據。此前經過gensim的玩過英文的維基百科語料並訓練LSI,LDA模型來計算兩個文檔的類似度,因此想看看gensim有沒有提供一種簡便的方式來處理維基百科數據,訓練word2vec模型,用於計算詞語之間的語義類似度。感謝Google,在gensim的google group下,找到了一個很長的討論帖:training word2vec on full Wikipedia ,這個帖子基本上把如何使用gensim在維基百科語料上訓練word2vec模型的問題說清楚了,甚至參與討論的gensim的做者Radim Řehůřek博士還在新的gensim版本里加了一點修正,而對於我來講,所作的工做就是作一下驗證而已。雖然github上有一個wiki2vec的項目也是作得這個事,不過我更喜歡用python gensim的方式解決問題。html
關於word2vec,這方面不管中英文的參考資料至關的多,英文方面既能夠看官方推薦的論文,也能夠看gensim做者Radim Řehůřek博士寫得一些文章。而中文方面,推薦 @licstar的《Deep Learning in NLP (一)詞向量和語言模型》,有道技術沙龍的《Deep Learning實戰之word2vec》,@飛林沙 的《word2vec的學習思路》, falao_beiliu 的《深度學習word2vec筆記之基礎篇》和《深度學習word2vec筆記之算法篇》等。
node
1、英文維基百科的Word2Vec測試python
首先測試了英文維基百科的數據,下載的是xml壓縮後的最新數據(下載日期是2015年3月1號),大概11G,下載地址:git
https://dumps.wikimedia.org/enwiki/latest/enwiki-latest-pages-articles.xml.bz2github
處理包括兩個階段,首先將xml的wiki數據轉換爲text格式,經過下面這個腳本(process_wiki.py)實現:算法
#!/usr/bin/env python # -*- coding: utf-8 -*- import logging import os.path import sys from gensim.corpora import WikiCorpus if __name__ == '__main__': program = os.path.basename(sys.argv[0]) logger = logging.getLogger(program) logging.basicConfig(format='%(asctime)s: %(levelname)s: %(message)s') logging.root.setLevel(level=logging.INFO) logger.info("running %s" % ' '.join(sys.argv)) # check and process input arguments if len(sys.argv) < 3: print globals()['__doc__'] % locals() sys.exit(1) inp, outp = sys.argv[1:3] space = " " i = 0 output = open(outp, 'w') wiki = WikiCorpus(inp, lemmatize=False, dictionary={}) for text in wiki.get_texts(): output.write(space.join(text) + "\n") i = i + 1 if (i % 10000 == 0): logger.info("Saved " + str(i) + " articles") output.close() logger.info("Finished Saved " + str(i) + " articles") |
這裏利用了gensim裏的維基百科處理類WikiCorpus,經過get_texts將維基裏的每篇文章轉換位1行text文本,而且去掉了標點符號等內容,注意這裏「wiki = WikiCorpus(inp, lemmatize=False, dictionary={})」將lemmatize設置爲False的主要目的是不使用pattern模塊來進行英文單詞的詞幹化處理,不管你的電腦是否已經安裝了pattern,由於使用pattern會嚴重影響這個處理過程,變得很慢。less
執行」python process_wiki.py enwiki-latest-pages-articles.xml.bz2 wiki.en.text」:ide
2015-03-07 15:08:39,181: INFO: running process_enwiki.py enwiki-latest-pages-articles.xml.bz2 wiki.en.text 2015-03-07 15:11:12,860: INFO: Saved 10000 articles 2015-03-07 15:13:25,369: INFO: Saved 20000 articles 2015-03-07 15:15:19,771: INFO: Saved 30000 articles 2015-03-07 15:16:58,424: INFO: Saved 40000 articles 2015-03-07 15:18:12,374: INFO: Saved 50000 articles 2015-03-07 15:19:03,213: INFO: Saved 60000 articles 2015-03-07 15:19:47,656: INFO: Saved 70000 articles 2015-03-07 15:20:29,135: INFO: Saved 80000 articles 2015-03-07 15:22:02,365: INFO: Saved 90000 articles 2015-03-07 15:23:40,141: INFO: Saved 100000 articles ..... 2015-03-07 19:33:16,549: INFO: Saved 3700000 articles 2015-03-07 19:33:49,493: INFO: Saved 3710000 articles 2015-03-07 19:34:23,442: INFO: Saved 3720000 articles 2015-03-07 19:34:57,984: INFO: Saved 3730000 articles 2015-03-07 19:35:31,976: INFO: Saved 3740000 articles 2015-03-07 19:36:05,790: INFO: Saved 3750000 articles 2015-03-07 19:36:32,392: INFO: finished iterating over Wikipedia corpus of 3758076 documents with 2018886604 positions (total 15271374 articles, 2075130438 positions before pruning articles shorter than 50 words) 2015-03-07 19:36:32,394: INFO: Finished Saved 3758076 articles |
在個人macpro(4核16G機器)大約跑了4個半小時,處理了375萬的文章後,咱們獲得了一個12G的text格式的英文維基百科數據wiki.en.text,格式相似這樣的:性能
anarchism is collection of movements and ideologies that hold the state to be undesirable unnecessary or harmful these movements advocate some form of stateless society instead often based on self governed voluntary institutions or non hierarchical free associations although anti statism is central to anarchism as political philosophy anarchism also entails rejection of and often hierarchical organisation in general as an anti dogmatic philosophy anarchism draws on many currents of thought and strategy anarchism does not offer fixed body of doctrine from single particular world view instead fluxing and flowing as philosophy there are many types and traditions of anarchism not all of which are mutually exclusive anarchist schools of thought can differ fundamentally supporting anything from extreme individualism to complete collectivism strains of anarchism have often been divided into the categories of social and individualist anarchism or similar dual classifications anarchism is usually considered radical left wing ideology and much of anarchist economics and anarchist legal philosophy reflect anti authoritarian interpretations of communism collectivism syndicalism mutualism or participatory economics etymology and terminology the term anarchism is compound word composed from the word anarchy and the suffix ism themselves derived respectively from the greek anarchy from anarchos meaning one without rulers from the privative prefix ἀν an without and archos leader ruler cf archon or arkhē authority sovereignty realm magistracy and the suffix or ismos isma from the verbal infinitive suffix…學習
有了這個數據後,不管用原始的word2vec binary版本仍是gensim中的python word2vec版本,均可以用來訓練word2vec模型,不過咱們試了一下前者,發現很慢,因此仍是採用google group 討論帖中的gensim word2vec方式的訓練腳本,不過作了一點修改,保留了vector text格式的輸出,方便debug, 腳本train_word2vec_model.py以下:
#!/usr/bin/env python # -*- coding: utf-8 -*- import logging import os.path import sys import multiprocessing from gensim.corpora import WikiCorpus from gensim.models import Word2Vec from gensim.models.word2vec import LineSentence if __name__ == '__main__': program = os.path.basename(sys.argv[0]) logger = logging.getLogger(program) logging.basicConfig(format='%(asctime)s: %(levelname)s: %(message)s') logging.root.setLevel(level=logging.INFO) logger.info("running %s" % ' '.join(sys.argv)) # check and process input arguments if len(sys.argv) < 4: print globals()['__doc__'] % locals() sys.exit(1) inp, outp1, outp2 = sys.argv[1:4] model = Word2Vec(LineSentence(inp), size=400, window=5, min_count=5, workers=multiprocessing.cpu_count()) # trim unneeded model memory = use(much) less RAM #model.init_sims(replace=True) model.save(outp1) model.save_word2vec_format(outp2, binary=False) |
執行 「python train_word2vec_model.py wiki.en.text wiki.en.text.model wiki.en.text.vector」:
2015-03-09 22:48:29,588: INFO: running train_word2vec_model.py wiki.en.text wiki.en.text.model wiki.en.text.vector 2015-03-09 22:48:29,593: INFO: collecting all words and their counts 2015-03-09 22:48:29,607: INFO: PROGRESS: at sentence #0, processed 0 words and 0 word types 2015-03-09 22:48:50,686: INFO: PROGRESS: at sentence #10000, processed 29353579 words and 430650 word types 2015-03-09 22:49:08,476: INFO: PROGRESS: at sentence #20000, processed 54695775 words and 610833 word types 2015-03-09 22:49:22,985: INFO: PROGRESS: at sentence #30000, processed 75344844 words and 742274 word types 2015-03-09 22:49:35,607: INFO: PROGRESS: at sentence #40000, processed 93430415 words and 859131 word types 2015-03-09 22:49:44,125: INFO: PROGRESS: at sentence #50000, processed 106057188 words and 935606 word types 2015-03-09 22:49:49,185: INFO: PROGRESS: at sentence #60000, processed 114319016 words and 952771 word types 2015-03-09 22:49:53,316: INFO: PROGRESS: at sentence #70000, processed 121263134 words and 969526 word types 2015-03-09 22:49:57,268: INFO: PROGRESS: at sentence #80000, processed 127773799 words and 984130 word types 2015-03-09 22:50:07,593: INFO: PROGRESS: at sentence #90000, processed 142688762 words and 1062932 word types 2015-03-09 22:50:19,162: INFO: PROGRESS: at sentence #100000, processed 159550824 words and 1157644 word types ...... 2015-03-09 23:11:52,977: INFO: PROGRESS: at sentence #3700000, processed 1999452503 words and 7990138 word types 2015-03-09 23:11:55,367: INFO: PROGRESS: at sentence #3710000, processed 2002777270 words and 8002903 word types 2015-03-09 23:11:57,842: INFO: PROGRESS: at sentence #3720000, processed 2006213923 words and 8019620 word types 2015-03-09 23:12:00,439: INFO: PROGRESS: at sentence #3730000, processed 2009762733 words and 8035408 word types 2015-03-09 23:12:02,793: INFO: PROGRESS: at sentence #3740000, processed 2013066196 words and 8045218 word types 2015-03-09 23:12:05,178: INFO: PROGRESS: at sentence #3750000, processed 2016363087 words and 8057784 word types 2015-03-09 23:12:07,013: INFO: collected 8069236 word types from a corpus of 2018886604 words and 3758076 sentences 2015-03-09 23:12:12,230: INFO: total 1969354 word types after removing those with count<5 2015-03-09 23:12:12,230: INFO: constructing a huffman tree from 1969354 words 2015-03-09 23:14:07,415: INFO: built huffman tree with maximum node depth 29 2015-03-09 23:14:09,790: INFO: resetting layer weights 2015-03-09 23:15:04,506: INFO: training model with 4 workers on 1969354 vocabulary and 400 features, using 'skipgram'=1 'hierarchical softmax'=1 'subsample'=0 and 'negative sampling'=0 2015-03-09 23:15:19,112: INFO: PROGRESS: at 0.01% words, alpha 0.02500, 19098 words/s 2015-03-09 23:15:20,224: INFO: PROGRESS: at 0.03% words, alpha 0.02500, 37671 words/s 2015-03-09 23:15:22,305: INFO: PROGRESS: at 0.07% words, alpha 0.02500, 75393 words/s 2015-03-09 23:15:27,712: INFO: PROGRESS: at 0.08% words, alpha 0.02499, 65618 words/s 2015-03-09 23:15:29,452: INFO: PROGRESS: at 0.09% words, alpha 0.02500, 70966 words/s 2015-03-09 23:15:34,032: INFO: PROGRESS: at 0.11% words, alpha 0.02498, 77369 words/s 2015-03-09 23:15:37,249: INFO: PROGRESS: at 0.12% words, alpha 0.02498, 74935 words/s 2015-03-09 23:15:40,618: INFO: PROGRESS: at 0.14% words, alpha 0.02498, 75399 words/s 2015-03-09 23:15:42,301: INFO: PROGRESS: at 0.16% words, alpha 0.02497, 86029 words/s 2015-03-09 23:15:46,283: INFO: PROGRESS: at 0.17% words, alpha 0.02497, 83033 words/s 2015-03-09 23:15:48,374: INFO: PROGRESS: at 0.18% words, alpha 0.02497, 83370 words/s 2015-03-09 23:15:51,398: INFO: PROGRESS: at 0.19% words, alpha 0.02496, 82794 words/s 2015-03-09 23:15:55,069: INFO: PROGRESS: at 0.21% words, alpha 0.02496, 83753 words/s 2015-03-09 23:15:57,718: INFO: PROGRESS: at 0.23% words, alpha 0.02496, 85031 words/s 2015-03-09 23:16:00,106: INFO: PROGRESS: at 0.24% words, alpha 0.02495, 86567 words/s 2015-03-09 23:16:05,523: INFO: PROGRESS: at 0.26% words, alpha 0.02495, 84850 words/s 2015-03-09 23:16:06,596: INFO: PROGRESS: at 0.27% words, alpha 0.02495, 87926 words/s 2015-03-09 23:16:09,500: INFO: PROGRESS: at 0.29% words, alpha 0.02494, 88618 words/s 2015-03-09 23:16:10,714: INFO: PROGRESS: at 0.30% words, alpha 0.02494, 91023 words/s 2015-03-09 23:16:18,467: INFO: PROGRESS: at 0.32% words, alpha 0.02494, 85960 words/s 2015-03-09 23:16:19,547: INFO: PROGRESS: at 0.33% words, alpha 0.02493, 89140 words/s 2015-03-09 23:16:23,500: INFO: PROGRESS: at 0.36% words, alpha 0.02493, 92026 words/s 2015-03-09 23:16:29,738: INFO: PROGRESS: at 0.37% words, alpha 0.02491, 88180 words/s 2015-03-09 23:16:32,000: INFO: PROGRESS: at 0.40% words, alpha 0.02492, 92734 words/s 2015-03-09 23:16:34,392: INFO: PROGRESS: at 0.42% words, alpha 0.02491, 93300 words/s 2015-03-09 23:16:41,018: INFO: PROGRESS: at 0.43% words, alpha 0.02490, 89727 words/s ....... 2015-03-10 05:03:31,849: INFO: PROGRESS: at 99.20% words, alpha 0.00020, 95350 words/s 2015-03-10 05:03:32,901: INFO: PROGRESS: at 99.21% words, alpha 0.00020, 95350 words/s 2015-03-10 05:03:34,296: INFO: PROGRESS: at 99.21% words, alpha 0.00020, 95350 words/s 2015-03-10 05:03:35,635: INFO: PROGRESS: at 99.22% words, alpha 0.00020, 95349 words/s 2015-03-10 05:03:36,730: INFO: PROGRESS: at 99.22% words, alpha 0.00020, 95350 words/s 2015-03-10 05:03:37,489: INFO: reached the end of input; waiting to finish 8 outstanding jobs 2015-03-10 05:03:37,908: INFO: PROGRESS: at 99.23% words, alpha 0.00019, 95350 words/s 2015-03-10 05:03:39,028: INFO: PROGRESS: at 99.23% words, alpha 0.00019, 95350 words/s 2015-03-10 05:03:40,127: INFO: PROGRESS: at 99.24% words, alpha 0.00019, 95350 words/s 2015-03-10 05:03:40,910: INFO: training on 1994415728 words took 20916.4s, 95352 words/s 2015-03-10 05:03:41,058: INFO: saving Word2Vec object under wiki.en.text.model, separately None 2015-03-10 05:03:41,209: INFO: not storing attribute syn0norm 2015-03-10 05:03:41,209: INFO: storing numpy array 'syn0' to wiki.en.text.model.syn0.npy 2015-03-10 05:04:35,199: INFO: storing numpy array 'syn1' to wiki.en.text.model.syn1.npy 2015-03-10 05:11:25,400: INFO: storing 1969354x400 projection weights into wiki.en.text.vector |
大約跑了7個小時,咱們獲得了一個gensim中默認格式的word2vec model和一個原始c版本word2vec的vector格式的模型: wiki.en.text.vector,格式以下:
1969354 400
the 0.129255 0.015725 0.049174 -0.016438 -0.018912 0.032752 0.079885 0.033669 -0.077722 -0.025709 0.012775 0.044153 0.134307 0.070499 -0.002243 0.105198 -0.016832 -0.028631 -0.124312 -0.123064 -0.116838 0.051181 -0.096058 -0.049734 0.017380 -0.101221 0.058945 0.013669 -0.012755 0.061053 0.061813 0.083655 -0.069382 -0.069868 0.066529 -0.037156 -0.072935 -0.009470 0.037412 -0.004406 0.047011 0.005033 -0.066270 -0.031815 0.023160 -0.080117 0.172918 0.065486 -0.072161 0.062875 0.019939 -0.048380 0.198152 -0.098525 0.023434 0.079439 0.045150 -0.079479 -0.051441 -0.021556 -0.024981 -0.045291 0.040284 -0.082500 0.014618 -0.071998 0.031887 0.043916 0.115783 -0.174898 0.086603 -0.023124 0.007293 -0.066576 -0.164817 -0.081223 0.058412 0.000132 0.064160 0.055848 0.029776 -0.103420 -0.007541 -0.031742 0.082533 -0.061760 -0.038961 0.001754 -0.023977 0.069616 0.095920 0.017136 0.067126 -0.111310 0.053632 0.017633 -0.003875 -0.005236 0.063151 0.039729 -0.039158 0.001415 0.021754 -0.012540 0.015070 -0.062636 -0.013605 -0.031770 0.005296 -0.078119 -0.069303 -0.080634 -0.058377 0.024398 -0.028173 0.026353 0.088662 0.018755 -0.113538 0.055538 -0.086012 -0.027708 -0.028788 0.017759 0.029293 0.047674 -0.106734 -0.134380 0.048605 -0.089583 0.029426 0.030552 0.141916 -0.022653 0.017204 -0.036059 0.061045 -0.000077 -0.076579 0.066747 0.060884 -0.072817…
…
在ipython中,咱們經過gensim來加載和測試這個模型,由於這個模型大約有7G,因此加載的時間也稍長一些:
In [2]: import gensim In [3]: model = gensim.models.Word2Vec.load_word2vec_format("wiki.en.text.vector", binary=False) In [4]: model.most_similar("queen") Out[4]: [(u'princess', 0.5760838389396667), (u'hyoui', 0.5671186447143555), (u'janggyung', 0.5598698854446411), (u'king', 0.5556215047836304), (u'dollallolla', 0.5540223121643066), (u'loranella', 0.5522741079330444), (u'ramphaiphanni', 0.5310937166213989), (u'jeheon', 0.5298476219177246), (u'soheon', 0.5243583917617798), (u'coronation', 0.5217245221138)] In [5]: model.most_similar("man") Out[5]: [(u'woman', 0.7120707035064697), (u'girl', 0.58659827709198), (u'handsome', 0.5637181997299194), (u'boy', 0.5425317287445068), (u'villager', 0.5084836483001709), (u'mustachioed', 0.49287813901901245), (u'mcgucket', 0.48355430364608765), (u'spider', 0.4804879426956177), (u'policeman', 0.4780033826828003), (u'stranger', 0.4750771224498749)] In [6]: model.most_similar("woman") Out[6]: [(u'man', 0.7120705842971802), (u'girl', 0.6736541986465454), (u'prostitute', 0.5765659809112549), (u'divorcee', 0.5429972410202026), (u'person', 0.5276163816452026), (u'schoolgirl', 0.5102938413619995), (u'housewife', 0.48748138546943665), (u'lover', 0.4858251214027405), (u'handsome', 0.4773051142692566), (u'boy', 0.47445783019065857)] In [8]: model.similarity("woman", "man") Out[8]: 0.71207063453821218 In [10]: model.doesnt_match("breakfast cereal dinner lunch".split()) Out[10]: 'cereal' In [11]: model.similarity("woman", "girl") Out[11]: 0.67365416785207421 In [13]: model.most_similar("frog") Out[13]: [(u'toad', 0.6868536472320557), (u'barycragus', 0.6607867479324341), (u'grylio', 0.626731276512146), (u'heckscheri', 0.6208407878875732), (u'clamitans', 0.6150864362716675), (u'coplandi', 0.612680196762085), (u'pseudacris', 0.6108512878417969), (u'litoria', 0.6084023714065552), (u'raniformis', 0.6044802665710449), (u'watjulumensis', 0.6043726205825806)] |
一切ok,可是當加載gensim默認的基於numpy格式的模型時,卻遇到了問題:
In [1]: import gensim In [2]: model = gensim.models.Word2Vec.load("wiki.en.text.model") In [3]: model.most_similar("man") ... RuntimeWarning: invalid value encountered in divide self.syn0norm = (self.syn0 / sqrt((self.syn0 ** 2).sum(-1))[..., newaxis]).astype(REAL) Out[3]: [(u'ahsns', nan), (u'ny\xedl', nan), (u'indradeo', nan), (u'jaimovich', nan), (u'addlepate', nan), (u'jagello', nan), (u'festenburg', nan), (u'picatic', nan), (u'tolosanum', nan), (u'mithoo', nan)] |
這也是我修改前面這個腳本的緣由所在,這個腳本在訓練小一些的數據,譬如前10萬條text的時候沒任何問題,不管原始格式仍是gensim格式,可是當跑完這個英文維基百科的時候,卻存在這個問題,試了一些方法解決,尚未成功,若是你們有好的建議或解決方案,歡迎提出。
2、中文維基百科的Word2Vec測試
測試完英文維基百科以後,天然想試試中文的維基百科數據,與英文處理過程類似,也分兩個步驟,不過這裏須要對中文維基百科數據特殊處理一下,包括繁簡轉換,中文分詞,去除非utf-8字符等。中文數據的下載地址是:https://dumps.wikimedia.org/zhwiki/latest/zhwiki-latest-pages-articles.xml.bz2。
中文維基百科的數據比較小,整個xml的壓縮文件大約才1G,相對英文數據小了不少。首先用 process_wiki.py處理這個XML壓縮文件,執行:python process_wiki.py zhwiki-latest-pages-articles.xml.bz2 wiki.zh.text
2015-03-11 17:39:22,739: INFO: running process_wiki.py zhwiki-latest-pages-articles.xml.bz2 wiki.zh.text 2015-03-11 17:40:08,329: INFO: Saved 10000 articles 2015-03-11 17:40:45,501: INFO: Saved 20000 articles 2015-03-11 17:41:23,659: INFO: Saved 30000 articles 2015-03-11 17:42:01,748: INFO: Saved 40000 articles 2015-03-11 17:42:33,779: INFO: Saved 50000 articles ...... 2015-03-11 17:55:23,094: INFO: Saved 200000 articles 2015-03-11 17:56:14,692: INFO: Saved 210000 articles 2015-03-11 17:57:04,614: INFO: Saved 220000 articles 2015-03-11 17:57:57,979: INFO: Saved 230000 articles 2015-03-11 17:58:16,621: INFO: finished iterating over Wikipedia corpus of 232894 documents with 51603419 positions (total 2581444 articles, 62177405 positions before pruning articles shorter than 50 words) 2015-03-11 17:58:16,622: INFO: Finished Saved 232894 articles |
獲得了大約23萬多篇中文語料的text格式的語料:wiki.zh.text,大概750多M。不過查看以後發現,除了加雜一些英文詞彙外,還有不少繁體字混跡其中,這裏仍是參考了 @licstar《維基百科簡體中文語料的獲取》中的方法,安裝opencc,而後將wiki.zh.text中的繁體字轉化位簡體字:
opencc -i wiki.zh.text -o wiki.zh.text.jian -c zht2zhs.ini
而後就是分詞處理了,此次我用基於MeCab訓練的一套中文分詞系統來進行中文分詞,目前雖尚未達到實用的狀態,可是性能和分詞結果基本能達到此次的使用要求:
mecab -d ../data/ -O wakati wiki.zh.text.jian -o wiki.zh.text.jian.seg -b 10000000
注意這裏data目錄下是給mecab訓練好的分詞模型和詞典文件等,詳細可參考《用MeCab打造一套實用的中文分詞系統》。
有了中文維基百科的分詞數據,還覺得就能夠執行word2vec模型訓練了:
python train_word2vec_model.py wiki.zh.text.jian.seg wiki.zh.text.model wiki.zh.text.vector
不過仍然遇到了問題,提示的錯誤是:
UnicodeDecodeError: ‘utf8’ codec can’t decode bytes in position 5394-5395: invalid continuation byte
google了一下,大體是文件中包含非utf-8字符,又用iconv處理了一下這個問題:
iconv -c -t UTF-8 < wiki.zh.text.jian.seg > wiki.zh.text.jian.seg.utf-8
這樣基本上就沒問題了,執行:
python train_word2vec_model.py wiki.zh.text.jian.seg.utf-8 wiki.zh.text.model wiki.zh.text.vector
2015-03-11 18:50:02,586: INFO: running train_word2vec_model.py wiki.zh.text.jian.seg.utf-8 wiki.zh.text.model wiki.zh.text.vector 2015-03-11 18:50:02,592: INFO: collecting all words and their counts 2015-03-11 18:50:02,592: INFO: PROGRESS: at sentence #0, processed 0 words and 0 word types 2015-03-11 18:50:12,476: INFO: PROGRESS: at sentence #10000, processed 12914562 words and 254662 word types 2015-03-11 18:50:20,215: INFO: PROGRESS: at sentence #20000, processed 22308801 words and 373573 word types 2015-03-11 18:50:28,448: INFO: PROGRESS: at sentence #30000, processed 30724902 words and 460837 word types ... 2015-03-11 18:52:03,498: INFO: PROGRESS: at sentence #210000, processed 143804601 words and 1483608 word types 2015-03-11 18:52:07,772: INFO: PROGRESS: at sentence #220000, processed 149352283 words and 1521199 word types 2015-03-11 18:52:11,639: INFO: PROGRESS: at sentence #230000, processed 154741839 words and 1563584 word types 2015-03-11 18:52:12,746: INFO: collected 1575172 word types from a corpus of 156430908 words and 232894 sentences 2015-03-11 18:52:13,672: INFO: total 278291 word types after removing those with count<5 2015-03-11 18:52:13,673: INFO: constructing a huffman tree from 278291 words 2015-03-11 18:52:29,323: INFO: built huffman tree with maximum node depth 25 2015-03-11 18:52:29,683: INFO: resetting layer weights 2015-03-11 18:52:38,805: INFO: training model with 4 workers on 278291 vocabulary and 400 features, using 'skipgram'=1 'hierarchical softmax'=1 'subsample'=0 and 'negative sampling'=0 2015-03-11 18:52:49,504: INFO: PROGRESS: at 0.10% words, alpha 0.02500, 15008 words/s 2015-03-11 18:52:51,935: INFO: PROGRESS: at 0.38% words, alpha 0.02500, 44434 words/s 2015-03-11 18:52:54,779: INFO: PROGRESS: at 0.56% words, alpha 0.02500, 53965 words/s 2015-03-11 18:52:57,240: INFO: PROGRESS: at 0.62% words, alpha 0.02491, 52116 words/s 2015-03-11 18:52:58,823: INFO: PROGRESS: at 0.72% words, alpha 0.02494, 55804 words/s 2015-03-11 18:53:03,649: INFO: PROGRESS: at 0.94% words, alpha 0.02486, 58277 words/s 2015-03-11 18:53:07,357: INFO: PROGRESS: at 1.03% words, alpha 0.02479, 56036 words/s ...... 2015-03-11 19:22:09,002: INFO: PROGRESS: at 98.38% words, alpha 0.00044, 85936 words/s 2015-03-11 19:22:10,321: INFO: PROGRESS: at 98.50% words, alpha 0.00044, 85971 words/s 2015-03-11 19:22:11,934: INFO: PROGRESS: at 98.55% words, alpha 0.00039, 85940 words/s 2015-03-11 19:22:13,384: INFO: PROGRESS: at 98.65% words, alpha 0.00036, 85960 words/s 2015-03-11 19:22:13,883: INFO: training on 152625573 words took 1775.1s, 85982 words/s 2015-03-11 19:22:13,883: INFO: saving Word2Vec object under wiki.zh.text.model, separately None 2015-03-11 19:22:13,884: INFO: not storing attribute syn0norm 2015-03-11 19:22:13,884: INFO: storing numpy array 'syn0' to wiki.zh.text.model.syn0.npy 2015-03-11 19:22:20,797: INFO: storing numpy array 'syn1' to wiki.zh.text.model.syn1.npy 2015-03-11 19:22:40,667: INFO: storing 278291x400 projection weights into wiki.zh.text.vector |
讓咱們看一下訓練好的中文維基百科word2vec模型「wiki.zh.text.vector」的效果:
In [1]: import gensim In [2]: model = gensim.models.Word2Vec.load("wiki.zh.text.model") In [3]: model.most_similar(u"足球") Out[3]: [(u'\u8054\u8d5b', 0.6553816199302673), (u'\u7532\u7ea7', 0.6530429720878601), (u'\u7bee\u7403', 0.5967546701431274), (u'\u4ff1\u4e50\u90e8', 0.5872289538383484), (u'\u4e59\u7ea7', 0.5840631723403931), (u'\u8db3\u7403\u961f', 0.5560152530670166), (u'\u4e9a\u8db3\u8054', 0.5308005809783936), (u'allsvenskan', 0.5249762535095215), (u'\u4ee3\u8868\u961f', 0.5214947462081909), (u'\u7532\u7ec4', 0.5177896022796631)] In [4]: result = model.most_similar(u"足球") In [5]: for e in result: print e[0], e[1] ....: 聯賽 0.65538161993 甲級 0.653042972088 籃球 0.596754670143 俱樂部 0.587228953838 乙級 0.58406317234 足球隊 0.556015253067 亞足聯 0.530800580978 allsvenskan 0.52497625351 表明隊 0.521494746208 甲組 0.51778960228 In [6]: result = model.most_similar(u"男人") In [7]: for e in result: print e[0], e[1] ....: 女人 0.77537125349 傢伙 0.617369174957 媽媽 0.567102909088 漂亮 0.560832381248 잘했어 0.540875017643 謊話 0.538448691368 爸爸 0.53660941124 傻瓜 0.535608053207 예쁘다 0.535151124001 mc劉 0.529670000076 In [8]: result = model.most_similar(u"女人") In [9]: for e in result: print e[0], e[1] ....: 男人 0.77537125349 個人某 0.589010596275 媽媽 0.576344847679 잘했어 0.562340974808 美麗 0.555426716805 爸爸 0.543958246708 新娘 0.543640494347 謊話 0.540272831917 妞兒 0.531066179276 老婆 0.528521537781 In [10]: result = model.most_similar(u"青蛙") In [11]: for e in result: print e[0], e[1] ....: 老鼠 0.559612870216 烏龜 0.489831030369 蜥蜴 0.478990525007 貓 0.46728849411 鱷魚 0.461885392666 蟾蜍 0.448014199734 猴子 0.436584025621 白雪公主 0.434905380011 蚯蚓 0.433413207531 螃蟹 0.4314712286 In [12]: result = model.most_similar(u"姨夫") In [13]: for e in result: print e[0], e[1] ....: 堂伯 0.583935439587 祖父 0.574735701084 妃所生 0.569327116013 內弟 0.562012672424 早卒 0.558042645454 曕 0.553856015205 胤禎 0.553288519382 陳潛 0.550716996193 愔之 0.550510883331 叔父 0.550032019615 In [14]: result = model.most_similar(u"衣服") In [15]: for e in result: print e[0], e[1] ....: 鞋子 0.686688780785 穿着 0.672499775887 衣物 0.67173999548 大衣 0.667605519295 褲子 0.662670075893 內褲 0.662210345268 裙子 0.659705817699 西裝 0.648508131504 洋裝 0.647238850594 圍裙 0.642895817757 In [16]: result = model.most_similar(u"公安局") In [17]: for e in result: print e[0], e[1] ....: 司法局 0.730189085007 公安廳 0.634275555611 公安 0.612798035145 房管局 0.597343325615 商業局 0.597183346748 軍管會 0.59476184845 體育局 0.59283208847 財政局 0.588721752167 戒毒所 0.575558543205 新聞辦 0.573395550251 In [18]: result = model.most_similar(u"鐵道部") In [19]: for e in result: print e[0], e[1] ....: 盛光祖 0.565509021282 交通部 0.548688530922 批覆 0.546967327595 劉志軍 0.541010737419 立項 0.517836689949 報送 0.510296344757 計委 0.508456230164 水利部 0.503531932831 國務院 0.503227233887 經貿委 0.50156635046 In [20]: result = model.most_similar(u"清華大學") In [21]: for e in result: print e[0], e[1] ....: 北京大學 0.763922810555 化學系 0.724210739136 物理系 0.694550514221 數學系 0.684280991554 中山大學 0.677202701569 復旦 0.657914161682 師範大學 0.656435549259 哲學系 0.654701948166 生物系 0.654403865337 中文系 0.653147578239 In [22]: result = model.most_similar(u"衛視") In [23]: for e in result: print e[0], e[1] ....: 湖南 0.676812887192 中文臺 0.626506924629 収蔵 0.621356606483 黃金檔 0.582251906395 cctv 0.536769032478 安徽 0.536752820015 非同凡響 0.534517168999 唱響 0.533438682556 最強音 0.532605051994 金鷹 0.531676828861 In [26]: result = model.most_similar(u"林丹") In [27]: for e in result: print e[0], e[1] ....: 黃綜翰 0.538035452366 蔣燕皎 0.52646958828 劉鑫 0.522252976894 韓晶娜 0.516120731831 王曉理 0.512289524078 王適 0.508560419083 楊影 0.508159279823 陳躍 0.507353425026 龔智超 0.503159761429 李敬元 0.50262516737 In [28]: result = model.most_similar(u"語言學") In [29]: for e in result: print e[0], e[1] ....: 社會學 0.632598280907 人類學 0.623406708241 歷史學 0.618442356586 比較文學 0.604823827744 心理學 0.600066184998 人文科學 0.577783346176 社會心理學 0.575571238995 政治學 0.574541330338 地理學 0.573896467686 哲學 0.573873817921 In [30]: result = model.most_similar(u"計算機") In [31]: for e in result: print e[0], e[1] ....: 自動化 0.674171924591 應用 0.614087462425 自動化系 0.611132860184 材料科學 0.607891201973 集成電路 0.600370049477 技術 0.597518980503 電子學 0.591316461563 建模 0.577238917351 工程學 0.572855889797 微電子 0.570086717606 In [32]: model.similarity(u"計算機", u"自動化") Out[32]: 0.67417196002404789 In [33]: model.similarity(u"女人", u"男人") Out[33]: 0.77537125129824813 In [34]: model.doesnt_match(u"早餐 晚餐 午飯 中心".split()) Out[34]: u'\u4e2d\u5fc3' In [35]: print model.doesnt_match(u"早餐 晚餐 午飯 中心".split()) 中心 |
有好的也有壞的case,甚至bad case可能會更多一些,這和語料庫的規模有關,還和分詞器的效果有關等等,不過這個實驗暫且就到這裏了。至於word2vec有什麼用,目前除了用來來計算詞語類似度外,業界更關注的是word2vec在具體的應用任務中的效果,這個纔是更有意思的東東,也歡迎你們一塊兒探討。
出處「我愛天然語言處理」:www.52nlp.cn