以前寫過《中英文維基百科語料上的Word2Vec實驗》,近期有很多同窗在這篇文章下留言提問,加上最近一些工做也與Word2Vec相關,因而又作了一些功課,包括從新過了一遍Word2Vec的相關資料,試了一下gensim的相關更新接口,google了一下"wikipedia word2vec" or "維基百科 word2vec" 相關的英中文資料,發現多數仍是走得這篇文章的老路,既經過gensim提供的維基百科預處理腳本"gensim.corpora.WikiCorpus"提取維基語料,每篇文章一行文本存放,而後基於gensim的Word2Vec模塊訓練詞向量模型。這裏再提供另外一個方法來處理維基百科的語料,訓練詞向量模型,計算詞語類似度(Word Similarity)。關於Word2Vec, 若是英文不錯,推薦從這篇文章入手讀相關的資料: Getting started with Word2Vec 。html
此次咱們僅以英文維基百科語料爲例,首先依然是下載維基百科的最新XML打包壓縮數據,在這個英文最新更新的數據列表下:https://dumps.wikimedia.org/enwiki/latest/ ,找到 "enwiki-latest-pages-articles.xml.bz2" 下載,這份英文維基百科全量壓縮數據的打包時間大概是2017年4月4號,大約13G,我經過家裏的電腦wget下載大概花了3個小時,電信100M寬帶,速度還不錯。python
接下來就是處理這份壓縮的XML英文維基百科語料了,此次咱們使用WikiExtractor:git
WikiExtractor.py is a Python script that extracts and cleans text from a Wikipedia database dump.
The tool is written in Python and requires Python 2.7 or Python 3.3+ but no additional library.github
WikiExtractor是一個Python 腳本,專門用於提取和清洗Wikipedia的dump數據,支持Python 2.7 或者 Python 3.3+,無額外依賴,安裝和使用都很是方便:bootstrap
安裝:git clone https://github.com/attardi/wikiextractor.git
app
cd wikiextractor/
sudo python setup.py install
使用:WikiExtractor.py -o enwiki enwiki-latest-pages-articles.xml.bz2
less
...... INFO: 53665431 Pampapaul INFO: 53665433 Charles Frederick Zimpel INFO: Finished 11-process extraction of 5375019 articles in 8363.5s (642.7 art/s) |
這個過程總計花了2個多小時,提取了大概537萬多篇文章。關於個人機器配置,可參考:《深度學習主機攢機小記》函數
提取後的文件按必定順序切分存儲在多個子目錄下:學習
每一個子目錄下的又存放若干個以wiki_num命名的文件,每一個大小在1M左右,這個大小能夠經過參數 -b 控制:測試
-b n[KMG], --bytes n[KMG] maximum bytes per output file (default 1M)
咱們看一下wiki_00裏的具體內容:
Anarchism
Anarchism is a political philosophy that advocates self-governed societies based on voluntary institutions. These are often described as stateless societies, although several authors have defined them more specifically as institutions based on non-hierarchical free associations. Anarchism holds the state to be undesirable, unnecessary, and harmful.
...
Criticisms of anarchism include moral criticisms and pragmatic criticisms. Anarchism is often evaluated as unfeasible or utopian by its critics.
Autism
Autism is a neurodevelopmental disorder characterized by impaired social interaction, verbal and non-verbal communication, and restricted and repetitive behavior. Parents usually notice signs in the first two years of their child's life. These signs often develop gradually, though some children with autism reach their developmental milestones at a normal pace and then regress. The diagnostic criteria require that symptoms become apparent in early childhood, typically before age three.
...
...
每一個wiki_num文件裏又存放若干個doc,每一個doc都有相關的tag標記,包括id, url, title等,很好區分。
這裏咱們按照Gensim做者提供的word2vec tutorial裏"memory-friendly iterator"方式來處理英文維基百科的數據。代碼以下,也同步放到了github裏:train_word2vec_with_gensim.py
#!/usr/bin/env python # -*- coding: utf-8 -*- # Author: Pan Yang (panyangnlp@gmail.com) # Copyright 2017 @ Yu Zhen import gensim import logging import multiprocessing import os import re import sys from pattern.en import tokenize from time import time logging.basicConfig(format='%(asctime)s : %(levelname)s : %(message)s', level=logging.INFO) def cleanhtml(raw_html): cleanr = re.compile('<.*?>') cleantext = re.sub(cleanr, ' ', raw_html) return cleantext class MySentences(object): def __init__(self, dirname): self.dirname = dirname def __iter__(self): for root, dirs, files in os.walk(self.dirname): for filename in files: file_path = root + '/' + filename for line in open(file_path): sline = line.strip() if sline == "": continue rline = cleanhtml(sline) tokenized_line = ' '.join(tokenize(rline)) is_alpha_word_line = [word for word in tokenized_line.lower().split() if word.isalpha()] yield is_alpha_word_line if __name__ == '__main__': if len(sys.argv) != 2: print "Please use python train_with_gensim.py data_path" exit() data_path = sys.argv[1] begin = time() sentences = MySentences(data_path) model = gensim.models.Word2Vec(sentences, size=200, window=10, min_count=10, workers=multiprocessing.cpu_count()) model.save("data/model/word2vec_gensim") model.wv.save_word2vec_format("data/model/word2vec_org", "data/model/vocabulary", binary=False) end = time() print "Total procesing time: %d seconds" % (end - begin) |
注意其中的word tokenize使用了pattern裏的英文tokenize模塊,固然,也可使用nltk裏的word_tokenize模塊,作一點修改便可,不過nltk對於句尾的一些詞的work tokenize處理的不太好。另外咱們設定詞向量維度爲200, 窗口長度爲10, 最小出現次數爲10,經過 is_alpha() 函數過濾掉標點和非英文詞。如今能夠用這個腳原本訓練英文維基百科的Word2Vec模型了:python train_word2vec_with_gensim.py enwiki
2017-04-22 14:31:04,703 : INFO : collecting all words and their counts 2017-04-22 14:31:04,704 : INFO : PROGRESS: at sentence #0, processed 0 words, keeping 0 word types 2017-04-22 14:31:06,442 : INFO : PROGRESS: at sentence #10000, processed 480546 words, keeping 33925 word types 2017-04-22 14:31:08,104 : INFO : PROGRESS: at sentence #20000, processed 983240 words, keeping 51765 word types 2017-04-22 14:31:09,685 : INFO : PROGRESS: at sentence #30000, processed 1455218 words, keeping 64982 word types 2017-04-22 14:31:11,349 : INFO : PROGRESS: at sentence #40000, processed 1957479 words, keeping 76112 word types ...... 2017-04-23 02:50:59,844 : INFO : worker thread finished; awaiting finish of 2 more threads 2017-04-23 02:50:59,844 : INFO : worker thread finished; awaiting finish of 1 more threads 2017-04-23 02:50:59,854 : INFO : worker thread finished; awaiting finish of 0 more threads 2017-04-23 02:50:59,854 : INFO : training on 8903084745 raw words (6742578791 effective words) took 37805.2s, 178351 effective words/s 2017-04-23 02:50:59,855 : INFO : saving Word2Vec object under data/model/word2vec_gensim, separately None 2017-04-23 02:50:59,855 : INFO : not storing attribute syn0norm 2017-04-23 02:50:59,855 : INFO : storing np array 'syn0' to data/model/word2vec_gensim.wv.syn0.npy 2017-04-23 02:51:00,241 : INFO : storing np array 'syn1neg' to data/model/word2vec_gensim.syn1neg.npy 2017-04-23 02:51:00,574 : INFO : not storing attribute cum_table 2017-04-23 02:51:13,886 : INFO : saved data/model/word2vec_gensim 2017-04-23 02:51:13,886 : INFO : storing vocabulary in data/model/vocabulary 2017-04-23 02:51:17,480 : INFO : storing 868777x200 projection weights into data/model/word2vec_org Total procesing time: 44476 seconds |
這個訓練過程當中大概花了12多小時,訓練後的文件存放在data/model下:
咱們來測試一下這個英文維基百科的Word2Vec模型:
textminer@textminer:/opt/wiki/data$ ipython Python 2.7.12 (default, Nov 19 2016, 06:48:10) Type "copyright", "credits" or "license" for more information. IPython 2.4.1 -- An enhanced Interactive Python. ? -> Introduction and overview of IPython's features. %quickref -> Quick reference. help -> Python's own help system. object? -> Details about 'object', use 'object??' for extra details. In [1]: from gensim.models import Word2Vec In [2]: en_wiki_word2vec_model = Word2Vec.load('data/model/word2vec_gensim') |
首先來測試幾個單詞的類似單詞(Word Similariy):
word:
In [3]: en_wiki_word2vec_model.most_similar('word') Out[3]: [('phrase', 0.8129693269729614), ('meaning', 0.7311851978302002), ('words', 0.7010501623153687), ('adjective', 0.6805518865585327), ('noun', 0.6461974382400513), ('suffix', 0.6440576314926147), ('verb', 0.6319557428359985), ('loanword', 0.6262609958648682), ('proverb', 0.6240501403808594), ('pronunciation', 0.6105246543884277)] |
In [4]: en_wiki_word2vec_model.most_similar('similarity') Out[4]: [('similarities', 0.8517599701881409), ('resemblance', 0.786037266254425), ('resemblances', 0.7496883869171143), ('affinities', 0.6571112275123596), ('differences', 0.6465682983398438), ('dissimilarities', 0.6212711930274963), ('correlation', 0.6071442365646362), ('dissimilarity', 0.6062943935394287), ('variation', 0.5970577001571655), ('difference', 0.5928016901016235)] |
nlp:
In [5]: en_wiki_word2vec_model.most_similar('nlp') Out[5]: [('neurolinguistic', 0.6698148250579834), ('psycholinguistic', 0.6388964056968689), ('connectionism', 0.6027182936668396), ('semantics', 0.5866401195526123), ('connectionist', 0.5865628719329834), ('bandler', 0.5837364196777344), ('phonics', 0.5733655691146851), ('psycholinguistics', 0.5613113641738892), ('bootstrapping', 0.559638261795044), ('psychometrics', 0.5555593967437744)] |
In [6]: en_wiki_word2vec_model.most_similar('learn') Out[6]: [('teach', 0.7533557415008545), ('understand', 0.71148681640625), ('discover', 0.6749690771102905), ('learned', 0.6599283218383789), ('realize', 0.6390970349311829), ('find', 0.6308424472808838), ('know', 0.6171890497207642), ('tell', 0.6146825551986694), ('inform', 0.6008728742599487), ('instruct', 0.5998791456222534)] |
man:
In [7]: en_wiki_word2vec_model.most_similar('man') Out[7]: [('woman', 0.7243080735206604), ('boy', 0.7029494047164917), ('girl', 0.6441491842269897), ('stranger', 0.63275545835495), ('drunkard', 0.6136815547943115), ('gentleman', 0.6122575998306274), ('lover', 0.6108279228210449), ('thief', 0.609005331993103), ('beggar', 0.6083744764328003), ('person', 0.597919225692749)] |
再來看看其餘幾個相關接口:
In [8]: en_wiki_word2vec_model.most_similar(positive=['woman', 'king'], negative=['man'], topn=1) Out[8]: [('queen', 0.7752252817153931)] In [9]: en_wiki_word2vec_model.similarity('woman', 'man') Out[9]: 0.72430799548282099 In [10]: en_wiki_word2vec_model.doesnt_match("breakfast cereal dinner lunch".split()) Out[10]: 'cereal' |
我把這篇文章的相關代碼還有另外一篇「中英文維基百科語料上的Word2Vec實驗」的相關代碼整理了一下,在github上創建了一個 Wikipedia_Word2vec 的項目,感興趣的同窗能夠參考。
注:原創文章,轉載請註明出處及保留連接「我愛天然語言處理」:http://www.52nlp.cn