基於 Gensim 的 Word2Vec 實踐,從屬於筆者的程序猿的數據科學與機器學習實戰手冊,代碼參考gensim.ipynb。推薦前置閱讀Python語法速覽與機器學習開發環境搭建,Scikit-Learn 備忘錄。html
Gensim中 Word2Vec 模型的指望輸入是進過度詞的句子列表,便是某個二維數組。這裏咱們暫時使用 Python 內置的數組,不過其在輸入數據集較大的狀況下會佔用大量的 RAM。Gensim 自己只是要求可以迭代的有序句子列表,所以在工程實踐中咱們可使用自定義的生成器,只在內存中保存單條語句。github
# 引入 word2vec from gensim.models import word2vec # 引入日誌配置 import logging logging.basicConfig(format='%(asctime)s : %(levelname)s : %(message)s', level=logging.INFO) # 引入數據集 raw_sentences = ["the quick brown fox jumps over the lazy dogs","yoyoyo you go home now to sleep"] # 切分詞彙 sentences= [s.encode('utf-8').split() for s in sentences] # 構建模型 model = word2vec.Word2Vec(sentences, min_count=1) # 進行相關性比較 model.similarity('dogs','you')
這裏咱們調用Word2Vec
建立模型實際上會對數據執行兩次迭代操做,第一輪操做會統計詞頻來構建內部的詞典數結構,第二輪操做會進行神經網絡訓練,而這兩個步驟是能夠分步進行的,這樣對於某些不可重複的流(譬如 Kafka 等流式數據中)能夠手動控制:數組
model = gensim.models.Word2Vec(iter=1) # an empty model, no training yet model.build_vocab(some_sentences) # can be a non-repeatable, 1-pass generator model.train(other_sentences) # can be a non-repeatable, 1-pass generator
min_count網絡
model = Word2Vec(sentences, min_count=10) # default value is 5
在不一樣大小的語料集中,咱們對於基準詞頻的需求也是不同的。譬如在較大的語料集中,咱們但願忽略那些只出現過一兩次的單詞,這裏咱們就能夠經過設置min_count
參數進行控制。通常而言,合理的參數值會設置在0~100之間。併發
size機器學習
size
參數主要是用來設置神經網絡的層數,Word2Vec 中的默認值是設置爲100層。更大的層次設置意味着更多的輸入數據,不過也能提高總體的準確度,合理的設置範圍爲 10~數百。svn
model = Word2Vec(sentences, size=200) # default value is 100
workers學習
workers
參數用於設置併發訓練時候的線程數,不過僅當Cython
安裝的狀況下才會起做用:
model = Word2Vec(sentences, workers=4) # default = 1 worker = no parallelization
在真實的訓練場景中咱們每每會使用較大的語料集進行訓練,譬如這裏以 Word2Vec 官方的text8爲例,只要改變模型中的語料集開源便可:
sentences = word2vec.Text8Corpus('text8') model = word2vec.Word2Vec(sentences, size=200)
這裏語料集中的語句是通過分詞的,所以能夠直接使用。筆者在第一次使用該類時報錯了,所以把 Gensim 中的源代碼貼一下,也方便之後自定義處理其餘語料集:
class Text8Corpus(object): """Iterate over sentences from the "text8" corpus, unzipped from http://mattmahoney.net/dc/text8.zip .""" def __init__(self, fname, max_sentence_length=MAX_WORDS_IN_BATCH): self.fname = fname self.max_sentence_length = max_sentence_length def __iter__(self): # the entire corpus is one gigantic line -- there are no sentence marks at all # so just split the sequence of tokens arbitrarily: 1 sentence = 1000 tokens sentence, rest = [], b'' with utils.smart_open(self.fname) as fin: while True: text = rest + fin.read(8192) # avoid loading the entire file (=1 line) into RAM if text == rest: # EOF words = utils.to_unicode(text).split() sentence.extend(words) # return the last chunk of words, too (may be shorter/longer) if sentence: yield sentence break last_token = text.rfind(b' ') # last token may have been split in two... keep for next iteration words, rest = (utils.to_unicode(text[:last_token]).split(), text[last_token:].strip()) if last_token >= 0 else ([], text) sentence.extend(words) while len(sentence) >= self.max_sentence_length: yield sentence[:self.max_sentence_length] sentence = sentence[self.max_sentence_length:]
咱們在上文中也說起,若是是對於大量的輸入語料集或者須要整合磁盤上多個文件夾下的數據,咱們能夠以迭代器的方式而不是一次性將所有內容讀取到內存中來節省 RAM 空間:
class MySentences(object): def __init__(self, dirname): self.dirname = dirname def __iter__(self): for fname in os.listdir(self.dirname): for line in open(os.path.join(self.dirname, fname)): yield line.split() sentences = MySentences('/some/directory') # a memory-friendly iterator model = gensim.models.Word2Vec(sentences)
model.save('text8.model') 2015-02-24 11:19:26,059 : INFO : saving Word2Vec object under text8.model, separately None 2015-02-24 11:19:26,060 : INFO : not storing attribute syn0norm 2015-02-24 11:19:26,060 : INFO : storing numpy array 'syn0' to text8.model.syn0.npy 2015-02-24 11:19:26,742 : INFO : storing numpy array 'syn1' to text8.model.syn1.npy model1 = Word2Vec.load('text8.model') model.save_word2vec_format('text.model.bin', binary=True) 2015-02-24 11:19:52,341 : INFO : storing 71290x200 projection weights into text.model.bin model1 = word2vec.Word2Vec.load_word2vec_format('text.model.bin', binary=True) 2015-02-24 11:22:08,185 : INFO : loading projection weights from text.model.bin 2015-02-24 11:22:10,322 : INFO : loaded (71290, 200) matrix from text.model.bin 2015-02-24 11:22:10,322 : INFO : precomputing L2-norms of word weight vectors
Word2Vec 最著名的效果便是以語義化的方式推斷出類似詞彙:
model.most_similar(positive=['woman', 'king'], negative=['man'], topn=1) [('queen', 0.50882536)] model.doesnt_match("breakfast cereal dinner lunch";.split()) 'cereal' model.similarity('woman', 'man') 0.73723527 model.most_similar(['man']) [(u'woman', 0.5686948895454407), (u'girl', 0.4957364797592163), (u'young', 0.4457539916038513), (u'luckiest', 0.4420626759529114), (u'serpent', 0.42716869711875916), (u'girls', 0.42680859565734863), (u'smokes', 0.4265017509460449), (u'creature', 0.4227582812309265), (u'robot', 0.417464017868042), (u'mortal', 0.41728296875953674)]
若是咱們但願直接獲取某個單詞的向量表示,直接如下標方式訪問便可:
model['computer'] # raw NumPy vector of a word array([-0.00449447, -0.00310097, 0.02421786, ...], dtype=float32)
Word2Vec 的訓練屬於無監督模型,並無太多的相似於監督學習裏面的客觀評判方式,更多的依賴於端應用。Google 以前公開了20000條左右的語法與語義化訓練樣本,每一條遵循A is to B as C is to D
這個格式,地址在這裏:
model.accuracy('/tmp/questions-words.txt') 2014-02-01 22:14:28,387 : INFO : family: 88.9% (304/342) 2014-02-01 22:29:24,006 : INFO : gram1-adjective-to-adverb: 32.4% (263/812) 2014-02-01 22:36:26,528 : INFO : gram2-opposite: 50.3% (191/380) 2014-02-01 23:00:52,406 : INFO : gram3-comparative: 91.7% (1222/1332) 2014-02-01 23:13:48,243 : INFO : gram4-superlative: 87.9% (617/702) 2014-02-01 23:29:52,268 : INFO : gram5-present-participle: 79.4% (691/870) 2014-02-01 23:57:04,965 : INFO : gram7-past-tense: 67.1% (995/1482) 2014-02-02 00:15:18,525 : INFO : gram8-plural: 89.6% (889/992) 2014-02-02 00:28:18,140 : INFO : gram9-plural-verbs: 68.7% (482/702) 2014-02-02 00:28:18,140 : INFO : total: 74.3% (5654/7614)
仍是須要強調下,訓練集上表現的好也不意味着 Word2Vec 在真實應用中就會表現的很好,仍是須要因地制宜。)
Gensim中 Word2Vec 模型的指望輸入是進過度詞的句子列表,便是某個二維數組。這裏咱們暫時使用 Python 內置的數組,不過其在輸入數據集較大的狀況下會佔用大量的 RAM。Gensim 自己只是要求可以迭代的有序句子列表,所以在工程實踐中咱們可使用自定義的生成器,只在內存中保存單條語句。
# 引入 word2vec from gensim.models import word2vec # 引入日誌配置 import logging logging.basicConfig(format='%(asctime)s : %(levelname)s : %(message)s', level=logging.INFO) # 引入數據集 raw_sentences = ["the quick brown fox jumps over the lazy dogs","yoyoyo you go home now to sleep"] # 切分詞彙 sentences= [s.encode('utf-8').split() for s in sentences] # 構建模型 model = word2vec.Word2Vec(sentences, min_count=1) # 進行相關性比較 model.similarity('dogs','you')
這裏咱們調用Word2Vec
建立模型實際上會對數據執行兩次迭代操做,第一輪操做會統計詞頻來構建內部的詞典數結構,第二輪操做會進行神經網絡訓練,而這兩個步驟是能夠分步進行的,這樣對於某些不可重複的流(譬如 Kafka 等流式數據中)能夠手動控制:
model = gensim.models.Word2Vec(iter=1) # an empty model, no training yet model.build_vocab(some_sentences) # can be a non-repeatable, 1-pass generator model.train(other_sentences) # can be a non-repeatable, 1-pass generator
min_count
model = Word2Vec(sentences, min_count=10) # default value is 5
在不一樣大小的語料集中,咱們對於基準詞頻的需求也是不同的。譬如在較大的語料集中,咱們但願忽略那些只出現過一兩次的單詞,這裏咱們就能夠經過設置min_count
參數進行控制。通常而言,合理的參數值會設置在0~100之間。
size
size
參數主要是用來設置神經網絡的層數,Word2Vec 中的默認值是設置爲100層。更大的層次設置意味着更多的輸入數據,不過也能提高總體的準確度,合理的設置範圍爲 10~數百。
model = Word2Vec(sentences, size=200) # default value is 100
workers
workers
參數用於設置併發訓練時候的線程數,不過僅當Cython
安裝的狀況下才會起做用:
model = Word2Vec(sentences, workers=4) # default = 1 worker = no parallelization
在真實的訓練場景中咱們每每會使用較大的語料集進行訓練,譬如這裏以 Word2Vec 官方的text8爲例,只要改變模型中的語料集開源便可:
sentences = word2vec.Text8Corpus('text8') model = word2vec.Word2Vec(sentences, size=200)
這裏語料集中的語句是通過分詞的,所以能夠直接使用。筆者在第一次使用該類時報錯了,所以把 Gensim 中的源代碼貼一下,也方便之後自定義處理其餘語料集:
class Text8Corpus(object): """Iterate over sentences from the "text8" corpus, unzipped from http://mattmahoney.net/dc/text8.zip .""" def __init__(self, fname, max_sentence_length=MAX_WORDS_IN_BATCH): self.fname = fname self.max_sentence_length = max_sentence_length def __iter__(self): # the entire corpus is one gigantic line -- there are no sentence marks at all # so just split the sequence of tokens arbitrarily: 1 sentence = 1000 tokens sentence, rest = [], b'' with utils.smart_open(self.fname) as fin: while True: text = rest + fin.read(8192) # avoid loading the entire file (=1 line) into RAM if text == rest: # EOF words = utils.to_unicode(text).split() sentence.extend(words) # return the last chunk of words, too (may be shorter/longer) if sentence: yield sentence break last_token = text.rfind(b' ') # last token may have been split in two... keep for next iteration words, rest = (utils.to_unicode(text[:last_token]).split(), text[last_token:].strip()) if last_token >= 0 else ([], text) sentence.extend(words) while len(sentence) >= self.max_sentence_length: yield sentence[:self.max_sentence_length] sentence = sentence[self.max_sentence_length:]
咱們在上文中也說起,若是是對於大量的輸入語料集或者須要整合磁盤上多個文件夾下的數據,咱們能夠以迭代器的方式而不是一次性將所有內容讀取到內存中來節省 RAM 空間:
class MySentences(object): def __init__(self, dirname): self.dirname = dirname def __iter__(self): for fname in os.listdir(self.dirname): for line in open(os.path.join(self.dirname, fname)): yield line.split() sentences = MySentences('/some/directory') # a memory-friendly iterator model = gensim.models.Word2Vec(sentences)
model.save('text8.model') 2015-02-24 11:19:26,059 : INFO : saving Word2Vec object under text8.model, separately None 2015-02-24 11:19:26,060 : INFO : not storing attribute syn0norm 2015-02-24 11:19:26,060 : INFO : storing numpy array 'syn0' to text8.model.syn0.npy 2015-02-24 11:19:26,742 : INFO : storing numpy array 'syn1' to text8.model.syn1.npy model1 = Word2Vec.load('text8.model') model.save_word2vec_format('text.model.bin', binary=True) 2015-02-24 11:19:52,341 : INFO : storing 71290x200 projection weights into text.model.bin model1 = word2vec.Word2Vec.load_word2vec_format('text.model.bin', binary=True) 2015-02-24 11:22:08,185 : INFO : loading projection weights from text.model.bin 2015-02-24 11:22:10,322 : INFO : loaded (71290, 200) matrix from text.model.bin 2015-02-24 11:22:10,322 : INFO : precomputing L2-norms of word weight vectors
Word2Vec 最著名的效果便是以語義化的方式推斷出類似詞彙:
model.most_similar(positive=['woman', 'king'], negative=['man'], topn=1) [('queen', 0.50882536)] model.doesnt_match("breakfast cereal dinner lunch";.split()) 'cereal' model.similarity('woman', 'man') 0.73723527 model.most_similar(['man']) [(u'woman', 0.5686948895454407), (u'girl', 0.4957364797592163), (u'young', 0.4457539916038513), (u'luckiest', 0.4420626759529114), (u'serpent', 0.42716869711875916), (u'girls', 0.42680859565734863), (u'smokes', 0.4265017509460449), (u'creature', 0.4227582812309265), (u'robot', 0.417464017868042), (u'mortal', 0.41728296875953674)]
若是咱們但願直接獲取某個單詞的向量表示,直接如下標方式訪問便可:
model['computer'] # raw NumPy vector of a word array([-0.00449447, -0.00310097, 0.02421786, ...], dtype=float32)
Word2Vec 的訓練屬於無監督模型,並無太多的相似於監督學習裏面的客觀評判方式,更多的依賴於端應用。Google 以前公開了20000條左右的語法與語義化訓練樣本,每一條遵循A is to B as C is to D
這個格式,地址在這裏:
model.accuracy('/tmp/questions-words.txt') 2014-02-01 22:14:28,387 : INFO : family: 88.9% (304/342) 2014-02-01 22:29:24,006 : INFO : gram1-adjective-to-adverb: 32.4% (263/812) 2014-02-01 22:36:26,528 : INFO : gram2-opposite: 50.3% (191/380) 2014-02-01 23:00:52,406 : INFO : gram3-comparative: 91.7% (1222/1332) 2014-02-01 23:13:48,243 : INFO : gram4-superlative: 87.9% (617/702) 2014-02-01 23:29:52,268 : INFO : gram5-present-participle: 79.4% (691/870) 2014-02-01 23:57:04,965 : INFO : gram7-past-tense: 67.1% (995/1482) 2014-02-02 00:15:18,525 : INFO : gram8-plural: 89.6% (889/992) 2014-02-02 00:28:18,140 : INFO : gram9-plural-verbs: 68.7% (482/702) 2014-02-02 00:28:18,140 : INFO : total: 74.3% (5654/7614)
仍是須要強調下,訓練集上表現的好也不意味着 Word2Vec 在真實應用中就會表現的很好,仍是須要因地制宜。