官方word2vec的github下載地址:https://github.com/svn2github/word2veclinux
環境,linux-ubuntu-14.04LST,安裝好git, gcc版本4.8.4git
linux下的安裝方式:github
% git clone https://github.com/svn2github/word2vec.gitubuntu
% cd word2vecwindows
% makeapp
命令解析:less
-train <file>
Use text data from <file> to train the model
-output <file>
Use <file> to save the resulting word vectors / word clusters
-size <int>
Set size of word vectors; default is 100
-window <int>
Set max skip length between words; default is 5
-sample <float>
Set threshold for occurrence of words. Those that appear with higher frequency in the training data
will be randomly down-sampled; default is 1e-3, useful range is (0, 1e-5)
-hs <int>
Use Hierarchical Softmax; default is 0 (not used)
-negative <int>
Number of negative examples; default is 5, common values are 3 - 10 (0 = not used)
-threads <int>
Use <int> threads (default 12)
-iter <int>
Run more training iterations (default 5)
-min-count <int>
This will discard words that appear less than <int> times; default is 5
-alpha <float>
Set the starting learning rate; default is 0.025 for skip-gram and 0.05 for CBOW
-classes <int>
Output word classes rather than word vectors; default number of classes is 0 (vectors are written)
-debug <int>
Set the debug mode (default = 2 = more info during training)
-binary <int>
Save the resulting vectors in binary moded; default is 0 (off)
-save-vocab <file>
The vocabulary will be saved to <file>
-read-vocab <file>
The vocabulary will be read from <file>, not constructed from the training data
-cbow <int>
Use the continuous bag of words model; default is 1 (use 0 for skip-gram model)dom
以後準備訓練預料就能夠了,將分詞後的文件拼成一行,訓練便可,ide
./word2vec -train fudan_corpus_final -output fudan_100_skip.bin -cbow 0 -size 100 -windows 10 -negative 5 -hs 0 -binary 1 -sample 1e-4 -threads 20 -iter 15svn
對於生成 「fudan_100_skip.bin」 文件,能夠用gensim 轉換爲txt明文形式:
from gensim.models import word2vec model = word2vec.Word2Vec.load_word2vec_format('path/to/GoogleNews-vectors-negative300.bin', binary=True) model.save_word2vec_format('path/to/GoogleNews-vectors-negative300.txt', binary=False)
注意:windows下須要先 切換到 gensim的環境(activate gensim),而後再執行
可是以上關於gensim讀取的在我這有問題,所以採用原生方法:參考自http://stackoverflow.com/questions/27324292/convert-word2vec-bin-file-to-text 將以上連接中的c代碼copy下來,取名readbin.c 編譯readbin.c文件時因爲涉及math庫,所以命令爲: % gcc -o readbin readbin.c -lm 以後執行將bin文件轉換爲txt文件的操做便可: % ./readbin fudan_100_skip.bin fudan_100.txt