1 Package Contents 2 To train your own GloVe vectors, first you'll need to prepare your corpus as a single text file with all words separated by a single space. If your corpus has multiple documents, simply concatenate documents together with a single space. If your documents are particularly short, it's possible that padding the gap between documents with e.g. 5 "dummy" words will produce better vectors. Once you create your corpus, you can train GloVe vectors using the following 4 tools. An example is included in demo.sh, which you can modify as necessary. 3 4 This four main tools in this package are: 5 6 1) vocab_count 7 This tool requires an input corpus that should already consist of whitespace-separated tokens. Use something like the Stanford Tokenizer first on raw text. From the corpus, it constructs unigram counts from a corpus, and optionally thresholds the resulting vocabulary based on total vocabulary size or minimum frequency count. 8 9 2) cooccur 10 Constructs word-word cooccurrence statistics from a corpus. The user should supply a vocabulary file, as produced by vocab_count, and may specify a variety of parameters, as described by running ./build/cooccur. 11 12 3) shuffle 13 Shuffles the binary file of cooccurrence statistics produced by cooccur. For large files, the file is automatically split into chunks, each of which is shuffled and stored on disk before being merged and shuffled together. The user may specify a number of parameters, as described by running ./build/shuffle. 14 15 4) glove 16 Train the GloVe model on the specified cooccurrence data, which typically will be the output of the shuffle tool. The user should supply a vocabulary file, as given by vocab_count, and may specify a number of other parameters, which are described by running ./build/glove.
1 若是你要訓練你本身的glove詞向量,那麼你首先須要把準備一個包含你語料集的單獨文件,格式要求,文件中的詞都用一個空格隔開。若是你的語料集有多個文檔,請用兩兩之間用空格鏈接起來。若是你的文檔都很是的短,你能夠用5個"dummy"單詞來填充文檔,這樣能夠產生更好的詞向量。一旦你建立了語料庫,你就能夠用如下4個工具進行glove詞向量訓練了。demo.sh中包含一個示例,能夠再必要的時候修改它。 2 3 攻擊包中主要的四個工具以下所示: 4 (1) vocab_count 5 這個工具要求輸入的語料庫已是以空格分隔的標準格式。它會首先使用相似Stanford Tokenizer 的方式做用在文本上,它會對語料庫中的一元詞進行統計計數,並根據總詞彙量或者最小詞頻計數來選擇閾值獲得最終結果 6 (2)ooccur 7 從語聊庫構建詞-詞共生統計,用戶應該提供一個由vocab_count獲得的詞彙表文件,同時須要指定一系列參數, 就像運行./build/cooccur時顯示的描述樣 8 (3)shuffle 9 混洗由cooccur生成二進制的共生統計結果文件。對於大文件,每一個塊都會在混合並混洗在一塊兒而後存儲並排列在磁盤陣列上。用戶須要指定一些參數,如運行 ./build/shuffle時顯示的那樣。 10 11 (4) glove 12 13 在指定的共生數據上訓練glove模型,這一般是混洗工具(shuffle)輸出的結果。用戶應該提供一個由vocab_count得出的文件並指定一系列參數,如運行./build/glove描述的那樣