word2vec詞向量處理英文語料

 

word2vec介紹

         word2vec官網https://code.google.com/p/word2vec/php

  • word2vec是google的一個開源工具,可以根據輸入的詞的集合計算出詞與詞之間的距離。
  • 它將term轉換成向量形式,能夠把對文本內容的處理簡化爲向量空間中的向量運算,計算出向量空間上的類似度,來表示文本語義上的類似度。
  • word2vec計算的是餘弦值,距離範圍爲0-1之間,值越大表明兩個詞關聯度越高。
  • 詞向量:用Distributed Representation表示詞,一般也被稱爲「Word Representation」或「Word Embedding(嵌入)」。

使用

運行和測試一樣須要text八、questions-words.txt文件,語料下載地址:http://mattmahoney.net/dc/text8.zip
該語料編碼格式UTF-8,存儲爲一行,語料訓練信息:training on 85026035 raw words (62529137 effective words) took 197.4s, 316692 effective words/s
html

word2vec使用參數解釋

-train 訓練數據
-output 結果輸入文件,即每一個詞的向量
-cbow 是否使用cbow模型,0表示使用skip-gram模型,1表示使用cbow模型,默認狀況下是skip-gram模型,cbow模型快一些,skip-gram模型效果好一些
-size 表示輸出的詞向量維數
-window 爲訓練的窗口大小,8表示每一個詞考慮前8個詞與後8個詞(實際代碼中還有一個隨機選窗口的過程,窗口大小<=5)
-negative 表示是否使用NEG方,0表示不使用,其它的值目前還不是很清楚
-hs 是否使用HS方法,0表示不使用,1表示使用
-sample 表示 採樣的閾值,若是一個詞在訓練樣本中出現的頻率越大,那麼就越會被採樣
-binary 表示輸出的結果文件是否採用二進制存儲,0表示不使用(即普通的文本存儲,能夠打開查看),1表示使用,即vectors.bin的存儲類型
-alpha 表示 學習速率
-min-count 表示設置最低頻率,默認爲5,若是一個詞語在文檔中出現的次數小於該閾值,那麼該詞就會被捨棄
-classes 表示詞聚類簇的個數,從相關源碼中能夠得出該聚類是採用k-means
網絡

 

 

代碼——

 1 # -*- coding: utf-8 -*-
 2  
 3 """
 4 功能:測試gensim使用
 5 時間:2016年5月2日 18:00:00
 6 """
 7  
 8 from gensim.models import word2vec
 9 import logging
10  
11 # 主程序
12 logging.basicConfig(format='%(asctime)s : %(levelname)s : %(message)s', level=logging.INFO)
13 sentences = word2vec.Text8Corpus("data/text8")  # 加載語料
14 model = word2vec.Word2Vec(sentences, size=200)  # 訓練skip-gram模型; 默認window=5
15  
16 # 計算兩個詞的類似度/相關程度
17 y1 = model.similarity("woman", "man")
18 print u"woman和man的類似度爲:", y1
19 print "--------\n"
20  
21 # 計算某個詞的相關詞列表
22 y2 = model.most_similar("good", topn=20)  # 20個最相關的
23 print u"和good最相關的詞有:\n"
24 for item in y2:
25     print item[0], item[1]
26 print "--------\n"
27  
28 # 尋找對應關係
29 print ' "boy" is to "father" as "girl" is to ...? \n'
30 y3 = model.most_similar(['girl', 'father'], ['boy'], topn=3)
31 for item in y3:
32     print item[0], item[1]
33 print "--------\n"
34  
35 more_examples = ["he his she", "big bigger bad", "going went being"]
36 for example in more_examples:
37     a, b, x = example.split()
38     predicted = model.most_similar([x, b], [a])[0][0]
39     print "'%s' is to '%s' as '%s' is to '%s'" % (a, b, x, predicted)
40 print "--------\n"
41  
42 # 尋找不合羣的詞
43 y4 = model.doesnt_match("breakfast cereal dinner lunch".split())
44 print u"不合羣的詞:", y4
45 print "--------\n"
46  
47 # 保存模型,以便重用
48 model.save("text8.model")
49 # 對應的加載方式
50 # model_2 = word2vec.Word2Vec.load("text8.model")
51  
52 # 以一種C語言能夠解析的形式存儲詞向量
53 model.save_word2vec_format("text8.model.bin", binary=True)
54 # 對應的加載方式
55 # model_3 = word2vec.Word2Vec.load_word2vec_format("text8.model.bin", binary=True)
56  
57 if __name__ == "__main__":
58     pass

Ubuntu16.04系統下運行結果

  1 2016-5-2 18:56:19,332 : INFO : collecting all words and their counts
  2 2016-5-2 18:56:19,334 : INFO : PROGRESS: at sentence #0, processed 0 words, keeping 0 word types
  3 2016-5-2 18:56:27,431 : INFO : collected 253854 word types from a corpus of 17005207 raw words and 1701 sentences
  4 2016-5-2 18:56:27,740 : INFO : min_count=5 retains 71290 unique words (drops 182564)
  5 2016-5-2 18:56:27,740 : INFO : min_count leaves 16718844 word corpus (98% of original 17005207)
  6 2016-5-2 18:56:27,914 : INFO : deleting the raw counts dictionary of 253854 items
  7 2016-5-2 18:56:27,947 : INFO : sample=0.001 downsamples 38 most-common words
  8 2016-5-2 18:56:27,947 : INFO : downsampling leaves estimated 12506280 word corpus (74.8% of prior 16718844)
  9 2016-5-2 18:56:27,947 : INFO : estimated required memory for 71290 words and 200 dimensions: 149709000 bytes
 10 2016-5-2 18:56:28,176 : INFO : resetting layer weights
 11 2016-5-2 18:56:29,074 : INFO : training model with 3 workers on 71290 vocabulary and 200 features, using sg=0 hs=0 sample=0.001 negative=5
 12 2016-5-2 18:56:29,074 : INFO : expecting 1701 sentences, matching count from corpus used for vocabulary survey
 13 2016-5-2 18:56:30,086 : INFO : PROGRESS: at 0.86% examples, 531932 words/s, in_qsize 6, out_qsize 0
 14 2016-5-2 18:56:31,088 : INFO : PROGRESS: at 1.72% examples, 528872 words/s, in_qsize 5, out_qsize 0
 15 2016-5-2 18:56:32,108 : INFO : PROGRESS: at 2.68% examples, 549248 words/s, in_qsize 6, out_qsize 0
 16 2016-5-2 18:56:33,113 : INFO : PROGRESS: at 3.47% examples, 534255 words/s, in_qsize 6, out_qsize 0
 17 2016-5-2 18:56:34,135 : INFO : PROGRESS: at 4.43% examples, 545575 words/s, in_qsize 5, out_qsize 0
 18 2016-5-2 18:56:35,145 : INFO : PROGRESS: at 5.40% examples, 555220 words/s, in_qsize 6, out_qsize 0
 19 2016-5-2 18:56:36,147 : INFO : PROGRESS: at 6.34% examples, 560815 words/s, in_qsize 5, out_qsize 0
 20 2016-5-2 18:56:37,155 : INFO : PROGRESS: at 7.28% examples, 564712 words/s, in_qsize 6, out_qsize 1
 21 2016-5-2 18:56:38,172 : INFO : PROGRESS: at 8.24% examples, 568088 words/s, in_qsize 5, out_qsize 0
 22 2016-5-2 18:56:39,169 : INFO : PROGRESS: at 9.19% examples, 570872 words/s, in_qsize 5, out_qsize 0
 23 2016-5-2 18:56:40,191 : INFO : PROGRESS: at 10.16% examples, 573068 words/s, in_qsize 6, out_qsize 0
 24 2016-5-2 18:56:41,203 : INFO : PROGRESS: at 11.12% examples, 575184 words/s, in_qsize 5, out_qsize 1
 25 2016-5-2 18:56:42,217 : INFO : PROGRESS: at 12.09% examples, 577227 words/s, in_qsize 5, out_qsize 0
 26 2016-5-2 18:56:43,220 : INFO : PROGRESS: at 13.04% examples, 578418 words/s, in_qsize 5, out_qsize 1
 27 2016-5-2 18:56:44,235 : INFO : PROGRESS: at 14.00% examples, 579574 words/s, in_qsize 5, out_qsize 1
 28 2016-5-2 18:56:45,239 : INFO : PROGRESS: at 14.96% examples, 580577 words/s, in_qsize 6, out_qsize 2
 29 2016-5-2 18:56:46,243 : INFO : PROGRESS: at 15.86% examples, 578374 words/s, in_qsize 6, out_qsize 0
 30 2016-5-2 18:56:47,252 : INFO : PROGRESS: at 16.70% examples, 574918 words/s, in_qsize 5, out_qsize 1
 31 2016-5-2 18:56:48,256 : INFO : PROGRESS: at 17.66% examples, 576221 words/s, in_qsize 5, out_qsize 0
 32 2016-5-2 18:56:49,258 : INFO : PROGRESS: at 18.61% examples, 577045 words/s, in_qsize 4, out_qsize 0
 33 2016-5-2 18:56:50,260 : INFO : PROGRESS: at 19.54% examples, 576947 words/s, in_qsize 4, out_qsize 1
 34 2016-5-2 18:56:51,261 : INFO : PROGRESS: at 20.47% examples, 577120 words/s, in_qsize 6, out_qsize 0
 35 2016-5-2 18:56:52,284 : INFO : PROGRESS: at 21.43% examples, 577251 words/s, in_qsize 5, out_qsize 1
 36 2016-5-2 18:56:53,287 : INFO : PROGRESS: at 22.34% examples, 576556 words/s, in_qsize 6, out_qsize 0
 37 2016-5-2 18:56:54,308 : INFO : PROGRESS: at 23.20% examples, 574618 words/s, in_qsize 6, out_qsize 1
 38 2016-5-2 18:56:55,306 : INFO : PROGRESS: at 24.15% examples, 575304 words/s, in_qsize 5, out_qsize 0
 39 2016-5-2 18:56:56,329 : INFO : PROGRESS: at 25.09% examples, 575610 words/s, in_qsize 5, out_qsize 1
 40 2016-5-2 18:56:57,333 : INFO : PROGRESS: at 26.04% examples, 576358 words/s, in_qsize 6, out_qsize 0
 41 2016-5-2 18:56:58,340 : INFO : PROGRESS: at 26.97% examples, 576745 words/s, in_qsize 5, out_qsize 0
 42 2016-5-2 18:56:59,337 : INFO : PROGRESS: at 27.91% examples, 577161 words/s, in_qsize 5, out_qsize 0
 43 2016-5-2 18:57:00,338 : INFO : PROGRESS: at 28.84% examples, 577303 words/s, in_qsize 5, out_qsize 0
 44 2016-5-2 18:57:01,346 : INFO : PROGRESS: at 29.65% examples, 575087 words/s, in_qsize 6, out_qsize 0
 45 2016-5-2 18:57:02,353 : INFO : PROGRESS: at 30.55% examples, 574516 words/s, in_qsize 5, out_qsize 1
 46 2016-5-2 18:57:03,356 : INFO : PROGRESS: at 31.36% examples, 572590 words/s, in_qsize 5, out_qsize 0
 47 2016-5-2 18:57:04,371 : INFO : PROGRESS: at 32.10% examples, 569320 words/s, in_qsize 6, out_qsize 0
 48 2016-5-2 18:57:05,380 : INFO : PROGRESS: at 32.95% examples, 568088 words/s, in_qsize 5, out_qsize 0
 49 2016-5-2 18:57:06,389 : INFO : PROGRESS: at 33.78% examples, 566886 words/s, in_qsize 6, out_qsize 1
 50 2016-5-2 18:57:07,399 : INFO : PROGRESS: at 34.60% examples, 565345 words/s, in_qsize 6, out_qsize 0
 51 2016-5-2 18:57:08,418 : INFO : PROGRESS: at 35.51% examples, 564685 words/s, in_qsize 5, out_qsize 0
 52 2016-5-2 18:57:09,432 : INFO : PROGRESS: at 36.39% examples, 564093 words/s, in_qsize 6, out_qsize 0
 53 2016-5-2 18:57:10,441 : INFO : PROGRESS: at 37.21% examples, 562778 words/s, in_qsize 5, out_qsize 1
 54 2016-5-2 18:57:11,453 : INFO : PROGRESS: at 38.14% examples, 563163 words/s, in_qsize 6, out_qsize 1
 55 2016-5-2 18:57:12,449 : INFO : PROGRESS: at 38.98% examples, 562072 words/s, in_qsize 6, out_qsize 0
 56 2016-5-2 18:57:13,461 : INFO : PROGRESS: at 39.88% examples, 561949 words/s, in_qsize 6, out_qsize 0
 57 2016-5-2 18:57:14,464 : INFO : PROGRESS: at 40.75% examples, 561493 words/s, in_qsize 6, out_qsize 0
 58 2016-5-2 18:57:15,482 : INFO : PROGRESS: at 41.60% examples, 560419 words/s, in_qsize 5, out_qsize 1
 59 2016-5-2 18:57:16,503 : INFO : PROGRESS: at 42.40% examples, 558807 words/s, in_qsize 6, out_qsize 0
 60 2016-5-2 18:57:17,520 : INFO : PROGRESS: at 43.27% examples, 558287 words/s, in_qsize 5, out_qsize 0
 61 2016-5-2 18:57:18,534 : INFO : PROGRESS: at 44.13% examples, 557685 words/s, in_qsize 6, out_qsize 0
 62 2016-5-2 18:57:19,538 : INFO : PROGRESS: at 44.93% examples, 556591 words/s, in_qsize 6, out_qsize 0
 63 2016-5-2 18:57:20,540 : INFO : PROGRESS: at 45.83% examples, 556881 words/s, in_qsize 5, out_qsize 0
 64 2016-5-2 18:57:21,541 : INFO : PROGRESS: at 46.75% examples, 557341 words/s, in_qsize 6, out_qsize 0
 65 2016-5-2 18:57:22,553 : INFO : PROGRESS: at 47.69% examples, 557860 words/s, in_qsize 5, out_qsize 1
 66 2016-5-2 18:57:23,557 : INFO : PROGRESS: at 48.51% examples, 557066 words/s, in_qsize 6, out_qsize 0
 67 2016-5-2 18:57:24,564 : INFO : PROGRESS: at 49.42% examples, 557201 words/s, in_qsize 5, out_qsize 0
 68 2016-5-2 18:57:25,571 : INFO : PROGRESS: at 50.31% examples, 557231 words/s, in_qsize 5, out_qsize 1
 69 2016-5-2 18:57:26,585 : INFO : PROGRESS: at 51.26% examples, 557820 words/s, in_qsize 6, out_qsize 1
 70 2016-5-2 18:57:27,586 : INFO : PROGRESS: at 52.22% examples, 558455 words/s, in_qsize 4, out_qsize 0
 71 2016-5-2 18:57:28,588 : INFO : PROGRESS: at 53.16% examples, 558932 words/s, in_qsize 6, out_qsize 1
 72 2016-5-2 18:57:29,609 : INFO : PROGRESS: at 54.11% examples, 559389 words/s, in_qsize 5, out_qsize 0
 73 2016-5-2 18:57:30,616 : INFO : PROGRESS: at 55.01% examples, 559415 words/s, in_qsize 6, out_qsize 0
 74 2016-5-2 18:57:31,642 : INFO : PROGRESS: at 55.87% examples, 558596 words/s, in_qsize 5, out_qsize 0
 75 2016-5-2 18:57:32,647 : INFO : PROGRESS: at 56.78% examples, 558665 words/s, in_qsize 6, out_qsize 0
 76 2016-5-2 18:57:33,656 : INFO : PROGRESS: at 57.57% examples, 557526 words/s, in_qsize 6, out_qsize 0
 77 2016-5-2 18:57:34,660 : INFO : PROGRESS: at 58.39% examples, 556830 words/s, in_qsize 4, out_qsize 0
 78 2016-5-2 18:57:35,664 : INFO : PROGRESS: at 59.31% examples, 557019 words/s, in_qsize 6, out_qsize 0
 79 2016-5-2 18:57:36,670 : INFO : PROGRESS: at 60.12% examples, 556187 words/s, in_qsize 6, out_qsize 0
 80 2016-5-2 18:57:37,683 : INFO : PROGRESS: at 60.94% examples, 555461 words/s, in_qsize 6, out_qsize 0
 81 2016-5-2 18:57:38,686 : INFO : PROGRESS: at 61.78% examples, 554836 words/s, in_qsize 6, out_qsize 0
 82 2016-5-2 18:57:39,705 : INFO : PROGRESS: at 62.54% examples, 553555 words/s, in_qsize 6, out_qsize 0
 83 2016-5-2 18:57:40,710 : INFO : PROGRESS: at 63.35% examples, 552863 words/s, in_qsize 6, out_qsize 0
 84 2016-5-2 18:57:41,719 : INFO : PROGRESS: at 64.12% examples, 551760 words/s, in_qsize 6, out_qsize 0
 85 2016-5-2 18:57:42,726 : INFO : PROGRESS: at 64.93% examples, 551152 words/s, in_qsize 5, out_qsize 0
 86 2016-5-2 18:57:43,741 : INFO : PROGRESS: at 65.74% examples, 550535 words/s, in_qsize 6, out_qsize 0
 87 2016-5-2 18:57:44,743 : INFO : PROGRESS: at 66.51% examples, 549746 words/s, in_qsize 6, out_qsize 0
 88 2016-5-2 18:57:45,743 : INFO : PROGRESS: at 67.23% examples, 548498 words/s, in_qsize 6, out_qsize 0
 89 2016-5-2 18:57:46,773 : INFO : PROGRESS: at 67.98% examples, 547297 words/s, in_qsize 6, out_qsize 0
 90 2016-5-2 18:57:47,786 : INFO : PROGRESS: at 68.81% examples, 546808 words/s, in_qsize 6, out_qsize 0
 91 2016-5-2 18:57:48,792 : INFO : PROGRESS: at 69.58% examples, 546028 words/s, in_qsize 6, out_qsize 0
 92 2016-5-2 18:57:49,798 : INFO : PROGRESS: at 70.37% examples, 545344 words/s, in_qsize 6, out_qsize 0
 93 2016-5-2 18:57:50,807 : INFO : PROGRESS: at 71.19% examples, 545012 words/s, in_qsize 6, out_qsize 1
 94 2016-5-2 18:57:51,802 : INFO : PROGRESS: at 72.09% examples, 545184 words/s, in_qsize 6, out_qsize 0
 95 2016-5-2 18:57:52,806 : INFO : PROGRESS: at 72.98% examples, 545315 words/s, in_qsize 5, out_qsize 0
 96 2016-5-2 18:57:53,827 : INFO : PROGRESS: at 73.92% examples, 545714 words/s, in_qsize 5, out_qsize 0
 97 2016-5-2 18:57:54,827 : INFO : PROGRESS: at 74.86% examples, 546256 words/s, in_qsize 5, out_qsize 0
 98 2016-5-2 18:57:55,840 : INFO : PROGRESS: at 75.79% examples, 546379 words/s, in_qsize 5, out_qsize 0
 99 2016-5-2 18:57:56,851 : INFO : PROGRESS: at 76.73% examples, 546823 words/s, in_qsize 5, out_qsize 0
100 2016-5-2 18:57:57,843 : INFO : PROGRESS: at 77.66% examples, 547189 words/s, in_qsize 6, out_qsize 0
101 2016-5-2 18:57:58,847 : INFO : PROGRESS: at 78.50% examples, 546858 words/s, in_qsize 6, out_qsize 0
102 2016-5-2 18:57:59,849 : INFO : PROGRESS: at 79.39% examples, 546959 words/s, in_qsize 5, out_qsize 0
103 2016-5-2 18:58:00,854 : INFO : PROGRESS: at 80.27% examples, 546954 words/s, in_qsize 5, out_qsize 1
104 2016-5-2 18:58:01,856 : INFO : PROGRESS: at 81.22% examples, 547394 words/s, in_qsize 3, out_qsize 0
105 2016-5-2 18:58:02,875 : INFO : PROGRESS: at 82.13% examples, 547429 words/s, in_qsize 6, out_qsize 0
106 2016-5-2 18:58:03,888 : INFO : PROGRESS: at 83.07% examples, 547815 words/s, in_qsize 6, out_qsize 0
107 2016-5-2 18:58:04,880 : INFO : PROGRESS: at 84.00% examples, 548153 words/s, in_qsize 5, out_qsize 0
108 2016-5-2 18:58:05,895 : INFO : PROGRESS: at 84.91% examples, 548428 words/s, in_qsize 5, out_qsize 0
109 2016-5-2 18:58:06,888 : INFO : PROGRESS: at 85.77% examples, 548357 words/s, in_qsize 6, out_qsize 0
110 2016-5-2 18:58:07,901 : INFO : PROGRESS: at 86.64% examples, 548365 words/s, in_qsize 6, out_qsize 0
111 2016-5-2 18:58:08,897 : INFO : PROGRESS: at 87.50% examples, 548265 words/s, in_qsize 6, out_qsize 0
112 2016-5-2 18:58:09,902 : INFO : PROGRESS: at 88.42% examples, 548504 words/s, in_qsize 6, out_qsize 0
113 2016-5-2 18:58:10,916 : INFO : PROGRESS: at 89.18% examples, 547765 words/s, in_qsize 5, out_qsize 0
114 2016-5-2 18:58:11,921 : INFO : PROGRESS: at 89.94% examples, 547006 words/s, in_qsize 5, out_qsize 0
115 2016-5-2 18:58:12,923 : INFO : PROGRESS: at 90.81% examples, 546992 words/s, in_qsize 6, out_qsize 0
116 2016-5-2 18:58:13,930 : INFO : PROGRESS: at 91.72% examples, 547225 words/s, in_qsize 6, out_qsize 0
117 2016-5-2 18:58:14,935 : INFO : PROGRESS: at 92.59% examples, 547187 words/s, in_qsize 5, out_qsize 0
118 2016-5-2 18:58:15,939 : INFO : PROGRESS: at 93.46% examples, 547133 words/s, in_qsize 6, out_qsize 0
119 2016-5-2 18:58:16,944 : INFO : PROGRESS: at 94.18% examples, 546224 words/s, in_qsize 6, out_qsize 0
120 2016-5-2 18:58:17,953 : INFO : PROGRESS: at 94.93% examples, 545497 words/s, in_qsize 6, out_qsize 0
121 2016-5-2 18:58:18,959 : INFO : PROGRESS: at 95.70% examples, 544697 words/s, in_qsize 6, out_qsize 0
122 2016-5-2 18:58:19,967 : INFO : PROGRESS: at 96.40% examples, 543702 words/s, in_qsize 5, out_qsize 0
123 2016-5-2 18:58:20,974 : INFO : PROGRESS: at 97.26% examples, 543612 words/s, in_qsize 5, out_qsize 0
124 2016-5-2 18:58:21,978 : INFO : PROGRESS: at 98.17% examples, 543801 words/s, in_qsize 5, out_qsize 0
125 2016-5-2 18:58:22,994 : INFO : PROGRESS: at 99.07% examples, 543908 words/s, in_qsize 4, out_qsize 2
126 2016-5-2 18:58:23,989 : INFO : PROGRESS: at 99.91% examples, 543692 words/s, in_qsize 6, out_qsize 0
127 2016-5-2 18:58:24,067 : INFO : worker thread finished; awaiting finish of 2 more threads
128 2016-5-2 18:58:24,083 : INFO : worker thread finished; awaiting finish of 1 more threads
129 2016-5-2 18:58:24,086 : INFO : worker thread finished; awaiting finish of 0 more threads
130 2016-5-2 18:58:24,086 : INFO : training on 85026035 raw words (62534095 effective words) took 115.0s, 543725 effective words/s
131 2016-5-2 18:58:24,086 : INFO : precomputing L2-norms of word weight vectors
132 <span style="color:#FF0000;">woman和man的類似度爲: 0.699695936218
133 --------
134 和good最相關的詞有:
135  
136 bad 0.721469461918
137 poor 0.567566931248
138 safe 0.534923613071
139 luck 0.518905758858
140 courage 0.510788619518
141 useful 0.498157411814
142 quick 0.497716665268
143 easy 0.497328162193
144 everyone 0.485905945301
145 pleasure 0.483758479357
146 true 0.482762247324
147 simple 0.480014979839
148 practical 0.479516804218
149 fair 0.479104012251
150 happy 0.476968646049
151 wrong 0.476797521114
152 reasonable 0.476701617241
153 you 0.475801795721
154 fun 0.472196519375
155 helpful 0.471719056368
156 --------
157  
158  "boy" is to "father" as "girl" is to ...? 
159  
160 mother 0.76334130764
161 grandmother 0.690031766891
162 daughter 0.684129178524
163 --------
164  
165 'he' is to 'his' as 'she' is to 'her'
166 'big' is to 'bigger' as 'bad' is to 'worse'
167 'going' is to 'went' as 'being' is to 'was'
168 --------
169  
170 不合羣的詞: cereal
171 --------</span>
172  
173 2016-5-2 18:58:24,185 : INFO : saving Word2Vec object under text8.model, separately None
174 2016-5-2 18:58:24,185 : INFO : storing numpy array 'syn1neg' to text8.model.syn1neg.npy
175 2016-5-2 18:58:24,235 : INFO : not storing attribute syn0norm
176 2016-5-2 18:58:24,235 : INFO : storing numpy array 'syn0' to text8.model.syn0.npy
177 2016-5-2 18:58:24,278 : INFO : not storing attribute cum_table
178 2016-5-2 18:58:25,083 : INFO : storing 71290x200 projection weights into text8.model.bin

經常使用語料資源

下面提供一些網上能下載到的中文的好語料,供研究人員學習使用。
(1).中科院自動化所的中英文新聞語料庫 http://www.datatang.com/data/13484
中文新聞分類語料庫從鳳凰、新浪、網易、騰訊等版面蒐集。英語新聞分類語料庫爲Reuters-21578的ModApte版本。
(2).搜狗的中文新聞語料庫 http://www.sogou.com/labs/dl/c.html
包括搜狐的大量新聞語料與對應的分類信息。有不一樣大小的版本能夠下載。
(3).李榮陸老師的中文語料庫 http://www.datatang.com/data/11968
壓縮後有240M大小
(4).譚鬆波老師的中文文本分類語料 http://www.datatang.com/data/11970
不只包含大的分類,例如經濟、運動等等,每一個大類下面還包含具體的小類,例如運動包含籃球、足球等等。可以做爲層次分類的語料庫,很是實用。這個網址免積分(譚鬆波老師的主頁):http://www.searchforum.org.cn/tansongbo/corpus1.PHP
(5).網易分類文本數據 http://www.datatang.com/data/11965
包含運動、汽車等六大類的4000條文本數據。
(6).中文文本分類語料 http://www.datatang.com/data/11963
包含Arts、Literature等類別的語料文本。
(7).更全的搜狗文本分類語料 http://www.sogou.com/labs/dl/c.html
搜狗實驗室發佈的文本分類語料,有不一樣大小的數據版本供免費下載
(8).2002年中文網頁分類訓練集 http://www.datatang.com/data/15021
app

2002年秋天北京大學網絡與分佈式實驗室天網小組經過動員不一樣專業的幾十個學生,人工選取造成了一個全新的基於層次模型的大規模中文網頁樣本集。它包括11678個訓練網頁實例和3630個測試網頁實例,分佈在11個大類別中。分佈式

 

經常使用分詞工具

將預料庫進行分詞並去掉停用詞,經常使用分詞工具備:
工具

StandardAnalyzer(中英文)、ChineseAnalyzer(中文)、CJKAnalyzer(中英文)、IKAnalyzer(中英文,兼容韓文,日文)、paoding(中文)、MMAnalyzer(中英文)、MMSeg4j(中英文)、imdict(中英文)、NLTK(中英文)、Jieba(中英文)。
學習

提供一份DEMO語料資源

原始語料 http://pan.baidu.com/s/1nviuFc1
訓練語料 http://pan.baidu.com/s/1kVEmNTd
測試

相關文章
相關標籤/搜索