word2vec官網:https://code.google.com/p/word2vec/php
運行和測試一樣須要text八、questions-words.txt文件,語料下載地址:http://mattmahoney.net/dc/text8.zip
該語料編碼格式UTF-8,存儲爲一行,語料訓練信息:training on 85026035 raw words (62529137 effective words) took 197.4s, 316692 effective words/shtml
-train 訓練數據
-output 結果輸入文件,即每一個詞的向量
-cbow 是否使用cbow模型,0表示使用skip-gram模型,1表示使用cbow模型,默認狀況下是skip-gram模型,cbow模型快一些,skip-gram模型效果好一些
-size 表示輸出的詞向量維數
-window 爲訓練的窗口大小,8表示每一個詞考慮前8個詞與後8個詞(實際代碼中還有一個隨機選窗口的過程,窗口大小<=5)
-negative 表示是否使用NEG方,0表示不使用,其它的值目前還不是很清楚
-hs 是否使用HS方法,0表示不使用,1表示使用
-sample 表示 採樣的閾值,若是一個詞在訓練樣本中出現的頻率越大,那麼就越會被採樣
-binary 表示輸出的結果文件是否採用二進制存儲,0表示不使用(即普通的文本存儲,能夠打開查看),1表示使用,即vectors.bin的存儲類型
-alpha 表示 學習速率
-min-count 表示設置最低頻率,默認爲5,若是一個詞語在文檔中出現的次數小於該閾值,那麼該詞就會被捨棄
-classes 表示詞聚類簇的個數,從相關源碼中能夠得出該聚類是採用k-means網絡
1 # -*- coding: utf-8 -*- 2 3 """ 4 功能:測試gensim使用 5 時間:2016年5月2日 18:00:00 6 """ 7 8 from gensim.models import word2vec 9 import logging 10 11 # 主程序 12 logging.basicConfig(format='%(asctime)s : %(levelname)s : %(message)s', level=logging.INFO) 13 sentences = word2vec.Text8Corpus("data/text8") # 加載語料 14 model = word2vec.Word2Vec(sentences, size=200) # 訓練skip-gram模型; 默認window=5 15 16 # 計算兩個詞的類似度/相關程度 17 y1 = model.similarity("woman", "man") 18 print u"woman和man的類似度爲:", y1 19 print "--------\n" 20 21 # 計算某個詞的相關詞列表 22 y2 = model.most_similar("good", topn=20) # 20個最相關的 23 print u"和good最相關的詞有:\n" 24 for item in y2: 25 print item[0], item[1] 26 print "--------\n" 27 28 # 尋找對應關係 29 print ' "boy" is to "father" as "girl" is to ...? \n' 30 y3 = model.most_similar(['girl', 'father'], ['boy'], topn=3) 31 for item in y3: 32 print item[0], item[1] 33 print "--------\n" 34 35 more_examples = ["he his she", "big bigger bad", "going went being"] 36 for example in more_examples: 37 a, b, x = example.split() 38 predicted = model.most_similar([x, b], [a])[0][0] 39 print "'%s' is to '%s' as '%s' is to '%s'" % (a, b, x, predicted) 40 print "--------\n" 41 42 # 尋找不合羣的詞 43 y4 = model.doesnt_match("breakfast cereal dinner lunch".split()) 44 print u"不合羣的詞:", y4 45 print "--------\n" 46 47 # 保存模型,以便重用 48 model.save("text8.model") 49 # 對應的加載方式 50 # model_2 = word2vec.Word2Vec.load("text8.model") 51 52 # 以一種C語言能夠解析的形式存儲詞向量 53 model.save_word2vec_format("text8.model.bin", binary=True) 54 # 對應的加載方式 55 # model_3 = word2vec.Word2Vec.load_word2vec_format("text8.model.bin", binary=True) 56 57 if __name__ == "__main__": 58 pass
1 2016-5-2 18:56:19,332 : INFO : collecting all words and their counts 2 2016-5-2 18:56:19,334 : INFO : PROGRESS: at sentence #0, processed 0 words, keeping 0 word types 3 2016-5-2 18:56:27,431 : INFO : collected 253854 word types from a corpus of 17005207 raw words and 1701 sentences 4 2016-5-2 18:56:27,740 : INFO : min_count=5 retains 71290 unique words (drops 182564) 5 2016-5-2 18:56:27,740 : INFO : min_count leaves 16718844 word corpus (98% of original 17005207) 6 2016-5-2 18:56:27,914 : INFO : deleting the raw counts dictionary of 253854 items 7 2016-5-2 18:56:27,947 : INFO : sample=0.001 downsamples 38 most-common words 8 2016-5-2 18:56:27,947 : INFO : downsampling leaves estimated 12506280 word corpus (74.8% of prior 16718844) 9 2016-5-2 18:56:27,947 : INFO : estimated required memory for 71290 words and 200 dimensions: 149709000 bytes 10 2016-5-2 18:56:28,176 : INFO : resetting layer weights 11 2016-5-2 18:56:29,074 : INFO : training model with 3 workers on 71290 vocabulary and 200 features, using sg=0 hs=0 sample=0.001 negative=5 12 2016-5-2 18:56:29,074 : INFO : expecting 1701 sentences, matching count from corpus used for vocabulary survey 13 2016-5-2 18:56:30,086 : INFO : PROGRESS: at 0.86% examples, 531932 words/s, in_qsize 6, out_qsize 0 14 2016-5-2 18:56:31,088 : INFO : PROGRESS: at 1.72% examples, 528872 words/s, in_qsize 5, out_qsize 0 15 2016-5-2 18:56:32,108 : INFO : PROGRESS: at 2.68% examples, 549248 words/s, in_qsize 6, out_qsize 0 16 2016-5-2 18:56:33,113 : INFO : PROGRESS: at 3.47% examples, 534255 words/s, in_qsize 6, out_qsize 0 17 2016-5-2 18:56:34,135 : INFO : PROGRESS: at 4.43% examples, 545575 words/s, in_qsize 5, out_qsize 0 18 2016-5-2 18:56:35,145 : INFO : PROGRESS: at 5.40% examples, 555220 words/s, in_qsize 6, out_qsize 0 19 2016-5-2 18:56:36,147 : INFO : PROGRESS: at 6.34% examples, 560815 words/s, in_qsize 5, out_qsize 0 20 2016-5-2 18:56:37,155 : INFO : PROGRESS: at 7.28% examples, 564712 words/s, in_qsize 6, out_qsize 1 21 2016-5-2 18:56:38,172 : INFO : PROGRESS: at 8.24% examples, 568088 words/s, in_qsize 5, out_qsize 0 22 2016-5-2 18:56:39,169 : INFO : PROGRESS: at 9.19% examples, 570872 words/s, in_qsize 5, out_qsize 0 23 2016-5-2 18:56:40,191 : INFO : PROGRESS: at 10.16% examples, 573068 words/s, in_qsize 6, out_qsize 0 24 2016-5-2 18:56:41,203 : INFO : PROGRESS: at 11.12% examples, 575184 words/s, in_qsize 5, out_qsize 1 25 2016-5-2 18:56:42,217 : INFO : PROGRESS: at 12.09% examples, 577227 words/s, in_qsize 5, out_qsize 0 26 2016-5-2 18:56:43,220 : INFO : PROGRESS: at 13.04% examples, 578418 words/s, in_qsize 5, out_qsize 1 27 2016-5-2 18:56:44,235 : INFO : PROGRESS: at 14.00% examples, 579574 words/s, in_qsize 5, out_qsize 1 28 2016-5-2 18:56:45,239 : INFO : PROGRESS: at 14.96% examples, 580577 words/s, in_qsize 6, out_qsize 2 29 2016-5-2 18:56:46,243 : INFO : PROGRESS: at 15.86% examples, 578374 words/s, in_qsize 6, out_qsize 0 30 2016-5-2 18:56:47,252 : INFO : PROGRESS: at 16.70% examples, 574918 words/s, in_qsize 5, out_qsize 1 31 2016-5-2 18:56:48,256 : INFO : PROGRESS: at 17.66% examples, 576221 words/s, in_qsize 5, out_qsize 0 32 2016-5-2 18:56:49,258 : INFO : PROGRESS: at 18.61% examples, 577045 words/s, in_qsize 4, out_qsize 0 33 2016-5-2 18:56:50,260 : INFO : PROGRESS: at 19.54% examples, 576947 words/s, in_qsize 4, out_qsize 1 34 2016-5-2 18:56:51,261 : INFO : PROGRESS: at 20.47% examples, 577120 words/s, in_qsize 6, out_qsize 0 35 2016-5-2 18:56:52,284 : INFO : PROGRESS: at 21.43% examples, 577251 words/s, in_qsize 5, out_qsize 1 36 2016-5-2 18:56:53,287 : INFO : PROGRESS: at 22.34% examples, 576556 words/s, in_qsize 6, out_qsize 0 37 2016-5-2 18:56:54,308 : INFO : PROGRESS: at 23.20% examples, 574618 words/s, in_qsize 6, out_qsize 1 38 2016-5-2 18:56:55,306 : INFO : PROGRESS: at 24.15% examples, 575304 words/s, in_qsize 5, out_qsize 0 39 2016-5-2 18:56:56,329 : INFO : PROGRESS: at 25.09% examples, 575610 words/s, in_qsize 5, out_qsize 1 40 2016-5-2 18:56:57,333 : INFO : PROGRESS: at 26.04% examples, 576358 words/s, in_qsize 6, out_qsize 0 41 2016-5-2 18:56:58,340 : INFO : PROGRESS: at 26.97% examples, 576745 words/s, in_qsize 5, out_qsize 0 42 2016-5-2 18:56:59,337 : INFO : PROGRESS: at 27.91% examples, 577161 words/s, in_qsize 5, out_qsize 0 43 2016-5-2 18:57:00,338 : INFO : PROGRESS: at 28.84% examples, 577303 words/s, in_qsize 5, out_qsize 0 44 2016-5-2 18:57:01,346 : INFO : PROGRESS: at 29.65% examples, 575087 words/s, in_qsize 6, out_qsize 0 45 2016-5-2 18:57:02,353 : INFO : PROGRESS: at 30.55% examples, 574516 words/s, in_qsize 5, out_qsize 1 46 2016-5-2 18:57:03,356 : INFO : PROGRESS: at 31.36% examples, 572590 words/s, in_qsize 5, out_qsize 0 47 2016-5-2 18:57:04,371 : INFO : PROGRESS: at 32.10% examples, 569320 words/s, in_qsize 6, out_qsize 0 48 2016-5-2 18:57:05,380 : INFO : PROGRESS: at 32.95% examples, 568088 words/s, in_qsize 5, out_qsize 0 49 2016-5-2 18:57:06,389 : INFO : PROGRESS: at 33.78% examples, 566886 words/s, in_qsize 6, out_qsize 1 50 2016-5-2 18:57:07,399 : INFO : PROGRESS: at 34.60% examples, 565345 words/s, in_qsize 6, out_qsize 0 51 2016-5-2 18:57:08,418 : INFO : PROGRESS: at 35.51% examples, 564685 words/s, in_qsize 5, out_qsize 0 52 2016-5-2 18:57:09,432 : INFO : PROGRESS: at 36.39% examples, 564093 words/s, in_qsize 6, out_qsize 0 53 2016-5-2 18:57:10,441 : INFO : PROGRESS: at 37.21% examples, 562778 words/s, in_qsize 5, out_qsize 1 54 2016-5-2 18:57:11,453 : INFO : PROGRESS: at 38.14% examples, 563163 words/s, in_qsize 6, out_qsize 1 55 2016-5-2 18:57:12,449 : INFO : PROGRESS: at 38.98% examples, 562072 words/s, in_qsize 6, out_qsize 0 56 2016-5-2 18:57:13,461 : INFO : PROGRESS: at 39.88% examples, 561949 words/s, in_qsize 6, out_qsize 0 57 2016-5-2 18:57:14,464 : INFO : PROGRESS: at 40.75% examples, 561493 words/s, in_qsize 6, out_qsize 0 58 2016-5-2 18:57:15,482 : INFO : PROGRESS: at 41.60% examples, 560419 words/s, in_qsize 5, out_qsize 1 59 2016-5-2 18:57:16,503 : INFO : PROGRESS: at 42.40% examples, 558807 words/s, in_qsize 6, out_qsize 0 60 2016-5-2 18:57:17,520 : INFO : PROGRESS: at 43.27% examples, 558287 words/s, in_qsize 5, out_qsize 0 61 2016-5-2 18:57:18,534 : INFO : PROGRESS: at 44.13% examples, 557685 words/s, in_qsize 6, out_qsize 0 62 2016-5-2 18:57:19,538 : INFO : PROGRESS: at 44.93% examples, 556591 words/s, in_qsize 6, out_qsize 0 63 2016-5-2 18:57:20,540 : INFO : PROGRESS: at 45.83% examples, 556881 words/s, in_qsize 5, out_qsize 0 64 2016-5-2 18:57:21,541 : INFO : PROGRESS: at 46.75% examples, 557341 words/s, in_qsize 6, out_qsize 0 65 2016-5-2 18:57:22,553 : INFO : PROGRESS: at 47.69% examples, 557860 words/s, in_qsize 5, out_qsize 1 66 2016-5-2 18:57:23,557 : INFO : PROGRESS: at 48.51% examples, 557066 words/s, in_qsize 6, out_qsize 0 67 2016-5-2 18:57:24,564 : INFO : PROGRESS: at 49.42% examples, 557201 words/s, in_qsize 5, out_qsize 0 68 2016-5-2 18:57:25,571 : INFO : PROGRESS: at 50.31% examples, 557231 words/s, in_qsize 5, out_qsize 1 69 2016-5-2 18:57:26,585 : INFO : PROGRESS: at 51.26% examples, 557820 words/s, in_qsize 6, out_qsize 1 70 2016-5-2 18:57:27,586 : INFO : PROGRESS: at 52.22% examples, 558455 words/s, in_qsize 4, out_qsize 0 71 2016-5-2 18:57:28,588 : INFO : PROGRESS: at 53.16% examples, 558932 words/s, in_qsize 6, out_qsize 1 72 2016-5-2 18:57:29,609 : INFO : PROGRESS: at 54.11% examples, 559389 words/s, in_qsize 5, out_qsize 0 73 2016-5-2 18:57:30,616 : INFO : PROGRESS: at 55.01% examples, 559415 words/s, in_qsize 6, out_qsize 0 74 2016-5-2 18:57:31,642 : INFO : PROGRESS: at 55.87% examples, 558596 words/s, in_qsize 5, out_qsize 0 75 2016-5-2 18:57:32,647 : INFO : PROGRESS: at 56.78% examples, 558665 words/s, in_qsize 6, out_qsize 0 76 2016-5-2 18:57:33,656 : INFO : PROGRESS: at 57.57% examples, 557526 words/s, in_qsize 6, out_qsize 0 77 2016-5-2 18:57:34,660 : INFO : PROGRESS: at 58.39% examples, 556830 words/s, in_qsize 4, out_qsize 0 78 2016-5-2 18:57:35,664 : INFO : PROGRESS: at 59.31% examples, 557019 words/s, in_qsize 6, out_qsize 0 79 2016-5-2 18:57:36,670 : INFO : PROGRESS: at 60.12% examples, 556187 words/s, in_qsize 6, out_qsize 0 80 2016-5-2 18:57:37,683 : INFO : PROGRESS: at 60.94% examples, 555461 words/s, in_qsize 6, out_qsize 0 81 2016-5-2 18:57:38,686 : INFO : PROGRESS: at 61.78% examples, 554836 words/s, in_qsize 6, out_qsize 0 82 2016-5-2 18:57:39,705 : INFO : PROGRESS: at 62.54% examples, 553555 words/s, in_qsize 6, out_qsize 0 83 2016-5-2 18:57:40,710 : INFO : PROGRESS: at 63.35% examples, 552863 words/s, in_qsize 6, out_qsize 0 84 2016-5-2 18:57:41,719 : INFO : PROGRESS: at 64.12% examples, 551760 words/s, in_qsize 6, out_qsize 0 85 2016-5-2 18:57:42,726 : INFO : PROGRESS: at 64.93% examples, 551152 words/s, in_qsize 5, out_qsize 0 86 2016-5-2 18:57:43,741 : INFO : PROGRESS: at 65.74% examples, 550535 words/s, in_qsize 6, out_qsize 0 87 2016-5-2 18:57:44,743 : INFO : PROGRESS: at 66.51% examples, 549746 words/s, in_qsize 6, out_qsize 0 88 2016-5-2 18:57:45,743 : INFO : PROGRESS: at 67.23% examples, 548498 words/s, in_qsize 6, out_qsize 0 89 2016-5-2 18:57:46,773 : INFO : PROGRESS: at 67.98% examples, 547297 words/s, in_qsize 6, out_qsize 0 90 2016-5-2 18:57:47,786 : INFO : PROGRESS: at 68.81% examples, 546808 words/s, in_qsize 6, out_qsize 0 91 2016-5-2 18:57:48,792 : INFO : PROGRESS: at 69.58% examples, 546028 words/s, in_qsize 6, out_qsize 0 92 2016-5-2 18:57:49,798 : INFO : PROGRESS: at 70.37% examples, 545344 words/s, in_qsize 6, out_qsize 0 93 2016-5-2 18:57:50,807 : INFO : PROGRESS: at 71.19% examples, 545012 words/s, in_qsize 6, out_qsize 1 94 2016-5-2 18:57:51,802 : INFO : PROGRESS: at 72.09% examples, 545184 words/s, in_qsize 6, out_qsize 0 95 2016-5-2 18:57:52,806 : INFO : PROGRESS: at 72.98% examples, 545315 words/s, in_qsize 5, out_qsize 0 96 2016-5-2 18:57:53,827 : INFO : PROGRESS: at 73.92% examples, 545714 words/s, in_qsize 5, out_qsize 0 97 2016-5-2 18:57:54,827 : INFO : PROGRESS: at 74.86% examples, 546256 words/s, in_qsize 5, out_qsize 0 98 2016-5-2 18:57:55,840 : INFO : PROGRESS: at 75.79% examples, 546379 words/s, in_qsize 5, out_qsize 0 99 2016-5-2 18:57:56,851 : INFO : PROGRESS: at 76.73% examples, 546823 words/s, in_qsize 5, out_qsize 0 100 2016-5-2 18:57:57,843 : INFO : PROGRESS: at 77.66% examples, 547189 words/s, in_qsize 6, out_qsize 0 101 2016-5-2 18:57:58,847 : INFO : PROGRESS: at 78.50% examples, 546858 words/s, in_qsize 6, out_qsize 0 102 2016-5-2 18:57:59,849 : INFO : PROGRESS: at 79.39% examples, 546959 words/s, in_qsize 5, out_qsize 0 103 2016-5-2 18:58:00,854 : INFO : PROGRESS: at 80.27% examples, 546954 words/s, in_qsize 5, out_qsize 1 104 2016-5-2 18:58:01,856 : INFO : PROGRESS: at 81.22% examples, 547394 words/s, in_qsize 3, out_qsize 0 105 2016-5-2 18:58:02,875 : INFO : PROGRESS: at 82.13% examples, 547429 words/s, in_qsize 6, out_qsize 0 106 2016-5-2 18:58:03,888 : INFO : PROGRESS: at 83.07% examples, 547815 words/s, in_qsize 6, out_qsize 0 107 2016-5-2 18:58:04,880 : INFO : PROGRESS: at 84.00% examples, 548153 words/s, in_qsize 5, out_qsize 0 108 2016-5-2 18:58:05,895 : INFO : PROGRESS: at 84.91% examples, 548428 words/s, in_qsize 5, out_qsize 0 109 2016-5-2 18:58:06,888 : INFO : PROGRESS: at 85.77% examples, 548357 words/s, in_qsize 6, out_qsize 0 110 2016-5-2 18:58:07,901 : INFO : PROGRESS: at 86.64% examples, 548365 words/s, in_qsize 6, out_qsize 0 111 2016-5-2 18:58:08,897 : INFO : PROGRESS: at 87.50% examples, 548265 words/s, in_qsize 6, out_qsize 0 112 2016-5-2 18:58:09,902 : INFO : PROGRESS: at 88.42% examples, 548504 words/s, in_qsize 6, out_qsize 0 113 2016-5-2 18:58:10,916 : INFO : PROGRESS: at 89.18% examples, 547765 words/s, in_qsize 5, out_qsize 0 114 2016-5-2 18:58:11,921 : INFO : PROGRESS: at 89.94% examples, 547006 words/s, in_qsize 5, out_qsize 0 115 2016-5-2 18:58:12,923 : INFO : PROGRESS: at 90.81% examples, 546992 words/s, in_qsize 6, out_qsize 0 116 2016-5-2 18:58:13,930 : INFO : PROGRESS: at 91.72% examples, 547225 words/s, in_qsize 6, out_qsize 0 117 2016-5-2 18:58:14,935 : INFO : PROGRESS: at 92.59% examples, 547187 words/s, in_qsize 5, out_qsize 0 118 2016-5-2 18:58:15,939 : INFO : PROGRESS: at 93.46% examples, 547133 words/s, in_qsize 6, out_qsize 0 119 2016-5-2 18:58:16,944 : INFO : PROGRESS: at 94.18% examples, 546224 words/s, in_qsize 6, out_qsize 0 120 2016-5-2 18:58:17,953 : INFO : PROGRESS: at 94.93% examples, 545497 words/s, in_qsize 6, out_qsize 0 121 2016-5-2 18:58:18,959 : INFO : PROGRESS: at 95.70% examples, 544697 words/s, in_qsize 6, out_qsize 0 122 2016-5-2 18:58:19,967 : INFO : PROGRESS: at 96.40% examples, 543702 words/s, in_qsize 5, out_qsize 0 123 2016-5-2 18:58:20,974 : INFO : PROGRESS: at 97.26% examples, 543612 words/s, in_qsize 5, out_qsize 0 124 2016-5-2 18:58:21,978 : INFO : PROGRESS: at 98.17% examples, 543801 words/s, in_qsize 5, out_qsize 0 125 2016-5-2 18:58:22,994 : INFO : PROGRESS: at 99.07% examples, 543908 words/s, in_qsize 4, out_qsize 2 126 2016-5-2 18:58:23,989 : INFO : PROGRESS: at 99.91% examples, 543692 words/s, in_qsize 6, out_qsize 0 127 2016-5-2 18:58:24,067 : INFO : worker thread finished; awaiting finish of 2 more threads 128 2016-5-2 18:58:24,083 : INFO : worker thread finished; awaiting finish of 1 more threads 129 2016-5-2 18:58:24,086 : INFO : worker thread finished; awaiting finish of 0 more threads 130 2016-5-2 18:58:24,086 : INFO : training on 85026035 raw words (62534095 effective words) took 115.0s, 543725 effective words/s 131 2016-5-2 18:58:24,086 : INFO : precomputing L2-norms of word weight vectors 132 <span style="color:#FF0000;">woman和man的類似度爲: 0.699695936218 133 -------- 134 和good最相關的詞有: 135 136 bad 0.721469461918 137 poor 0.567566931248 138 safe 0.534923613071 139 luck 0.518905758858 140 courage 0.510788619518 141 useful 0.498157411814 142 quick 0.497716665268 143 easy 0.497328162193 144 everyone 0.485905945301 145 pleasure 0.483758479357 146 true 0.482762247324 147 simple 0.480014979839 148 practical 0.479516804218 149 fair 0.479104012251 150 happy 0.476968646049 151 wrong 0.476797521114 152 reasonable 0.476701617241 153 you 0.475801795721 154 fun 0.472196519375 155 helpful 0.471719056368 156 -------- 157 158 "boy" is to "father" as "girl" is to ...? 159 160 mother 0.76334130764 161 grandmother 0.690031766891 162 daughter 0.684129178524 163 -------- 164 165 'he' is to 'his' as 'she' is to 'her' 166 'big' is to 'bigger' as 'bad' is to 'worse' 167 'going' is to 'went' as 'being' is to 'was' 168 -------- 169 170 不合羣的詞: cereal 171 --------</span> 172 173 2016-5-2 18:58:24,185 : INFO : saving Word2Vec object under text8.model, separately None 174 2016-5-2 18:58:24,185 : INFO : storing numpy array 'syn1neg' to text8.model.syn1neg.npy 175 2016-5-2 18:58:24,235 : INFO : not storing attribute syn0norm 176 2016-5-2 18:58:24,235 : INFO : storing numpy array 'syn0' to text8.model.syn0.npy 177 2016-5-2 18:58:24,278 : INFO : not storing attribute cum_table 178 2016-5-2 18:58:25,083 : INFO : storing 71290x200 projection weights into text8.model.bin
下面提供一些網上能下載到的中文的好語料,供研究人員學習使用。
(1).中科院自動化所的中英文新聞語料庫 http://www.datatang.com/data/13484
中文新聞分類語料庫從鳳凰、新浪、網易、騰訊等版面蒐集。英語新聞分類語料庫爲Reuters-21578的ModApte版本。
(2).搜狗的中文新聞語料庫 http://www.sogou.com/labs/dl/c.html
包括搜狐的大量新聞語料與對應的分類信息。有不一樣大小的版本能夠下載。
(3).李榮陸老師的中文語料庫 http://www.datatang.com/data/11968
壓縮後有240M大小
(4).譚鬆波老師的中文文本分類語料 http://www.datatang.com/data/11970
不只包含大的分類,例如經濟、運動等等,每一個大類下面還包含具體的小類,例如運動包含籃球、足球等等。可以做爲層次分類的語料庫,很是實用。這個網址免積分(譚鬆波老師的主頁):http://www.searchforum.org.cn/tansongbo/corpus1.PHP
(5).網易分類文本數據 http://www.datatang.com/data/11965
包含運動、汽車等六大類的4000條文本數據。
(6).中文文本分類語料 http://www.datatang.com/data/11963
包含Arts、Literature等類別的語料文本。
(7).更全的搜狗文本分類語料 http://www.sogou.com/labs/dl/c.html
搜狗實驗室發佈的文本分類語料,有不一樣大小的數據版本供免費下載
(8).2002年中文網頁分類訓練集 http://www.datatang.com/data/15021
app
2002年秋天北京大學網絡與分佈式實驗室天網小組經過動員不一樣專業的幾十個學生,人工選取造成了一個全新的基於層次模型的大規模中文網頁樣本集。它包括11678個訓練網頁實例和3630個測試網頁實例,分佈在11個大類別中。分佈式
將預料庫進行分詞並去掉停用詞,經常使用分詞工具備:
工具
StandardAnalyzer(中英文)、ChineseAnalyzer(中文)、CJKAnalyzer(中英文)、IKAnalyzer(中英文,兼容韓文,日文)、paoding(中文)、MMAnalyzer(中英文)、MMSeg4j(中英文)、imdict(中英文)、NLTK(中英文)、Jieba(中英文)。
學習
原始語料 http://pan.baidu.com/s/1nviuFc1
訓練語料 http://pan.baidu.com/s/1kVEmNTd 測試