轉自:SevenBluehtml
Pre-trained vectors trained on part of Google News dataset (about 100 billion words). The model contains 300-dimensional vectors for 3 million words and phrases. The phrases were obtained using a simple data-driven approach described in this papergit
download link | source linkgithub
1 million word vectors trained on Wikipedia 2017, UMBC webbase corpus and statmt.org news dataset (16B tokens).web
download link | source linkapp
1 million word vectors trained with subword infomation on Wikipedia 2017, UMBC webbase corpus and statmt.org news dataset (16B tokens).this
download link | source linkgoogle
2 million word vectors trained on Common Crawl (600B tokens).code
download link | source linkhtm
Wikipedia 2014 + Gigaword 5 (6B tokens, 400K vocab, uncased, 50d, 100d, 200d, & 300d vectors, 822 MB download)token
Common Crawl (42B tokens, 1.9M vocab, uncased, 300d vectors, 1.75 GB download)
Common Crawl (840B tokens, 2.2M vocab, cased, 300d vectors, 2.03 GB download)
Twitter (2B tweets, 27B tokens, 1.2M vocab, uncased, 25d, 50d, 100d, & 200d vectors, 1.42 GB download)
Wikipedia database, Vector Size 300, Corpus Size 1G, Vocabulary Size 50101, Jieba tokenizor
Trained on Common Crawl and Wikipedia using fastText. These models were trained using CBOW with position-weights, in dimension 300, with character n-grams of length 5, a window of size 5 and 10 negatives. We used the Stanford word segmenter for Tokenization
https://github.com/Hironsan/awesome-embedding-models
http://ahogrammer.com/2017/01/20/the-list-of-pretrained-word-embeddings/
https://code.google.com/archive/p/word2vec/
https://github.com/facebookresearch/fastText/blob/master/pretrained-vectors.md
https://fasttext.cc/docs/en/english-vectors.html
https://arxiv.org/pdf/1310.4546.pdf