個人原文:www.hijerry.cn/p/54554.htm…html
去年冬季學習了cs224n的2017課程,作了三個assignments,用的是TensorFlow。今年cs224n再次放課,一共有5個assignments,使用PyTorch,主講仍是Manning,特別喜歡這個老師,講課生動有趣還挺可愛的哈哈哈~~python
Assignment1(點擊下載) 的任務是探索詞向量。以基於計數的共現矩陣和基於預測的word2vec兩種方式,計算詞的類似度,研究近義詞、反義詞等等性質,從代碼層面來理解它們,有更深入的記憶。web
做業是ipynb文件,因此要用jupyter打開,能夠參考chaibubble的如何打開ipynb文件。算法
注意:python版本 >= 3.5app
詞向量是下游NLP任務(如問答、文本生成、翻譯等) 的基本組件,詞向量的好壞能在很大程度上影響下游任務的性能。這裏咱們將探索兩類詞向量:共現矩陣 和 word2vec 。dom
術語解釋: "word vectors" 和 "word embeddings" 一般能夠互換使用。"embedding" 這個詞的內在含義是將詞編碼到一個底維空間中。"概念上而言,它是指把一個維數爲全部詞的數量的高維空間嵌入到一個維數低得多的連續向量空間中,每一個單詞或詞組被映射爲實數域上的向量。"——維基百科ide
大多數詞向量模型都是基於一個觀點:oop
You shall know a word by the company it keeps (Firth, J. R. 1957:11)性能
大多數詞向量的實現的核心是 類似詞 ,也就是同義詞,由於它們有類似的上下文。這裏咱們介紹一種策略叫作 共現矩陣 (更多信息能夠查看 這裏 或 這裏 )學習
這部分要實現的是,給定語料庫,根據共現矩陣計算詞向量,獲得語料庫中每一個詞的詞向量,流程以下:
計算語料庫的單詞數量、單詞集
def distinct_words(corpus):
""" Determine a list of distinct words for the corpus. Params: corpus (list of list of strings): corpus of documents Return: corpus_words (list of strings): list of distinct words across the corpus, sorted (using python 'sorted' function) num_corpus_words (integer): number of distinct words across the corpus """
corpus_words = []
num_corpus_words = -1
# ------------------
# Write your implementation here.
corpus = [w for sent in corpus for w in sent]
corpus_words = list(set(corpus))
corpus_words = sorted(corpus_words)
num_corpus_words = len(corpus_words)
# ------------------
return corpus_words, num_corpus_words
複製代碼
計算給定語料庫的共現矩陣。具體來講,對於每個詞 w
,統計前、後方 window_size
個詞的出現次數
def compute_co_occurrence_matrix(corpus, window_size=4):
""" Compute co-occurrence matrix for the given corpus and window_size (default of 4). Note: Each word in a document should be at the center of a window. Words near edges will have a smaller number of co-occurring words. For example, if we take the document "START All that glitters is not gold END" with window size of 4, "All" will co-occur with "START", "that", "glitters", "is", and "not". Params: corpus (list of list of strings): corpus of documents window_size (int): size of context window Return: M (numpy matrix of shape (number of corpus words, number of corpus words)): Co-occurence matrix of word counts. The ordering of the words in the rows/columns should be the same as the ordering of the words given by the distinct_words function. word2Ind (dict): dictionary that maps word to index (i.e. row/column number) for matrix M. """
words, num_words = distinct_words(corpus)
M = None
word2Ind = {}
# ------------------
# Write your implementation here.
M = np.zeros(shape=(num_words, num_words), dtype=np.int32)
for i in range(num_words):
word2Ind[words[i]] = i
for sent in corpus:
for p in range(len(sent)):
ci = word2Ind[sent[p]]
# preceding
for w in sent[max(0, p - window_size):p]:
wi = word2Ind[w]
M[ci][wi] += 1
# subsequent
for w in sent[p + 1:p + 1 + window_size]:
wi = word2Ind[w]
M[ci][wi] += 1
# ------------------
return M, word2Ind
複製代碼
這一步是降維。在問題1.2獲得的是一個N x N的矩陣(N是單詞集的大小),使用scikit-learn實現的SVD(奇異值分解),從這個大矩陣裏分解出一個含k個特製的N x k 小矩陣。
注意:在numpy、scipy和scikit-learn都提供了一些SVD的實現,可是隻有scipy、sklearn有Truncated SVD,而且只有sklearn提供了計算大規模SVD的高效的randomized算法,詳情參考sklearn.decomposition.TruncatedSVD 。
def reduce_to_k_dim(M, k=2):
""" Reduce a co-occurence count matrix of dimensionality (num_corpus_words, num_corpus_words) to a matrix of dimensionality (num_corpus_words, k) using the following SVD function from Scikit-Learn: - http://scikit-learn.org/stable/modules/generated/sklearn.decomposition.TruncatedSVD.html Params: M (numpy matrix of shape (number of corpus words, number of corpus words)): co-occurence matrix of word counts k (int): embedding size of each word after dimension reduction Return: M_reduced (numpy matrix of shape (number of corpus words, k)): matrix of k-dimensioal word embeddings. In terms of the SVD from math class, this actually returns U * S """
n_iters = 10 # Use this parameter in your call to `TruncatedSVD`
M_reduced = None
print("Running Truncated SVD over %i words..." % (M.shape[0]))
# ------------------
# Write your implementation here.
svd = TruncatedSVD(n_components=k)
svd.fit(M.T)
M_reduced = svd.components_.T
# ------------------
print("Done.")
return M_reduced
複製代碼
基於matplotlib,用scatter
畫 "×",用 text
寫字
def plot_embeddings(M_reduced, word2Ind, words):
""" Plot in a scatterplot the embeddings of the words specified in the list "words". NOTE: do not plot all the words listed in M_reduced / word2Ind. Include a label next to each point. Params: M_reduced (numpy matrix of shape (number of unique words in the corpus , k)): matrix of k-dimensioal word embeddings word2Ind (dict): dictionary that maps word to indices for matrix M words (list of strings): words whose embeddings we want to visualize """
# ------------------
# Write your implementation here.
for w in words:
x = M_reduced[word2Ind[w]][0]
y = M_reduced[word2Ind[w]][1]
plt.scatter(x, y, marker='x')
plt.text(x, y, w)
plt.show()
# ------------------
複製代碼
效果:
將詞嵌入到2個維度上,歸一化,最終詞向量會落到一個單位圓內,在座標系上尋找相近的詞。
目前,基於預測的詞向量是最流行的,好比word2vec。如今咱們來探索word2vec生成的詞向量,若是想要深刻了解,能夠讀一讀 原始論文 。
這一部分主要是使用gensim探索詞向量,不是本身實現word2vec,所使用的詞向量維度是300,由google發佈。
首先使用SVD降維,將300維降2維,方便打印查看。
和問題1.5同樣
找到一個有多個含義的詞(好比 "leaves","scoop"),這種詞的top-10類似詞(根據餘弦類似度)裏有兩個詞的意思不同。好比"leaves"(葉子,花瓣)的top-10詞裏有"vanishes"(消失)和"stalks"(莖稈)。
這裏我找到的詞是"column"(列),它的top-10裏有"columnist"(專欄做家)和"article"(文章)
# ------------------
# Write your polysemous word exploration code here.
wv_from_bin.most_similar("column")
# ------------------
複製代碼
輸出:
[('columns', 0.767943263053894),
('columnist', 0.6541407108306885),
('article', 0.651928186416626),
('columnists', 0.617466926574707),
('syndicated_column', 0.599014401435852),
('op_ed', 0.588202714920044),
('Op_Ed', 0.5801560282707214),
('op_ed_column', 0.5779396891593933),
('nationally_syndicated_column', 0.572504997253418),
('colum', 0.5595961213111877)]
複製代碼
找到三個詞(w1, w2, w3),其中w1和w2是近義詞,w1和w3是反義詞,可是w1和w3的距離<w1和w2的距離。例如:w1="happy",w2="cheerful",w3="sad"
爲何反義詞的類似度反而更大呢(距離越小說明越類似)?由於他們的上下文一般很是一致
# ------------------
# Write your synonym & antonym exploration code here.
w1 = "love"
w2 = "like"
w3 = "hate"
w1_w2_dist = wv_from_bin.distance(w1, w2)
w1_w3_dist = wv_from_bin.distance(w1, w3)
print("Synonyms {}, {} have cosine distance: {}".format(w1, w2, w1_w2_dist))
print("Antonyms {}, {} have cosine distance: {}".format(w1, w3, w1_w3_dist))
# ------------------
複製代碼
輸出:
Synonyms love, like have cosine distance: 0.6328612565994263
Antonyms love, hate have cosine distance: 0.39960432052612305
複製代碼
man 對於 king,至關於woman對於___,這樣的問題也能夠用word2vec來解決,關於most_similar的詳細用法能夠參考 GenSim文檔。
這裏咱們找另一組類比
# ------------------
# Write your analogy exploration code here.
# man : him :: woman : her
pprint.pprint(wv_from_bin.most_similar(positive=['woman', 'him'], negative=['man']))
# ------------------
複製代碼
輸出:
[('her', 0.694490909576416),
('she', 0.6385233402252197),
('me', 0.628451406955719),
('herself', 0.6239798665046692),
('them', 0.5843966007232666),
('She', 0.5237804651260376),
('myself', 0.4885627031326294),
('saidshe', 0.48337966203689575),
('he', 0.48184287548065186),
('Gail_Quets', 0.4784894585609436)]
複製代碼
能夠看到正確的計算出了"queen"
找到一個錯誤的類比,樹:樹葉 ::花:花瓣
# ------------------
# Write your incorrect analogy exploration code here.
# tree : leaf :: flower : petal
pprint.pprint(wv_from_bin.most_similar(positive=['leaf', 'flower'], negative=['tree']))
# ------------------
複製代碼
輸出:
[('floral', 0.5532568693161011),
('marigold', 0.5291938185691833),
('tulip', 0.521312952041626),
('rooted_cuttings', 0.5189826488494873),
('variegation', 0.5136324763298035),
('Asiatic_lilies', 0.5132641792297363),
('gerberas', 0.5106234550476074),
('gerbera_daisies', 0.5101010203361511),
('Verbena_bonariensis', 0.5070016980171204),
('violet', 0.5058108568191528)]
複製代碼
結果輸出的裏面沒有「花瓣」
注意偏見是很重要的好比性別歧視、種族歧視等,執行下面代碼,分析兩個問題:
(a) 哪一個詞與「woman」和「boss」最類似,和「man」最不類似?
(b) 哪一個詞與「man」和「boss」最類似,和「woman」最不類似?
# Run this cell
# Here `positive` indicates the list of words to be similar to and `negative` indicates the list of words to be
# most dissimilar from.
pprint.pprint(wv_from_bin.most_similar(positive=['woman', 'boss'], negative=['man']))
print()
pprint.pprint(wv_from_bin.most_similar(positive=['man', 'boss'], negative=['woman']))
複製代碼
輸出:
[('bosses', 0.5522644519805908),
('manageress', 0.49151360988616943),
('exec', 0.45940813422203064),
('Manageress', 0.45598435401916504),
('receptionist', 0.4474116563796997),
('Jane_Danson', 0.44480544328689575),
('Fiz_Jennie_McAlpine', 0.44275766611099243),
('Coronation_Street_actress', 0.44275566935539246),
('supremo', 0.4409853219985962),
('coworker', 0.43986251950263977)]
[('supremo', 0.6097398400306702),
('MOTHERWELL_boss', 0.5489562153816223),
('CARETAKER_boss', 0.5375303626060486),
('Bully_Wee_boss', 0.5333974361419678),
('YEOVIL_Town_boss', 0.5321705341339111),
('head_honcho', 0.5281980037689209),
('manager_Stan_Ternent', 0.525971531867981),
('Viv_Busby', 0.5256162881851196),
('striker_Gabby_Agbonlahor', 0.5250812768936157),
('BARNSLEY_boss', 0.5238943099975586)]
複製代碼
第一個類比 男人:女人 :: 老闆:___,最合適的詞應該是"landlady"(老闆娘)之類的,可是top-10裏只有"manageress"(女經理),"receptionist"(接待員)之類的詞。
第二個類比 女人:男人 :: 老闆:___,輸出的不知道是些什麼東西/捂臉
這裏我找的例子是:
# ------------------
# Write your bias exploration code here.
pprint.pprint(wv_from_bin.most_similar(positive=['woman', 'doctor'], negative=['man']))
print()
pprint.pprint(wv_from_bin.most_similar(positive=['man', 'doctor'], negative=['woman']))
# ------------------
複製代碼
輸出:
[('gynecologist', 0.7093892097473145),
('nurse', 0.647728681564331),
('doctors', 0.6471461057662964),
('physician', 0.64389967918396),
('pediatrician', 0.6249487996101379),
('nurse_practitioner', 0.6218312978744507),
('obstetrician', 0.6072014570236206),
('ob_gyn', 0.5986712574958801),
('midwife', 0.5927063226699829),
('dermatologist', 0.5739566683769226)]
[('physician', 0.6463665962219238),
('doctors', 0.5858404040336609),
('surgeon', 0.5723941326141357),
('dentist', 0.552364706993103),
('cardiologist', 0.5413815975189209),
('neurologist', 0.5271126627922058),
('neurosurgeon', 0.5249835848808289),
('urologist', 0.5247740149497986),
('Doctor', 0.5240625143051147),
('internist', 0.5183224081993103)]
複製代碼
第一個類比中,咱們看到了"nurse"(護士),這是一個有偏見的類比
什麼會致使詞向量裏的偏見?
由於數據集中有偏見
[1] CS224n: Natural Language Processing with Deep Learning, 2019-03-12.
[2] 計算機系統裏的偏見和歧視:除了殺死,還有其餘方法, 2019-03-12.