Answer: 在Q2的(b)中,咱們已經獲得了softmax關於輸入向量的梯度=y_hat-y 因此本題的結果爲 其中U=[μ1,μ2,…,μW]是全體輸出向量造成的矩陣 即
python
Answer: 相似於上題,結果爲 即
算法
Answer: 一樣是計算梯度,使用咱們以前的結果便可。 app
Answer: 咱們用F(O,Vc)(其中O表示詞彙)來做爲損失函數的佔位符。題目當中已經有提示,對於skip-gram,以c爲中心的周圍內容的損失計算爲:
其中,Vc使咱們的詞向量。dom
題目中提到,CBOW的損失函數定義爲: ide
首先完成歸一化函數:函數
def normalizeRows(x): #行歸一化函數 """ Row normalization function Implement a function that normalizes each row of a matrix to have unit length. """ n = x.shape[0] x /= np.sqrt(np.sum(x**2,axis=1)).reshape((n,1)) + 1e-30 #防止除0加個小數 return x
完成word2vec的softmax損失函數:post
def softmaxCostAndGradient(predicted, target, outputVectors, dataset): """ Softmax cost function for word2vec models Implement the cost and gradients for one predicted word vector and one target word vector as a building block for word2vec models, assuming the softmax prediction function and cross entropy loss. Arguments: predicted -- numpy ndarray, predicted word vector (\hat{v} in the written component) target -- integer, the index of the target word outputVectors -- "output" vectors (as rows) for all tokens dataset -- needed for negative sampling, unused here. Return: cost -- cross entropy cost for the softmax word prediction gradPred -- the gradient with respect to the predicted word vector grad -- the gradient with respect to all the other word vectors We will not provide starter code for this function, but feel free to reference the code you previously wrote for this assignment! """ #針對一個predict word和當前的的target word,完成一個傳播過程 #計算預測結果 v_hat = predicted z = np.dot(outputVectors,v_hat) preds = softmax(z) cost = -np.log(preds[target]) #計算梯度 z = preds.copy() z[target] -= 1.0 grad = np.outer(z,v_hat) #np.outer函數: #①對於多維向量,所有展開變爲一維向量 #②第一個參數表示倍數,使得第二個向量每次變爲幾倍 #③第一個參數肯定結果的行,第二個參數肯定結果的列 gradPred = np.dot(outputVectors.T,z) return cost, gradPred, grad
Word2vec模型負例採樣後的損失函數和梯度:ui
def negSamplingCostAndGradient(predicted, target, outputVectors, dataset, K=10): """ Negative sampling cost function for word2vec models Implement the cost and gradients for one predicted word vector and one target word vector as a building block for word2vec models, using the negative sampling technique. K is the sample size. Note: See test_word2vec below for dataset's initialization. Arguments/Return Specifications: same as softmaxCostAndGradient """ # Sampling of indices is done for you. Do not modify this if you # wish to match the autograder and receive points! #爲每一個窗口取k個負樣本 indices = [target] indices.extend(getNegativeSamples(target, dataset, K)) grad = np.zeros(outputVectors.shape) gradPred = np.zeros(predicted.shape) cost = 0 z = sigmoid(np.dot(outputVectors[target],predicted)) cost -= np.log(z) grad[target] += predicted*(z-1.0) gradPred = outputVectors[target] * (z-1.0) #最小化這些詞隨中心詞出如今中心詞附近的機率 for k in range(K): sample = indices[k+1] z = sigmoid(np.dot(outputVectors[sample],predicted)) cost -= np.log(1.0-z) grad[sample] += predicted*z gradPred += outputVectors[sample] * z return cost, gradPred, grad
完成skipgram:this
def skipgram(currentWord, C, contextWords, tokens, inputVectors, outputVectors, dataset, word2vecCostAndGradient=softmaxCostAndGradient): """ Skip-gram model in word2vec Implement the skip-gram model in this function. Arguments: currentWord -- a string of the current center word C -- integer, context size contextWords -- list of no more than 2*C strings, the context words tokens -- a dictionary that maps words to their indices in the word vector list inputVectors -- "input" word vectors (as rows) for all tokens outputVectors -- "output" word vectors (as rows) for all tokens word2vecCostAndGradient -- the cost and gradient function for a prediction vector given the target word vectors, could be one of the two cost functions you implemented above. Return: cost -- the cost function value for the skip-gram model grad -- the gradient with respect to the word vectors """ cost = 0.0 gradIn = np.zeros(inputVectors.shape) gradOut = np.zeros(outputVectors.shape) cword_idx = tokens[currentWord] v_hat = inputVectors[cword_idx] #skipgram即根據當前詞預測必定範圍內的上下文詞彙,選擇讓機率分部值最大的向量 for i in contextWords:#對於窗口中的每一個單詞 idx = tokens[i] #target的下標(要預測的單詞的下標) c_cost,c_grad_in,c_grad_out = word2vecCostAndGradient(v_hat,idx,outputVectors,dataset) #更新cost、grad 即便用k個單詞來訓練這個向量 cost += c_cost gradOut += c_grad_out gradIn[cword_idx] += c_grad_in return cost, gradIn, gradOut
注意在python3中已經不用cPickle,直接用pickle就能夠了。spa
def sgd(f, x0, step, iterations, postprocessing=None, useSaved=False, PRINT_EVERY=10): """ Stochastic Gradient Descent Implement the stochastic gradient descent method in this function. Arguments: f -- the function to optimize, it should take a single argument and yield two outputs, a cost and the gradient with respect to the arguments x0 -- the initial point to start SGD from step -- the step size for SGD iterations -- total iterations to run SGD for postprocessing -- postprocessing function for the parameters if necessary. In the case of word2vec we will need to normalize the word vectors to have unit length. PRINT_EVERY -- specifies how many iterations to output loss Return: x -- the parameter value after SGD finishes """ # Anneal learning rate every several iterations ANNEAL_EVERY = 20000 if useSaved: start_iter, oldx, state = load_saved_params() if start_iter > 0: x0 = oldx step *= 0.5 ** (start_iter / ANNEAL_EVERY) if state: random.setstate(state) else: start_iter = 0 x = x0 if not postprocessing: postprocessing = lambda x: x expcost = None for iter in range(start_iter + 1, iterations + 1): # Don't forget to apply the postprocessing after every iteration! # You might want to print the progress every few iterations. cost = None ### YOUR CODE HERE cost,grad = f(x) x -= step*grad postprocessing(x) ### END YOUR CODE if iter % PRINT_EVERY == 0: if not expcost: expcost = cost else: expcost = .95 * expcost + .05 * cost print ("iter %d: %f" % (iter, expcost)) if iter % SAVE_PARAMS_EVERY == 0 and useSaved: save_params(iter, x) if iter % ANNEAL_EVERY == 0: step *= 0.5 return x
這裏爲了使用python3,須要改較多地方,根據報錯的地方逐漸修改便可 訓練一共花了8.09個小時
def cbow(currentWord, C, contextWords, tokens, inputVectors, outputVectors, dataset, word2vecCostAndGradient=softmaxCostAndGradient): """CBOW model in word2vec Implement the continuous bag-of-words model in this function. Arguments/Return specifications: same as the skip-gram model Extra credit: Implementing CBOW is optional, but the gradient derivations are not. If you decide not to implement CBOW, remove the NotImplementedError. """ #經過周圍單詞向量的和來預測中心單詞 cost = 0 gradIn = np.zeros(inputVectors.shape) gradOut = np.zeros(outputVectors.shape) D = inputVectors.shape[1] predicted = np.zeros((D,)) indices = [tokens[cwd] for cwd in contextWords] #輸入爲周圍單詞的向量和 for idx in indices: predicted += inputVectors[idx, :] #tokens[currentWord]中心詞的下標 cost, gp, gradOut = word2vecCostAndGradient(predicted, tokens[currentWord], outputVectors, dataset) #即用這個單詞來訓練周圍的向量 gradIn = np.zeros(inputVectors.shape) for idx in indices: gradIn[idx, :] += gp return cost, gradIn, gradOut
參考: https://blog.csdn.net/longxinchen_ml/article/details/51765418