在文章:NLP入門(四)命名實體識別(NER)中,筆者介紹了兩個實現命名實體識別的工具——NLTK和Stanford NLP。在本文中,咱們將會學習到如何使用深度學習工具來本身一步步地實現NER,只要你堅持看完,就必定會頗有收穫的。
OK,話很少說,讓咱們進入正題。
幾乎全部的NLP都依賴一個強大的語料庫,本項目實現NER的語料庫以下(文件名爲train.txt,一共42000行,這裏只展現前15行,能夠在文章最後的Github地址下載該語料庫):python
played on Monday ( home team in CAPS ) :
VBD IN NNP ( NN NN IN NNP ) :
O O O O O O O O O O
American League
NNP NNP
B-MISC I-MISC
Cleveland 2 DETROIT 1
NNP CD NNP CD
B-ORG O B-ORG O
BALTIMORE 12 Oakland 11 ( 10 innings )
VB CD NNP CD ( CD NN )
B-ORG O B-ORG O O O O O
TORONTO 5 Minnesota 3
TO CD NNP CD
B-ORG O B-ORG O
......
簡單介紹下該語料庫的結構:該語料庫一共42000行,每三行爲一組,其中,第一行爲英語句子,第二行爲每一個句子的詞性(關於英語單詞的詞性,可參考文章:NLP入門(三)詞形還原(Lemmatization)),第三行爲NER系統的標註,具體的含義會在以後介紹。
咱們的NER項目的名稱爲DL_4_NER,結構以下:git
項目中每一個文件的功能以下:github
接下來,筆者將結合代碼文件,分部介紹該項目的步驟,當全部步驟介紹完畢後,咱們的項目就結束了,而你,也就知道了如何用深度學習實現命名實體識別(NER)。
Let's begin!web
第一步,是項目的配置及數據導入,在utils.py文件中實現,完整的代碼以下:算法
# -*- coding: utf-8 -*- import numpy as np import pandas as pd # basic settings for DL_4_NER Project BASE_DIR = "F://NERSystem" CORPUS_PATH = "%s/train.txt" % BASE_DIR KERAS_MODEL_SAVE_PATH = '%s/Bi-LSTM-4-NER.h5' % BASE_DIR WORD_DICTIONARY_PATH = '%s/word_dictionary.pk' % BASE_DIR InVERSE_WORD_DICTIONARY_PATH = '%s/inverse_word_dictionary.pk' % BASE_DIR LABEL_DICTIONARY_PATH = '%s/label_dictionary.pk' % BASE_DIR OUTPUT_DICTIONARY_PATH = '%s/output_dictionary.pk' % BASE_DIR CONSTANTS = [ KERAS_MODEL_SAVE_PATH, InVERSE_WORD_DICTIONARY_PATH, WORD_DICTIONARY_PATH, LABEL_DICTIONARY_PATH, OUTPUT_DICTIONARY_PATH ] # load data from corpus to from pandas DataFrame def load_data(): with open(CORPUS_PATH, 'r') as f: text_data = [text.strip() for text in f.readlines()] text_data = [text_data[k].split('\t') for k in range(0, len(text_data))] index = range(0, len(text_data), 3) # Transforming data to matrix format for neural network input_data = list() for i in range(1, len(index) - 1): rows = text_data[index[i-1]:index[i]] sentence_no = np.array([i]*len(rows[0]), dtype=str) rows.append(sentence_no) rows = np.array(rows).T input_data.append(rows) input_data = pd.DataFrame(np.concatenate([item for item in input_data]),\ columns=['word', 'pos', 'tag', 'sent_no']) return input_data
在該代碼中,先是設置了語料庫文件的路徑CORPUS_PATH,KERAS模型保存路徑KERAS_MODEL_SAVE_PATH,以及在項目過程當中會用到的三個字典的保存路徑(以pickle文件形式保存)WORD_DICTIONARY_PATH,LABEL_DICTIONARY_PATH, OUTPUT_DICTIONARY_PATH。而後是load_data()函數,它將語料庫中的文本以Pandas中的DataFrame結構展現出來,該數據框的前30行以下:微信
word pos tag sent_no 0 played VBD O 1 1 on IN O 1 2 Monday NNP O 1 3 ( ( O 1 4 home NN O 1 5 team NN O 1 6 in IN O 1 7 CAPS NNP O 1 8 ) ) O 1 9 : : O 1 10 American NNP B-MISC 2 11 League NNP I-MISC 2 12 Cleveland NNP B-ORG 3 13 2 CD O 3 14 DETROIT NNP B-ORG 3 15 1 CD O 3 16 BALTIMORE VB B-ORG 4 17 12 CD O 4 18 Oakland NNP B-ORG 4 19 11 CD O 4 20 ( ( O 4 21 10 CD O 4 22 innings NN O 4 23 ) ) O 4 24 TORONTO TO B-ORG 5 25 5 CD O 5 26 Minnesota NNP B-ORG 5 27 3 CD O 5 28 Milwaukee NNP B-ORG 6 29 3 CD O 6
在該數據框中,word這一列表示文本語料庫中的單詞,pos這一列表示該單詞的詞性,tag這一列表示NER的標註,sent_no這一列表示該單詞在第幾個句子中。app
接着,第二步是數據探索,即對輸入的數據(input_data)進行一些數據review,完整的代碼(data_processing.py)以下:函數
# -*- coding: utf-8 -*- import pickle import numpy as np from collections import Counter from itertools import accumulate from operator import itemgetter import matplotlib.pyplot as plt import matplotlib as mpl from utils import BASE_DIR, CONSTANTS, load_data # 設置matplotlib繪圖時的字體 mpl.rcParams['font.sans-serif']=['SimHei'] # 數據查看 def data_review(): # 數據導入 input_data = load_data() # 基本的數據review sent_num = input_data['sent_no'].astype(np.int).max() print("一共有%s個句子。\n"%sent_num) vocabulary = input_data['word'].unique() print("一共有%d個單詞。"%len(vocabulary)) print("前10個單詞爲:%s.\n"%vocabulary[:11]) pos_arr = input_data['pos'].unique() print("單詞的詞性列表:%s.\n"%pos_arr) ner_tag_arr = input_data['tag'].unique() print("NER的標註列表:%s.\n" % ner_tag_arr) df = input_data[['word', 'sent_no']].groupby('sent_no').count() sent_len_list = df['word'].tolist() print("句子長度及出現頻數字典:\n%s." % dict(Counter(sent_len_list))) # 繪製句子長度及出現頻數統計圖 sort_sent_len_dist = sorted(dict(Counter(sent_len_list)).items(), key=itemgetter(0)) sent_no_data = [item[0] for item in sort_sent_len_dist] sent_count_data = [item[1] for item in sort_sent_len_dist] plt.bar(sent_no_data, sent_count_data) plt.title("句子長度及出現頻數統計圖") plt.xlabel("句子長度") plt.ylabel("句子長度出現的頻數") plt.savefig("%s/句子長度及出現頻數統計圖.png" % BASE_DIR) plt.close() # 繪製句子長度累積分佈函數(CDF) sent_pentage_list = [(count/sent_num) for count in accumulate(sent_count_data)] # 尋找分位點爲quantile的句子長度 quantile = 0.9992 #print(list(sent_pentage_list)) for length, per in zip(sent_no_data, sent_pentage_list): if round(per, 4) == quantile: index = length break print("\n分位點爲%s的句子長度:%d." % (quantile, index)) # 繪製CDF plt.plot(sent_no_data, sent_pentage_list) plt.hlines(quantile, 0, index, colors="c", linestyles="dashed") plt.vlines(index, 0, quantile, colors="c", linestyles="dashed") plt.text(0, quantile, str(quantile)) plt.text(index, 0, str(index)) plt.title("句子長度累積分佈函數圖") plt.xlabel("句子長度") plt.ylabel("句子長度累積頻率") plt.savefig("%s/句子長度累積分佈函數圖.png" % BASE_DIR) plt.close() # 數據處理 def data_processing(): # 數據導入 input_data = load_data() # 標籤及詞彙表 labels, vocabulary = list(input_data['tag'].unique()), list(input_data['word'].unique()) # 字典列表 word_dictionary = {word: i+1 for i, word in enumerate(vocabulary)} inverse_word_dictionary = {i+1: word for i, word in enumerate(vocabulary)} label_dictionary = {label: i+1 for i, label in enumerate(labels)} output_dictionary = {i+1: labels for i, labels in enumerate(labels)} dict_list = [word_dictionary, inverse_word_dictionary,label_dictionary, output_dictionary] # 保存爲pickle形式 for dict_item, path in zip(dict_list, CONSTANTS[1:]): with open(path, 'wb') as f: pickle.dump(dict_item, f) #data_review()
調用data_review()函數,輸出的結果以下:工具
一共有13998個句子。 一共有24339個單詞。 前10個單詞爲:['played' 'on' 'Monday' '(' 'home' 'team' 'in' 'CAPS' ')' ':' 'American']. 單詞的詞性列表:['VBD' 'IN' 'NNP' '(' 'NN' ')' ':' 'CD' 'VB' 'TO' 'NNS' ',' 'VBP' 'VBZ' '.' 'VBG' 'PRP$' 'JJ' 'CC' 'JJS' 'RB' 'DT' 'VBN' '"' 'PRP' 'WDT' 'WRB' 'MD' 'WP' 'POS' 'JJR' 'WP$' 'RP' 'NNPS' 'RBS' 'FW' '$' 'RBR' 'EX' "''" 'PDT' 'UH' 'SYM' 'LS' 'NN|SYM']. NER的標註列表:['O' 'B-MISC' 'I-MISC' 'B-ORG' 'I-ORG' 'B-PER' 'B-LOC' 'I-PER' 'I-LOC' 'sO']. 句子長度及出現頻數字典: {1: 177, 2: 1141, 3: 620, 4: 794, 5: 769, 6: 639, 7: 999, 8: 977, 9: 841, 10: 501, 11: 395, 12: 316, 13: 339, 14: 291, 15: 275, 16: 225, 17: 229, 18: 212, 19: 197, 20: 221, 21: 228, 22: 221, 23: 230, 24: 210, 25: 207, 26: 224, 27: 188, 28: 199, 29: 214, 30: 183, 31: 202, 32: 167, 33: 167, 34: 141, 35: 130, 36: 119, 37: 105, 38: 112, 39: 98, 40: 78, 41: 74, 42: 63, 43: 51, 44: 42, 45: 39, 46: 19, 47: 22, 48: 19, 49: 15, 50: 16, 51: 8, 52: 9, 53: 5, 54: 4, 55: 9, 56: 2, 57: 2, 58: 2, 59: 2, 60: 3, 62: 2, 66: 1, 67: 1, 69: 1, 71: 1, 72: 1, 78: 1, 80: 1, 113: 1, 124: 1}. 分位點爲0.9992的句子長度:60.
在該語料庫中,一共有13998個句子,比預期的42000/3=14000個句子少兩個。一個有24339個單詞,單詞量仍是蠻大的,固然,這裏對單詞沒有作任何處理,直接保留了語料庫中的形式(後期能夠繼續優化)。單詞的詞性能夠參考文章:NLP入門(三)詞形還原(Lemmatization)。咱們須要注意的是,NER的標註列表爲['O' ,'B-MISC', 'I-MISC', 'B-ORG' ,'I-ORG', 'B-PER' ,'B-LOC' ,'I-PER', 'I-LOC','sO'],所以,本項目的NER一共分爲四類:PER(人名),LOC(位置),ORG(組織)以及MISC,其中B表示開始,I表示中間,O表示單字詞,不計入NER,sO表示特殊單字詞。
接下來,讓咱們考慮下句子的長度,這對後面的建模時填充的句子長度有有參考做用。句子長度及出現頻數的統計圖以下:post
能夠看到,句子長度基本在60如下,固然,這也能夠在輸出的句子長度及出現頻數字典中看到。那麼,咱們是否能夠選在一個標準做爲後面模型的句子填充的長度呢?答案是,利用出現頻數的累計分佈函數的分位點,在這裏,咱們選擇分位點爲0.9992,對應的句子長度爲60,以下圖:
接着是數據處理函數data_processing(),它的功能主要是實現單詞、標籤字典,並保存爲pickle文件形式,便於後續直接調用。
在第三步中,咱們創建Bi-LSTM模型來訓練訓練,完整的Python代碼(Bi_LSTM_Model_training.py)以下:
# -*- coding: utf-8 -*- import pickle import numpy as np import pandas as pd from utils import BASE_DIR, CONSTANTS, load_data from data_processing import data_processing from keras.utils import np_utils, plot_model from keras.models import Sequential from keras.preprocessing.sequence import pad_sequences from keras.layers import Bidirectional, LSTM, Dense, Embedding, TimeDistributed # 模型輸入數據 def input_data_for_model(input_shape): # 數據導入 input_data = load_data() # 數據處理 data_processing() # 導入字典 with open(CONSTANTS[1], 'rb') as f: word_dictionary = pickle.load(f) with open(CONSTANTS[2], 'rb') as f: inverse_word_dictionary = pickle.load(f) with open(CONSTANTS[3], 'rb') as f: label_dictionary = pickle.load(f) with open(CONSTANTS[4], 'rb') as f: output_dictionary = pickle.load(f) vocab_size = len(word_dictionary.keys()) label_size = len(label_dictionary.keys()) # 處理輸入數據 aggregate_function = lambda input: [(word, pos, label) for word, pos, label in zip(input['word'].values.tolist(), input['pos'].values.tolist(), input['tag'].values.tolist())] grouped_input_data = input_data.groupby('sent_no').apply(aggregate_function) sentences = [sentence for sentence in grouped_input_data] x = [[word_dictionary[word[0]] for word in sent] for sent in sentences] x = pad_sequences(maxlen=input_shape, sequences=x, padding='post', value=0) y = [[label_dictionary[word[2]] for word in sent] for sent in sentences] y = pad_sequences(maxlen=input_shape, sequences=y, padding='post', value=0) y = [np_utils.to_categorical(label, num_classes=label_size + 1) for label in y] return x, y, output_dictionary, vocab_size, label_size, inverse_word_dictionary # 定義深度學習模型:Bi-LSTM def create_Bi_LSTM(vocab_size, label_size, input_shape, output_dim, n_units, out_act, activation): model = Sequential() model.add(Embedding(input_dim=vocab_size + 1, output_dim=output_dim, input_length=input_shape, mask_zero=True)) model.add(Bidirectional(LSTM(units=n_units, activation=activation, return_sequences=True))) model.add(TimeDistributed(Dense(label_size + 1, activation=out_act))) model.compile(optimizer='adam', loss='categorical_crossentropy', metrics=['accuracy']) return model # 模型訓練 def model_train(): # 將數據集分爲訓練集和測試集,佔比爲9:1 input_shape = 60 x, y, output_dictionary, vocab_size, label_size, inverse_word_dictionary = input_data_for_model(input_shape) train_end = int(len(x)*0.9) train_x, train_y = x[0:train_end], np.array(y[0:train_end]) test_x, test_y = x[train_end:], np.array(y[train_end:]) # 模型輸入參數 activation = 'selu' out_act = 'softmax' n_units = 100 batch_size = 32 epochs = 10 output_dim = 20 # 模型訓練 lstm_model = create_Bi_LSTM(vocab_size, label_size, input_shape, output_dim, n_units, out_act, activation) lstm_model.fit(train_x, train_y, epochs=epochs, batch_size=batch_size, verbose=1) # 模型保存 model_save_path = CONSTANTS[0] lstm_model.save(model_save_path) plot_model(lstm_model, to_file='%s/LSTM_model.png' % BASE_DIR) # 在測試集上的效果 N = test_x.shape[0] # 測試的條數 avg_accuracy = 0 # 預測的平均準確率 for start, end in zip(range(0, N, 1), range(1, N+1, 1)): sentence = [inverse_word_dictionary[i] for i in test_x[start] if i != 0] y_predict = lstm_model.predict(test_x[start:end]) input_sequences, output_sequences = [], [] for i in range(0, len(y_predict[0])): output_sequences.append(np.argmax(y_predict[0][i])) input_sequences.append(np.argmax(test_y[start][i])) eval = lstm_model.evaluate(test_x[start:end], test_y[start:end]) print('Test Accuracy: loss = %0.6f accuracy = %0.2f%%' % (eval[0], eval[1] * 100)) avg_accuracy += eval[1] output_sequences = ' '.join([output_dictionary[key] for key in output_sequences if key != 0]).split() input_sequences = ' '.join([output_dictionary[key] for key in input_sequences if key != 0]).split() output_input_comparison = pd.DataFrame([sentence, output_sequences, input_sequences]).T print(output_input_comparison.dropna()) print('#' * 80) avg_accuracy /= N print("測試樣本的平均預測準確率:%.2f%%." % (avg_accuracy * 100)) model_train()
在上面的代碼中,先是經過input_data_for_model()函數來處理好進入模型的數據,其參數爲input_shape,即填充句子時的長度。而後是建立Bi-LSTM模型create_Bi_LSTM(),模型的示意圖以下:
最後,是在輸入的數據上進行模型訓練,將原始的數據分爲訓練集和測試集,佔比爲9:1,訓練的週期爲10次。
運行上述模型訓練代碼,一共訓練10個週期,訓練時間大概爲500s,在訓練集上的準確率達99%以上,在測試集上的平均準確率爲95%以上。如下是最後幾個測試集上的預測結果:
......(前面的輸出已忽略) Test Accuracy: loss = 0.000986 accuracy = 100.00% 0 1 2 0 Cardiff B-ORG B-ORG 1 1 O O 2 Brighton B-ORG B-ORG 3 0 O O ################################################################################ 1/1 [==============================] - 0s 10ms/step Test Accuracy: loss = 0.000274 accuracy = 100.00% 0 1 2 0 Carlisle B-ORG B-ORG 1 0 O O 2 Hull B-ORG B-ORG 3 0 O O ################################################################################ 1/1 [==============================] - 0s 9ms/step Test Accuracy: loss = 0.000479 accuracy = 100.00% 0 1 2 0 Chester B-ORG B-ORG 1 1 O O 2 Cambridge B-ORG B-ORG 3 1 O O ################################################################################ 1/1 [==============================] - 0s 9ms/step Test Accuracy: loss = 0.003092 accuracy = 100.00% 0 1 2 0 Darlington B-ORG B-ORG 1 4 O O 2 Swansea B-ORG B-ORG 3 1 O O ################################################################################ 1/1 [==============================] - 0s 8ms/step Test Accuracy: loss = 0.000705 accuracy = 100.00% 0 1 2 0 Exeter B-ORG B-ORG 1 2 O O 2 Scarborough B-ORG B-ORG 3 2 O O ################################################################################ 測試樣本的平均預測準確率:95.55%.
該模型在原始數據上的識別效果仍是能夠的。
訓練完模型後,BASE_DIR中的全部文件以下:
最後,也許是整個項目最爲激動人心的時刻,由於,咱們要在新數據集上測試模型的識別效果。預測新數據的識別結果的完整Python代碼(Bi_LSTM_Model_predict.py)以下:
# -*- coding: utf-8 -*- # Name entity recognition for new data # Import the necessary modules import pickle import numpy as np from utils import CONSTANTS from keras.preprocessing.sequence import pad_sequences from keras.models import load_model from nltk import word_tokenize # 導入字典 with open(CONSTANTS[1], 'rb') as f: word_dictionary = pickle.load(f) with open(CONSTANTS[4], 'rb') as f: output_dictionary = pickle.load(f) try: # 數據預處理 input_shape = 60 sent = 'New York is the biggest city in America.' new_sent = word_tokenize(sent) new_x = [[word_dictionary[word] for word in new_sent]] x = pad_sequences(maxlen=input_shape, sequences=new_x, padding='post', value=0) # 載入模型 model_save_path = CONSTANTS[0] lstm_model = load_model(model_save_path) # 模型預測 y_predict = lstm_model.predict(x) ner_tag = [] for i in range(0, len(new_sent)): ner_tag.append(np.argmax(y_predict[0][i])) ner = [output_dictionary[i] for i in ner_tag] print(new_sent) print(ner) # 去掉NER標註爲O的元素 ner_reg_list = [] for word, tag in zip(new_sent, ner): if tag != 'O': ner_reg_list.append((word, tag)) # 輸出模型的NER識別結果 print("NER識別結果:") if ner_reg_list: for i, item in enumerate(ner_reg_list): if item[1].startswith('B'): end = i+1 while end <= len(ner_reg_list)-1 and ner_reg_list[end][1].startswith('I'): end += 1 ner_type = item[1].split('-')[1] ner_type_dict = {'PER': 'PERSON: ', 'LOC': 'LOCATION: ', 'ORG': 'ORGANIZATION: ', 'MISC': 'MISC: ' } print(ner_type_dict[ner_type],\ ' '.join([item[0] for item in ner_reg_list[i:end]])) else: print("模型並未識別任何有效命名實體。") except KeyError as err: print("您輸入的句子有單詞不在詞彙表中,請從新輸入!") print("不在詞彙表中的單詞爲:%s." % err)
輸出結果爲:
['New', 'York', 'is', 'the', 'biggest', 'city', 'in', 'America', '.'] ['B-LOC', 'I-LOC', 'O', 'O', 'O', 'O', 'O', 'B-LOC', 'O'] NER識別結果: LOCATION: New York LOCATION: America
接下來,再測試三個筆者本身想的句子:
輸入爲:
sent = 'James is a world famous actor, whose home is in London.'
輸出結果爲:
['James', 'is', 'a', 'world', 'famous', 'actor', ',', 'whose', 'home', 'is', 'in', 'London', '.'] ['B-PER', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'B-LOC', 'O'] NER識別結果: PERSON: James LOCATION: London
輸入爲:
sent = 'Oxford is in England, Jack is from here.'
輸出爲:
['Oxford', 'is', 'in', 'England', ',', 'Jack', 'is', 'from', 'here', '.'] ['B-PER', 'O', 'O', 'B-LOC', 'O', 'B-PER', 'O', 'O', 'O', 'O'] NER識別結果: PERSON: Oxford LOCATION: England PERSON: Jack
輸入爲:
sent = 'I love Shanghai.'
輸出爲:
['I', 'love', 'Shanghai', '.'] ['O', 'O', 'B-LOC', 'O'] NER識別結果: LOCATION: Shanghai
在上面的例子中,只有Oxford的識別效果不理想,模型將它識別爲PERSON,其實應該是ORGANIZATION。
接下來是三個來自CNN和wikipedia的句子:
輸入爲:
sent = "the US runs the risk of a military defeat by China or Russia"
輸出爲:
['the', 'US', 'runs', 'the', 'risk', 'of', 'a', 'military', 'defeat', 'by', 'China', 'or', 'Russia'] ['O', 'B-LOC', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'B-LOC', 'O', 'B-LOC'] NER識別結果: LOCATION: US LOCATION: China LOCATION: Russia
輸入爲:
sent = "Home to the headquarters of the United Nations, New York is an important center for international diplomacy."
輸出爲:
['Home', 'to', 'the', 'headquarters', 'of', 'the', 'United', 'Nations', ',', 'New', 'York', 'is', 'an', 'important', 'center', 'for', 'international', 'diplomacy', '.'] ['O', 'O', 'O', 'O', 'O', 'O', 'B-ORG', 'I-ORG', 'O', 'B-LOC', 'I-LOC', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O'] NER識別結果: ORGANIZATION: United Nations LOCATION: New York
輸入爲:
sent = "The United States is a founding member of the United Nations, World Bank, International Monetary Fund."
輸出爲:
['The', 'United', 'States', 'is', 'a', 'founding', 'member', 'of', 'the', 'United', 'Nations', ',', 'World', 'Bank', ',', 'International', 'Monetary', 'Fund', '.'] ['O', 'B-LOC', 'I-LOC', 'O', 'O', 'O', 'O', 'O', 'O', 'B-ORG', 'I-ORG', 'O', 'B-ORG', 'I-ORG', 'O', 'B-ORG', 'I-ORG', 'I-ORG', 'O'] NER識別結果: LOCATION: United States ORGANIZATION: United Nations ORGANIZATION: World Bank ORGANIZATION: International Monetary Fund
這三個例子識別所有正確。
到這兒,筆者的這個項目就差很少了。咱們有必要對這個項目作個總結。
首先是這個項目的優勢。它的優勢在於可以讓你一步步地實現NER,並且除了語料庫,你基本熟悉瞭如何建立一個識別NER系統的步驟,同時,對深度學習模型及其應用也有了深入理解。所以,好處是顯而易見的。固然,在實際工做中,語料庫的整理纔是最耗費時間的,可以佔到90%或者更多的時間,所以,有一個好的語料庫你才能展開工做。
接着講講這個項目的缺點。第一個,是語料庫不夠大,固然,約14000條句子也夠了,但本項目沒有對句子進行文本預處理,因此,有些單詞的變形可能沒法進入詞彙表。第二個,缺乏對新詞的處理,一旦句子中出現一個新的單詞,這個模型便沒法處理,這是後期須要完善的地方。第三個,句子的填充長度爲60,若是輸入的句子長度大於60,則後面的部分將沒法有效識別。
所以,後續還有更多的工做須要去作,固然,作一箇中文NER也是能夠考慮的。
本項目已上傳Github,地址爲 https://github.com/percent4/D... 。:歡迎你們參考~
注意:本人現已開通微信公衆號: Python爬蟲與算法(微信號爲:easy_web_scrape), 歡迎你們關注哦~~