如何爲基於NLP的實體識別模型設置人工審查？

時間 2021-01-18

標籤 html git github 算法 json segmentfault app less 機器學習 ide 欄目 HTML 简体版

原文原文鏈接

目前，各個行業的組織都有大量非結構化數據，供決策團隊經過評估得到基於實體的洞察看法。此外，你們可能還但願添加本身的專有業務實體類型，例如專有零件編號或行業特定術語等。然而爲了建立基於天然語言處理（NLP）的模型，咱們首先須要根據這些特定實體進行數據標記。html

Amazon SageMaker Ground Truth可以幫助你們輕鬆構建起用於機器學習（ML）的高精度訓練數據集，Amazon Comprehend則爲模型訓練做業快速選擇正確的算法與參數。最後，Amazon Augmented AI（Amazon A2I）使咱們可以審計、覈對並擴充得出的預測結果。git

本文將介紹如何使用Ground Truth命名實體識別（Named entity recognition，簡稱NER）進行特徵標記，藉此爲自定義實體構建起標記數據集；並將介紹如何使用Amazon Comprehend訓練一套自定義實體識別器，以及如何使用Amazon A2I提供的人工審覈機制對置信度低於特定閾值的Amazon Comprehend預測進行復核。github

咱們將使用一套示例Amazon SageMaker Jupyter notebook，在演練中完成如下步驟：算法

預處理輸入文件。
建立一項Ground Truth NER標記做業。
訓練Amazon Comprehend自定義實體識別器模型。
使用Amazon A2I設置人工審覈循環，藉此檢測低置信度預測結果。

先決條件json

在開始以前，請經過如下步驟設置Jupyter notebook：segmentfault

在Amazon SageMaker中建立一個Notebook實例。

確保Amazon SageMaker notebook擁有Notebook中先決條件部分所說起的必要AWS身份與訪問管理（AWS Identity and Access Management，簡稱IAM）角色及權限。app

在Notebook處於活動狀態時，選擇Open Jupyter。
在Jupyter儀表上選擇New, 然後選擇Terminal。
在終端內，輸入如下代碼：

cd SageMaker
git clone 「https://github.com/aws-samples/augmentedai-comprehendner-groundtruth」

在augmentedai-comprehendner-groundtruth文件夾中選擇SageMakerGT-ComprehendNER-A2I-Notebook.ipynb以打開該Notebook。

如今，咱們能夠在Notebook單元中執行如下操做步驟。less

預處理輸入文件機器學習

在本用例中，你們正在查看聊天消息或幾份提交的工單，但願弄清它們是否與AWS產品有關。咱們使用Ground Truth中的NER 標記功能，將輸入消息中的SERVICE或VERSION實體標記出來。以後，咱們訓練Amazon Comprehend自定義實體識別器，藉此從推特或工單註釋等文本當中識別出對應實體。ide

樣本數據集可經過GitHub repo中的data/rawinput/aws-service-offerings.txt處獲取。如下截屏所示，爲本次演練中使用的數據集示例。

經過對文件進行預處理，咱們將生成如下文件：

inputs.csv – 使用此文件爲Ground Truth NER標記生成輸入manifest文件。
Train.csv與test.csv – 使用這些文件做爲自定義實體訓練的輸入。咱們能夠在Amazon Simple Storage Service (Amazon S3) 存儲桶中找到這些文件。

關於數據集生成方法，請參閱Notebook中的步驟1a與1b部分。

建立一項Ground Truth NER標記做業

咱們的目標是對輸入文件中的句子進行註釋與標記，將其歸類於咱們的各自定義實體。在本節中，你們須要完成如下步驟：

建立Ground Truth所須要的manifest文件。
設置標記工做人員。
建立標記做業。
啓動標記做業並驗證其輸出結果。

建立一個manifest文件

咱們使用在預處理過程當中生成的inputs.csv文件建立NER標記特徵所須要的manifest文件。咱們將生成的manifest文件命名爲prefix+-text-input.manifest，用於在建立Ground Truth做業時進行數據標記。詳見如下代碼：

# Create and upload the input manifest by appending a source tag to each of the lines in the input text file.
# Ground Truth uses the manifest file to determine labeling tasks
manifest_name = prefix + '-text-input.manifest'
# remove existing file with the same name to avoid duplicate entries
!rm *.manifest
s3bucket = s3res.Bucket(BUCKET)
with open(manifest_name, 'w') as f:
 for fn in s3bucket.objects.filter(Prefix=prefix +'/input/'):
 fn_obj = s3res.Object(BUCKET, fn.key)
 for line in fn_obj.get()['Body'].read().splitlines(): 
 f.write('{"source":"' + line.decode('utf-8') +'"}n')
f.close()
s3.upload_file(manifest_name, BUCKET, prefix + "/manifest/" + manifest_name)

NER標記做業須要將輸入manifest位於{"source": "embedded text"}中。下列截圖展現了從input.csv生成的input.manifest文件內容。

建立專有標記工做人員

在Ground Truth中，咱們使用專有工做人員建立一套通過標記的數據集。

你們能夠在Amazon SageMaker控制檯上建立專有工做人員。關於具體操做說明，請參閱使用Amazon SageMaker Ground Truth與Amazon Comprehend開發NER模型中的建立專有工做團隊部分。

或者也能夠按照Notebook中的指導分步操做。

在本演練中，咱們使用同一專有工做人員在自定義實體訓練完成以後，使用Amazon A2I標記並擴充低置信度數據。

建立一項標記做業

下一步是建立NER標記做業。本文將從新介紹其中的關鍵步驟。關於更多詳細信息，請參閱使用Amazon SageMaker Ground Truth添加數據標記工做流以實現命名實體識別。

在Amazon SageMaker控制檯的Ground Truth之下，選擇Labeling jobs。
選擇Create labeling job。
在Job name部分，輸入一個做業名稱。
在Input dataset location部分，輸入以前建立的輸入manifest文件所對應的Amazon S3存儲位置(s3://_bucket_//_path-to-your-manifest.json_)。
在Output Dataset Location部分，輸入帶有輸出前綴的S3存儲桶（例如s3://_bucket-name_/output)。
在IAM role部分，選擇Create a new Role。
選擇Any S3 Bucket。
選擇Create。
在Task category部分，選擇Text。
選擇Named entity recognition。

選擇Next。
在Worker type部分，選擇Private。
在Private Teams當中，選擇所建立的團隊。
在Named Entity Recognition Labeling Tool部分的Enter a brief description of the task位置，輸入：Highlight the word or group of words and select the corresponding most appropriate label from the right。
在Instructions對話框中，輸入：Your labeling will be used to train an ML model for predictions. Please think carefully on the most appropriate label for the word selection. Remember to label at least 200 annotations per label type。
選擇Bold Italics。
在Labels部分，輸入但願向工做人員展現的標籤名稱。
選擇Create。

啓動標記做業

工做人員（或者是由咱們親自擔任工做人員）將收到一封包含登陸說明的電子郵件。

選擇the URL provided and enter your user name and password.

隨後將被定向至標記任務UI。

經過爲詞組選擇標籤以完成標記任務。
選擇Submit。

在對全部條目進行過標記以後，UI將自動退出。
要檢查做業狀態，請在Amazon SageMaker控制檯的Ground Truth之下，選擇Labeling jobs。
等待，直至做業狀態顯示爲Complete。

驗證註釋輸出

要驗證註釋輸出，請打開S3存儲桶並前往_<S3 Bucket Name>/output/<labeling-job-name>_/manifests/output/output.manifest。咱們能夠在這裏查看Ground Truth建立的manifest文件。如下截屏所示，爲本次演練中的示例條目。

訓練一套自定義實體模型

如今，咱們可使用通過註釋的數據集或者以前建立的output.manifest Ground Truth訓練一套自定義實體識別器了。本節將引導你們完成Notebook中說起的具體步驟。

處理通過註釋的數據集

你們能夠經過實體列表或者註釋，爲Amazon Comprehend自定義實體提供標籤。在本文中，咱們將使用Ground Truth標記做業生成註釋內容。你們須要將通過註釋的output.manifest文件轉換爲如下CSV格式：

File, Line, Begin Offset, End Offset, Typedocuments.txt, 0, 0, 11, VERSION

運行Notebook中的如下代碼以生成此annotations.csv文件：

# Read the output manifest json and convert into a csv format as expected by Amazon Comprehend Custom Entity Recognizer
import json
import csv
# this will be the file that will be written by the format conversion code block below
csvout = 'annotations.csv'
with open(csvout, 'w', encoding="utf-8") as nf:
 csv_writer = csv.writer(nf)
 csv_writer.writerow(["File", "Line", "Begin Offset", "End Offset", "Type"])
 with open("data/groundtruth/output.manifest", "r") as fr:
 for num, line in enumerate(fr.readlines()):
 lj = json.loads(line)
 #print(str(lj))
 if lj and labeling_job_name in lj:
 for ent in lj[labeling_job_name]['annotations']['entities']:
 csv_writer.writerow([fntrain,num,ent['startOffset'],ent['endOffset'],ent['label'].upper()])
 fr.close()
nf.close() 
s3_annot_key = "output/" + labeling_job_name + "/comprehend/" + csvout
upload_to_s3(s3_annot_key, csvout)

下圖所示，爲該文件的具體內容。

設置一套自定義實體識別器

本文在示例中使用API，但你們能夠選擇在Amazon Comprehend控制檯上建立識別與批量分析做業。關於具體操做說明，請參閱使用Amazon Comprehend構建自定義實體識別器。

輸入如下代碼。在s3_train_channel當中使用咱們在預處理階段生成的train.csv文件，藉此進行識別器訓練。在s3_annot_channel當中，使用annotations.csv做爲標籤以訓練您的自定義實體識別器。

custom_entity_request = {
 "Documents": {
 "S3Uri": s3_train_channel
 },
 "Annotations": {
 "S3Uri": s3_annot_channel
 },
 "EntityTypes": [
 {
 "Type": "SERVICE"
 },
 {
 "Type": "VERSION"
 }
 ]
}

使用CreateEntityRecognizer建立實體識別器。該實體識別器使用最低數量訓練樣本進行訓練，借今生成Amazon A2I工做流中須要的部分低置信度預測結果。詳見如下代碼：

import datetime
id = str(datetime.datetime.now().strftime("%s"))
create_custom_entity_response = comprehend.create_entity_recognizer(
 RecognizerName = prefix + "-CER",
 DataAccessRoleArn = role,
 InputDataConfig = custom_entity_request,
 LanguageCode = "en"
)

在實體識別器做業完成以後，咱們將得到一款附帶性能分數的識別器。如前所述，咱們使用最低數量的訓練樣本進行識別器訓練，借今生成Amazon A2I工做流中須要的部分低置信度預測結果。咱們能夠在Amazon Comprehend控制檯上找到這些指標，具體參見如下截屏。

建立一項批量實體檢測分析做業，用以檢測大量文件中的相應實體。

使用Amazon Comprehend StartEntitiesDetectionJob操做以檢測文件中的自定義實體。關於使用自定義實體識別器建立實時分析端點的具體操做說明，請參閱啓動Amazon Comprehend自定義實體識別實時端點以執行註釋任務。

要使用EntityRecognizerArn進行自定義實體識別，咱們須要爲識別器提供訪問權限以進行自定義實體檢測。執行CreateEntityRecognizer操做便可經過響應結果得到此ARN。

運行自定義實體檢測做業，經過Notebook運行如下單元，對預處理步驟當中建立的測試數據集作出預測：

s3_test_channel = 's3://{}/{}'.format(BUCKET, s3_test_key) s3_output_test_data = 's3://{}/{}'.format(BUCKET, "output/testresults/")
test_response = comprehend.start_entities_detection_job( InputDataConfig={
'S3Uri': s3_test_channel,
'InputFormat': 'ONE_DOC_PER_LINE'
},
OutputDataConfig={'S3Uri': s3_output_test_data
},
DataAccessRoleArn=role,
JobName='a2i-comprehend-gt-blog',
EntityRecognizerArn=jobArn,
LanguageCode='en')

如下截屏所示，爲本次演練中得出的測試結果。

創建人工審覈循環

在本節中，咱們將爲Amazon A2I中的低置信度檢測創建起人工審覈循環，具體包括如下步驟：

選擇工做人員。
建立人工任務UI。
建立一項工做人員任務模板建立器函數。
建立流定義。
檢查人員循環狀態，並等待審覈人員完成任務。

選擇工做人員

在本文中，咱們使用由爲Ground Truth標記做業建立的專有工做人員。使用工做人員ARN爲Amazon A2I設置工做人員。

建立人工任務UI

使用liquid HTML中的UI模板建立人工任務UI資源。每當須要人工循環時，皆須要使用這套模板。

如下示例代碼已經過測試，可以與Amazon Comprehend實體檢測相兼容：

template = """
<script src="https://assets.crowd.aws/crowd-html-elements.js"></script>
<style>
 .highlight {
 background-color: yellow;
 }
</style>
<crowd-entity-annotation
 name="crowd-entity-annotation"
 header="Highlight parts of the text below"
 labels="[{'label': 'service', 'fullDisplayName': 'Service'}, {'label': 'version', 'fullDisplayName': 'Version'}]"
 text="{{ task.input.originalText }}"
> 
 <full-instructions header="Named entity recognition instructions">
 <ol>
 <li><strong>Read</strong> the text carefully.</li>
 <li><strong>Highlight</strong> words, phrases, or sections of the text.</li>
 <li><strong>Choose</strong> the label that best matches what you have highlighted.</li>
 <li>To <strong>change</strong> a label, choose highlighted text and select a new label.</li>
 <li>To <strong>remove</strong> a label from highlighted text, choose the X next to the abbreviated label name on the highlighted text.</li>
 <li>You can select all of a previously highlighted text, but not a portion of it.</li>
 </ol>
 </full-instructions>
 <short-instructions>
 Select the word or words in the displayed text corresponding to the entity, label it and click submit
 </short-instructions>
 <div id="recognizedEntities" style="margin-top: 20px">
 <h3>Label the Entity below in the text above</h3>
 <p>{{ task.input.entities }}</p>
 </div>
</crowd-entity-annotation>
<script>
 function highlight(text) {
 var inputText = document.getElementById("inputText");
 var innerHTML = inputText.innerHTML;
 var index = innerHTML.indexOf(text);
 if (index >= 0) {
 innerHTML = innerHTML.substring(0,index) + "<span class='highlight'>" + innerHTML.substring(index,index+text.length) + "</span>" + innerHTML.substring(index + text.length);
 inputText.innerHTML = innerHTML;
 }
 }
 document.addEventListener('all-crowd-elements-ready', () => {
 document
 .querySelector('crowd-entity-annotation')
 .shadowRoot
 .querySelector('crowd-form')
 .form
 .appendChild(recognizedEntities);

 });
</script>

"""

建立一項工做人員任務模板建立器函數

此函數屬於對Amazon SageMaker軟件包方法的高級抽象，用於建立人工審覈工做流。詳見如下代碼：

def create_task_ui():
 '''
 Creates a Human Task UI resource.
 Returns:
 struct: HumanTaskUiArn
'''
 response = sagemaker.create_human_task_ui(
 HumanTaskUiName=taskUIName,
 UiTemplate={'Content': template})
 return response
# Task UI name - this value is unique per account and region. You can also provide your own value here.
taskUIName = prefix + '-ui'
# Create task UI
humanTaskUiResponse = create_task_ui()
humanTaskUiArn = humanTaskUiResponse['HumanTaskUiArn']
print(humanTaskUiArn)

建立流定義

咱們能夠在流定義中指定如下內容：

做爲任務接收方的工做人員
工做人員收到的標記指示

本文使用API，但也能夠選擇在Amazon A2I控制檯上建立這項工做流定義。

關於更多詳細信息，請參閱如何建立流定義。

要設置觸發人工循環審覈的條件，請輸入如下代碼（能夠設置CONFIDENCE_SCORE_THRESHOLD閾值，藉此調整觸發人工審覈的具體置信度）：

human_loops_started = []
import json
CONFIDENCE_SCORE_THRESHOLD = 90
for line in data:
 print("Line is: " + str(line))
 begin_offset=line['BEGIN_OFFSET']
 end_offset=line['END_OFFSET']
 if(line['CONFIDENCE_SCORE'] < CONFIDENCE_SCORE_THRESHOLD):
 humanLoopName = str(uuid.uuid4())
 human_loop_input = {}
 human_loop_input['labels'] = line['ENTITY']
 human_loop_input['entities']= line['ENTITY']
 human_loop_input['originalText'] = line['ORIGINAL_TEXT']
 start_loop_response = a2i_runtime_client.start_human_loop(
 HumanLoopName=humanLoopName,
 FlowDefinitionArn=flowDefinitionArn,
 HumanLoopInput={
 "InputContent": json.dumps(human_loop_input)
 }
 )
 print(human_loop_input)
 human_loops_started.append(humanLoopName)
 print(f'Score is less than the threshold of {CONFIDENCE_SCORE_THRESHOLD}')
 print(f'Starting human loop with name: {humanLoopName} n')
 else:
 print('No human loop created. n')

檢查人工循環狀態並等待審覈人員完成任務

要定義一項檢查人工循環狀態的函數，請輸入如下代碼：

completed_human_loops = []
for human_loop_name in human_loops_started:
 resp = a2i_runtime_client.describe_human_loop(HumanLoopName=human_loop_name)
 print(f'HumanLoop Name: {human_loop_name}')
 print(f'HumanLoop Status: {resp["HumanLoopStatus"]}')
 print(f'HumanLoop Output Destination: {resp["HumanLoopOutput"]}')
 print('n')
 if resp["HumanLoopStatus"] == "Completed":
 completed_human_loops.append(resp)

導航至專有工做人員門戶（爲Notebook在上一步驟中的單元2輸出結果），詳見如下代碼：

workteamName = WORKTEAM_ARN[WORKTEAM_ARN.rfind('/') + 1:]
print("Navigate to the private worker portal and do the tasks. Make sure you've invited yourself to your workteam!")
print('https://' + sagemaker.describe_workteam(WorkteamName=workteamName)['Workteam']['SubDomain'])

這套UI模板相似於Ground Truth NER標記功能。Amazon A2I顯示從輸入文本中識別出的實體（即低置信度預測結果）。然後，工做人員能夠根據須要更新或驗證明體標籤，並選擇Submit。