BERT生成文本摘要

做者|Daulet Nurmanbetov
編譯|VK
來源|Towards Data Sciencepython

你有沒有曾經須要把一份冗長的文件概括成摘要?或者爲一份文件提供一份摘要?如你所知,這個過程對咱們人類來講是乏味而緩慢的——咱們須要閱讀整個文檔,而後專一於重要的句子,最後,將句子從新寫成一個連貫的摘要。git

這就是自動摘要能夠幫助咱們的地方。機器學習在總結方面取得了長足的進步,但仍有很大的發展空間。一般,機器摘要分爲兩種類型github

摘要提取:若是重要句子出如今原始文件中,提取它。promise

總結摘要:總結文件中包含的重要觀點或事實,不要重複文章裏的話。這是咱們在被要求總結一份文件時一般會想到的。cookie

我想向你展現最近的一些結果,用BERT_Sum_Abs總結摘要,Yang Liu和Mirella Lapata的工做Text Summarization with Pretrained Encoders:https://arxiv.org/pdf/1908.08...session

BERT總結摘要的性能

摘要旨在將文檔壓縮成較短的版本,同時保留其大部分含義。總結摘要任務須要語言生成能力來建立包含源文檔中沒有的新單詞和短語的摘要。摘要抽取一般被定義爲一個二值分類任務,其標籤指示摘要中是否應該包含一個文本範圍(一般是一個句子)。app

下面是BERT_Sum_Abs如何處理標準摘要數據集:CNN和Daily Mail,它們一般用於基準測試。評估指標被稱爲ROGUE F1分數機器學習

結果代表,BERT_Sum_Abs模型的性能優於大多數基於非Transformer的模型。更好的是,模型背後的代碼是開源的,實現能夠在Github上得到(https://github.com/huggingfac...ide

示範和代碼

讓咱們經過一個例子來總結一篇文章。咱們將選擇如下文章來總結摘要,美聯儲官員說,各國央行行長一致應對冠狀病毒。這是全文性能

The Federal Reserve Bank of New York president, John C. Williams, made clear on Thursday evening that officials viewed the emergency rate cut they approved earlier this week as part of an international push to cushion the economy as the coronavirus threatens global growth.
Mr. Williams, one of the Fed’s three key leaders, spoke in New York two days after the Fed slashed borrowing costs by half a point in its first emergency move since the depths of the 2008 financial crisis. The move came shortly after a call between finance ministers and central bankers from the Group of 7, which also includes Britain, Canada, France, Germany, Italy and Japan.
「Tuesday’s phone call between G7 finance ministers and central bank governors, the subsequent statement, and policy actions by central banks are clear indications of the close alignment at the international level,」 Mr. Williams said in a speech to the Foreign Policy Association.
Rate cuts followed in Canada, Asia and the Middle East on Wednesday. The Bank of Japan and European Central Bank — which already have interest rates set below zero — have yet to further cut borrowing costs, but they have pledged to support their economies.
Mr. Williams’s statement is significant, in part because global policymakers were criticized for failing to satisfy market expectations for a coordinated rate cut among major economies. Stock prices temporarily rallied after the Fed’s announcement, but quickly sank again.
Central banks face challenges in offsetting the economic shock of the coronavirus.
Many were already working hard to stoke stronger economic growth, so they have limited room for further action. That makes the kind of carefully orchestrated, lock step rate cut central banks undertook in October 2008 all but impossible.
Interest rate cuts can also do little to soften the near-term hit from the virus, which is forcing the closure of offices and worker quarantines and delaying shipments of goods as infections spread across the globe.
「It’s up to individual countries, individual fiscal policies and individual central banks to do what they were going to do,」 Fed Chair Jerome H. Powell said after the cut, noting that different nations had 「different situations.」
Mr. Williams reiterated Mr. Powell’s pledge that the Fed would continue monitoring risks in the 「weeks and months」 ahead. Economists widely expect another quarter-point rate cut at the Fed’s March 18 meeting.
The New York Fed president, whose reserve bank is partly responsible for ensuring financial markets are functioning properly, also promised that the Fed stood ready to act as needed to make sure that everything is working smoothly.
Since September, when an obscure but crucial corner of money markets experienced unusual volatility, the Fed has been temporarily intervening in the market to keep it calm. The goal is to keep cash flowing in the market for overnight and short-term loans between banks and other financial institutions. The central bank has also been buying short-term government debt.
「We remain flexible and ready to make adjustments to our operations as needed to ensure that monetary policy is effectively implemented and transmitted to financial markets and the broader economy,」 Mr. Williams said Thursday.

首先,咱們須要獲取模型代碼,安裝依賴項並下載數據集,以下所示,你能夠在本身的Linux計算機上輕鬆執行這些操做:

# 安裝Huggingface的Transformers
git clone https://github.com/huggingface/transformers && cd transformers
pip install .
pip install nltk py-rouge
cd examples/summarization

#------------------------------
# 下載原始摘要數據集。代碼從Linux上的谷歌驅動器下載
wget --save-cookies cookies.txt --keep-session-cookies --no-check-certificate 'https://drive.google.com/uc?export=download&id=0BwmD_VLjROrfTHk4NFg2SndKcjQ' -O- | sed -rn 's/.*confirm=([0-9A-Za-z_]+).*/Code: \1\n/p'
wget --load-cookies cookies.txt --no-check-certificate 'https://drive.google.com/uc?export=download&confirm=<CONFIRMATION CODE HERE>&id=0BwmD_VLjROrfTHk4NFg2SndKcjQ' -O cnn_stories.tgz

wget --save-cookies cookies.txt --keep-session-cookies --no-check-certificate 'https://drive.google.com/uc?export=download&id=0BwmD_VLjROrfM1BxdkxVaTY2bWs' -O- | sed -rn 's/.*confirm=([0-9A-Za-z_]+).*/Code: \1\n/p'
wget --load-cookies cookies.txt --no-check-certificate 'https://drive.google.com/uc?export=download&confirm=<CONFIRMATION CODE HERE>&id=0BwmD_VLjROrfM1BxdkxVaTY2bWs' -O dailymail_stories.tgz

# 解壓文件
tar -xvf cnn_stories.tgz && tar -xvf dailymail_stories.tgz
rm cnn_stories.tgz dailymail_stories.tgz

#將文章移動到一個位置
mkdir bertabs/dataset
mkdir bertabs/summaries_out
cp -r bertabs/cnn/stories dataset
cp -r bertabs/dailymail/stories dataset

# 選擇要總結摘要的文章子集
mkdir bertabs/dataset2
cd bertabs/dataset && find . -maxdepth 1 -type f | head -1000 | xargs cp -t ../dataset2/

在執行了上面的代碼以後,咱們如今執行下面所示的python命令來總結/dataset2目錄中的文檔摘要:

python run_summarization.py \
    --documents_dir bertabs/dataset2 \
    --summaries_output_dir bertabs/summaries_out \
    --batch_size 64 \
    --min_length 50 \
    --max_length 200 \
    --beam_size 5 \
    --alpha 0.95 \
    --block_trigram true \
    --compute_rouge true

這裏的參數以下

documents_dir,文檔所在的文件夾

summaries_output_dir,寫入摘要的文件夾。默認爲文檔所在的文件夾

batch_size,用於訓練的每一個GPU/CPU的batch大小

beam_size,每一個示例要開始的集束數

block_trigram,是否阻止由集束搜索生成的文本中重複的trigram

compute_rouge,計算評估期間的ROUGE指標。僅適用於CNN/DailyMail數據集

alpha,集束搜索中長度懲罰的alpha值(值越大,懲罰越大)

min_length,摘要的最小標記數

max_length,摘要的最大標記數

BERT_Sum_Abs完成後,咱們得到如下摘要:

The Fed slashed borrowing costs by half a point in its first emergency move since the depths of the 2008 financial crisis. Rate cuts followed in Canada, Asia and the Middle East on Wednesday. The Bank of Japan and European Central Bank have yet to further cut borrowing costs, but they have pledged to support their economies.

這是另外一篇英語文章:https://news.stonybrook.edu/n...

獲得的摘要以下

The research team focused on the Presymptomatic period during which prevention may be most effective. They showed that communication between brain regions destabilizes with age, typically in the late 40's, and that destabilization associated with poorer cognition. The good news is that we may be able to prevent or reverse these effects with diet, mitigating the impact of encroaching Hypometabolism by exchanging glucose for ketones as fuel for neurons.

結論

如你所見,BERT正在改進NLP的各個方面。這意味着,在開源的同時,咱們天天都看到NLP的性能接近人類的水平。

NLP商業化產品正在接近,每個新的NLP模型不只在基準上創建了新的記錄,並且任何人均可以使用。就像OCR技術在10年前被商品化同樣,NLP在將來幾年也將如此。

原文連接:https://towardsdatascience.co...

歡迎關注磐創AI博客站:
http://panchuang.net/

sklearn機器學習中文官方文檔:
http://sklearn123.com/

歡迎關注磐創博客資源彙總站:
http://docs.panchuang.net/

相關文章
相關標籤/搜索