天然語言處理NLP快速入門

天然語言處理NLP快速入門java

 

https://mp.weixin.qq.com/s/J-vndnycZgwVrSlDCefHZApython

 

 

【導讀】天然語言處理已經成爲人工智能領域一個重要的分支,它研究能實現人與計算機之間用天然語言進行有效通訊的各類理論和方法。本文提供了一份簡要的天然語言處理介紹,幫助讀者對天然語言處理快速入門。web

 

做者 | George Seif算法

編譯 | Xiaowen數據庫

 

                       

An easy introduction to Natural Language Processing

Using computers to understand human language編程

計算機很是擅長處理標準化和結構化的數據,如數據庫表和財務記錄。他們可以比咱們人類更快地處理這些數據。但咱們人類不使用「結構化數據」進行交流,也不會說二進制語言!咱們用文字進行交流,這是一種非結構化數據。api

 

不幸的是,計算機很難處理非結構化數據,由於沒有標準化的技術來處理它。當咱們使用c、java或python之類的語言對計算機進行編程時,咱們其實是給計算機一組它應該操做的規則。對於非結構化數據,這些規則是很是抽象和具備挑戰性的具體定義。網絡



 

互聯網上有不少非結構化的天然語言,有時甚至連谷歌都不知道你在搜索什麼!app

 




人與計算機對語言的理解機器學習



人類寫東西已經有幾千年了。在這段時間裏,咱們的大腦在理解天然語言方面得到了大量的經驗。當咱們在一張紙上或互聯網上的博客上讀到一些東西時,咱們就會明白它在現實世界中的真正含義。咱們感覺到了閱讀這些東西所引起的情感,咱們常常想象現實生活中那東西會是什麼樣子。

 

天然語言處理 (NLP) 是人工智能的一個子領域,致力於使計算機可以理解和處理人類語言,使計算機更接近於人類對語言的理解。計算機對天然語言的直觀理解還不如人類,他們不能真正理解語言到底想說什麼。簡而言之,計算機不能在字裏行間閱讀。

 

儘管如此,機器學習 (ML) 的最新進展使計算機可以用天然語言作不少有用的事情!深度學習使咱們可以編寫程序來執行諸如語言翻譯、語義理解和文本摘要等工做。全部這些都增長了現實世界的價值,使得你能夠輕鬆地理解和執行大型文本塊上的計算,而無需手工操做。

 

讓咱們從一個關於NLP如何在概念上工做的快速入門開始。以後,咱們將深刻研究一些python代碼,這樣你就能夠本身開始使用NLP了!

 


 

NLP難的真正緣由



閱讀和理解語言的過程比乍一看要複雜得多。要真正理解一段文字在現實世界中意味着什麼,有不少事情要作。例如,你認爲下面這段文字意味着什麼?

 

「Steph Curry was on fire last nice. He totallydestroyed the other team」

 

對一我的來講,這句話的意思很明顯。咱們知道 Steph Curry 是一名籃球運動員,即便你不知道,咱們也知道他在某種球隊,多是一支運動隊。當咱們看到「着火」和「毀滅」時,咱們知道這意味着Steph Curry昨晚踢得很好,擊敗了另外一支球隊。

 

計算機每每把事情看得太過字面意思。從字面上看,咱們會看到「Steph Curry」,並根據大寫假設它是一我的,一個地方,或其餘重要的東西。但後來咱們看到Steph Curry「着火了」…電腦可能會告訴你昨天有人把Steph Curry點上了火!…哎呀。在那以後,電腦可能會說, curry已經摧毀了另外一支球隊…它們再也不存在…偉大的…

 

 

Steph Curry真的着火了!

 

但並非機器所作的一切都是殘酷的,感謝機器學習,咱們實際上能夠作一些很是聰明的事情來快速地從天然語言中提取和理解信息!讓咱們看看如何在幾行代碼中使用幾個簡單的python庫來實現這一點。

 


 

使用Python代碼解決NLP問題

 

爲了瞭解NLP是如何工做的,咱們將使用Wikipedia中的如下文本做爲咱們的運行示例:



Amazon.com, Inc., doing business as Amazon, is an Americanelectronic commerce and cloud computing company based in Seattle, Washington,that was founded by Jeff Bezos on July 5, 1994. The tech giant is the largestInternet retailer in the world as measured by revenue and market capitalization,and second largest after Alibaba Group in terms of total sales. The amazon.comwebsite started as an online bookstore and later diversified to sell videodownloads/streaming, MP3 downloads/streaming, audiobook downloads/streaming,software, video games, electronics, apparel, furniture, food, toys, andjewelry. The company also produces consumer electronics—Kindle e-readers,Fire tablets, Fire TV, and Echo—and is the world’s largest provider of cloud infrastructure services (IaaS andPaaS). Amazon also sells certain low-end products under its in-house brandAmazonBasics.

 

幾個須要的庫



首先,咱們將安裝一些有用的python NLP庫,這些庫將幫助咱們分析本文。

 

### Installing spaCy, general Python NLP lib 
 
pip3 install spacy 
 
### Downloading the English dictionary model for spaCy 
 
python3 -m spacy download en_core_web_lg 
 
### Installing textacy, basically a useful add-on to spaCy 
 
pip3 install textacy

 

實體分析



如今全部的東西都安裝好了,咱們能夠對文本進行快速的實體分析。實體分析將遍歷文本並肯定文本中全部重要的詞或「實體」。當咱們說「重要」時,咱們真正指的是具備某種真實世界語義意義或意義的單詞。

 

請查看下面的代碼,它爲咱們進行了全部的實體分析:

 

# coding: utf-8 
 
import spacy 
 
### Load spaCy's English NLP model 
 
nlp = spacy.load('en_core_web_lg') 
 
### The text we want to examine 
 
text = "Amazon.com, Inc., doing business as Amazon,  
is anAmerican electronic commerce and cloud computing  
company based in Seattle,Washington, that was founded  
by Jeff Bezos on July 5, 1994. The tech giant isthe  
largest Internet retailer in the world as measured by  
revenue and marketcapitalization, and second largest  
after Alibaba Group in terms of total sales.The amazon. 
com website started as an online bookstore and later  
diversified tosell video downloads/streaming, MP3  
downloads/streaming, audiobookdownloads/streaming,  
software, video games, electronics, apparel, furniture, 
food, toys, and jewelry. The company also produces  
consumer electronics-Kindle e-readers,Fire tablets,  
Fire TV, and Echo-and is the world's largest provider 
of cloud infrastructureservices (IaaS and PaaS).  
Amazon also sells certain low-end products under  
itsin-house brand AmazonBasics." 
 
### Parse the text with spaCy 
 
### Our 'document' variable now contains a parsed version oftext. 
 
document = nlp(text) 
 
### print out all the named entities that were detected 
 
for entity in document.ents: 
 
    print(entity.text,entity.label_)



咱們首先加載spaCy’s learned ML模型,並初始化想要處理的文本。咱們在文本上運行ML模型來提取實體。當運行taht代碼時,你將獲得如下輸出:



Amazon.com, Inc. ORG 
Amazon ORG 
American NORP 
Seattle GPE 
Washington GPE 
Jeff Bezos PERSON 
July 5, 1994 DATE 
second ORDINAL 
Alibaba Group ORG 
amazon.com ORG 
Fire TV ORG 
Echo -  LOC 
PaaS ORG 
Amazon ORG 
AmazonBasics ORG

 

文本旁邊的3個字母代碼[1]是標籤,表示咱們正在查看的實體的類型。看來咱們的模型幹得不錯!Jeff Bezos確實是一我的,日期是正確的,亞馬遜是一個組織,西雅圖和華盛頓都是地緣政治實體(即國家、城市、州等)。惟一棘手的問題是,Fire TV和Echo之類的東西其實是產品,而不是組織。然而模型錯過了亞馬遜銷售的其餘產品「視頻下載/流媒體、mp3下載/流媒體、有聲讀物下載/流媒體、軟件、視頻遊戲、電子產品、服裝、傢俱、食品、玩具和珠寶」,多是由於它們在一個龐大的的列表中,所以看起來相對不重要。

 

總的來講,咱們的模型已經完成了咱們想要的。想象一下,咱們有一個巨大的文檔,裏面盡是幾百頁的文本,這個NLP模型能夠快速地讓你瞭解文檔的內容以及文檔中的關鍵實體是什麼。

 

對實體進行操做

 

讓咱們嘗試作一些更適用的事情。假設你有與上面相同的文本塊,但出於隱私考慮,你但願自動刪除全部人員和組織的名稱。spaCy庫有一個很是有用的清除函數,咱們可使用它來清除任何咱們不想看到的實體類別。以下所示:



# coding: utf-8 
 
import spacy 
 
### Load spaCy's English NLP model 
nlp = spacy.load('en_core_web_lg') 
 
### The text we want to examine 
text = "Amazon.com, Inc., doing business as Amazon,  
is an American electronic commerce and cloud computing 
company based in Seattle, Washington, that was founded  
by Jeff Bezos on July 5, 1994. The tech giant is the  
largest Internet retailer in the world as measured by  
revenue and market capitalization, and second largest  
after Alibaba Group in terms of total sales. The  
amazon.com website started as an online bookstore and  
later diversified to sell video downloads/streaming,  
MP3 downloads/streaming, audiobook downloads/streaming, 
 software, video games, electronics, apparel, furniture 
 , food, toys, and jewelry. The company also produces  
 consumer electronics - Kindle e-readers, Fire tablets, 
  Fire TV, and Echo - and is the world's largest  
  provider of cloud infrastructure services (IaaS and  
  PaaS). Amazon also sells certain low-end products  
  under its in-house brand AmazonBasics." 
 
### Replace a specific entity with the word "PRIVATE" 
def replace_entity_with_placeholder(token): 
    if token.ent_iob != 0 and (token.ent_type_ == "PERSON" or token.ent_type_ == "ORG"): 
        return "[PRIVATE] " 
    else: 
        return token.string 
 
### Loop through all the entities in a piece of text and apply entity replacement 
def scrub(text): 
    doc = nlp(text) 
    for ent in doc.ents: 
        ent.merge() 
    tokens = map(replace_entity_with_placeholder, doc) 
    return "".join(tokens) 
     
print(scrub(text))



 

 

效果很好!這其實是一種很是強大的技術。人們老是在計算機上使用ctrl+f函數來查找和替換文檔中的單詞。可是使用NLP,咱們能夠找到和替換特定的實體,考慮到它們的語義意義,而不只僅是它們的原始文本。

 

從文本中提取信息



咱們以前安裝的textacy庫在spaCy的基礎上實現了幾種常見的NLP信息提取算法。它會讓咱們作一些比簡單的開箱即用的事情更先進的事情。

 

它實現的算法之一是半結構化語句提取。這個算法從本質上分析了spaCy的NLP模型可以提取的一些信息,並在此基礎上獲取一些關於某些實體的更具體的信息!簡而言之,咱們能夠提取關於咱們選擇的實體的某些「事實」。

 

讓咱們看看代碼中是什麼樣子的。對於這一篇,咱們將把華盛頓特區維基百科頁面的所有摘要都拿出來。



# coding: utf-8 
 
import spacy 
import textacy.extract 
 
### Load spaCy's English NLP model 
nlp = spacy.load('en_core_web_lg') 
 
### The text we want to examine 
text = """Washington, D.C., formally the District of Columbia and commonly referred to as Washington or D.C., is the capital of the United States of America.[4] Founded after the American Revolution as the seat of government of the newly independent country, Washington was named after George Washington, first President of the United States and Founding Father.[5] Washington is the principal city of the Washington metropolitan area, which has a population of 6,131,977.[6] As the seat of the United States federal government and several international organizations, the city is an important world political capital.[7] Washington is one of the most visited cities in the world, with more than 20 million annual tourists.[8][9] 
The signing of the Residence Act on July 16, 1790, approved the creation of a capital district located along the Potomac River on the country's East Coast. The U.S. Constitution provided for a federal district under the exclusive jurisdiction of the Congress and the District is therefore not a part of any state. The states of Maryland and Virginia each donated land to form the federal district, which included the pre-existing settlements of Georgetown and Alexandria. Named in honor of President George Washington, the City of Washington was founded in 1791 to serve as the new national capital. In 1846, Congress returned the land originally ceded by Virginia; in 1871, it created a single municipal government for the remaining portion of the District. 
Washington had an estimated population of 693,972 as of July 2017, making it the 20th largest American city by population. Commuters from the surrounding Maryland and Virginia suburbs raise the city's daytime population to more than one million during the workweek. The Washington metropolitan area, of which the District is the principal city, has a population of over 6 million, the sixth-largest metropolitan statistical area in the country. 
All three branches of the U.S. federal government are centered in the District: U.S. Congress (legislative), President (executive), and the U.S. Supreme Court (judicial). Washington is home to many national monuments and museums, which are primarily situated on or around the National Mall. The city hosts 177 foreign embassies as well as the headquarters of many international organizations, trade unions, non-profit, lobbying groups, and professional associations, including the Organization of American States, AARP, the National Geographic Society, the Human Rights Campaign, the International Finance Corporation, and the American Red Cross. 
A locally elected mayor and a 13‑member council have governed the District since 1973. However, Congress maintains supreme authority over the city and may overturn local laws. D.C. residents elect a non-voting, at-large congressional delegate to the House of Representatives, but the District has no representation in the Senate. The District receives three electoral votes in presidential elections as permitted by the Twenty-third Amendment to the United States Constitution, ratified in 1961.""" 
### Parse the text with spaCy 
### Our 'document' variable now contains a parsed version of text. 
document = nlp(text) 
 
### Extracting semi-structured statements 
statements = textacy.extract.semistructured_statements(document, "Washington") 
 
print("**** Information from Washington's Wikipedia page ****") 
count = 1 
for statement in statements: 
    subject, verb, fact = statement 
    print(str(count) + " - Statement: ", statement) 
    print(str(count) + " - Fact: ", fact) 
    count += 1

 

 

 

咱們的NLP模型從這篇文章中發現了關於華盛頓特區的三個有用的事實:

(1)華盛頓是美國的首都

(2)華盛頓的人口,以及它是大都會的事實

(3)許多國家記念碑和博物館

最好的部分是,這些都是這一段文字中最重要的信息!

 


 

深刻研究NLP



到這裏就結束了咱們對NLP的簡單介紹。咱們學了不少,但這只是一個小小的嘗試…

 

NLP有許多更好的應用,例如語言翻譯,聊天機器人,以及對文本文檔的更具體和更復雜的分析。今天的大部分工做都是利用深度學習,特別是遞歸神經網絡(RNNs)和長期短時間記憶(LSTMs)網絡來完成的。

 

若是你想本身玩更多的NLP,看看spaCy文檔[2] 和textacy文檔[3] 是一個很好的起點!你將看到許多處理解析文本的方法的示例,並從中提取很是有用的信息。全部的東西都是快速和簡單的,你能夠從中獲得一些很是大的價值。是時候用深刻的學習來作更大更好的事情了!

 

參考連接:

[1] https://spacy.io/usage/linguistic-features#entity-types

[2]https://spacy.io/api/doc

[3]http://textacy.readthedocs.io/en/latest/



原文連接:

https://towardsdatascience.com/an-easy-introduction-to-natural-language-processing-b1e2801291c1

 



-END-

相關文章
相關標籤/搜索