手把手教程：用Python開發一個天然語言處理模型，並用Flask進行部署

時間 2019-12-13

標籤手把手教程 python 開發一個天然語言處理模型並用 flask 進行部署欄目 Python 简体版

原文原文鏈接

摘要： 實用性教程！教你如何快速建立一個可用的機器學習程序！

截住到目前爲止，咱們已經開發了許多機器學習模型，對測試數據進行了數值預測，並測試告終果。實際上，生成預測只是機器學習項目的一部分，儘管它是我認爲最重要的部分。今天咱們來建立一個用於文檔分類、垃圾過濾的天然語言處理模型，使用機器學習來檢測垃圾短信文本消息。咱們的ML系統工做流程以下：離線訓練->將模型做爲服務提供->在線預測。css

一、經過垃圾郵件和非垃圾郵件訓練離線分類器。html

二、通過訓練的模型被部署爲服務用戶的服務。python

當咱們開發機器學習模型時，咱們須要考慮如何部署它，即如何使這個模型可供其餘用戶使用。Kaggle和數據科學訓練營很是適合學習如何構建和優化模型，但他們並無教會工程師如何將它們帶給其餘用戶使用，創建模型與實際爲人們提供產品和服務之間存在重大差別。git

在本文中，咱們將重點關注：構建垃圾短信分類的機器學習模型，而後使用Flask（用於構建Web應用程序的Python微框架）爲模型建立API。此API容許用戶經過HTTP請求利用預測功能。讓咱們開始吧！github

構建ML模型

數據是標記爲垃圾郵件或正常郵件的SMS消息的集合，可在此處找到。首先，咱們將使用此數據集構建預測模型，以準確分類哪些文本是垃圾郵件。樸素貝葉斯分類器是一種流行的電子郵件過濾統計技術。他們一般使用詞袋功能來識別垃圾郵件。所以，咱們將使用Naive Bayes定理構建一個簡單的消息分類器。flask

import pandas as pd
import numpy as np
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.model_selection import train_test_split
from sklearn.naive_bayes import MultinomialNB
from sklearn.metrics import classification_report

df = pd.read_csv('spam.csv', encoding="latin-1")
df.drop(['Unnamed: 2', 'Unnamed: 3', 'Unnamed: 4'], axis=1, inplace=True)
df['label'] = df['class'].map({'ham': 0, 'spam': 1})
X = df['message']
y = df['label']
cv = CountVectorizer()
X = cv.fit_transform(X) # Fit the Data
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.33, random_state=42)
#Naive Bayes Classifier
clf = MultinomialNB()
clf.fit(X_train,y_train)
clf.score(X_test,y_test)
y_pred = clf.predict(X_test)
print(classification_report(y_test, y_pred))

Naive Bayes分類器不只易於實現，並且提供了很是好的性能。在訓練模型以後，咱們都但願有一種方法來保持模型以供未來使用而無需從新訓練。爲實現此目的，咱們添加如下行以將咱們的模型保存爲.pkl文件供之後使用。瀏覽器

from sklearn.externals import joblib
joblib.dump(clf, 'NB_spam_model.pkl')

咱們加載並使用保存的模型：服務器

NB_spam_model = open('NB_spam_model.pkl','rb')
clf = joblib.load(NB_spam_model)

上述過程稱爲「標準格式的持久模型」，即模型以特定的開發語言的特定格式持久存儲。下一步就是將模型在一個微服務中提供，該服務的公開端點用來接收來自客戶端的請求。app

將垃圾郵件分類器轉換爲Web應用程序

在上一節中準備好用於對SMS消息進行分類的代碼以後，咱們將開發一個Web應用程序，該應用程序由一個簡單的Web頁面組成，該頁面具備容許咱們輸入消息的表單字段。在將消息提交給Web應用程序後，它將在新頁面上呈現該消息，從而爲咱們提供是否爲垃圾郵件的結果。框架

首先，咱們爲這個項目建立一個名爲SMS-Message-Spam-Detector 的文件夾，這是該文件夾中的目錄樹，接下來咱們將解釋每一個文件。

spam.csv
app.py
templates/
        home.html
        result.html
static/
        style.css

子目錄templates是Flask在Web瀏覽器中查找靜態HTML文件的目錄，在咱們的例子中，咱們有兩個html文件：home.html和result.html 。

app.py

app.py文件包含將由Python解釋器執行以運行Flask Web應用程序的主代碼，還包含用於對SMS消息進行分類的ML代碼：

from flask import Flask,render_template,url_for,request
import pandas as pd 
import pickle
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.naive_bayes import MultinomialNB
from sklearn.externals import joblib

app = Flask(__name__)

@app.route('/')
def home():
    return render_template('home.html')

@app.route('/predict',methods=['POST'])
def predict():
    df= pd.read_csv("spam.csv", encoding="latin-1")
    df.drop(['Unnamed: 2', 'Unnamed: 3', 'Unnamed: 4'], axis=1, inplace=True)
    # Features and Labels
    df['label'] = df['class'].map({'ham': 0, 'spam': 1})
    X = df['message']
    y = df['label']

    # Extract Feature With CountVectorizer
    cv = CountVectorizer()
    X = cv.fit_transform(X) # Fit the Data
    from sklearn.model_selection import train_test_split
    X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.33, random_state=42)
    #Naive Bayes Classifier
    from sklearn.naive_bayes import MultinomialNB

    clf = MultinomialNB()
    clf.fit(X_train,y_train)
    clf.score(X_test,y_test)
    #Alternative Usage of Saved Model
    # joblib.dump(clf, 'NB_spam_model.pkl')
    # NB_spam_model = open('NB_spam_model.pkl','rb')
    # clf = joblib.load(NB_spam_model)

    if request.method == 'POST':
        message = request.form['message']
        data = [message]
        vect = cv.transform(data).toarray()
        my_prediction = clf.predict(vect)
    return render_template('result.html',prediction = my_prediction)

if __name__ == '__main__':
    app.run(debug=True)

一、咱們將應用程序做爲單個模塊運行，所以咱們使用參數初始化了一個新的Flask實例，__name__是爲了讓Flask知道它能夠在templates所在的同一目錄中找到HTML模板文件夾（）。

二、接下來，咱們使用route decorator（@app.route('/')）來指定能夠觸發home 函數執行的URL 。咱們的home 函數只是呈現home.htmlHTML文件，該文件位於templates文件夾中。

三、在predict函數內部，咱們訪問垃圾郵件數據集、預處理文本、進行預測，而後存儲模型。咱們訪問用戶輸入的新消息，並使用咱們的模型對其標籤進行預測。

四、咱們使用該POST方法將表單數據傳輸到郵件正文中的服務器。最後，經過debug=True在app.run方法中設置參數，進一步激活Flask的調試器。

五、最後，咱們使用run函數執行在服務器上的腳本文件，咱們須要確保使用if語句 __name__ == '__main__'。

home.html

如下是home.html將呈現文本表單的文件的內容，用戶能夠在其中輸入消息：

<!DOCTYPE html>
<html>
<head>
    <title>Home</title>
    <!-- <link rel="stylesheet" type="text/css" href="../static/css/styles.css"> -->
    <link rel="stylesheet" type="text/css" href="{{ url_for('static', filename='css/styles.css') }}">
</head>
<body>

    <header>
        <div class="container">
        <div id="brandname">
            Machine Learning App with Flask
        </div>
        <h2>Spam Detector For SMS Messages</h2>

    </div>
    </header>

    <div class="ml-container">

        <form action="{{ url_for('predict')}}" method="POST">
        <p>Enter Your Message Here</p>
        <!-- <input type="text" name="comment"/> -->
        <textarea name="message" rows="4" cols="50"></textarea>
        <br/>

        <input type="submit" class="btn-info" value="predict">

    </form>

    </div>
</body>
</html>
view raw

style.css文件

在home.html的head部分，咱們將加載styles.css文件，CSS文件是用於肯定HTML文檔的外觀和風格的。styles.css必須保存在一個名爲的子目錄中static，這是Flask查找靜態文件（如CSS）的默認目錄。

body{
    font:15px/1.5 Arial, Helvetica,sans-serif;
    padding: 0px;
    background-color:#f4f3f3;
}

.container{
    width:100%;
    margin: auto;
    overflow: hidden;
}

header{
    background:#03A9F4;#35434a;
    border-bottom:#448AFF 3px solid;
    height:120px;
    width:100%;
    padding-top:30px;

}

.main-header{
            text-align:center;
            background-color: blue;
            height:100px;
            width:100%;
            margin:0px;
        }
#brandname{
    float:left;
    font-size:30px;
    color: #fff;
    margin: 10px;
}

header h2{
    text-align:center;
    color:#fff;

}

.btn-info {background-color: #2196F3;
    height:40px;
    width:100px;} /* Blue */
.btn-info:hover {background: #0b7dda;}

.resultss{
    border-radius: 15px 50px;
    background: #345fe4;
    padding: 20px; 
    width: 200px;
    height: 150px;
}

result.html

咱們建立一個result.html文件，該文件將經過函數render_template('result.html', prediction=my_prediction)返回呈現predict，咱們在app.py腳本中定義該文件以顯示用戶經過文本字段提交的文本。result.html文件包含如下內容：

<!DOCTYPE html>
<html>
<head>
    <title></title>
    <link rel="stylesheet" type="text/css" href="{{ url_for('static', filename='css/styles.css') }}">
</head>
<body>
    <header>
        <div class="container">
        <div id="brandname">
            ML App
        </div>
        <h2>Spam Detector For SMS Messages</h2>        
    </div>
    </header>
    <p style="color:blue;font-size:20;text-align: center;"><b>Results for Comment</b></p>
    <div class="results">

    {% if prediction == 1%}
    <h2 style="color:red;">Spam</h2>
    {% elif prediction == 0%}
    <h2 style="color:blue;">Not a Spam (It is a Ham)</h2>
    {% endif %}
    </div>
</body>
</html>

從result.htm文件咱們能夠看到一些代碼使用一般在HTML文件中找不到的語法例如，{% if prediction ==1%},{% elif prediction == 0%},{% endif %}這是jinja語法，它用於訪問從HTML文件中請求返回的預測。

咱們就要大功告成了！

完成上述全部操做後，你能夠經過雙擊appy.py 或從終端執行命令來開始運行API ：

cd SMS-Message-Spam-Detector
python app.py

你應該獲得如下輸出：

如今你能夠打開Web瀏覽器並導航到http://127.0.0.1:5000/，你應該看到一個簡單的網站，內容以下：

恭喜！咱們如今以零成本的代價建立了端到端機器學習（NLP）應用程序。若是你回顧一下，其實整個過程根本不復雜。有點耐心和渴望學習的動力，任何人均可以作到。全部開源工具都使每件事都成爲可能。

更重要的是，咱們可以將咱們對機器學習理論的知識擴展到有用和實用的Web應用程序！

完整的工做源代碼可在此存儲庫中找到，祝你度過愉快的一週！

本文做者：【方向】

閱讀原文

本文爲雲棲社區原創內容，未經容許不得轉載。

相關標籤/搜索

每日一句

每一个你不满意的现在，都有一个你没有努力的曾经。