Paddlepaddle實現基於LSTM的動漫評論情感分類

背景介紹

經過網絡蒐集資料發現大多情感分析案例都是基於影評和購物網站評論的, 對於動漫評論的情感分析幾乎沒有相關的案例出現; 動漫是本人的愛好之一, 因而本次課程實驗就經過學習基於fluid的情感分析來進行B站動漫的情感分析. 本次課程實驗的學習資料大多參考paddlepaddle官方提供的情感分析教程.數據爲本身爬取並預處理以後獲得的相似IMDB數據集的數據.python

下載安裝命令

## CPU版本安裝命令
pip install -f https://paddlepaddle.org.cn/pip/oschina/cpu paddlepaddle

## GPU版本安裝命令
pip install -f https://paddlepaddle.org.cn/pip/oschina/gpu paddlepaddle-gpu

本次實驗分流程以下：網絡

引用庫的導入
數據集及數據處理
模型訓練
模型預測
總結

1 - 引用庫

首先載入須要用到的庫，它們分別是：框架

os：用於對文件和路徑進行操做
sys：提供了一系列有關Python運行環境的變量和函數
gzip：壓縮與解壓模塊，用於讀寫壓縮文件
math:包含進行各類數學運算的函數
paddle：PaddlePaddle深度學習框架
matplotlab: 用於畫圖
from future import print_function:在開頭加上from future import print_function這句以後，即便在python2.X，使用print就得像 python3.X那樣加括號使用.

In[14]

from __future__ import print_function
import os
import sys
import gzip
import math
import paddle
import paddle.fluid as fluid
import unittest
import contextlib
import numpy as np
import io
import matplotlib.pyplot as plt

2 - 數據集與數據處理

本次實驗中，我採用的是本身獲取並預處理獲得的B站評論數據集, 數據集中word_dict.txt爲詞典數據, train_data.txt爲訓練數據, test_data.txt爲測試數據集.函數

In[2]

# 加載詞典數據
with io.open("/home/aistudio/data/data2184/word_dict.txt", "r", encoding="utf-8") as input:
    word_dict = eval(input.read())
    print(len(word_dict))

獲取訓練集和測試集數據生成器學習

In[3]

# 此處因爲網絡較複雜, Batch_size不可設置太小, AIStudio容易崩掉
BATCH_SIZE = 8

# 訓練集生成器
def train_generator():
    with io.open("/home/aistudio/data/data2184/train_data.txt", "r", encoding="utf-8") as output:
        train_data = eval(output.read())
        print(len(train_data))
    def reader():
        for word_vector, label in train_data:
            yield word_vector, label
    return reader

# 測試集生成器
def test_generator():
    with io.open("/home/aistudio/data/data2184/train_data.txt", "r", encoding="utf-8") as output:
        test_data = eval(output.read())
        print(len(test_data))
    def reader():
        for word_vector, label in test_data:
            #print(word_vector, label)
            yield word_vector, label
    return reader




# 數據分Batch處理, 並打亂減小相關性束縛
train_reader = paddle.batch(
    paddle.reader.shuffle(
        train_generator(), 
    buf_size=51200),
    batch_size= BATCH_SIZE)
test_reader = paddle.batch(
    test_generator(),
    batch_size= BATCH_SIZE)

# for data in test_reader():
# print(data)
# print(len(data))
dict_dim = len(word_dict)

7732
7732

3 - 模型訓練

介紹完數據及之後，咱們就能夠開始訓練過程了，訓練過程分爲如下幾個步驟：測試

模型配置fetch
訓練優化
預測網站

1. 模型配置

咱們首先配置 LSTM 網絡。ui

1.one hot 轉化爲 word embedding
2.構建LSTM網絡
3. 精度計算
4. 此處定義了普通LSTM網絡和棧式雙向LSTM網絡結構

In[4]

# 普通LSTM網絡結構
def lstm_net(data, label, dict_dim, emb_dim=128, hid_dim=128, hid_dim2=96, class_dim=2, emb_lr=30.0):
    # 轉化爲 embedding 
    emb = fluid.layers.embedding(
        input=data,
        size=[dict_dim, emb_dim],
        param_attr=fluid.ParamAttr(learning_rate=emb_lr))

    # lstm 設置
    fc0 = fluid.layers.fc(input=emb, size=hid_dim * 4)
    lstm_h, c = fluid.layers.dynamic_lstm(
        input=fc0, size=hid_dim * 4, is_reverse=False)
    
    lstm_max = fluid.layers.sequence_pool(input=lstm_h, pool_type='max')
    lstm_max_tanh = fluid.layers.tanh(lstm_max)
    fc1 = fluid.layers.fc(input=lstm_max_tanh, size=hid_dim2, act='tanh')
    
    prediction = fluid.layers.fc(input=fc1, size=class_dim, act='softmax')
    cost = fluid.layers.cross_entropy(input=prediction, label=label)
    avg_cost = fluid.layers.mean(x=cost)
    acc = fluid.layers.accuracy(input=prediction, label=label)
    return avg_cost, acc, prediction

# 棧式雙向LSTM網絡結構
def stacked_lstm_net(data,label, input_dim, class_dim=2, emb_dim=128, hid_dim=512, stacked_num=3):
    # 因爲設置奇數層正向, 偶數層反向, 最後一層LSTM網絡一定正向, 因此棧數一定爲奇數
    assert stacked_num % 2 == 1

    emb = fluid.layers.embedding(
        input=data, size=[input_dim, emb_dim], is_sparse=True)

    fc1 = fluid.layers.fc(input=emb, size=hid_dim)
    lstm1, cell1 = fluid.layers.dynamic_lstm(input=fc1, size=hid_dim)

    inputs = [fc1, lstm1]

    for i in range(2, stacked_num + 1):
        fc = fluid.layers.fc(input=inputs, size=hid_dim)
        lstm, cell = fluid.layers.dynamic_lstm(
            input=fc, size=hid_dim, is_reverse=(i % 2) == 0) #設置奇數層正向, 偶數層反向
        inputs = [fc, lstm]

    fc_last = fluid.layers.sequence_pool(input=inputs[0], pool_type='max')
    lstm_last = fluid.layers.sequence_pool(input=inputs[1], pool_type='max')

    prediction = fluid.layers.fc(
        input=[fc_last, lstm_last], size=class_dim, act='softmax')
    cost = fluid.layers.cross_entropy(input=prediction, label=label)
    avg_cost = fluid.layers.mean(x=cost)
    acc = fluid.layers.accuracy(input=prediction, label=label)
    return avg_cost, acc, prediction

2. 定義訓練過程

訓練過程符合 fluid 的基本套路。下面梳理一下基本套路：

定義輸入層
定義標籤層
輸入層
標籤層
網絡結構
優化器
設備、執行器、feeder 定義
模型參數初始化
雙層訓練過程
9.1 外層針對 epoch
9.2 內層針對 step
9.3 在合適的時機存儲參數模型

In[20]

def train(train_reader, word_dict, network, use_cuda, save_dirname, lr=0.2, batch_size=128, pass_num=30):
    
    # 輸入層
    data = fluid.layers.data(
        name="words", shape=[1], dtype="int64", lod_level=1)

    # 標籤層
    label = fluid.layers.data(name="label", shape=[1], dtype="int64")

    # 網絡結構
    cost, acc, prediction = network(data, label, len(word_dict))
    
    # 優化器
    sgd_optimizer = fluid.optimizer.Adagrad(learning_rate=lr)
    sgd_optimizer.minimize(cost)

    # 設備、執行器、feeder 定義
    place = fluid.CUDAPlace(0) if use_cuda else fluid.CPUPlace()
    exe = fluid.Executor(place)
    feeder = fluid.DataFeeder(feed_list=[data, label], place=place)
    
    #模型參數初始化
    exe.run(fluid.default_startup_program())
    
    # 雙層循環訓練
    # 外層 epoch
    for pass_id in range(pass_num):
        i = 0
        for data in train_reader():
            avg_cost_np, avg_acc_np = exe.run(fluid.default_main_program(),
                                              feed=feeder.feed(data),
                                              fetch_list=[cost, acc])
            if i % 100 == 0:
                print("Pass {:d},Batch {:d}, cost {:.6f}".format(pass_id, i, np.mean(avg_cost_np)))
            i+=1
        epoch_model = save_dirname
        fluid.io.save_inference_model(epoch_model, ["words", "label"], acc, exe)
    print('train end')

3.訓練

設計各個超參數, 調用train方法進行訓練

In[21]

# pass_num不可設置太大, 會形成進程內存溢出, 意外停止. 
train(
    train_reader,
    word_dict,
    lstm_net,
    use_cuda=False,
    save_dirname="lstm_model",
    lr=0.001,
    pass_num=2,
    batch_size=4)

Pass 0,Batch 0, cost 0.691885
Pass 0,Batch 100, cost 0.684455
Pass 0,Batch 200, cost 0.662502
Pass 0,Batch 300, cost 0.609410
Pass 0,Batch 400, cost 0.629891
Pass 0,Batch 500, cost 0.553046
Pass 0,Batch 600, cost 0.578969
Pass 0,Batch 700, cost 0.686090
Pass 0,Batch 800, cost 0.729985
Pass 0,Batch 900, cost 0.542598
Pass 1,Batch 0, cost 0.708446
Pass 1,Batch 100, cost 0.599567
Pass 1,Batch 200, cost 0.787641
Pass 1,Batch 300, cost 0.398084
Pass 1,Batch 400, cost 0.478610
Pass 1,Batch 500, cost 0.627605
Pass 1,Batch 600, cost 0.739151
Pass 1,Batch 700, cost 0.730697
Pass 1,Batch 800, cost 0.478778
Pass 1,Batch 900, cost 0.563518
train end

4.測試

4.1 定義測試過程

設置設備和執行器
建立並使用 scope
加載測試模型
測試

In[22]

def infer(test_reader, use_cuda, model_path=None):
    
    # 輸入層
    data = fluid.layers.data(
        name="words", shape=[1], dtype="int64", lod_level=1)

    # 標籤層
    label = fluid.layers.data(name="label", shape=[1], dtype="int64")
    
    #設置設備 和 執行器
    place = fluid.CUDAPlace(0) if use_cuda else fluid.CPUPlace()
    exe = fluid.Executor(place)
    feeder = fluid.DataFeeder(feed_list=[data, label], place=place)
    
    # 建立並使用 scope 
    inference_scope = fluid.core.Scope()    
    
    with fluid.scope_guard(inference_scope):
        # 加載預測模型
        [inference_program, feed_target_names,
         fetch_targets] = fluid.io.load_inference_model(model_path, exe)
        total_acc = 0.0
        total_count = 0
        for data in test_reader():
            #預測
            acc = exe.run(inference_program,
                          feed=feeder.feed(data),
                          fetch_list=fetch_targets,
                          return_numpy=True)
            total_acc += acc[0] * len(data)
            total_count += len(data)

        avg_acc = total_acc / total_count
        print("model_path: %s, avg_acc: %f" % (model_path, avg_acc))

4.2 實施預測

對各類變量進行設置，實施預測

In[23]

model_path = "lstm_model"
infer(test_reader, use_cuda=False, model_path=model_path)

model_path: lstm_model, avg_acc: 0.726332

如但願使用使用GPU環境來運行, 須要選擇高級版環境:

而且檢查相關參數設置, 例如use_gpu, use_cuda, fluid.CUDAPlace(0)等處是否設置正確.

點擊連接，使用AI Studio一鍵上手實踐項目吧： https://aistudio.baidu.com/aistudio/projectdetail/127565

下載安裝命令

## CPU版本安裝命令
pip install -f https://paddlepaddle.org.cn/pip/oschina/cpu paddlepaddle

## GPU版本安裝命令
pip install -f https://paddlepaddle.org.cn/pip/oschina/gpu paddlepaddle-gpu

>> 訪問 PaddlePaddle 官網，瞭解更多相關內容。

用PaddlePaddle實現基於LSTM的動漫評論情感分類