[深度應用]·首屆中國心電智能大賽初賽開源Baseline(基於Keras val_acc: 0.88)

[深度應用]·首屆中國心電智能大賽初賽開源Baseline(基於Keras val_acc: 0.88)

我的網站–> http://www.yansongsong.cnpython

項目github地址:https://github.com/xiaosongshine/preliminary_challenge_baseline_kerasgit

大賽簡介

爲響應國家健康中國戰略,推送健康醫療和大數據的融合發展的政策,由清華大學臨牀醫學院和數據科學研究院,天津市武清區京津高村科技創新園,以及多家重點醫院聯合主辦的首屆中國心電智能大賽正式啓動。自今日起至2019年3月31日24時,大賽開啓全球招募,預計大賽總獎金將高達百萬元!目前官方報名網站已上線,歡迎高校、醫院、創業團隊等有志於中國心電人工智能發展的人員踊躍參加。github

首屆中國心電智能大賽官方報名網站>>http://mdi.ids.tsinghua.edu.cn算法

 

數據介紹

下載完整的訓練集和測試集,共1000例常規心電圖,其中訓練集中包含600例,測試集中共400例。該數據是從多個公開數據集中獲取。參賽團隊須要利用有正常/異常兩類標籤的訓練集數據設計和實現算法,並在沒有標籤的測試集上作出預測。編程

該心電數據的採樣率爲500 Hz。爲了方便參賽團隊用不一樣編程語言都能讀取數據,全部心電數據的存儲格式爲MAT格式。該文件中存儲了12個導聯的電壓信號。訓練數據對應的標籤存儲在txt文件中,其中0表明正常,1表明異常。bash

 

賽題分析

簡單分析一下,初賽的數據集共有1000個樣本,其中訓練集中包含600例,測試集中共400例。其中訓練集中包含600例是具備label的,能夠用於咱們訓練模型;測試集中共400例沒有標籤,須要咱們使用訓練好的模型進行預測。網絡

賽題就是一個二分類預測問題,解題思路應該包括如下內容app

  1. 數據讀取與處理
  2. 網絡模型搭建
  3. 模型的訓練
  4. 模型應用與提交預測結果

 

實戰應用

通過對賽題的分析,咱們把任務分紅四個小任務,首先第一步是:dom

1.數據讀取與處理

該心電數據的採樣率爲500 Hz。爲了方便參賽團隊用不一樣編程語言都能讀取數據,全部心電數據的存儲格式爲MAT格式。該文件中存儲了12個導聯的電壓信號。訓練數據對應的標籤存儲在txt文件中,其中0表明正常,1表明異常。編程語言

咱們由上述描述能夠得知,

  • 咱們的數據保存在MAT格式文件中(這決定了後面咱們要如何讀取數據
  • 採樣率爲500 Hz(這個信息並無怎麼用到,你們能夠簡單瞭解一下,就是1秒採集500個點,由後面咱們得知每一個數據都是5000個點,也就是10秒的心電圖片)
  • 12個導聯的電壓信號(這個是指採用12種導聯方式,你們能夠簡單理解爲用12個體溫計量體溫,從而獲得更加準確的信息,下圖爲導聯方式簡單介紹,你們瞭解下便可。要注意的是,既然提供了12種導聯,咱們應該所有都用到,雖然咱們僅使用一種導聯方式也能夠進行訓練與預測,可是經驗告訴咱們,採起多個特徵會取得更優效果

 

數據處理函數定義:

import keras
from scipy.io import loadmat
import matplotlib.pyplot as plt
import glob
import numpy as np
import pandas as pd
import math
import os
from keras.layers import *
from keras.models import *
from keras.objectives import *


BASE_DIR = "preliminary/TRAIN/"

#進行歸一化
def normalize(v):
    return (v - v.mean(axis=1).reshape((v.shape[0],1))) / (v.max(axis=1).reshape((v.shape[0],1)) + 2e-12)

#loadmat打開文件
def get_feature(wav_file,Lens = 12,BASE_DIR=BASE_DIR):
    mat = loadmat(BASE_DIR+wav_file)
    dat = mat["data"]
    feature = dat[0:12]
    return(normalize(feature).transopse())


#把標籤轉成oneHot形式
def convert2oneHot(index,Lens):
    hot = np.zeros((Lens,))
    hot[index] = 1
    return(hot)

TXT_DIR = "preliminary/reference.txt"
MANIFEST_DIR = "preliminary/reference.csv"

  

讀取一條數據進行顯示

if __name__ == "__main__":
    dat1 = get_feature("preliminary/TRAIN/TRAIN101.mat")
    print(dat1.shape)
    #one data shape is (12, 5000)
    plt.plt(dat1[:,0])
    plt.show()

  

咱們由上述信息能夠看出每種導聯都是由5000個點組成的列表,12種導聯方式使每一個樣本都是12*5000的矩陣,相似於一張分辨率爲12x5000的照片。

咱們須要處理的就是把每一個讀取出來,歸一化一下,送入網絡進行訓練能夠了。

標籤處理方式

def create_csv(TXT_DIR=TXT_DIR):
    lists = pd.read_csv(TXT_DIR,sep=r"\t",header=None)
    lists = lists.sample(frac=1)
    lists.to_csv(MANIFEST_DIR,index=None)
    print("Finish save csv")

  

我這裏是採用從reference.txt讀取,而後打亂保存到reference.csv中,注意必定要進行數據打亂操做,否則訓練效果不好。由於原始數據前面便籤所有是1,後面所有是0

數據迭代方式

Batch_size = 20
def xs_gen(path=MANIFEST_DIR,batch_size = Batch_size,train=True):

    img_list = pd.read_csv(path)
    if train :
        img_list = np.array(img_list)[:500]
        print("Found %s train items."%len(img_list))
        print("list 1 is",img_list[0])
        steps = math.ceil(len(img_list) / batch_size)    # 肯定每輪有多少個batch
    else:
        img_list = np.array(img_list)[500:]
        print("Found %s test items."%len(img_list))
        print("list 1 is",img_list[0])
        steps = math.ceil(len(img_list) / batch_size)    # 肯定每輪有多少個batch
    while True:
        for i in range(steps):

            batch_list = img_list[i * batch_size : i * batch_size + batch_size]
            np.random.shuffle(batch_list)
            batch_x = np.array([get_feature(file) for file in batch_list[:,0]])
            batch_y = np.array([convert2oneHot(label,2) for label in batch_list[:,1]])

            yield batch_x, batch_y

  

數據讀取的方式我採用的是生成器的方式,這樣能夠按batch讀取,加快訓練速度,你們也能夠採用一下所有讀取,看我的的習慣了

 

2.網絡模型搭建

數據咱們處理好了,後面就是模型的搭建了,我使用keras搭建的,操做簡單便捷,tf,pytorch,sklearn你們能夠按照本身喜愛來。

網絡模型能夠選擇CNN,RNN,Attention結構,或者多模型的融合,拋磚引玉,此Baseline採用的一維CNN方式,一維CNN學習地址

模型搭建

TIME_PERIODS = 5000
num_sensors = 12
def build_model(input_shape=(TIME_PERIODS,num_sensors),num_classes=2):
    model = Sequential()
    #model.add(Reshape((TIME_PERIODS, num_sensors), input_shape=input_shape))
    model.add(Conv1D(16, 16,strides=2, activation='relu',input_shape=input_shape))
    model.add(Conv1D(16, 16,strides=2, activation='relu',padding="same"))
    model.add(MaxPooling1D(2))
    model.add(Conv1D(64, 8,strides=2, activation='relu',padding="same"))
    model.add(Conv1D(64, 8,strides=2, activation='relu',padding="same"))
    model.add(MaxPooling1D(2))
    model.add(Conv1D(128, 4,strides=2, activation='relu',padding="same"))
    model.add(Conv1D(128, 4,strides=2, activation='relu',padding="same"))
    model.add(MaxPooling1D(2))
    model.add(Conv1D(256, 2,strides=1, activation='relu',padding="same"))
    model.add(Conv1D(256, 2,strides=1, activation='relu',padding="same"))
    model.add(MaxPooling1D(2))
    model.add(GlobalAveragePooling1D())
    model.add(Dropout(0.3))
    model.add(Dense(num_classes, activation='softmax'))
    return(model)

  

用model.summary()輸出的網絡模型爲

________________________________________________________________
Layer (type) Output Shape Param # ================================================================= reshape_1 (Reshape) (None, 5000, 12) 0 _________________________________________________________________ conv1d_1 (Conv1D) (None, 2493, 16) 3088 _________________________________________________________________ conv1d_2 (Conv1D) (None, 1247, 16) 4112 _________________________________________________________________ max_pooling1d_1 (MaxPooling1 (None, 623, 16) 0 _________________________________________________________________ conv1d_3 (Conv1D) (None, 312, 64) 8256 _________________________________________________________________ conv1d_4 (Conv1D) (None, 156, 64) 32832 _________________________________________________________________ max_pooling1d_2 (MaxPooling1 (None, 78, 64) 0 _________________________________________________________________ conv1d_5 (Conv1D) (None, 39, 128) 32896 _________________________________________________________________ conv1d_6 (Conv1D) (None, 20, 128) 65664 _________________________________________________________________ max_pooling1d_3 (MaxPooling1 (None, 10, 128) 0 _________________________________________________________________ conv1d_7 (Conv1D) (None, 10, 256) 65792 _________________________________________________________________ conv1d_8 (Conv1D) (None, 10, 256) 131328 _________________________________________________________________ max_pooling1d_4 (MaxPooling1 (None, 5, 256) 0 _________________________________________________________________ global_average_pooling1d_1 ( (None, 256) 0 _________________________________________________________________ dropout_1 (Dropout) (None, 256) 0 _________________________________________________________________ dense_1 (Dense) (None, 2) 514 ================================================================= Total params: 344,482 Trainable params: 344,482 Non-trainable params: 0 _________________________________________________________________

訓練參數比較少,你們能夠根據本身想法更改。

3.網絡模型訓練

模型訓練

TIME_PERIODS = 5000
num_sensors = 12
def build_model(input_shape=(TIME_PERIODS,num_sensors),num_classes=2):
    model = Sequential()
    #model.add(Reshape((TIME_PERIODS, num_sensors), input_shape=input_shape))
    model.add(Conv1D(16, 16,strides=2, activation='relu',input_shape=input_shape))
    model.add(Conv1D(16, 16,strides=2, activation='relu',padding="same"))
    model.add(MaxPooling1D(2))
    model.add(Conv1D(64, 8,strides=2, activation='relu',padding="same"))
    model.add(Conv1D(64, 8,strides=2, activation='relu',padding="same"))
    model.add(MaxPooling1D(2))
    model.add(Conv1D(128, 4,strides=2, activation='relu',padding="same"))
    model.add(Conv1D(128, 4,strides=2, activation='relu',padding="same"))
    model.add(MaxPooling1D(2))
    model.add(Conv1D(256, 2,strides=1, activation='relu',padding="same"))
    model.add(Conv1D(256, 2,strides=1, activation='relu',padding="same"))
    model.add(MaxPooling1D(2))
    model.add(GlobalAveragePooling1D())
    model.add(Dropout(0.3))
    model.add(Dense(num_classes, activation='softmax'))
    return(model)

  

訓練過程輸出(最優結果:loss: 0.0565 - acc: 0.9820 - val_loss: 0.8307 - val_acc: 0.8800)

Epoch 10/20
25/25 [==============================] - 1s 37ms/step - loss: 0.2329 - acc: 0.9040 - val_loss: 0.4041 - val_acc: 0.8700

Epoch 00010: val_acc improved from 0.85000 to 0.87000, saving model to best_model.10-0.87.h5
Epoch 11/20
25/25 [==============================] - 1s 38ms/step - loss: 0.1633 - acc: 0.9380 - val_loss: 0.5277 - val_acc: 0.8300

Epoch 00011: val_acc did not improve from 0.87000
Epoch 12/20
25/25 [==============================] - 1s 40ms/step - loss: 0.1394 - acc: 0.9500 - val_loss: 0.4916 - val_acc: 0.7400

Epoch 00012: val_acc did not improve from 0.87000
Epoch 13/20
25/25 [==============================] - 1s 38ms/step - loss: 0.1746 - acc: 0.9220 - val_loss: 0.5208 - val_acc: 0.8100

Epoch 00013: val_acc did not improve from 0.87000
Epoch 14/20
25/25 [==============================] - 1s 38ms/step - loss: 0.1009 - acc: 0.9720 - val_loss: 0.5513 - val_acc: 0.8000

Epoch 00014: val_acc did not improve from 0.87000
Epoch 15/20
25/25 [==============================] - 1s 38ms/step - loss: 0.0565 - acc: 0.9820 - val_loss: 0.8307 - val_acc: 0.8800

Epoch 00015: val_acc improved from 0.87000 to 0.88000, saving model to best_model.15-0.88.h5
Epoch 16/20
25/25 [==============================] - 1s 38ms/step - loss: 0.0261 - acc: 0.9920 - val_loss: 0.6443 - val_acc: 0.8400

Epoch 00016: val_acc did not improve from 0.88000
Epoch 17/20
25/25 [==============================] - 1s 38ms/step - loss: 0.0178 - acc: 0.9960 - val_loss: 0.7773 - val_acc: 0.8700

Epoch 00017: val_acc did not improve from 0.88000
Epoch 18/20
25/25 [==============================] - 1s 38ms/step - loss: 0.0082 - acc: 0.9980 - val_loss: 0.8875 - val_acc: 0.8600

Epoch 00018: val_acc did not improve from 0.88000
Epoch 19/20
25/25 [==============================] - 1s 37ms/step - loss: 0.0045 - acc: 1.0000 - val_loss: 1.0057 - val_acc: 0.8600

Epoch 00019: val_acc did not improve from 0.88000
Epoch 20/20
25/25 [==============================] - 1s 37ms/step - loss: 0.0012 - acc: 1.0000 - val_loss: 1.1088 - val_acc: 0.8600

Epoch 00020: val_acc did not improve from 0.88000

 

4.模型應用預測結果

預測數據

if __name__ == "__main__":
    """dat1 = get_feature("TRAIN101.mat")
    print("one data shape is",dat1.shape)
    #one data shape is (12, 5000)
    plt.plot(dat1[0])
    plt.show()"""
    """if (os.path.exists(MANIFEST_DIR)==False):
        create_csv()
    train_iter = xs_gen(train=True)
    test_iter = xs_gen(train=False)
    model = build_model()
    print(model.summary())
    ckpt = keras.callbacks.ModelCheckpoint(
        filepath='best_model.{epoch:02d}-{val_acc:.2f}.h5',
        monitor='val_acc', save_best_only=True,verbose=1)
    model.compile(loss='categorical_crossentropy',
                optimizer='adam', metrics=['accuracy'])
    model.fit_generator(
        generator=train_iter,
        steps_per_epoch=500//Batch_size,
        epochs=20,
        initial_epoch=0,
        validation_data = test_iter,
        nb_val_samples = 100//Batch_size,
        callbacks=[ckpt],
        )"""
    PRE_DIR = "sample_codes/answers.txt"
    model = load_model("best_model.15-0.88.h5")
    pre_lists = pd.read_csv(PRE_DIR,sep=r" ",header=None)
    print(pre_lists.head())
    pre_datas = np.array([get_feature(item,BASE_DIR="preliminary/TEST/") for item in pre_lists[0]])
    pre_result = model.predict_classes(pre_datas)#0-1機率預測
    print(pre_result.shape)
    pre_lists[1] = pre_result
    pre_lists.to_csv("sample_codes/answers1.txt",index=None,header=None)
    print("predict finish")

  

下面是前十條預測結果:

TEST394,0
TEST313,1
TEST484,0
TEST288,0
TEST261,1
TEST310,0
TEST286,1
TEST367,1
TEST149,1
TEST160,1

你們須要注意一下,我預測的方式和官方不一樣,須要你們本身根據賽題要求來進行預測提交。。

 

展望

此Baseline採用最簡單的一維卷積達到了88%測試準確率(可能會由於隨機初始化值上下波動),你們也能夠多嘗試GRU,Attention,和Resnet等結果,測試準確率準確率會突破90+。

能力有限,寫的很差的地方歡迎你們批評指正。。

我的主頁--> https://xiaosongshine.github.io/ 

項目github地址:https://github.com/xiaosongshine/preliminary_challenge_baseline_keras

歡迎Fork+Star,以爲有用的話,麻煩小小鼓勵一下 ><

相關文章
相關標籤/搜索