機器學習在入侵檢測方面的應用 - 基於ADFA-LD訓練集訓練入侵檢測判別模型

時間 2019-12-05

原文原文鏈接

1. ADFA-LD數據集簡介

ADFA-LD數據集是澳大利亞國防學院對外發布的一套主機級入侵檢測數據集合，包括Linux和Windows，是一個包含了入侵事件的系統調用syscall序列的數據集（以單個進程，一段時間窗口內的systemcall api爲一組）html

ADFA-LD數據已經將各種系統調用完成了特徵化，並針對攻擊類型進行了標註，各類攻擊類型見下表git

攻擊類型	數據量	標註類型
Trainning	833	normal
Validation	4373	normal
Hydra-FTP	162	attack
Hydra-SSH	148	attack
Adduser	91	attack
Java-Meterpreter	125	attack
Meterpreter	75	attack
Webshell	118	attack

ADFA-LD數據集的每一個數據文件都獨立記錄了一段時間內的系統調用順序，每一個系統調用都用數字編號（定義在unisted.h中）github

/*
 * This file contains the system call numbers, based on the
 * layout of the x86-64 architecture, which embeds the
 * pointer to the syscall in the table.
 *
 * As a basic principle, no duplication of functionality
 * should be added, e.g. we don't use lseek when llseek
 * is present. New architectures should use this file
 * and implement the less feature-full calls in user space.
 */

#ifndef __SYSCALL
#define __SYSCALL(x, y)
#endif

#if __BITS_PER_LONG == 32 || defined(__SYSCALL_COMPAT)
#define __SC_3264(_nr, _32, _64) __SYSCALL(_nr, _32)
#else
#define __SC_3264(_nr, _32, _64) __SYSCALL(_nr, _64)
#endif

#define __NR_io_setup 0
__SYSCALL(__NR_io_setup, sys_io_setup)
#define __NR_io_destroy 1
__SYSCALL(__NR_io_destroy, sys_io_destroy)
#define __NR_io_submit 2
__SYSCALL(__NR_io_submit, sys_io_submit)
#define __NR_io_cancel 3
__SYSCALL(__NR_io_cancel, sys_io_cancel)
#define __NR_io_getevents 4
__SYSCALL(__NR_io_getevents, sys_io_getevents)

/* fs/xattr.c */
#define __NR_setxattr 5
__SYSCALL(__NR_setxattr, sys_setxattr)
#define __NR_lsetxattr 6
__SYSCALL(__NR_lsetxattr, sys_lsetxattr)
#define __NR_fsetxattr 7
__SYSCALL(__NR_fsetxattr, sys_fsetxattr)
#define __NR_getxattr 8
__SYSCALL(__NR_getxattr, sys_getxattr)
#define __NR_lgetxattr 9
__SYSCALL(__NR_lgetxattr, sys_lgetxattr)

0x1：包含的攻擊類型

1. Hydra-FTP：FTP暴力破解攻擊
2. Hydra-SSH：SSH暴力破解攻擊
3. Adduser
4. Meterpreter：the uploads of Java and Linux executable Meterpreter payloads for the remote compromise of a target host
5. Webshell：privilege escalation using C100 webshell

0x2：數據集特徵分析

在進行特徵工程以前，咱們先嚐試對訓練數據集進行一個簡要的分析，嘗試從中發現一些規律輔助咱們進行後續的特徵工程web

1. syscall序列長度

序列長度體現了該進程從開始運行到最後完成攻擊/被攻擊總共調用的syscall次數，經過可視化不一樣類別label數據集的Trace length的機率密度曲線（PDF）算法

能夠看到，Trace length大體分佈在【100：500】區間，可是咱們並無找到明顯的分界線/面來區分這些不一樣的label樣本，這說明Trace length可能不會是一個好的特徵shell

2. 從詞模型角度看樣本集中不一樣類別label的數據中是否存在公共模式

這一步本質上是在考慮樣本數據集是否線性可分，即樣本中包含的規律真值是否足夠明顯，只有數據集自己是線性可分的，纔有可能經過算法建模分析api

樣本集中的syacall本質上就是一個詞序列，咱們將其2-gram處理，統計詞頻直方圖promise

咱們發現，在Adduser類別中，「168 168」、「168 265」這2個2-gram序列出現的頻次最高，而在Webshell類別中，「5 5」、「5 3」這2個2-gram出現的頻次最高。這從必定程度上代表兩類數據集在2-gram詞頻這個層面上是線性可分的網絡

Relevant Link:app

Evaluating host-based anomaly detection system：A preliminary analysis of ADFA-LD
https://www.unsw.adfa.edu.au/australian-centre-for-cyber-security/cybersecurity/ADFA-IDS-Datasets/

1. 如何進行特徵工程

由syscall api組成的token set本質上是由詞組成的詞序列，咱們能夠經過詞模型的方式對樣本進行特徵工程

0x1：詞袋模型 - 基本單元是單個詞（即1-gram），以詞頻做爲向量化的值空間，

Bag-of-words model (BoW model) 最先出如今天然語言處理（Natural Language Processing）和信息檢索（Information Retrieval）領域.。該模型忽略掉文本的語法和語序等要素，將其僅僅看做是若干個詞彙的集合，文檔中每一個單詞的出現都是獨立的。BoW使用一組無序的單詞(words)來表達一段文字或一個文檔

首先給出兩個簡單的文本文檔以下：

John likes to watch movies. Mary likes too.
John also likes to watch football games.

基於上述兩個文檔中出現的單詞，構建以下一個詞典 (dictionary)：

{"John": 1, "likes": 2,"to": 3, "watch": 4, "movies": 5,"also": 6, "football": 7, "games": 8,"Mary": 9, "too": 10}

上面的詞典中包含10個單詞, 每一個單詞有惟一的索引（注意索引的排序前後無心義）, 那麼每一個文本咱們可使用一個10維的向量來表示。以下：

[1, 2, 1, 1, 1, 0, 0, 0, 1, 1]
[1, 1,1, 1, 0, 1, 1, 1, 0, 0]

該向量與原來文本中單詞出現的順序沒有關係，而是詞典中每一個單詞（不管該單詞是否在該樣本中出現）在文本中出現的頻率

scikit-learn中使用CountVectorizer()實現詞袋特徵的提取，CountVectorizer在一個類中實現了標記和計數：

# -*- coding:utf-8 -*-

from sklearn.feature_extraction.text import CountVectorizer

if __name__ == '__main__':
    vectorizer = CountVectorizer()
    corpus = [
        'This is the first document.',
        'This is the second second document.',
        'And the third one.',
        'Is this the first document?',
    ]
    X = vectorizer.fit_transform(corpus)
    # 顯示根據語料訓練出的詞袋
    print vectorizer.vocabulary_

    # 顯示原始語料通過詞袋編碼後的向量矩陣
    print X.toarray()

稀疏性

大多數文檔一般只會使用語料庫中全部詞的一個子集，於是產生的矩陣將有許多特徵值是0（一般99%以上都是0）。例如，一組10,000個短文本（好比email）會使用100,000的詞彙總量，而每一個文檔會使用100到1,000個惟一的詞。

爲了可以在內存中存儲這個矩陣，同時也提供矩陣/向量代數運算的速度，sklearn一般會使用稀疏表來存儲和運算特徵

訓練集覆蓋度問題

詞袋模型的vocab詞表在訓練期間就肯定下來了，所以，在訓練語料中沒有出現的詞在後續調用轉化方法時將被徹底忽略：

vectorizer.transform(['Something completely new.']).toarray()
                           
array([[0, 0, 0, 0, 0, 0, 0, 0, 0]]...)

這會在必定程度上影響詞袋模型的泛化能力

0x2：TF-IDF term weighting - 在詞頻的基礎上加上了權重的概念

在大文本語料中，一些詞語出現很是多（好比：英語中的「the」, 「a」, 「is」），它們攜帶着不多量的信息量。咱們不能在分類器中直接使用這些詞的頻數，這會下降那些咱們感興趣可是頻數很小的term。咱們須要對feature的count頻數作進一步re-weight成浮點數，以方便分類器的使用，這一步經過tf-idf轉換來完成。

若是一個詞越常見，那麼分母就越大，逆文檔頻率就越小越接近0。分母之因此要加1，是爲了不分母爲0（即全部文檔都不包含該詞）。log表示對獲得的值取對數。

能夠看到，詞的在單個樣本里的頻數和在總體語料庫中的頻數互相調和，動態決定了改詞的權重

scikit-learn中使用TfidfTransformer()實現了TF-IDF逆文檔特徵提取

# -*- coding:utf-8 -*-

from sklearn.feature_extraction.text import TfidfVectorizer

if __name__ == '__main__':
    vectorizer = TfidfVectorizer(min_df=1)
    corpus = [
        'This is the first document.',
        'This is the second second document.',
        'And the third one.',
        'Is this the first document?',
    ]
    tfidf = vectorizer.fit_transform(corpus)

    # 顯示原始語料通過TF-IDF編碼後的向量矩陣
    print tfidf.toarray()

'''
[[ 0.          0.43877674  0.54197657  0.43877674  0.          0.
   0.35872874  0.          0.43877674]
 [ 0.          0.27230147  0.          0.27230147  0.          0.85322574
   0.22262429  0.          0.27230147]
 [ 0.55280532  0.          0.          0.          0.55280532  0.
   0.28847675  0.55280532  0.        ]
 [ 0.          0.43877674  0.54197657  0.43877674  0.          0.
   0.35872874  0.          0.43877674]]
'''

TF-IDF和詞袋模型同樣，將原始樣本抽象成了一個定長的向量，所不一樣的在詞袋中的詞頻被替換成了TF-IDF權重

0x3：N-Gram模型 - 在詞頻模型基礎上考慮多詞上下文結構

一組unigrams（即詞袋）沒法捕捉短語和多詞（multi-word）表達，咱們能夠用n-gram來對詞進行窗口化組合，在n-gram的基礎上進行詞頻向量化

CountVectorizer類中一樣實現了n-gram，或者說n-gram只是詞頻模型的一個參數選項

# -*- coding:utf-8 -*-

from sklearn.feature_extraction.text import CountVectorizer

if __name__ == '__main__':
    vectorizer = CountVectorizer(min_df=1, analyzer='word', ngram_range=(2, 3))
    corpus = [
        'This is the first document.',
        'This is the second second document.',
        'And the third one.',
        'Is this the first document?',
    ]
    tfidf = vectorizer.fit_transform(corpus)

    # n-gram詞表
    print vectorizer.get_feature_names()

    # 顯示原始語料通過n-gram編碼後的向量矩陣
    print tfidf.toarray()

'''
[u'and the', u'and the third', u'first document', u'is the', u'is the first', u'is the second', u'is this', u'is this the', u'second document', u'second second', u'second second document', u'the first', u'the first document', u'the second', u'the second second', u'the third', u'the third one', u'third one', u'this is', u'this is the', u'this the', u'this the first']
[[0 0 1 1 1 0 0 0 0 0 0 1 1 0 0 0 0 0 1 1 0 0]
 [0 0 0 1 0 1 0 0 1 1 1 0 0 1 1 0 0 0 1 1 0 0]
 [1 1 0 0 0 0 0 0 0 0 0 0 0 0 0 1 1 1 0 0 0 0]
 [0 0 1 0 0 0 1 1 0 0 0 1 1 0 0 0 0 0 0 0 1 1]]
'''

0x4：Word2Vec詞向量嵌入模型 - 不考慮詞頻而是從詞之間類似性在高維空間上的距離層面抽取特徵

word2vec的訓練過程是在訓練一個淺層神經網絡，將訓練集中的每一個詞都映射到一個指定維度的向量空間中

# -*- coding:utf-8 -*-

import warnings
warnings.filterwarnings(action='ignore', category=UserWarning, module='gensim')
import gensim
import os

if __name__ == '__main__':
    modelpath = "./word2vec.test.txt"
    if os.path.isfile(modelpath):
        # 導入模型
        print "load modeling..."
        model = gensim.models.Word2Vec.load(modelpath)
    else:
        # 用LineSentence把詞序列轉爲所須要的格式
        corpus = [
            ['This', 'is', 'the', 'first', 'document.'],
            ['This', 'is', 'the', 'second', 'second', 'document.'],
            ['And', 'the', 'third', 'one.'],
            ['Is', 'this', 'the', 'first', 'document?']
        ]
        # 將詞嵌入到一個100維的向量空間中
        model = gensim.models.Word2Vec(corpus, min_count=1, size=100)
        # 保存模型
        model.save(modelpath)

    print model['This']
'''
load modeling...
[  4.21720184e-03  -4.96086199e-03   3.77745135e-03   2.94174161e-03
  -1.84197503e-03  -2.94078956e-03   1.41434965e-03  -1.12752395e-03
   3.44854128e-03  -1.56023342e-03   2.58653867e-03   2.33289364e-04
   3.44703044e-03  -2.01581535e-03   4.42115450e-03  -2.88038654e-03
  -2.38809455e-03  -4.50134743e-03  -1.49860769e-03   7.91519240e-04
   4.98433039e-03   1.85355416e-03   2.31889612e-03  -1.69523829e-03
  -3.30593879e-03   4.40168194e-03  -4.88520879e-03   2.60615419e-03
   6.49481721e-04  -2.49359757e-03  -3.32681416e-03   2.01359508e-03
   3.97601305e-03   6.56171120e-04   3.81603022e-03   2.93262041e-04
  -2.28614034e-03  -2.23138509e-03  -2.07091100e-03  -2.18214374e-03
  -1.24846201e-03  -4.72204387e-03   1.10300467e-03   2.74274289e-03
   3.69609370e-05   2.28803046e-03   1.93586131e-03  -3.52792139e-03
   6.02113956e-04  -4.30466002e-03  -1.68499397e-03   4.44801664e-03
   3.73569527e-03  -2.87452945e-03  -4.44274070e-03   1.91680994e-03
   3.03726265e-04  -2.60479492e-03   3.86350509e-03  -3.56708956e-03
  -4.24962817e-03  -2.64985068e-03   4.89832275e-03   4.93438961e-03
  -8.93970719e-04  -4.92232037e-04  -2.22921767e-03  -2.13925354e-03
   3.71658040e-04   2.85526551e-03   3.21991998e-03   3.41509795e-03
  -4.62498562e-03  -2.23036925e-03   4.81000589e-03   3.47611774e-03
  -4.62327013e-03  -2.20024776e-05   4.42962535e-03   2.17637443e-03
   1.95589405e-03   3.56489979e-03   2.77884956e-03  -1.01689191e-03
  -3.14383302e-03   1.79978073e-04  -4.77676420e-03   4.18598717e-03
  -2.46347464e-03  -4.86065960e-03   2.29529128e-03   2.09548216e-06
   4.92842309e-03   4.01797617e-04  -4.82031086e-04   1.20579556e-03
   2.56112689e-04  -1.17955834e-03  -4.68734046e-03   3.14474717e-04]
'''

能夠看到，word2vec向量化的基本單位是詞，每一個詞都被映射成了一個指定維度的向量，而全部詞組成一個詞序列（句子）就成了一個向量矩陣（詞個數 x 指定的word2vec嵌入維度）。可是機器學習的算法要求的輸入都是一個一維張量，所以，咱們還須要進行一次特徵處理，即用詞向量表對原始語料進行特徵編碼，編碼的方式有不少種，例如

1. 將全部詞向量相加，取每一個維度的均值做爲向量值
2. 在TF-IDF的基礎上進行方式1

能夠看到，和詞袋模型相比，樣本語料（一段詞序列）進行word2vec emberdding以後的的向量維度是詞空間的維度（例如咱們本例代碼中指定的100維），可是詞袋模型編碼後的向量維度是詞袋的大小，大都數狀況下詞向量模型編碼後的維度小於詞袋模型

# -*- coding:utf-8 -*-

import warnings
warnings.filterwarnings(action='ignore', category=UserWarning, module='gensim')
import gensim
from collections import Counter, defaultdict
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.feature_extraction.text import TfidfVectorizer
import os
import numpy as np


class MeanEmbeddingVectorizer(object):
    def __init__(self, word2vec):
        self.word2vec = word2vec
        self.dim = len(word2vec.itervalues().next())

    def fit(self):
        return self

    # 遍歷輸入詞序列中的每個詞，取其在此項量表中的向量，若是改詞不在詞向量詞表中（即訓練集中未出現），則填0
    def transform(self, X):
        return np.array([np.mean([self.word2vec[w] for w in words if w in self.word2vec]
                                or [np.zeros(self.dim)], axis=0)
                        for words in X
                        ])



# and a tf-idf version of the same
class TfidfEmbeddingVectorizer(object):
    def __init__(self, word2vec):
        self.word2vec = word2vec
        self.word2weight = None
        self.dim = len(word2vec.itervalues().next())

    def fit(self, X):
        tfidf = TfidfVectorizer(analyzer=lambda x: x)
        tfidf.fit(X)
        # if a word was never seen - it must be at least as infrequent
        # as any of the known words - so the default idf is the max of
        # known idf's
        max_idf = max(tfidf.idf_)
        self.word2weight = defaultdict(
            lambda: max_idf,
            [(w, tfidf.idf_[i]) for w, i in tfidf.vocabulary_.items()])

        return self

    # 在詞向量的基礎上乘上TF-IDF的詞權重
    def transform(self, X):
        return np.array([np.mean([self.word2vec[w] * self.word2weight[w]
                                 for w in words if w in self.word2vec] or
                                [np.zeros(self.dim)], axis=0)
                        for words in X
                        ])


corpus = [
    ['This', 'is', 'the', 'first', 'document.'],
    ['This', 'is', 'the', 'second', 'second', 'document.'],
    ['And', 'the', 'third', 'one.'],
    ['Is', 'this', 'the', 'first', 'document?']
]


if __name__ == '__main__':
    modelpath = "./word2vec.test.txt"
    model = None
    if os.path.isfile(modelpath):
        # 導入模型
        print "load modeling..."
        model = gensim.models.Word2Vec.load(modelpath)
    else:
        # 將詞嵌入到一個100維的向量空間中
        model = gensim.models.Word2Vec(corpus, min_count=1, size=100)
        # 保存模型
        model.save(modelpath)

    print "word emberding vocab: ", model.wv.vocab.keys()
    # 生成詞向量詞表
    words_vocab = dict()
    for key in model.wv.vocab.keys():
        nums = map(float, model[key])
        words_vocab[key] = np.array(nums)

    meanVectorizer = MeanEmbeddingVectorizer(words_vocab)
    # fit()能夠忽略
    # 將訓練語料經過詞向量表編碼成一個行向量（取均值）
    corpusVecs = meanVectorizer.transform(corpus)
    for i in range(len(corpus)):
        print corpus[i]
        print corpusVecs[i]
        print ""

    tfidfVectorizer = TfidfEmbeddingVectorizer(words_vocab)
    tfidfVectorizer.fit(corpus)
    # 將訓練語料經過詞向量表編碼成一個行向量（在TF-IDF基礎上取均值）
    corpusVecs = tfidfVectorizer.transform(corpus)
    for i in range(len(corpus)):
        print corpus[i]
        print corpusVecs[i]
        print ""

    # print words_vocab



'''
load modeling...
[  4.21720184e-03  -4.96086199e-03   3.77745135e-03   2.94174161e-03
  -1.84197503e-03  -2.94078956e-03   1.41434965e-03  -1.12752395e-03
   3.44854128e-03  -1.56023342e-03   2.58653867e-03   2.33289364e-04
   3.44703044e-03  -2.01581535e-03   4.42115450e-03  -2.88038654e-03
  -2.38809455e-03  -4.50134743e-03  -1.49860769e-03   7.91519240e-04
   4.98433039e-03   1.85355416e-03   2.31889612e-03  -1.69523829e-03
  -3.30593879e-03   4.40168194e-03  -4.88520879e-03   2.60615419e-03
   6.49481721e-04  -2.49359757e-03  -3.32681416e-03   2.01359508e-03
   3.97601305e-03   6.56171120e-04   3.81603022e-03   2.93262041e-04
  -2.28614034e-03  -2.23138509e-03  -2.07091100e-03  -2.18214374e-03
  -1.24846201e-03  -4.72204387e-03   1.10300467e-03   2.74274289e-03
   3.69609370e-05   2.28803046e-03   1.93586131e-03  -3.52792139e-03
   6.02113956e-04  -4.30466002e-03  -1.68499397e-03   4.44801664e-03
   3.73569527e-03  -2.87452945e-03  -4.44274070e-03   1.91680994e-03
   3.03726265e-04  -2.60479492e-03   3.86350509e-03  -3.56708956e-03
  -4.24962817e-03  -2.64985068e-03   4.89832275e-03   4.93438961e-03
  -8.93970719e-04  -4.92232037e-04  -2.22921767e-03  -2.13925354e-03
   3.71658040e-04   2.85526551e-03   3.21991998e-03   3.41509795e-03
  -4.62498562e-03  -2.23036925e-03   4.81000589e-03   3.47611774e-03
  -4.62327013e-03  -2.20024776e-05   4.42962535e-03   2.17637443e-03
   1.95589405e-03   3.56489979e-03   2.77884956e-03  -1.01689191e-03
  -3.14383302e-03   1.79978073e-04  -4.77676420e-03   4.18598717e-03
  -2.46347464e-03  -4.86065960e-03   2.29529128e-03   2.09548216e-06
   4.92842309e-03   4.01797617e-04  -4.82031086e-04   1.20579556e-03
   2.56112689e-04  -1.17955834e-03  -4.68734046e-03   3.14474717e-04]
'''

值得注意的是：採用均值的方式能夠解決待編碼的句子的長度不一樣問題，經過均值化保證了不會應爲句子的長度致使向量化後量綱不一致。均值的方式比fix length and padding的方式要合理

0x5：Doc2Vec句向量嵌入模型 - 將一段詞序列直接抽象成一個固定長度的行向量

Doc2Vec 或者叫作 paragraph2vec, sentence embeddings，是一種非監督式算法，能夠得到 sentences/paragraphs/documents 的向量表達，是 word2vec 的拓展。
學出來的向量能夠經過計算距離來找 sentences/paragraphs/documents 之間的類似性，或者進一步能夠給文檔打標籤

# -*- coding:utf-8 -*-

import warnings
warnings.filterwarnings(action='ignore', category=UserWarning, module='gensim')
import gensim
gensimLabeledSentence = gensim.models.doc2vec.LabeledSentence
import os


# 用文檔集合來訓練模型
class LabeledLineSentence(object):
    def __init__(self, doc_list, labels_list):
        self.labels_list = labels_list
        self.doc_list = doc_list

    def __iter__(self):
            for idx, doc in enumerate(self.doc_list):
                # 在 gensim 中模型是以單詞爲單位訓練的，因此不論是句子仍是文檔都分解成單詞
                yield gensimLabeledSentence(words=doc.split(), tags=[self.labels_list[idx]])




corpus = [
    'This is the first document.This is the first document.This is the first document.This is the first document.This is the first document.',
    'This is the second second document.This is the second second document.This is the second second document.This is the second second document.',
    'And the third one.And the third one.And the third one.And the third one.And the third one.',
    'Is this the first document?Is this the first document?Is this the first document?Is this the first document?Is this the first document?',
    'This is the first document.This is the first document.This is the first document.This is the first document.This is the first document.',
    'This is the second second document.This is the second second document.This is the second second document.This is the second second document.',
    'And the third one.And the third one.And the third one.And the third one.And the third one.',
    'Is this the first document?Is this the first document?Is this the first document?Is this the first document?Is this the first document?',
    'This is the first document.This is the first document.This is the first document.This is the first document.This is the first document.',
    'This is the second second document.This is the second second document.This is the second second document.This is the second second document.',
    'And the third one.And the third one.And the third one.And the third one.And the third one.',
    'Is this the first document?Is this the first document?Is this the first document?Is this the first document?Is this the first document?'
]
corpus_label = [
    'normal', 'normal', 'normal', 'bad',
    'normal', 'normal', 'normal', 'bad',
    'normal', 'normal', 'normal', 'bad'
]


if __name__ == '__main__':
    modelpath = "./doc2vec.test.txt"
    model = None
    if os.path.isfile(modelpath):
        # 導入模型
        print "load modeling..."
        model = gensim.models.Doc2Vec.load(modelpath)
        # 測試模型
        print model['normal']
    else:
        # 載入樣本集
        it = LabeledLineSentence(corpus, corpus_label)
        # 訓練 Doc2Vec，並保存模型
        model = gensim.models.Doc2Vec(size=300, window=10, min_count=5, workers=11, alpha=0.025, min_alpha=0.025)
        model.build_vocab(it)

        for epoch in range(10):
            model.train(it, total_examples=model.corpus_count, epochs=model.iter)
            model.alpha -= 0.002  # decrease the learning rate
            model.min_alpha = model.alpha  # fix the learning rate, no deca
            model.train(it, total_examples=model.corpus_count, epochs=model.iter)

        model.save(modelpath)

Relevant Link:

http://scikit-learn.org/stable/modules/feature_extraction.html
http://blog.csdn.net/u010213393/article/details/40987945
http://www.ruanyifeng.com/blog/2013/03/tf-idf.html
http://d0evi1.com/sklearn/feature_extraction/ 
http://scikit-learn.org/stable/modules/feature_extraction.html
http://blog.csdn.net/sinsa110/article/details/76855428
http://cloga.info/2014/01/19/sklearn_text_feature_extraction 
http://blog.csdn.net/jerr__y/article/details/52967351
http://www.52nlp.cn/中英文維基百科語料上的word2vec實驗
http://nadbordrozd.github.io/blog/2016/05/20/text-classification-with-word2vec/
https://github.com/nadbordrozd/blog_stuff/blob/master/classification_w2v/benchmarking.ipynb
https://rare-technologies.com/doc2vec-tutorial/
http://www.jianshu.com/p/854a59b93e09
http://cs.stanford.edu/~quocle/paragraph_vector.pdf
http://blog.csdn.net/lenbow/article/details/52120230

2. 基於KNN（K-Nearest Neighbor K近鄰）檢測Webshell

0x1：基於詞袋模型特徵

# -*- coding:utf-8 -*-

import re
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.model_selection import cross_val_score

import os
import numpy as np
from sklearn.neighbors import KNeighborsClassifier


def load_one_flle(filename):
    x = []
    with open(filename) as f:
        line = f.readline()
        line = line.strip('\n')
    return line

def load_adfa_training_files(rootdir):
    x = []
    y = []
    list = os.listdir(rootdir)
    for i in range(0, len(list)):
        path = os.path.join(rootdir, list[i])
        if os.path.isfile(path):
            x.append(load_one_flle(path))
            y.append(0)
    return x, y

def dirlist(path, allfile):
    filelist = os.listdir(path)

    for filename in filelist:
        filepath = os.path.join(path, filename)
        if os.path.isdir(filepath):
            dirlist(filepath, allfile)
        else:
            allfile.append(filepath)
    return allfile

def load_adfa_webshell_files(rootdir):
    x = []
    y = []
    allfile=dirlist(rootdir,[])
    for file in allfile:
        if re.match(r"\.\./data/ADFA-LD/Attack_Data_Master/Web_Shell_\d+\\UAD-W*", file):
            x.append(load_one_flle(file))
            y.append(1)
    return x, y



if __name__ == '__main__':

    x1, y1 = load_adfa_training_files("../data/ADFA-LD/Training_Data_Master/")  # 訓練集（normal）
    x2, y2 = load_adfa_webshell_files("../data/ADFA-LD/Attack_Data_Master/")    # 訓練集（attack）
    x3, y3 = load_adfa_training_files("../data/ADFA-LD/Validation_Data_Master/")  # 驗證集（normal）

    # 訓練集黑白樣本混合
    x_train = x1 + x2
    y_train = y1 + y2
    x_validate = x3 + x2
    y_validate = y3 + y2

    # 詞袋模型，僅統計單個詞出現的頻數
    vectorizer = CountVectorizer(min_df=1)
    vecmodel = vectorizer.fit(x_train) # 按照訓練的詞表訓練vocab詞彙表
    x_train = vecmodel.transform(x_train).toarray() # 生成訓練集詞頻向量
    x_validate = vecmodel.transform(x_validate).toarray() # 按照一樣的標準生成驗證機詞頻向量

    # 根據訓練集生成KNN模型
    clf = KNeighborsClassifier(n_neighbors=4).fit(x_train, y_train)
    scores = cross_val_score(clf, x_train, y_train, n_jobs=-1, cv=10)
    # 反映KNN模型訓練擬合的程度
    print "Training accurate: "
    print scores
    print np.mean(scores)

    # Make predictions using the validate set
    # print x_train.shape
    # print x_validate.shape
    y_pre = clf.predict(x_validate)
    print "Predict result: ", y_pre
    # 預測的準確度
    print "Prediction accurate: %2f" % np.mean(y_pre == y_validate)

對驗證集能到達93的準確度

0x2：基於TF-IDF模型特徵

# -*- coding:utf-8 -*-

import re
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.model_selection import cross_val_score

import os
import numpy as np
from sklearn.neighbors import KNeighborsClassifier


def load_one_flle(filename):
    x = []
    with open(filename) as f:
        line = f.readline()
        line = line.strip('\n')
    return line

def load_adfa_training_files(rootdir):
    x = []
    y = []
    list = os.listdir(rootdir)
    for i in range(0, len(list)):
        path = os.path.join(rootdir, list[i])
        if os.path.isfile(path):
            x.append(load_one_flle(path))
            y.append(0)
    return x, y

def dirlist(path, allfile):
    filelist = os.listdir(path)

    for filename in filelist:
        filepath = os.path.join(path, filename)
        if os.path.isdir(filepath):
            dirlist(filepath, allfile)
        else:
            allfile.append(filepath)
    return allfile

def load_adfa_webshell_files(rootdir):
    x = []
    y = []
    allfile = dirlist(rootdir, [])
    for file in allfile:
        if re.match(r"\.\./data/ADFA-LD/Attack_Data_Master/Web_Shell_\d+\\UAD-W*", file):
            x.append(load_one_flle(file))
            y.append(1)
    return x, y



if __name__ == '__main__':
    x1, y1 = load_adfa_training_files("../data/ADFA-LD/Training_Data_Master/")  # 訓練集（normal）
    x2, y2 = load_adfa_webshell_files("../data/ADFA-LD/Attack_Data_Master/")    # 訓練集（attack）
    x3, y3 = load_adfa_training_files("../data/ADFA-LD/Validation_Data_Master/")  # 驗證集（normal）

    # 訓練集黑白樣本混合
    x_train = x1 + x2
    y_train = y1 + y2
    x_validate = x3 + x2
    y_validate = y3 + y2

    # TF-IDF模型
    vectorizer = TfidfVectorizer(min_df=1)
    vecmodel = vectorizer.fit(x_train) # 按照訓練集訓練vocab詞彙表
    print "vocabulary_: "
    print vecmodel.vocabulary_

    x_train = vecmodel.transform(x_train).toarray()
    x_validate = vecmodel.transform(x_validate).toarray()
    print "x_train[0]: ", x_train[0]
    print "x_validate[0]: ", x_validate[0]

    # 根據訓練集生成KNN模型
    clf = KNeighborsClassifier(n_neighbors=4).fit(x_train, y_train)
    # 反映KNN模型訓練擬合的程度
    y_train_pre = clf.predict(x_train)
    print "Train result: ", y_train_pre
    print "Train accurate: %2f" % np.mean(y_train_pre == y_train)

    # Make predictions using the validate set
    # print x_train.shape
    # print x_validate.shape
    y_valid_pre = clf.predict(x_validate)
    print "Predict result: ", y_valid_pre
    # 預測的準確度
    print "Prediction accurate: %2f" % np.mean(y_valid_pre == y_validate)

0x3：基於N-gram模型特徵

# -*- coding:utf-8 -*-

import re
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.model_selection import cross_val_score

import os
import numpy as np
from sklearn.neighbors import KNeighborsClassifier


def load_one_flle(filename):
    x = []
    with open(filename) as f:
        line = f.readline()
        line = line.strip('\n')
    return line

def load_adfa_training_files(rootdir):
    x = []
    y = []
    list = os.listdir(rootdir)
    for i in range(0, len(list)):
        path = os.path.join(rootdir, list[i])
        if os.path.isfile(path):
            x.append(load_one_flle(path))
            y.append(0)
    return x, y

def dirlist(path, allfile):
    filelist = os.listdir(path)

    for filename in filelist:
        filepath = os.path.join(path, filename)
        if os.path.isdir(filepath):
            dirlist(filepath, allfile)
        else:
            allfile.append(filepath)
    return allfile

def load_adfa_webshell_files(rootdir):
    x = []
    y = []
    allfile=dirlist(rootdir,[])
    for file in allfile:
        if re.match(r"\.\./data/ADFA-LD/Attack_Data_Master/Web_Shell_\d+\\UAD-W*", file):
            x.append(load_one_flle(file))
            y.append(1)
    return x, y



if __name__ == '__main__':

    x1, y1 = load_adfa_training_files("../data/ADFA-LD/Training_Data_Master/")  # 訓練集（normal）
    x2, y2 = load_adfa_webshell_files("../data/ADFA-LD/Attack_Data_Master/")    # 訓練集（attack）
    x3, y3 = load_adfa_training_files("../data/ADFA-LD/Validation_Data_Master/")  # 驗證集（normal）

    # 訓練集黑白樣本混合
    x_train = x1 + x2
    y_train = y1 + y2
    x_validate = x3 + x2
    y_validate = y3 + y2

    # n-gram模型
    vectorizer = CountVectorizer(min_df=1, analyzer='word', ngram_range=(2, 3))
    vecmodel = vectorizer.fit(x_train)
    x_train = vecmodel.transform(x_train).toarray()
    x_validate = vecmodel.transform(x_validate).toarray()

    # 根據訓練集生成KNN模型
    clf = KNeighborsClassifier(n_neighbors=4).fit(x_train, y_train)
    scores = cross_val_score(clf, x_train, y_train, n_jobs=-1, cv=10)
    # 反映KNN模型訓練擬合的程度
    print "Training accurate: "
    print scores
    print np.mean(scores)

    # Make predictions using the validate set
    y_pre = clf.predict(x_validate)
    print "Predict result: ", y_pre
    # 預測的準確度
    print "Prediction accurate: %2f" % np.mean(y_pre == y_validate)

0x4：基於Word2Vec模型

# -*- coding:utf-8 -*-

import re
from sklearn.model_selection import cross_val_score

from sklearn.neighbors import KNeighborsClassifier

import warnings
warnings.filterwarnings(action='ignore', category=UserWarning, module='gensim')
import gensim
from collections import Counter, defaultdict
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.feature_extraction.text import TfidfVectorizer
import os
import numpy as np


class MeanEmbeddingVectorizer(object):
    def __init__(self, word2vec):
        self.word2vec = word2vec
        self.dim = len(word2vec.itervalues().next())

    def fit(self):
        return self

    # 遍歷輸入詞序列中的每個詞，取其在此項量表中的向量，若是改詞不在詞向量詞表中（即訓練集中未出現），則填0
    def transform(self, X):
        return np.array([np.mean([self.word2vec[w] for w in words if w in self.word2vec]
                                or [np.zeros(self.dim)], axis=0)
                        for words in X
                        ])

def load_one_flle(filename):
    x = []
    with open(filename) as f:
        line = f.readline()
        x = line.strip('\n').split()
    return x

def load_adfa_training_files(rootdir):
    x = []
    y = []
    list = os.listdir(rootdir)
    for i in range(0, len(list)):
        path = os.path.join(rootdir, list[i])
        if os.path.isfile(path):
            x.append(load_one_flle(path))
            y.append(0)
    return x, y

def dirlist(path, allfile):
    filelist = os.listdir(path)

    for filename in filelist:
        filepath = os.path.join(path, filename)
        if os.path.isdir(filepath):
            dirlist(filepath, allfile)
        else:
            allfile.append(filepath)
    return allfile

def load_adfa_webshell_files(rootdir):
    x = []
    y = []
    allfile=dirlist(rootdir,[])
    for file in allfile:
        if re.match(r"\.\./data/ADFA-LD/Attack_Data_Master/Web_Shell_\d+\\UAD-W*", file):
            x.append(load_one_flle(file))
            y.append(1)
    return x, y



if __name__ == '__main__':
    x1, y1 = load_adfa_training_files("../data/ADFA-LD/Training_Data_Master/")  # 訓練集（normal）
    x2, y2 = load_adfa_webshell_files("../data/ADFA-LD/Attack_Data_Master/")    # 訓練集（attack）
    x3, y3 = load_adfa_training_files("../data/ADFA-LD/Validation_Data_Master/")  # 驗證集（normal）

    # 訓練集黑白樣本混合
    x_train = x1 + x2
    y_train = y1 + y2
    x_validate = x3 + x2
    y_validate = y3 + y2

    modelpath = "./word2vec.test.txt"
    model = None
    if os.path.isfile(modelpath):
        # 導入模型
        print "load modeling..."
        model = gensim.models.Word2Vec.load(modelpath)
    else:
        # 將詞嵌入到一個100維的向量空間中
        model = gensim.models.Word2Vec(x_train, min_count=1, size=100)
        # 保存模型
        model.save(modelpath)

    print "word emberding vocab: ", model.wv.vocab.keys()

    # 生成詞向量詞表
    words_vocab = dict()
    for key in model.wv.vocab.keys():
        nums = map(float, model[key])
        words_vocab[key] = np.array(nums)

    meanVectorizer = MeanEmbeddingVectorizer(words_vocab)
    # 將訓練語料經過詞向量表編碼成一個行向量（取均值）
    x_trainVecs = meanVectorizer.transform(x_train)
    #for i in range(len(x_train)):
    #    print x_train[i]
    #    print x_trainVecs[i]
    #    print ""
    # 將驗證語料經過詞向量表編碼成一個行向量（取均值）
    x_validateVecs = meanVectorizer.transform(x_validate)
    #for i in range(len(x_train)):
    #    print x_validate[i]
    #    print x_validateVecs[i]
    #    print ""

    # 根據訓練集生成KNN模型
    clf = KNeighborsClassifier(n_neighbors=4).fit(x_trainVecs, y_train)
    scores = cross_val_score(clf, x_trainVecs, y_train, n_jobs=-1, cv=10)
    # 反映KNN模型訓練擬合的程度
    print "Training accurate: "
    print scores
    print np.mean(scores)

    # Make predictions using the validate set
    y_pre = clf.predict(x_validateVecs)
    print "Predict result: ", y_pre
    # 預測的準確度
    print "Prediction accurate: %2f" % np.mean(y_pre == y_validate)

https://arxiv.org/pdf/1611.01726.pdf  - LSTM-BASED SYSTEM-CALL LANGUAGE MODELING AND ROBUST ENSEMBLE METHOD FOR DESIGNING HOST-BASED INTRUSION DETECTION SYSTEMS
http://www.internationaljournalssrg.org/IJCSE/2015/Volume2-Issue6/IJCSE-V2I6P109.pdf - Review of A Semantic Approach to Host-based Intrusion Detection Systems Using Contiguous and Dis-contiguous System Call Patterns
http://www.ijirst.org/articles/IJIRSTV1I11121.pdf - A Host Based Intrusion Detection System Using Improved Extreme Learning Machine