【hyperscan】示例解讀 pcapscan

時間 2019-12-14

標籤 hyperscan 示例解讀 pcapscan 简体版

原文原文鏈接

示例位置: <hyperscan source>/examples/pcapscan.cc
參考：http://01org.github.io/hyperscan/dev-reference/api_files.htmlhtml

1. 概述

此示例實現一個簡單的數據包匹配性能測量程序。前端

pcapscan使用libpcap從pcap文件中讀取數據包，並根據一個規則文件中指定的多個正則表達式對報文進行匹配，並輸出匹配結果和一些統計信息。pcapscan使用並對比了兩種匹配模式：BLOCK和STREAM。BLOCK模式時它對單個數據包進行匹配；而STREAM模式下它經過五元組將數據包進行簡單分流，並對每條流中的數據進行匹配。STREAM模式能夠命中跨越數據包邊界的匹配數據（好比，要匹配abc，而a在前一個數據的末尾，而bc在後一個數據包的前端，這兩個數據包在一個流中，那麼STREAM模式匹配能夠命中它，而BLOCK模式不能）。c++

此示例演示瞭如下hyperscan概念：git

多個模式的編譯
與simplegrep示例不一樣，pcapgrep讀取並編譯規則文件中的多個正則表達式。編譯好的database在運行時能夠並行匹配全部模式（而不是一次scan匹配一個）
流模式匹配
包括流狀態數據的構造以及流模式下匹配回調函數的用法

2. 源碼解讀

下面按照代碼執行的前後順序對pcapscan源碼進行簡單解讀。github

2.1 編譯

函數buildDatabase用來編譯規則文件中的多個正則表達式，參數mode指定了是BLOCK仍是STREAM模式。正則表達式

static hs_database_t *buildDatabase(const vector<const char *> &expressions,
                                    const vector<unsigned> flags,
                                    const vector<unsigned> ids,
                                    unsigned int mode) {
    hs_database_t *db;
    hs_compile_error_t *compileErr;
    hs_error_t err;

    Clock clock;
    clock.start();

 err = hs_compile_multi(expressions.data(), flags.data(), ids.data(), expressions.size(), mode, nullptr, &db, &compileErr); 
    clock.stop();

    if (err != HS_SUCCESS) {
        if (compileErr->expression < 0) {
            // The error does not refer to a particular expression.
            cerr << "ERROR: " << compileErr->message << endl;
        } else {
            cerr << "ERROR: Pattern '" << expressions[compileErr->expression]
                 << "' failed compilation with error: " << compileErr->message
                 << endl;
        }
        // As the compileErr pointer points to dynamically allocated memory, if
        // we get an error, we must be sure to release it. This is not
        // necessary when no error is detected.
        hs_free_compile_error(compileErr);
        exit(-1);
    }
//...
}

其中的核心代碼是hs_compile_multi的調用，此函數用來編譯多個正則表達式，從代碼可見除了mode參數，BLOCK和STREAM模式都使用這一API。它的原型是express

hs_error_t hs_compile_multi(const char *const * expressions, 
                            const unsigned int * flags, 
                            const unsigned int * ids, 
                            unsigned int elements, 
                            unsigned int mode, 
                            const hs_platform_info_t * platform, 
                            hs_database_t ** db, 
                            hs_compile_error_t ** error)

其中，expressions是多個正則表達式字符串，flags和ids分別是expressions對應的flag和id數組；elements是表達式字符串的個數；其他參數與上一個例子中提到的hs_compile的參數涵義相同。ubuntu

這裏要注意的一個事情是參數ids，它是正則表達式的ID數組。每一個表達式都有一個惟一ID，這樣命中的時候匹配回調函數能夠獲得此ID，告訴調用者哪一個表達式命中了。若是ids傳入NULL，則全部表達式的ID都爲0。api

2.2 準備匹配臨時數據

Benchmark構造函數中，爲接下來的匹配分配足夠的臨時數據空間(scratch space）。這裏有一個技巧：1）BLOCK和STREAM模式的匹配只需共用一個scratch；2）這個scratch足夠大，方法是調用兩次，在第2次調用時hyperscan若是發現空間不夠會進行增長。數組

public:
    Benchmark(const hs_database_t *streaming, const hs_database_t *block)
        : db_streaming(streaming), db_block(block), scratch(nullptr),
          matchCount(0) {
        // Allocate enough scratch space to handle either streaming or block
        // mode, so we only need the one scratch region.
        hs_error_t err = hs_alloc_scratch(db_streaming, &scratch); if (err != HS_SUCCESS) {
            cerr << "ERROR: could not allocate scratch space. Exiting." << endl;
            exit(-1);
        }
        // This second call will increase the scratch size if more is required
        // for block mode.
        err = hs_alloc_scratch(db_block, &scratch); if (err != HS_SUCCESS) {
            cerr << "ERROR: could not allocate scratch space. Exiting." << endl;
            exit(-1);
        }
    }

2.3 讀取數據包、分流

在Benchmark::readStreams方法中，從pcap文件中讀取了全部數據包（其實封裝必須是ethernet-ipv4-tcp/udp），並根據五元組進行簡單分流。主要代碼以下

while ((pktData = pcap_next(pcapHandle, &pktHeader)) != nullptr) {
            unsigned int offset = 0, length = 0;
            if (!payloadOffset(pktData, &offset, &length)) {
                continue;
            }

            // Valid TCP or UDP packet
            const struct ip *iphdr = (const struct ip *)(pktData
                    + sizeof(struct ether_header));
            const char *payload = (const char *)pktData + offset;

            size_t id = stream_map.insert(std::make_pair(FiveTuple(iphdr), stream_map.size())).first->second;

            packets.push_back(string(payload, length));
            stream_ids.push_back(id);
        }

注意，stream_ids這個vector存儲了每個數據包對應的stream id。

2.4 打開流

因爲須要用到STREAM模式，因此在匹配前要先將流打開，見Benchmark::openStreams

// Open a Hyperscan stream for each stream in stream_ids
    void openStreams() {
        streams.resize(stream_map.size());
        for (auto &stream : streams) {
            hs_error_t err = hs_open_stream(db_streaming, 0, &stream); if (err != HS_SUCCESS) {
                cerr << "ERROR: Unable to open stream. Exiting." << endl;
                exit(-1);
            }
        }
    }

其中，streams的類型是vector<hs_stream_t *>。

2.5 匹配

2.5.1 STREAM模式

在Benchmark::scanStreams中

// Scan each packet (in the ordering given in the PCAP file) through
    // Hyperscan using the streaming interface.
    void scanStreams() {
        for (size_t i = 0; i != packets.size(); ++i) {
            const std::string &pkt = packets[i];
            hs_error_t err = hs_scan_stream(streams[stream_ids[i]], pkt.c_str(), pkt.length(), 0, scratch, onMatch, &matchCount); if (err != HS_SUCCESS) {
                cerr << "ERROR: Unable to scan packet. Exiting." << endl;
                exit(-1);
            }
        }
    }

hs_scan_stream的原型：

hs_error_t hs_scan_stream(hs_stream_t * id, 
                          const char * data, 
                          unsigned int length, 
                          unsigned int flags, 
                          hs_scratch_t * scratch, 
                          match_event_handler onEvent, 
                          void * ctxt)

其中，id是數據所屬的stream對應hs_stream_t指針，這裏叫id其實我感受不太合適; 其他參數與hs_scan相同。

這裏調用的streams[stream_ids[i]]已經在上一步打開流中初始化。

2.5.2 BLOCK模式

BLOCK模式比STREAM簡單許多，在Benchmark::scanBlock中

// Scan each packet (in the ordering given in the PCAP file) through
    // Hyperscan using the block-mode interface.
    void scanBlock() {
        for (size_t i = 0; i != packets.size(); ++i) {
            const std::string &pkt = packets[i];
           hs_error_t err = hs_scan(db_block, pkt.c_str(), pkt.length(), 0, scratch, onMatch, &matchCount); if (err != HS_SUCCESS) {
                cerr << "ERROR: Unable to scan packet. Exiting." << endl;
                exit(-1);
            }
        }
    }

hs_scan在解讀simple中已經說過了，再也不贅述。

2.6 清理資源

包括關閉流（hs_close_stream）、釋放database等。這裏要注意hs_close_stream時仍會進行匹配。

3. STREAM模式總結

STREAM模式的用法比BLOCK模式要複雜一些，這裏簡單用僞代碼總結一下

// N是流的規格，事先已肯定好
hs_database_t* db;
hs_stream_t*  steams[N];
hs_scratch_t* tmp;
uint8_t* pkt;

// 1) 編譯多個正則表達式
hs_compile_multi(&db, HS_MODE_STREAM);
// 2) 準備scratch
hs_alloc_scratch(db, &tmp);
// 3) 打開流
for(i=0; i<N; i++)
  hs_open_stream(db, &streams[i]);
// 4) 收到數據包，並將其分到指定流
stream_id = classify(pkt);
// 5) 流匹配
hs_scan_stream(streams[stream_id], pkt, &tmp, callBack);
// 6) 清理資源, 注意hs_close_stream仍可能有匹配
for(i=0; i<N; i++)
  hs_close_stream(db, streams[i], &tmp, callBack);
hs_free_scrach(tmp);
hs_free_database(db);

能夠經過hs_database_size()和hs_stream_size()分別得到database和每條流的stream state的大小。正則表達式的數目和複雜度會影響stream state的大小，隨着數目和複雜度的增長，可能會愈來愈大。在支持上百萬條流和複雜規則文件的系統上，stream state的內存耗費可能很大。

4. 編譯運行

運行示例前要準備一個pcap文件和一個規則文件，規則文件的格式如

123：/weibo/
456：/[f|F]ile/

每行一個正則表達式，冒號前面是表達式的ID，後面是pcre正則表達式。

如下是編譯和運行截圖，我用了一個微博流量的pcap，並匹配其中的weibo關鍵字：

zzq@ubuntu14:~/hs_demo$ g++ -o pcapscan pcapscan.cc -std=c++11 -lhs -lpcap  
zzq@ubuntu14:~/hs_demo$ ./pcapscan ptn weibo.pcap 
Pattern file: ptn
Compiling Hyperscan databases with 1 patterns.
Hyperscan streaming mode database compiled in 0.000236959 seconds.
Hyperscan block mode database compiled in 4.8277e-05 seconds.
PCAP input file: weibo.pcap
4 packets in 3 streams, totalling 3641 bytes.
Average packet length: 910 bytes.
Average stream length: 1213 bytes.

Streaming mode Hyperscan database size    : 1000 bytes.
Block mode Hyperscan database size        : 1000 bytes.
Streaming mode Hyperscan stream state size: 25 bytes (per stream).

Streaming mode:

  Total matches: 9
  Match rate:    2.5312 matches/kilobyte
  Throughput (with stream overhead): 2576.33 megabits/sec
  Throughput (no stream overhead):   5444.49 megabits/sec

Block mode:

  Total matches: 9
  Match rate:    2.5312 matches/kilobyte
  Throughput:    16227.30 megabits/sec


WARNING: Input PCAP file is less than 2MB in size.
This test may have been too short to calculate accurate results.

相關標籤/搜索

每日一句

每一个你不满意的现在，都有一个你没有努力的曾经。