思科惡意加密TLS流檢測論文記錄——因爲樣本不均衡，其實作得並很差，神馬99.9的準確率都是浮雲啊，之因此思科使用DNS和http一個重要假設是DGA和HTTP C&C（正常http會有圖片等）。一開

0x00

本系列筆記是用來記錄論文閱讀過程當中產生的問題與思考的隨筆性質文本，結構可能比較鬆散，沒法徹底體現園論文的精髓之處，僅供本身往後溫習參考之用。html

題目：Identifying Encrypted Malware Traffic with Contextual Flow Data
做者： Blake Anderson (Cisco), David McGrew (Cisco)
出處：AISec ‘16 Proceedings of the 2016 ACM Workshop on Artificial Intelligence and Security
關鍵詞：Malware; Machine Learning; Transport Layer Security; Network Monitoring

0x01 提出問題

根據惡意軟件收發的加密流量數據來檢測惡意軟件的類型是頗有必要的。
傳統的特徵提取方式大多聚焦在數據包大小和一些與時間有關的參數，本文擴充了特徵提取範圍，運用到完整TLS握手數據包、同TLS握手數據包同一來源的DNS數據流和5分鐘窗口內的HTTP數據流（後二者被稱爲contextual flow）。根據以上數據，咱們可以
將提取到的特徵輸入到監督機器學習算法中，可以獲得很是高的識別準確率。

0x02 解決方法

特徵提取步驟：針對contextual flow，從DNS流中，咱們主要分析從DNS服務器中返回的帶有一個地址的響應，以及和這個地址相關聯的TTL值；從HTTP流中，咱們主要分析HTTP頭中的各類屬性。針對TLS stream，咱們主要分析它們的握手包中提供的信息；針對其餘數據包（如普通TCP，UDP，ICMP包，Observable metadata）。咱們將提取它們的「邊信道信息」。
分類識別步驟：對特徵進行正則化處理，並投入監督學習算法中。
使用真實網絡環境下抓取的數據包進行測試。

0x03 特徵來源

TLS流

TLS流在交互之初是不加密的，因其須要同遠程服務器進行握手。咱們能夠觀測到的未加密TLS元數據包括clientHello和clientKeyExchange。從這些包的信息中，咱們能夠推斷出客戶端使用的TLS庫等信息。從這些信息中，咱們能夠發現，良性流量的行爲軌跡與惡意流量是十分不一樣的。nginx

客戶端方面，咱們首先觀察兩個TLS特徵：Offered Ciphersuites和Advertised TLS Extensions。對於前者，惡意流量更喜歡在clientHello中提供0x0004(TLS_RSA_WITH_RC4_128_MD5)套件，而良性流量則更多提供0x002f(TLS_RSA_WITH_AES_128_CBC_SHA)套件；對於後者，大多數TLS流量提供0x000d(signature_algorithms)，可是良性流量會使用如下不多在惡意流量中見到的參數：git

 
   0x0005 (status request) 0x3374 (next protocol negotiation) 0xff01 (renegotiation info)  
  

隨後，咱們觀察良性與惡性流量客戶端公鑰的區別。良性流量每每選擇256-bit的橢圓曲線密碼公鑰，而惡意流量每每選擇2048-bit的RSA密碼公鑰。github

服務端方面，咱們可以從serverHello流中獲得服務端選擇的Offered Ciphersuites和Advertised TLS Extensions信息。良性流量的選擇比較多元化，而惡性流量每每會選擇較爲過期的技術。在certificate流中，咱們可以獲得服務端的證書鏈。不管是惡意流量仍是良性流量，其證書的數量都是差很少的，但若咱們觀察長度爲1的證書鏈，就可以發現，其中的70%都來自惡意流量自簽名，0.1%來自良性流量自簽名。算法

除此以外，SubjectAltName這個X.509拓展以及證書的有效時間也可區分必定量的良性和惡意流量。數組

DNS流

許多惡意軟件使用域名生成算法來隨機生成域名，這是一個明顯區別於普通流量的行爲。所以這即是咱們識別惡意流量的突破口。服務器

在比較域名的長度時，良性流量的域名基本符合高斯分佈，其最高點在6或7處；而惡意流量的域名分佈在6處存在一個極爲尖銳的高峯。在對域名使用字符種類的探測上，咱們發現良性流量域名使用數字字符較惡意流量更多。網絡

在比較DNS返回響應中攜帶的IP地址的個數時，咱們發現，良性流量更多地返回2或8個，而惡意流量更多地返回4或11個。同時，在比較響應中的TTL數值時，咱們發現良性流量中最常出現的數值爲60、300、20和30；在惡意流量的TTL數值中，300是一個常見數值，可是20和30卻並不常見。且惡意流量中常常出現數值100，但這個數值幾乎從未出如今良性流量中過。session

除了以上指標，咱們還能經過參考Alexa排名來獲取良性流量和惡意流量在域名上的區別。咱們將域名分爲6類：top-100, top-1000, top-10000, top-100000以及未上榜。隨後咱們發現，86%的惡意流量域名都未上榜。app

HTTP流

HTTP響應報頭中，惡意流量最經常使用的屬性爲Server，Set-Cookie和Location，但良性流量最經常使用的屬性爲Connection，Expires和Last-Modified；在HTTP請求報頭中，良性流量最經常使用的屬性爲User-Agent，Accept-Encoding和Accept-Language。

在屬性值的觀察中，良性流量最經常使用的Content-Type爲image／\*，而惡意流量最經常使用的是text／\*。其餘惡意流量經常使用的MIME值爲：text／html；charset=UTF-8以及text／html；charset=utf-8。

惡意流量每每宣稱本身使用的服務器爲低版本的Nginx，而良性流量每每宣稱本身使用的是低版本的Apache或Nginx。

惡意流量的User-Agent字段中較爲常見的值爲Opera/9.50(WindowsNT6.0;U;en)，次常見的爲一些版本的Mozilla／5.0或Mozilla／4.0；而良性流量則通常爲Windows或OS X版本的Mozilla／5.0。

0x04 特徵提取細節

邊信道信息

（此處未看懂，與馬爾科夫鏈有關）
創立一個256-bit的數組，爲每一種長度的payload計數

TLS數據

基於客戶端的特徵：將176種密碼套件的類型、TLS拓展以及公鑰長度列成一個list，並使用一個二元數組（只有0和1）針對對該流量數據的具體狀況進行標記；
基於服務端的特徵：同上。

DNS數據

相似於上文的方法，咱們羅列了針對域名的特徵以下：32個可能的TTL值和一個「other」選項、數字字符的數量、非字母數字字符的數量、DNS響應中返回的IP地址數量，以及6個衡量域名在Alexa排名的位階。

HTTP數據

相似於上文的方法，選擇6個在HTTP報頭中常常的出現的字段，以及一個「other」選項。

0x05 測試結果

SPLT + BD + TLS + HTTP + DNS：99.933%
SPLT + BD + TLS + HTTP：99.983%
SPLT + BD + TLS + DNS：99.968%
TLS + HTTP + DNS：99.988%
SPLT + BD + TLS：99.933%
HTTP + DNS：99.985%
TLS + HTTP：99.955%
TLS + DNS：99.883%
HTTP：99.945%
DNS：99.496%
TLS：96.335%

補充：

Machine Learning for Encrypted Malware Traic Classification: Accounting for Noisy Labels and Non-Stationarity 一樣的做者在kdd 2017上的文章

裏面提到了tls的交互過程：

Figure 1 provides a graphical representation of a simple TLS session. The client initially sends a ClientHello message that provides the server with, among other fields, a list of cipher suites and a set of TLS extensions that the client supports. The cipher suite list is ordered by preference of the client, and each cipher suite denes a set of cryptographic algorithms needed for TLS to operate. The set of extensions provides additional information to the server that facilitates extended functionality, e.g., the Server Name Indication extension indicates the hostname of the server that the client is trying to connect to, which is important for virtual hosting. As explained in Section 4, all of the TLS data features used in this paper are taken from the unencrypted ClientHello message. After the ClientHello, the server sends a ServerHello message that contains the selected cipher suite, selected from the client’s offer list, which defines the set of cryptographic algorithms that will be used to secure the exchanged application data. The ServerHello message also contains a list of extensions that the server supports, where this list is a subset of what the client supports. At this time, the server also sends a Certificate message containing the server’s certicate chain, which can be used to authenticate the server.
The client then sends a ClientKeyExchange message that establishes the premaster secret of the TLS session. Then the client and server exchange ChangeCipherSpec messages indicating that future messages will be encrypted with the negotiated cryptographic parameters. Finally, the client and server begin to exchange application data. In Figure 1, red text represents unencrypted messages, and blue text represents encrypted messages. The current TLS 1.2 handshake protocol provides a lot of interesting, unencrypted information. To enhance privacy, TLS 1.3 will be encrypting more of the handshake, e.g., the Certificate message will be encrypted, but the data features used in this paper will still be available. Many important details were omitted for the sake of brevity, but the associated RFC’s provide the full specification [18, 34]. Because TLS encrypts many of the application-specific features, therefore making traditional deep packet inspection infeasible,
many researchers have utilized side-channel information to make useful inferences on the TLS trac [38]. These data features are typically constructed from the individual packet lengths and packet inter-arrival times of the encrypted session. Commonly used features include the mean of the packet lengths, n-gram or Markov chain based features derived from the sequence of packet lengths, or similarly constructed features for the timing information.

google翻譯下：

圖1提供了簡單TLS會話的圖形表示。客戶端最初發送ClientHello消息，該消息爲服務器提供密碼套件列表和客戶端支持的一組TLS擴展。密碼套件列表按客戶端的優先順序排序，每一個密碼套件定義了TLS運行所需的一組加密算法。該組擴展向服務器提供便於擴展功能的附加信息，例如，服務器名稱指示擴展指示客戶端嘗試鏈接的服務器的主機名，這對於虛擬主機是重要的。如第4節所述，本文中使用的全部TLS數據功能都來自未加密的ClientHello消息。在ClientHello以後，服務器發送ServerHello消息，該消息包含從客戶端的商品列表中選擇的選定密碼套件，該列表定義將用於保護交換的應用程序數據的加密算法集。 ServerHello消息還包含服務器支持的擴展列表，其中此列表是客戶端支持的子集。此時，服務器還會發送包含服務器證書鏈的證書消息，該消息可用於對服務器進行身份驗證。
而後，客戶端發送ClientKeyExchange消息，該消息創建TLS會話的預主密鑰。而後，客戶端和服務器交換ChangeCipherSpec消息，指示將使用協商的加密參數對未來的消息進行加密。最後，客戶端和服務器開始交換應用程序數據。在圖1中，紅色文本表示未加密的消息，藍色文本表示加密的消息。當前的TLS 1.2握手協議提供了許多有趣的，未加密的信息。爲了加強隱私，TLS 1.3將加密更多的握手，例如，證書消息將被加密，但本文中使用的數據功能仍然可用。爲簡潔起見，省略了許多重要細節，但相關的RFC提供了完整的規範[18,34]。由於TLS加密了許多特定於應用程序的功能，所以傳統的深度包檢測不可行，許多研究人員利用旁道信息對TLS流量作出了有用的推論[38]。這些數據特徵一般由加密會話的各個分組長度和分組到達間隔時間構成。經常使用的特徵包括分組長度的平均值，從分組長度序列導出的n-gram或基於馬爾可夫鏈的特徵，或者用於定時信息的相似構造的特徵。

我總以爲報文大小不該該是關鍵特徵，可是論文說是：

最後看下算法準確率，

樣本數量：Total 4,287,892 285,895 惡意樣本：白樣本爲7:100

Enterprise Malware Algorithm Standard Enhanced Standard Enhanced

LinReg 99.92% 99.28% 0.00% 58.65%

l2-LogReg 93.35% 98.36% 16.86% 76.13%

l1-LogReg 92.75% 98.97% 19.71% 75.08%

DecTree 97.55% 97.02% 40.98% 83.33%

RandForest 99.53% 99.99% 33.54% 76.79%

SVM 11.94% 99.78% 77.98% 72.62%

MLP 95.90% 99.54% 20.61% 72.53%

因爲樣本不均衡，其實分類效果並很差，就看惡意軟件的檢出率和準確率就知道。最高的才83%。

Identifying Encrypted Malware Traffic with Contextual Flow Data 文章裏一些要點文章裏有不少特徵提取的圖，能夠認真看下。一開始思科使用的邏輯迴歸，在這個文章裏就是。

We can see that malware usually offers a set of three obsolete ciphersuites in the clientHello message including 0x0004 (TLS_RSA_WITH_RC4_128_MD5). In the benign traffic we collected, the 0x002f (TLS_RSA_WITH_AES_128_CBC_SHA)
ciphersuite was the most offered. Malware also seems to have comparatively little diversity in the client-supported TLS extensions. 0x000d (signature_algorithms) was the only TLS extension supported in the majority of TLS flows. ∼50% of the DMZ traffic also advertised the following extensions, which were rarely seen in the malware dataset:
• 0x0005 (status request)
• 0x3374 (next protocol negotiation)
• 0xff01 (renegotiation info)
Although not shown, the client’s public key length was another client-based data feature that had significant differences. Most of the DMZ traffic used 256-bit elliptic curve cryptography for the public keys, but most of the malicious traffic used 2048-bit RSA public keys. The serverHello and certificate messages can be used to gain information about the server. The serverHello message contains the selected ciphersuite and supported extensions. As one would expect given the type and diversity of the offered ciphersuites and the advertised extensions, the malicious traffic most often selected obsolete ciphersuites. The DMZ traffic contained a wider variety of supported TLS extensions by the servers.

翻譯就是：

咱們能夠看到惡意軟件一般在clientHello消息中提供一組三個過期的密碼套件，包括0x0004（TLS_RSA_WITH_RC4_128_MD5）。在咱們收集的良性流量中，0x002f（TLS_RSA_WITH_AES_128_CBC_SHA）
密碼套件是最多的。惡意軟件彷佛在客戶端支持的TLS擴展中具備相對較小的多樣性。 0x000d（signature_algorithms）是大多數TLS流中惟一支持的TLS擴展。 ~50％的DMZ流量還宣傳瞭如下擴展，這在惡意軟件數據集中不多見：
•0x0005（狀態請求）
•0x3374（下一個協議協商）
•0xff01（從新協商信息）
雖然未顯示，但客戶端的公鑰長度是另外一個基於客戶端的數據功能，具備顯着差別。大多數DMZ流量使用256位橢圓曲線加密做爲公鑰，但大多數惡意流量使用2048位RSA公鑰。 serverHello和證書消息可用於獲取有關服務器的信息。 serverHello消息包含選定的密碼套件和支持的擴展。正如人們所指望的那樣，鑑於所提供的密碼套件和廣告擴展的類型和多樣性，惡意流量一般選擇過期的密碼套件。 DMZ流量包含服務器支持的各類TLS擴展。

The certificate message passes the server’s certificate chain to the client. We observed that the number of certificates in the chain for the malware and DMZ data were roughly the same. But, if we restrict our focus on the length1 chains, ∼70% were self-signed for malware and ∼.1% were self-signed for the DMZ traffic. The number of names in the SubjectAltName (SAN) X.509 extension also differed in the two datasets. For the DMZ traffic, the length of the list was 1 ∼45% of the time. This is in part because a number of Content Distribution Network (CDN) providers, e.g., Akamai, only have one entry. Length-10/12 lists were also common in the DMZ traffic due to some ad services.
Figure 2 also shows the distribution of the validity of the certificates rounded to the nearest day. Similar to the other data features, the period of validity for a server certificate has notable differences in the malicious and DMZ traffic.

證書消息將服務器的證書鏈傳遞給客戶端。咱們觀察到惡意軟件和DMZ數據鏈中的證書數量大體相同。可是，若是咱們將注意力集中在長度爲1的鏈上，則大約有70％是針對惡意軟件進行自簽名的，而且~.1％是針對DMZ流量進行自簽名的。 SubjectAltName（SAN）X.509擴展中的名稱數量在兩個數據集中也不一樣。對於DMZ流量，列表的長度是1~45％的時間。這部分是由於許多內容分發網絡（CDN）提供商（例如Akamai）只有一個條目。因爲某些廣告服務，長度爲10/12的列表在DMZ流量中也很常見。
圖2還顯示了四捨五入到最近一天的證書有效性的分佈。與其餘數據功能相似，服務器證書的有效期在惡意和DMZ流量方面有顯着差別。

特徵和相關度：

Weight Feature 3.38 DNS Suffix org 2.99 DNS TTL 3600 2.62 TLS Ciphersuite TLS_RSA_WITH_RC4_128_SHA 2.28 HTTP Field accept-encoding 1.95 TLS Ciphersuite SSL_RSA_FIPS_WITH_3DES_EDE_CBC_SHA 1.78 HTTP Field location 1.38 DNS Alexa: None 1.21 TLS Ciphersuite TLS_RSA_WITH_RC4_128_MD5 1.12 HTTP Server nginx 1.11 HTTP Code 404 -2.16 TLS Extension extended_master_secret -1.65 HTTP Content Type application/octet-stream -1.61 HTTP Accept Language en-US,en;q=0.5 -1.35 TLS Ciphersuite TLS_DHE_RSA_WITH_DES_CBC_SHA -1.10 HTTP Content Type text/plain;charset=UTF-8 -0.97 HTTP Server Microsoft-IIS/8.5 -0.95 DNS Alexa: top-1,000,000 -0.91 HTTP User-Agent Microsoft-CryptoAPI/6.1 -0.88 TLS Ciphersuite TLS_ECDHE_ECDSA_WITH_RC4_128_SHA -0.85 HTTP Content Type application/x-gzip

Table 2: The data features most relevant to the TLS/DNS/HTTP classifier.

7.2 DNS
域名系統（DNS）[28]是一種分層的，分散的手段，用於提供有關域名的附加信息，特別是域名到IP地址映射。最近，惡意軟件利用DNS和域生成算法（DGA）[8]來提供運行其命令和控制通道的強大方法。之前有不少關於將DNS數據分類爲惡意或良性的結果[7,9,24]。這項工做都不利用DNS數據來推斷加密流量。咱們的工做也不一樣，咱們說明了DNS的不一樣數據特徵的分佈，例如TTL值。
7.3 HTTP
超文本傳輸協議（HTTP）[17]是用於在萬維網上傳輸數據的應用程序級協議。與DNS相似，威脅行爲者也將HTTP用做命令和控制通道[29,33]。已經有一些專門針對HTTP數據中存在的功能的工做。在[33]中，做者使用統計數據（例如URL的平均長度）和URL上的字符串匹配方法來聚類惡意軟件。一樣，惡意軟件和良性HTTP會話的具體差別不會突出顯示。 [22]專門分析了User-Agent字段值。咱們提供了更多HTTP字段的詳細說明，並使用此信息爲加密流量建立機器學習分類器。

之因此思科使用DNS和http一個重要假設是DGA和HTTP C&C。