SARIF在應用過程當中對深層次需求的實現

時間 2021-04-06

標籤 python git github 算法 express 數組安全 ide 函數工具欄目快樂工作简体版

原文原文鏈接

摘要：爲了下降各類分析工具的結果彙總到通用工做流程中的成本和複雜性, 業界開始採用靜態分析結果交換格式(Static Analysis Results Interchange Format (SARIF))來解決這些問題。

本文分享自華爲雲社區《DevSecOps工具與平臺交互的橋樑 -- SARIF進階》，原文做者：Uncle_Tom。python

1. 引言

目前DevSecOps已經成爲構建企業級研發安全的重要模式。靜態掃描工具融入在DevSecOps的開發過程當中，對提升產品的總體的安全水平發揮着重要的做用。爲了獲取安全檢查能力覆蓋的最大化，開發團隊一般會引入多個安全掃描工具。但這也給開發人員和平臺帶來了更多的問題，爲了下降各類分析工具的結果彙總到通用工做流程中的成本和複雜性, 業界開始採用靜態分析結果交換格式(Static Analysis Results Interchange Format (SARIF))來解決這些問題。本篇是SARIF應用的入門篇和進階篇中的進階篇，將介紹SARIF在應用過程當中對深層次需求的實現。對於SARIF的基礎介紹，請參看《DevSecOps工具與平臺間交互的橋樑–SARIF入門》。git

2. SARIF 進階

上次咱們說了SARIF的一些基本應用，這裏咱們再來講下SARIF在更復雜的場景中的一些應用，這樣才能爲靜態掃描工具提供一個完整的報告解決方案。github

在業界著名的靜態分析工具Coverity最新的2021.03版本中，新增的功能就包括: 支持在GitHub代碼倉中以SARIF格式顯示Coverity的掃描結果。可見Covreity也完成了SARIF格式的適配。算法

2.1. 元數據（metadata）的使用

爲了不掃描報告過大，對一些重複使用的信息，須要提取出來，作爲元數據。例如：規則、規則的消息，掃描的內容等。express

下面的例子中，將規則、規則信息在tool.driver.rules 中進行定義，在掃描結果(results)中直接使用規則編號ruleId來獲得規則的信息，同時消息也採用了message.id的方式獲得告警信息。這樣能夠避免規則產生一樣告警的大量的重複信息，有效的縮小報告的大小。數組

vscode 中顯示以下：安全

 
  {
  "version": "2.1.0",
  "runs": [
    {
      "tool": {
        "driver": {
          "name": "CodeScanner",
          "rules": [
            {
              "id": "CS0001",
              "messageStrings": {
                "default": {
                  "text": "This is the message text. It might be very long."
                }
              }
            }
          ]
        }
      },
      "results": [
        {
          "ruleId": "CS0001",
          "ruleIndex": 0,
          "message": {
            "id": "default"
          }
        }
      ]
    }
  ]
} 
 

2.2. 消息參數的使用

掃描結果的告警每每須要，根據具體的代碼問題，在提示消息中給出具體的變量或函數的相關信息，便於用戶對問題的理解。這個時候能夠採用消息參數的方式，提供可變更缺陷消息。ide

下例中，對規則的消息中採用佔位符的方式("{0}")提供信息模板，在掃描結果(results)中，經過arguments數組，提供對應的參數。在vscode中顯示以下：函數

 
  {
  "version": "2.1.0",
  "runs": [
    {
      "tool": {
        "driver": {
          "name": "CodeScanner",
          "rules": [
            {
              "id": "CS0001",
              "messageStrings": {
                "default": {
                  "text": "Variable '{0}' was used without being initialized."
                }
              }
            }
          ]
        }
      },
      "results": [
        {
          "ruleId": "CS0001",
          "ruleIndex": 0,
          "message": {
            "id": "default",
            "arguments": [
              "x"
            ]
          }
        }
      ]
    }
  ]
} 
 

2.3. 消息中關聯信息的使用

在有些時候，爲了更好的說明這個告警的發生緣由，須要給用戶提供更多的參考信息，幫助他們理解問題。好比，給出這個變量的定義位置，污染源的引入點，或者其餘輔助信息。工具

下例中，經過定義問題的發生位置(locations)的關聯位置(relatedLocations)給出了，污染源的引入位置。在vscode中顯示以下, 但用戶點擊「here」時，工具就能夠跳轉到變量expr引入的位置。

 
  {
  "ruleId": "PY2335",
  "message": {
    "text": "Use of tainted variable 'expr' (which entered the system [here](1)) in the insecure function 'eval'."
  },
  "locations": [
    {
      "physicalLocation": {
        "artifactLocation": {
          "uri": "3-Beyond-basics/bad-eval.py"
        },
        "region": {
          "startLine": 4
        }
      }
    }
  ],
  "relatedLocations": [
    {
      "id": 1,
      "message": {
        "text": "The tainted data entered the system here."
      },
      "physicalLocation": {
        "artifactLocation": {
          "uri": "3-Beyond-basics/bad-eval.py"
        },
        "region": {
          "startLine": 3
        }
      }
    }
  ]
} 
 

2.4. 缺陷分類信息的使用

缺陷的分類對於工具和掃描結果的分析是很是重要的。工具能夠依託對缺陷的分類進行規則的管理，方便用戶選取須要的規則；另外一方面用戶在查看分析報告時，也能夠經過對缺陷的分類，快速對分析結果進行過濾。工具能夠參考業界的標準，例如咱們經常使用的Common Weakness Enumeration (CWE), 也能夠自定義本身的分類，這些SARIF都提供了支持。

缺陷分類的例子

 
  {
  "version": "2.1.0",
  "runs": [
    {
      "taxonomies": [
        {
          "name": "CWE",
          "version": "3.2",
          "releaseDateUtc": "2019-01-03",
          "guid": "A9282C88-F1FE-4A01-8137-E8D2A037AB82",
          "informationUri": "https://cwe.mitre.org/data/published/cwe_v3.2.pdf/",
          "downloadUri": "https://cwe.mitre.org/data/xml/cwec_v3.2.xml.zip",
          "organization": "MITRE",
          "shortDescription": {
            "text": "The MITRE Common Weakness Enumeration"
          },
          "taxa": [
            {
              "id": "401",
              "guid": "10F28368-3A92-4396-A318-75B9743282F6",
              "name": "Memory Leak",
              "shortDescription": {
                "text": "Missing Release of Memory After Effective Lifetime"
              },
              "defaultConfiguration": {
                "level": "warning"
              }
            }
          ],
          "isComprehensive": false
        }
      ],
      "tool": {
        "driver": {
          "name": "CodeScanner",
          "supportedTaxonomies": [
            {
              "name": "CWE",
              "guid": "A9282C88-F1FE-4A01-8137-E8D2A037AB82"
            }
          ],
          "rules": [
            {
              "id": "CA2101",
              "shortDescription": {
                "text": "Failed to release dynamic memory."
              },
              "relationships": [
                {
                  "target": {
                    "id": "401",
                    "guid": "A9282C88-F1FE-4A01-8137-E8D2A037AB82",
                    "toolComponent": {
                      "name": "CWE",
                      "guid": "10F28368-3A92-4396-A318-75B9743282F6"
                    }
                  },
                  "kinds": [
                    "superset"
                  ]
                }
              ]
            }
          ]
        }
      },
      "results": [
        {
          "ruleId": "CA2101",
          "message": {
            "text": "Memory allocated in variable 'p' was not released."
          },
          "taxa": [
            {
              "id": "401",
              "guid": "A9282C88-F1FE-4A01-8137-E8D2A037AB82",
              "toolComponent": {
                "name": "CWE",
                "guid": "10F28368-3A92-4396-A318-75B9743282F6"
              }
            }
          ]
        }
      ]
    }
  ]
} 
 

2.4.1. 業界分類標準的引入（runs.taxonomies）

taxonomies 的定義

 
   "taxonomies": {
    "description": "An array of toolComponent objects relevant to a taxonomy in which results are categorized.",
    "type": "array",
    "minItems": 0,
    "uniqueItems": true,
    "default": [],
    "items": {
      "$ref": "#/definitions/toolComponent"
    }
  }, 
 

taxonomies節點是個數組節點，能夠定義多個分類標準。同時taxonomies的定義參考定義組節點definitions下的toolComponent的定義。這與咱們前面的工具掃描引擎(tool.driver)和工具擴展(tool.extensions)保持了一致. 這樣設計的緣由是引擎和結果的強相關性，能夠經過這樣的方法使之保持屬性上的一致。

業界標準分類(standard taxonomy)的定義
例子中經過runs.taxonomies節點，聲明瞭業界的分類標準CWE。在節點taxonomies中經過屬性節點給出了該規範的描述，下面的只是樣例，具體的參考SARIF的規範說明：

name: 規範的名字;
version: 版本;
releaseDateUtc: 發佈日期;
guid: 惟一標識，便於其餘地方引用此規範；
informationUri: 規則的文檔信息;
downloadUri：下載地址；
organization：發佈組織
shortDescription：規範的短描述。

2.4.2. 自定義分類的引入(runs.taxonomies.taxa)

taxa是個數組節點，爲了縮小報告的尺寸，沒有必要將全部自定義的分類信息都放在taxa節點下面，只須要列出和本次掃描相關的分類信息就夠了。這也是爲何後面標識是否全面(isComprehensive)節點的默認值是false的緣由。

例子中經過taxa節點引入了一個工具須要的分類：CWE-401 內存泄漏，並用guid 和id，作了這個分類的惟一標識，便於後面工具在規則或缺陷中引用這個標識。

2.4.3. 工具與業界分類標準關聯(tool.driver.supportedTaxonomies)

工具對象經過tool.driver.supportedTaxonomies節點和定義的業界分類標準關聯。supportedTaxonomies的數組元素是toolComponentReference對象，由於分類法taxonomies自己是toolComponent對象。 toolComponentReference.guid屬性與run.taxonomies []中定義的分類法的對象的guid屬性匹配。

例子中supportedTaxonomies.name:CWE, 它表示此工具支持CWE分類法，並用引用了taxonomies[0]中的guid：A9282C88-F1FE-4A01-8137-E8D2A037AB82，使之與業界分類標準CWE關聯。

2.5. 規則與缺陷分類關聯(rule.relationships)

規則是在tool.driver.rules節點下定義，rules是個數組節點，規則經過數組元素中的reportingDescriptor對象定義；
每一個規則(ReportingDescriptor)中的relationships是個數組元素，每一個元素都是一個reportingDescriptorRelationship對象，該對象創建了從該規則到另外一個reportingDescriptor對象的關係。關係的目標能夠是分類法中的分類單元（如本例中所示），也能夠是另外一個工具組件中的另外一個規則；
關係(ReportingDescriptorRelationship)中的target屬性標識關係的目標，它的值是一個reportingDescriptorReference對象，由此引用對象toolComponent中的reportingDescriptor；
reportingDescriptorReference對象中的toolComponent是一個toolComponentReference對象, 指向工具supportedTaxonomies中定義的分類。

下圖爲例子中的規則與缺陷分類的關聯圖：

2.5.1. 掃描結果中的分類(result.taxa)

在掃描結果(run.results)中, 每個結果(result)下，有一個屬性分類(taxa), taxa是一個數組元素，數組中的每一個元素指向reportingDescriptorReference對象，用於指定該缺陷的分類。這個與規則對應分類的方式同樣。從這一點也能夠看出，咱們能夠省略result下的taxa，而是經過規則對應到缺陷的分類。

2.6. 代碼流（Code Flow)

一些工具經過模擬程序的執行來檢測問題，有時跨多個執行線程。 SARIF經過一組位置信息模擬執行過程，像代碼流(Code Flow)同樣。 SARIF代碼流包含一個或多個線程流，每一個線程流描述了單個執行線程上按時間順序排列的代碼位置。

2.6.1. 缺陷代碼流組（result.codeFlows）

因爲缺陷中，可能存在不止一個代碼流，所以可選的result.codeFlows屬性是一個數組形式的codeFlow對象。

 
   "result": {
      "description": "A result produced by an analysis tool.",
      "additionalProperties": false,
      "type": "object",
      "properties": {

        ... ...
        "codeFlows": {
          "description": "An array of 'codeFlow' objects relevant to the result.",
          "type": "array",
          "minItems": 0,
          "uniqueItems": false,
          "default": [],
          "items": {
            "$ref": "#/definitions/codeFlow"
          }
        },
      }
   } 
 

2.6.2. 代碼流的線程流組（codeFlow.threadFlows）

codeFlow的定義能夠看到，每一個代碼流有，由一個線程組(threadFlows)構成，且線程組(threadFlows)是必須的。

 
   "codeFlow": {
      "description": "A set of threadFlows which together describe a pattern of code execution relevant to detecting a result.",
      "additionalProperties": false,
      "type": "object",
      "properties": {

        "message": {
          "description": "A message relevant to the code flow.",
          "$ref": "#/definitions/message"
        },

        "threadFlows": {
          "description": "An array of one or more unique threadFlow objects, each of which describes the progress of a program through a thread of execution.",
          "type": "array",
          "minItems": 1,
          "uniqueItems": false,
          "items": {
            "$ref": "#/definitions/threadFlow"
          }
        },
      },

      "required": [ "threadFlows" ]
    }, 
 

2.6.3. 線程流（threadFlow）和線程流位置（threadFlowLocation）

在每一個線程流(threadFlow)中，一個數組形式的位置組(locations)來描述工具對代碼的分析過程。

線程流（threadFlow）定義：

 
   "threadFlow": {
      "description": "Describes a sequence of code locations that specify a path through a single thread of execution such as an operating system or fiber.",
      "type": "object",
      "additionalProperties": false,
      "properties": {

        "id": {
        ...

        "message": {
        ...  

        "initialState": {
        ...

        "immutableState": {
        ...

        "locations": {
          "description": "A temporally ordered array of 'threadFlowLocation' objects, each of which describes a location visited by the tool while producing the result.",
          "type": "array",
          "minItems": 1,
          "uniqueItems": false,
          "items": {
            "$ref": "#/definitions/threadFlowLocation"
          }
        },

        "properties": {
        ...
      },

      "required": [ "locations" ]
    }, 
 

線程流位置（threadFlowLocation）定義：
位置組(locations)中的每一個元素, 又是經過threadFlowLocation來表示工具的對代碼位置的訪問。最終經過location類型的location屬性給出分析的位置信息。location能夠包含物理和邏輯位置信息，所以codeFlow也能夠用於二進制的分析流的表示。

在threadFlowLocation還有一個state屬性的節點，咱們能夠經過它來存儲變量、表達式的值或者符號表信息，或者用於狀態機的表述。

 
   "threadFlowLocation": {
      "description": "A location visited by an analysis tool while simulating or monitoring the execution of a program.",
      "additionalProperties": false,
      "type": "object",
      "properties": {

        "index": {
          "description": "The index within the run threadFlowLocations array.",
        ...
 
        "location": {
          "description": "The code location.",
          "$ref": "#/definitions/location"
        },

        "state": {
          "description": "A dictionary, each of whose keys specifies a variable or expression, the associated value of which represents the variable or expression value. For an annotation of kind 'continuation', for example, this dictionary might hold the current assumed values of a set of global variables.",
          "type": "object",
          "additionalProperties": {
            "$ref": "#/definitions/multiformatMessageString"
          }
        },
        ...
      }
    }, 
 

2.6.4. 代碼流樣例

參考代碼

 
# 3-Beyond-basics/bad-eval-with-code-flow.py

print("Hello, world!")
expr = input("Expression> ")
use_input(expr)

def use_input(raw_input):
   print(eval(raw_input)) 
 

上面是一個python代碼的代碼注入的一個案例。

在第四行，輸入信息賦值給變量expr；
在第五行，變量expr經過函數use_input的第一個參數，進入到函數use_input;
在第八行，經過函數print打印輸入結果，但這裏使用了函數eval()對輸入參數進行了處理，因爲參數在輸入後，未通過檢驗，就直接用於函數eval的處理，這裏可能會引入代碼注入的安全問題。

這個分析過程能夠經過下面的掃描結果表現出來，便於用戶理解問題的發生過程。

掃描結果

 
  {
  "version": "2.1.0",
  "runs": [
    {
      "tool": {
        "driver": {
          "name": "PythonScanner"
        }
      },
      "results": [
        {
          "ruleId": "PY2335",
          "message": {
            "text": "Use of tainted variable 'raw_input' in the insecure function 'eval'."
          },
          "locations": [
            {
              "physicalLocation": {
                "artifactLocation": {
                  "uri": "3-Beyond-basics/bad-eval-with-code-flow.py"
                },
                "region": {
                  "startLine": 8
                }
              }
            }
          ],
          "codeFlows": [
            {
              "message": {
                "text": "Tracing the path from user input to insecure usage."
              },
              "threadFlows": [
                {
                  "locations": [
                    {
                      "message": {
                        "text": "The tainted data enters the system here."
                      },
                      "location": {
                        "physicalLocation": {
                          "artifactLocation": {
                            "uri": "3-Beyond-basics/bad-eval-with-code-flow.py"
                          },
                          "region": {
                            "startLine": 4
                          }
                        }
                      },
                      "state": {
                        "expr": {
                          "text": "42"
                        }
                      },
                      "nestingLevel": 0
                    },
                    {
                      "message": {
                        "text": "The tainted data is used insecurely here."
                      },
                      "location": {
                        "physicalLocation": {
                          "artifactLocation": {
                            "uri": "3-Beyond-basics/bad-eval-with-code-flow.py"
                          },
                          "region": {
                            "startLine": 8
                          }
                        }
                      },
                      "state": {
                        "raw_input": {
                          "text": "42"
                        }
                      },
                      "nestingLevel": 1
                    }
                  ]
                }
              ]
            }
          ]
        }
      ]
    }
  ]
} 
 

這裏只是一個簡單的示例，經過SARIF的codeFLow，咱們能夠適應更加複雜的分析過程，從而讓用戶更好的理解問題，進而快速作出判斷和修改。

2.7. 缺陷指紋（fingerprint）

在大型軟件項目中，分析工具一次就能夠產生成千上萬個結果。爲了處理如此多的結果，在缺陷管理上，咱們須要記錄現有缺陷，制定一個掃描基線，而後對現有問題進行處理。同時在後期的掃描中，須要將新的掃描結果與基線進行比較，以區分是否有新問題的引入。爲了肯定後續運行的結果在邏輯上是否與基線的結果相同，必須經過一種算法:使用缺陷結果中包含的特有信息來構造一個穩定的標識，咱們將此標識稱爲指紋。使用這個指紋來標識這個缺陷的特徵以區別於其餘缺陷，咱們也稱這個指紋爲這個缺陷的缺陷指紋。

缺陷指紋應該包含相對穩定不變的缺陷信息：

產生結果的工具的名稱；
規則編號；
分析目標的文件系統路徑；這個路徑應該是工程自己具備的相對路徑。不該該包含路徑前面工程存放位置信息，由於每臺機器存放工程的位置可能不一樣；
缺陷特徵值（partialFingerprints）。

SARIF的每一個掃描結果(result)中提供了一組這樣的屬性節點，用於缺陷指紋的存放，便於缺陷的管理系統經過這些標識，識別缺陷的惟一性。

 
   "result": {
      "description": "A result produced by an analysis tool.",
      "additionalProperties": false,
      "type": "object",
      "properties": {
        ... ...
        "guid": {
          "description": "A stable, unique identifier for the result in the form of a GUID.",
          "type": "string",
          "pattern": "^[0-9a-fA-F]{8}-[0-9a-fA-F]{4}-[1-5][0-9a-fA-F]{3}-[89abAB][0-9a-fA-F]{3}-[0-9a-fA-F]{12}$"
        },

        "correlationGuid": {
          "description": "A stable, unique identifier for the equivalence class of logically identical results to which this result belongs, in the form of a GUID.",
          "type": "string",
          "pattern": "^[0-9a-fA-F]{8}-[0-9a-fA-F]{4}-[1-5][0-9a-fA-F]{3}-[89abAB][0-9a-fA-F]{3}-[0-9a-fA-F]{12}$"
        },

        "occurrenceCount": {
          "description": "A positive integer specifying the number of times this logically unique result was observed in this run.",
          "type": "integer",
          "minimum": 1
        },

        "partialFingerprints": {
          "description": "A set of strings that contribute to the stable, unique identity of the result.",
          "type": "object",
          "additionalProperties": {
            "type": "string"
          }
        },

        "fingerprints": {
          "description": "A set of strings each of which individually defines a stable, unique identity for the result.",
          "type": "object",
          "additionalProperties": {
            "type": "string"
          }
        },
        ... ...
      }
    } 
 

只經過缺陷的固有的信息特徵，在某些狀況下，不容易獲得惟一識別結果的信息。這個時候咱們須要增長一些與這個缺陷強相關的一些屬性值，作爲附加信息來加入到缺陷指紋的計算中，使最後的計算獲得的指紋惟一。這個有些像咱們作加密算法時的鹽值，只是這個鹽值須要保證生成的惟一值具備可重複性，以確保下次掃描時，對於同一缺陷可以獲得相同的輸入值，從而獲得和上次同樣的指紋。例如，工具在檢查文檔中是否存在敏感性的單詞，告警信息爲：「 xxx不該在文檔中使用。」，這個時候就可使用這個單詞做爲這個缺陷的一個特徵值。

SARIF格式就提供了這樣一個partialFingerprints屬性，用於保存這個特徵值，以容許SARIF生態系統中的分析工具和其餘組件使用這個信息。缺陷管理系統能夠將其附加到爲每一個結果構造的指紋中。前面的例子中，該工具就能夠會將partialFingerprints對象中的屬性的值設置爲：禁止的單詞。缺陷管理系統應該在其指紋計算中將信息包括在partialFingerprints中。

對於partialFingerprints，應該只添加和缺陷特徵強相關的屬性，並且屬性的值應該相對穩定。好比，缺陷發生的代碼行號就不適合加入到指紋的的邏輯運算中，由於代碼行是一個會常常變更的值，在下次掃描的時候，極可能由於開發人員在問題行前添加或刪除了一些代碼行，而使一樣的問題在新的掃描報告中獲得不同的代碼行，從而影響缺陷指紋的計算值，致使比對時發生差別。

儘管咱們試圖爲每一個缺陷找到惟一的標識特徵，還加入了一些可變的特徵屬性，但仍是很難設計出一種算法來構造一個真正穩定的指紋結果。好比剛纔的例子，若是同一個文件中存在幾個一樣的敏感字，咱們這個時後仍是沒法爲每個告警缺陷給出一個惟一的標識。固然這個時候還能夠加入函數名做爲一個指紋的計算因子，由於函數名在一個程序中是相對穩定的存在，函數名的加入有助於區分同一個文件中同一個問題的出現範圍，但仍是會存在同一個函數內一樣問題的多個相同缺陷。因此儘管咱們儘可能區分每個告警，但缺陷指紋相同的場景在實際的掃描中仍是會存在的。

幸運的是，出於實際目的，指紋並不必定要絕對穩定。它只須要足夠穩定，就能夠將錯誤報告爲「新」的結果數量減小到足夠低的水平，以使開發團隊能夠無需過多努力就能夠管理錯誤報告的結果。

3. 總結

SARIF給出了靜態掃描工具的標準輸出的通用格式，可以知足靜態掃描工具報告輸出的各類要求；
對於各類靜態掃描工具整合到DevSecOps平臺，SARIF將下降掃描結果彙總到通用工做流程中的成本和複雜性；
SARIF也將爲IDE整合各類掃描結果，提供統一的缺陷處理模塊提供了可能；掃描結果在IDE中的缺陷展現、修復等，這樣可讓工具的開發商專一於問題的發現，而減小對各類IDE的適配的工做量；
SARIF已經成爲OASIS的標準之一，並被微軟、GrammaTech等重要靜態掃描工具廠商在工具中提供支持；同時U.S. DHS, U.S. NIST在一些靜態檢查工具的評估和比賽中，也要求提供掃描報告的格式採用SARIF；
SARIF雖然目前主要是爲靜態掃描工具的結果設計的，但因爲其設計的通用性，一些動態分析工具廠商也給出了SARIF的成功應用。