intel：spectre&Meltdown側信道攻擊（四）—— cache mapping

時間 2020-08-01

標籤 intel spectre&meltdown spectre meltdown 信道攻擊 cache mapping 欄目 Intel 简体版

原文原文鏈接

　　前面簡單介紹了row hammer攻擊的原理和方法，爲了更好理解這種底層硬件類攻擊，今天介紹一下cpu的cache mapping；html

　　衆所周知，cpu從內存讀數據，最開始用的是虛擬地址，須要經過分頁機制，將虛擬地址轉換成物理地址，而後從物理地址（默認是DRAM，俗稱內存條）讀數據；但內存條速度和cpu相差近百倍，由此誕生了L1\L2\L3 cache；cpu取數據時，會先從各個層級的cache去找，沒有的再從內存取；那麼問題來了，L3 cache裏面有set、slice、line等模塊將整個cache劃分紅一個一個64byte的cache line，cpu是怎麼根據物理地址從L3 cache中取數據的了？好比8MB的L3 cache，一共有8MB/64byte = 2,097,152個cache line，cpu怎麼根據物理地址精確地找到目標cache line了？git

　　一、直接映射（單路相連）github

　　假如物理地址是0x654，這個地址對應的L3 cache的哪一個存儲單元了？先看一種最簡單的狀況：算法

假若有8個cache line，須要3bit遍歷，中間標黃的010就是cache line之間的index；
假如每一個cache line 長度是8byte，一樣只須要3bit就能遍歷全部bbyte，標藍的就是cache line內部的offset
剩下標綠的11001就是tag；cpu額外有個tag array，經過index取出tag array中的tag，和11001對比，若是是，說明這個byte就是該物理地址對應的存儲單元，能夠立刻取數據了，這叫cache hit；不然稱爲cache miss；

　　直接映射有缺陷：若是兩個物理地址的index和offset都同樣，但tag不一樣，也會映射到同一個cache line，增長了刷新cache的時間成本。由此產生了改進的方法，express

　　二、兩路相連apache

　　和1的直連比，僅僅把tag array和cache line組均分紅2分，offset和index尋址不變，僅僅是tag對比改變：這裏因爲分了兩組，因此會有2個tag，只要物理地址的tag和其中一個相同，就算cache hit；至關於多了一次tag比對的機會，增長了命中機率；好比物理地址的tag=0x32，和tag array左邊那個是同樣的，那麼cache line就用way0的；緩存

若是繼續分組，好比4組，就是4way；8組就是8way了，以此類推（後面我在kali上作實驗，查到cache是4 way的，也就是說每一個物理地址的tag都有4次對比的機會，命中的機率仍是蠻大的）；架構

　再舉例，好比緩存總大小32 KB，由4路（4slice，或則說4core）組相連cache，cache line大小是32 Bytes，該怎麼劃分了？app

　總大小32KB，由4路，每路8KB；
每一個cache line 32byte，那麼一共有8KB/32byte=256個，因此index至少8bit；
每一個cache line 32byte，offset至少5bit；

　　整個規劃架構以下：負載均衡

　　三、全鏈接

　　全部的cache line都在一個組內，所以地址中不須要index部分；可根據地址中的tag部分和全部的cache line對應的tag進行比較（硬件上可能並行比較也可能串行比較），哪一個tag比較相等，就命中某個cache line，因此在全相連緩存中，任意地址的數據能夠緩存在任意的cache line；但這麼作成本很高；

　　四、前面介紹3中cache mapping的方法，一旦出現cache miss，cpu會怎麼作了？

　　假設咱們有一個64 Bytes大小直接映射緩存，cache line大小是8 Bytes，採用寫分配和寫回機制。當CPU從地址0x2a讀取一個字節，cache中的數據將會如何變化呢？假設當前cache狀態以下圖所示(tag旁邊valid一欄的數字1表明合法。0表明非法。後面Dirty的1表明dirty，0表明沒有寫過數據，即非dirty)；

　　根據index找到對應的cache line，對應的tag部分valid bit是合法的，可是tag的值不相等，所以發生cache miss。此時咱們須要從地址0x28（8字節對齊）地址加載8字節數據到該cache line中（cache line是緩存最小的讀寫單元）；可是，咱們發現當前cache line的dirty bit置位（表示），因此cache line裏面的數據不能被簡單的丟棄；因爲採用寫回機制，因此咱們須要將cache中的數據0x11223344寫到地址0x0128地址（tag:0x04 index:101 offset:010，鏈接起來就是100 101 010=0x12a，考慮到8字節對齊，就從0x128開始）；

　　當寫回操做完成，再將主存中0x28地址開始的8個字節加載到該cache line中，並清除dirty bit。而後根據offset找到0x52返回給CPU；

　　五、 cache mapping測試

　　https://github.com/google/rowhammer-test/tree/master/cache_analysis 這裏有現成的代碼，能夠直接用；

　　核心思路：分配虛擬空間->轉成物理地址->每隔一頁再生成物理地址->這兩個地址在同一個cache set嗎? -> 若是是就保留->從該保留的地址讀10次數據，保留每次耗時->取中位數;

　　本人vmware虛擬機實驗環境kali下查看cpu L3緩存（這裏用index2表示）的ways_of_associate是4，關聯度就是4；cache line是64byte，那麼物理地址的0~5bit就是offset，6~7bit就是index；下面的代碼中uintptr_t next_addr = buf + page_size，產生新物理地址時直接在上一個物理地址上加0x1000，低12bit是沒變的，offset和index是同樣的，因此新舊物理地址都在同一個cache set；

// Copyright 2015, Google, Inc.
//
// Licensed under the Apache License, Version 2.0 (the "License");
// you may not use this file except in compliance with the License.
// You may obtain a copy of the License at
//
//     http://www.apache.org/licenses/LICENSE-2.0
//
// Unless required by applicable law or agreed to in writing, software
// distributed under the License is distributed on an "AS IS" BASIS,
// WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
// See the License for the specific language governing permissions and
// limitations under the License.

#include <assert.h>
#include <fcntl.h>
#include <stdint.h>
#include <stdio.h>
#include <sys/mman.h>
#include <time.h>
#include <unistd.h>

#include <algorithm>

// This program attempts to pick sets of memory locations that map to
// the same L3 cache set.  It tests whether they really do map to the
// same cache set by timing accesses to them and outputting a CSV file
// of times that can be graphed.  This program assumes a 2-core Sandy
// Bridge CPU.


// Dummy variable to attempt to prevent compiler and CPU from skipping
// memory accesses.
int g_dummy;

namespace {

const int page_size = 0x1000;
int g_pagemap_fd = -1;

// Extract the physical page number from a Linux /proc/PID/pagemap entry.
uint64_t frame_number_from_pagemap(uint64_t value) {
  return value & ((1ULL << 54) - 1);
}

void init_pagemap() {
  g_pagemap_fd = open("/proc/self/pagemap", O_RDONLY);
  assert(g_pagemap_fd >= 0);
}

/*虛擬地址轉成物理地址*/
uint64_t get_physical_addr(uintptr_t virtual_addr) {
  uint64_t value;
  /*virtual_addr=16<<20；page_size=4096，sizeof(value)=8，offset=4096*8*/
  off_t offset = (virtual_addr / page_size) * sizeof(value);
  int got = pread(g_pagemap_fd, &value, sizeof(value), offset);
  assert(got == 8);

  // Check the "page present" flag.
  assert(value & (1ULL << 63));

  uint64_t frame_num = frame_number_from_pagemap(value);
  return (frame_num * page_size) | (virtual_addr & (page_size - 1));
}

// Execute a CPU memory barrier.  This is an attempt to prevent memory
// accesses from being reordered, in case reordering affects what gets
// evicted from the cache.  It's also an attempt to ensure we're
// measuring the time for a single memory access.
//
// However, this appears to be unnecessary on Sandy Bridge CPUs, since
// we get the same shape graph without this.
inline void mfence() {
  asm volatile("mfence");
}

// Measure the time taken to access the given address, in nanoseconds.
int time_access(uintptr_t ptr) {
  struct timespec ts0;
  int rc = clock_gettime(CLOCK_MONOTONIC, &ts0);
  assert(rc == 0);

  g_dummy += *(volatile int *) ptr;
  mfence();

  struct timespec ts;
  rc = clock_gettime(CLOCK_MONOTONIC, &ts);
  assert(rc == 0);
  return (ts.tv_sec - ts0.tv_sec) * 1000000000
         + (ts.tv_nsec - ts0.tv_nsec);
}

// Given a physical memory address, this hashes the address and
// returns the number of the cache slice that the address maps to.
//
// This assumes a 2-core Sandy Bridge CPU.
//
// "bad_bit" lets us test whether this hash function is correct.  It
// inverts whether the given bit number is included in the set of
// address bits to hash.   不一樣cpu架構的hash算法不一樣，做者是基於sandy brige架構的，其餘架構好比ivy、hashwell、coffe lake可能沒法運行或邏輯錯誤；
int get_cache_slice(uint64_t phys_addr, int bad_bit) {
  // On a 4-core machine, the CPU's hash function produces a 2-bit
  // cache slice number, where the two bits are defined by "h1" and
  // "h2":
  //
  // h1 function:
  //   static const int bits[] = { 18, 19, 21, 23, 25, 27, 29, 30, 31 };
  // h2 function:
  //   static const int bits[] = { 17, 19, 20, 21, 22, 23, 24, 26, 28, 29, 31 };
  //
  // This hash function is described in the paper "Practical Timing
  // Side Channel Attacks Against Kernel Space ASLR".
  //
  // On a 2-core machine, the CPU's hash function produces a 1-bit
  // cache slice number which appears to be the XOR of h1 and h2.

  // XOR of h1 and h2: 這些位依次作檢驗，根據不一樣的0或1來決定存放不一樣的slice，以此達到負載均衡的目的
  static const int bits[] = { 17, 18, 20, 22, 24, 25, 26, 27, 28, 30 };

  int count = sizeof(bits) / sizeof(bits[0]);
  int hash = 0;
  //分別測試bits各個元素指向的位是1仍是0
  for (int i = 0; i < count; i++) {
    hash ^= (phys_addr >> bits[i]) & 1;//h1
  }
  if (bad_bit != -1) {
    /*phys_addr中，bad_bit位是1嗎?若是是，hash不變；若是不是，hash=1；
    好比phys_addr=0x1234，bad_bit=17，那麼(phys_addr>>bad_bit)&1=0,hash=1;
    好比phys_addr=0x8234，bad_bit=15，那麼(phys_addr>>bad_bit)&1=1,hash不變;
    */
    hash ^= (phys_addr >> bad_bit) & 1;//h1 xor h2
  }
  return hash;//hash初始值是0，這裏只能是0或1，由於這是2-core cpu，slice只能是0或1；
}

/*
一、低17位相同
二、hash相同
*/
bool in_same_cache_set(uint64_t phys1, uint64_t phys2, int bad_bit) {
  // For Sandy Bridge, the bottom 17 bits determine the cache set
  // within the cache slice (or the location within a cache line).
  uint64_t mask = ((uint64_t) 1 << 17) - 1;//1FFFF，只保留低17位，其他清零
  return ((phys1 & mask) == (phys2 & mask) && //兩個物理地址低17位相同
          get_cache_slice(phys1, bad_bit) == get_cache_slice(phys2, bad_bit));
}

int timing(int addr_count, int bad_bit) {
  size_t size = 16 << 20;
  uintptr_t buf =
    (uintptr_t) mmap(NULL, size, PROT_READ | PROT_WRITE,
                     MAP_PRIVATE | MAP_ANONYMOUS | MAP_POPULATE, -1, 0);//分配內存
  assert(buf);

  uintptr_t addrs[addr_count];
  addrs[0] = buf;
  uintptr_t phys1 = get_physical_addr(addrs[0]);

  // Pick a set of addresses which we think belong to the same cache set；
  /*本人CPU是intel core-i7 8750, 用cpu-z查是coffee lake架構，L3=9M，12way,cahe line=64byte（0~5位是offset）;
  cache line總數=9M/64byte=147456個；cache set數量=cache line總數/way = 12288，須要17bit,因此物理地址的6~23bit是index，用來尋找cache set的
  物理地址增長0x1000，低12bit沒變，原做者的offset和index沒變（本人的cpu6~23bit是index，會致使set改變），映射到的set應該是同樣的；
  但第13位依次加1，會致使物理地址的tag(從第10位開始)不一樣，由此映射到同一set下不一樣的slice
  */
  uintptr_t next_addr = buf + page_size;
  uintptr_t end_addr = buf + size;
  int found = 1;
  while (found < addr_count) {
    assert(next_addr < end_addr);
    uintptr_t addr = next_addr;
    //從buf開始取第一個物理地址，每隔1頁再取物理地址，看看這些物理地址在不在同一個cache set
    next_addr += page_size;
    uint64_t phys2 = get_physical_addr(addr);
    if (in_same_cache_set(phys1, phys2, bad_bit)) {
      addrs[found] = addr;
      found++;
    }
  }

  // Time memory accesses.
  int runs = 10;
  int times[runs];
  for (int run = 0; run < runs; run++) {
    // Ensure the first address is cached by accessing it.
    g_dummy += *(volatile int *) addrs[0];
    mfence();
    // Now pull the other addresses through the cache too.
    for (int i = 1; i < addr_count; i++) {
      g_dummy += *(volatile int *) addrs[i];
    }
    mfence();
    // See whether the first address got evicted from the cache by
    // timing accessing it. 若是時間很長，說明第一個地址已經被從cache set驅逐出去了；
    times[run] = time_access(addrs[0]);
  }
  // Find the median time.  We use the median in order to discard
  // outliers.  We want to discard outlying slow results which are
  // likely to be the result of other activity on the machine.
  //
  // We also want to discard outliers where memory was accessed
  // unusually quickly.  These could be the result of the CPU's
  // eviction policy not using an exact LRU policy.
  std::sort(times, &times[runs]);
  int median_time = times[runs / 2];

  int rc = munmap((void *) buf, size);
  assert(rc == 0);

  return median_time;
}

int timing_mean(int addr_count, int bad_bit) {
  int runs = 10;
  int sum_time = 0;
  for (int i = 0; i < runs; i++)
    sum_time += timing(addr_count, bad_bit);
  return sum_time / runs;
}

} // namespace

int main() {
  init_pagemap();

  // Turn off stdout caching.
  setvbuf(stdout, NULL, _IONBF, 0);

  // For a 12-way cache, we want to pick 13 addresses belonging to the
  // same cache set.  Measure the effect of picking more addresses to
  // test whether in_same_cache_set() is correctly determining whether
  // addresses belong to the same cache set.
  //，這裏用超過12個的物理地址作測試
  //會致使第一個物理地址的緩存被從cache set驅逐(eviction)，再次讀該物理地址
  //時耗時明顯增長
  int max_addr_count = 13 * 4;

  bool test_bad_bits = true;

  printf("Address count");
  printf(",Baseline hash (no bits changed)");
  if (test_bad_bits) {
    for (int bad_bit = 17; bad_bit < 32; bad_bit++) {
      printf(",Change bit %i", bad_bit);
    }
  }
  printf("\n");

  for (int addr_count = 0; addr_count < max_addr_count; addr_count++) {
    printf("%i", addr_count);
    printf(",%i", timing_mean(addr_count, -1));
    if (test_bad_bits) {
      for (int bad_bit = 17; bad_bit < 32; bad_bit++) {
        printf(",%i", timing_mean(addr_count, bad_bit));
      }
    }
    printf("\n");
  }
  return 0;
}