前面簡單介紹了row hammer攻擊的原理和方法,爲了更好理解這種底層硬件類攻擊,今天介紹一下cpu的cache mapping;html
衆所周知,cpu從內存讀數據,最開始用的是虛擬地址,須要經過分頁機制,將虛擬地址轉換成物理地址,而後從物理地址(默認是DRAM,俗稱內存條)讀數據;但內存條速度和cpu相差近百倍,由此誕生了L1\L2\L3 cache;cpu取數據時,會先從各個層級的cache去找,沒有的再從內存取;那麼問題來了,L3 cache裏面有set、slice、line等模塊將整個cache劃分紅一個一個64byte的cache line,cpu是怎麼根據物理地址從L3 cache中取數據的了?好比8MB的L3 cache,一共有8MB/64byte = 2,097,152個cache line,cpu怎麼根據物理地址精確地找到目標cache line了?git
一、直接映射(單路相連)github
假如物理地址是0x654,這個地址對應的L3 cache的哪一個存儲單元了?先看一種最簡單的狀況:算法
直接映射有缺陷:若是兩個物理地址的index和offset都同樣,但tag不一樣,也會映射到同一個cache line,增長了刷新cache的時間成本。由此產生了改進的方法,express
二、兩路相連apache
和1的直連比,僅僅把tag array和cache line組均分紅2分,offset和index尋址不變,僅僅是tag對比改變:這裏因爲分了兩組,因此會有2個tag,只要物理地址的tag和其中一個相同,就算cache hit;至關於多了一次tag比對的機會,增長了命中機率;好比物理地址的tag=0x32,和tag array左邊那個是同樣的,那麼cache line就用way0的;緩存
若是繼續分組,好比4組,就是4way;8組就是8way了,以此類推(後面我在kali上作實驗,查到cache是4 way的,也就是說每一個物理地址的tag都有4次對比的機會,命中的機率仍是蠻大的);架構
再舉例,好比緩存總大小32 KB,由4路(4slice,或則說4core)組相連cache,cache line大小是32 Bytes,該怎麼劃分了?app
整個規劃架構以下:負載均衡
三、全鏈接
全部的cache line都在一個組內,所以地址中不須要index部分;可根據地址中的tag部分和全部的cache line對應的tag進行比較(硬件上可能並行比較也可能串行比較),哪一個tag比較相等,就命中某個cache line,因此在全相連緩存中,任意地址的數據能夠緩存在任意的cache line;但這麼作成本很高;
四、前面介紹3中cache mapping的方法,一旦出現cache miss,cpu會怎麼作了?
假設咱們有一個64 Bytes大小直接映射緩存,cache line大小是8 Bytes,採用寫分配和寫回機制。當CPU從地址0x2a讀取一個字節,cache中的數據將會如何變化呢?假設當前cache狀態以下圖所示(tag旁邊valid一欄的數字1表明合法。0表明非法。後面Dirty的1表明dirty,0表明沒有寫過數據,即非dirty);
根據index找到對應的cache line,對應的tag部分valid bit是合法的,可是tag的值不相等,所以發生cache miss。此時咱們須要從地址0x28(8字節對齊)地址加載8字節數據到該cache line中(cache line是緩存最小的讀寫單元);可是,咱們發現當前cache line的dirty bit置位(表示),因此cache line裏面的數據不能被簡單的丟棄;因爲採用寫回機制,因此咱們須要將cache中的數據0x11223344寫到地址0x0128地址(tag:0x04 index:101 offset:010,鏈接起來就是100 101 010=0x12a,考慮到8字節對齊,就從0x128開始);
當寫回操做完成,再將主存中0x28地址開始的8個字節加載到該cache line中,並清除dirty bit。而後根據offset找到0x52返回給CPU;
五、 cache mapping測試
https://github.com/google/rowhammer-test/tree/master/cache_analysis 這裏有現成的代碼,能夠直接用;
核心思路:分配虛擬空間->轉成物理地址->每隔一頁再生成物理地址->這兩個地址在同一個cache set嗎? -> 若是是就保留->從該保留的地址讀10次數據,保留每次耗時->取中位數;
本人vmware虛擬機實驗環境kali下查看cpu L3緩存(這裏用index2表示)的ways_of_associate是4,關聯度就是4;cache line是64byte,那麼物理地址的0~5bit就是offset,6~7bit就是index;下面的代碼中uintptr_t next_addr = buf + page_size,產生新物理地址時直接在上一個物理地址上加0x1000,低12bit是沒變的,offset和index是同樣的,因此新舊物理地址都在同一個cache set;
// Copyright 2015, Google, Inc. // // Licensed under the Apache License, Version 2.0 (the "License"); // you may not use this file except in compliance with the License. // You may obtain a copy of the License at // // http://www.apache.org/licenses/LICENSE-2.0 // // Unless required by applicable law or agreed to in writing, software // distributed under the License is distributed on an "AS IS" BASIS, // WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. // See the License for the specific language governing permissions and // limitations under the License. #include <assert.h> #include <fcntl.h> #include <stdint.h> #include <stdio.h> #include <sys/mman.h> #include <time.h> #include <unistd.h> #include <algorithm> // This program attempts to pick sets of memory locations that map to // the same L3 cache set. It tests whether they really do map to the // same cache set by timing accesses to them and outputting a CSV file // of times that can be graphed. This program assumes a 2-core Sandy // Bridge CPU. // Dummy variable to attempt to prevent compiler and CPU from skipping // memory accesses. int g_dummy; namespace { const int page_size = 0x1000; int g_pagemap_fd = -1; // Extract the physical page number from a Linux /proc/PID/pagemap entry. uint64_t frame_number_from_pagemap(uint64_t value) { return value & ((1ULL << 54) - 1); } void init_pagemap() { g_pagemap_fd = open("/proc/self/pagemap", O_RDONLY); assert(g_pagemap_fd >= 0); } /*虛擬地址轉成物理地址*/ uint64_t get_physical_addr(uintptr_t virtual_addr) { uint64_t value; /*virtual_addr=16<<20;page_size=4096,sizeof(value)=8,offset=4096*8*/ off_t offset = (virtual_addr / page_size) * sizeof(value); int got = pread(g_pagemap_fd, &value, sizeof(value), offset); assert(got == 8); // Check the "page present" flag. assert(value & (1ULL << 63)); uint64_t frame_num = frame_number_from_pagemap(value); return (frame_num * page_size) | (virtual_addr & (page_size - 1)); } // Execute a CPU memory barrier. This is an attempt to prevent memory // accesses from being reordered, in case reordering affects what gets // evicted from the cache. It's also an attempt to ensure we're // measuring the time for a single memory access. // // However, this appears to be unnecessary on Sandy Bridge CPUs, since // we get the same shape graph without this. inline void mfence() { asm volatile("mfence"); } // Measure the time taken to access the given address, in nanoseconds. int time_access(uintptr_t ptr) { struct timespec ts0; int rc = clock_gettime(CLOCK_MONOTONIC, &ts0); assert(rc == 0); g_dummy += *(volatile int *) ptr; mfence(); struct timespec ts; rc = clock_gettime(CLOCK_MONOTONIC, &ts); assert(rc == 0); return (ts.tv_sec - ts0.tv_sec) * 1000000000 + (ts.tv_nsec - ts0.tv_nsec); } // Given a physical memory address, this hashes the address and // returns the number of the cache slice that the address maps to. // // This assumes a 2-core Sandy Bridge CPU. // // "bad_bit" lets us test whether this hash function is correct. It // inverts whether the given bit number is included in the set of // address bits to hash. 不一樣cpu架構的hash算法不一樣,做者是基於sandy brige架構的,其餘架構好比ivy、hashwell、coffe lake可能沒法運行或邏輯錯誤; int get_cache_slice(uint64_t phys_addr, int bad_bit) { // On a 4-core machine, the CPU's hash function produces a 2-bit // cache slice number, where the two bits are defined by "h1" and // "h2": // // h1 function: // static const int bits[] = { 18, 19, 21, 23, 25, 27, 29, 30, 31 }; // h2 function: // static const int bits[] = { 17, 19, 20, 21, 22, 23, 24, 26, 28, 29, 31 }; // // This hash function is described in the paper "Practical Timing // Side Channel Attacks Against Kernel Space ASLR". // // On a 2-core machine, the CPU's hash function produces a 1-bit // cache slice number which appears to be the XOR of h1 and h2. // XOR of h1 and h2: 這些位依次作檢驗,根據不一樣的0或1來決定存放不一樣的slice,以此達到負載均衡的目的 static const int bits[] = { 17, 18, 20, 22, 24, 25, 26, 27, 28, 30 }; int count = sizeof(bits) / sizeof(bits[0]); int hash = 0; //分別測試bits各個元素指向的位是1仍是0 for (int i = 0; i < count; i++) { hash ^= (phys_addr >> bits[i]) & 1;//h1 } if (bad_bit != -1) { /*phys_addr中,bad_bit位是1嗎?若是是,hash不變;若是不是,hash=1; 好比phys_addr=0x1234,bad_bit=17,那麼(phys_addr>>bad_bit)&1=0,hash=1; 好比phys_addr=0x8234,bad_bit=15,那麼(phys_addr>>bad_bit)&1=1,hash不變; */ hash ^= (phys_addr >> bad_bit) & 1;//h1 xor h2 } return hash;//hash初始值是0,這裏只能是0或1,由於這是2-core cpu,slice只能是0或1; } /* 一、低17位相同 二、hash相同 */ bool in_same_cache_set(uint64_t phys1, uint64_t phys2, int bad_bit) { // For Sandy Bridge, the bottom 17 bits determine the cache set // within the cache slice (or the location within a cache line). uint64_t mask = ((uint64_t) 1 << 17) - 1;//1FFFF,只保留低17位,其他清零 return ((phys1 & mask) == (phys2 & mask) && //兩個物理地址低17位相同 get_cache_slice(phys1, bad_bit) == get_cache_slice(phys2, bad_bit)); } int timing(int addr_count, int bad_bit) { size_t size = 16 << 20; uintptr_t buf = (uintptr_t) mmap(NULL, size, PROT_READ | PROT_WRITE, MAP_PRIVATE | MAP_ANONYMOUS | MAP_POPULATE, -1, 0);//分配內存 assert(buf); uintptr_t addrs[addr_count]; addrs[0] = buf; uintptr_t phys1 = get_physical_addr(addrs[0]); // Pick a set of addresses which we think belong to the same cache set; /*本人CPU是intel core-i7 8750, 用cpu-z查是coffee lake架構,L3=9M,12way,cahe line=64byte(0~5位是offset); cache line總數=9M/64byte=147456個;cache set數量=cache line總數/way = 12288,須要17bit,因此物理地址的6~23bit是index,用來尋找cache set的 物理地址增長0x1000,低12bit沒變,原做者的offset和index沒變(本人的cpu6~23bit是index,會致使set改變),映射到的set應該是同樣的; 但第13位依次加1,會致使物理地址的tag(從第10位開始)不一樣,由此映射到同一set下不一樣的slice */ uintptr_t next_addr = buf + page_size; uintptr_t end_addr = buf + size; int found = 1; while (found < addr_count) { assert(next_addr < end_addr); uintptr_t addr = next_addr; //從buf開始取第一個物理地址,每隔1頁再取物理地址,看看這些物理地址在不在同一個cache set next_addr += page_size; uint64_t phys2 = get_physical_addr(addr); if (in_same_cache_set(phys1, phys2, bad_bit)) { addrs[found] = addr; found++; } } // Time memory accesses. int runs = 10; int times[runs]; for (int run = 0; run < runs; run++) { // Ensure the first address is cached by accessing it. g_dummy += *(volatile int *) addrs[0]; mfence(); // Now pull the other addresses through the cache too. for (int i = 1; i < addr_count; i++) { g_dummy += *(volatile int *) addrs[i]; } mfence(); // See whether the first address got evicted from the cache by // timing accessing it. 若是時間很長,說明第一個地址已經被從cache set驅逐出去了; times[run] = time_access(addrs[0]); } // Find the median time. We use the median in order to discard // outliers. We want to discard outlying slow results which are // likely to be the result of other activity on the machine. // // We also want to discard outliers where memory was accessed // unusually quickly. These could be the result of the CPU's // eviction policy not using an exact LRU policy. std::sort(times, ×[runs]); int median_time = times[runs / 2]; int rc = munmap((void *) buf, size); assert(rc == 0); return median_time; } int timing_mean(int addr_count, int bad_bit) { int runs = 10; int sum_time = 0; for (int i = 0; i < runs; i++) sum_time += timing(addr_count, bad_bit); return sum_time / runs; } } // namespace int main() { init_pagemap(); // Turn off stdout caching. setvbuf(stdout, NULL, _IONBF, 0); // For a 12-way cache, we want to pick 13 addresses belonging to the // same cache set. Measure the effect of picking more addresses to // test whether in_same_cache_set() is correctly determining whether // addresses belong to the same cache set. //,這裏用超過12個的物理地址作測試 //會致使第一個物理地址的緩存被從cache set驅逐(eviction),再次讀該物理地址 //時耗時明顯增長 int max_addr_count = 13 * 4; bool test_bad_bits = true; printf("Address count"); printf(",Baseline hash (no bits changed)"); if (test_bad_bits) { for (int bad_bit = 17; bad_bit < 32; bad_bit++) { printf(",Change bit %i", bad_bit); } } printf("\n"); for (int addr_count = 0; addr_count < max_addr_count; addr_count++) { printf("%i", addr_count); printf(",%i", timing_mean(addr_count, -1)); if (test_bad_bits) { for (int bad_bit = 17; bad_bit < 32; bad_bit++) { printf(",%i", timing_mean(addr_count, bad_bit)); } } printf("\n"); } return 0; }
代碼中:嘗試的地址個數:int max_addr_count = 5 * 4 就能夠在8附近(好比3~7)多取幾個值對比看看結果;(原做則是12way的,用不一樣數量地址反覆作測試,發現地址數量大於13後耗時明顯增長不少,也就是cache missing激增)
參考:http://lackingrhoticity.blogspot.com/2015/04/l3-cache-mapping-on-sandy-bridge-cpus.html L3 cache mapping on Sandy Bridge CPUs
https://zhuanlan.zhihu.com/p/102293437 Cache的基本原理
Reverse Engineering IntelLast-Level Cache Complex AddressingUsing Performance Counters
Mapping the Intel Last-Level Cache
最後整理了一個腦圖,方便串聯理解各個要點: