跟蹤Linux文件系統磁盤請求的新方法

開發一個小工具來跟蹤程序初始化期間第一次訪問每一個文件的時間和部分時,我有機會應用這些新得到的功能。此請求帶有一些約束,例如必須獨立於所使用的緩衝I / O方法(同步,aio,內存映射)的信息。關聯正在訪問文件的塊數據也應該是微不足道的,最重要的是,跟蹤代碼不該該對觀察到的系統產生大的性能影響。node

安裝探針的最佳級別是什麼?
我首先調查了咱們想要放置探針的位置:ios

在塊層級別進行跟蹤
Block Layer中的跟蹤器將檢查磁盤塊方面的請求,這些請求與文件系統文件不直接相關。此級別的跟蹤能夠深刻了解磁盤的哪些區域正在讀取或寫入,以及它們是否以連續的方式進行物理組織。但它不會以文件爲基礎爲您提供更高級別的系統視圖。已經存在其餘工具來跟蹤塊級訪問,例如EBF腳本biosnoop和傳統的blktrace。git

跟蹤文件系統級別
文件系統級別的跟蹤器以文件塊的形式公開數據,文件塊能夠解析爲磁盤中的一個或多個數據塊。在示例場景中,具備4Kib文件塊大小的Ext4文件系統,一個系統頁面可能對應於1個物理塊(在4K磁盤中)或具備512個扇區大小的磁盤中的4個物理塊。文件系統級別的跟蹤容許咱們根據偏移量查看文件,這樣咱們就能夠忽略磁盤碎片。不一樣的安裝可能會以不一樣的方式對磁盤進行分段,從應用程序級別來看,咱們不該該對磁盤佈局感興趣,就像咱們須要的數據文件塊同樣,以防咱們想要經過預取它們進行優化。github

跟蹤頁面緩存級別
頁面緩存是位於VFS /內存映射系統和文件系統層之間的結構,負責管理已從磁盤讀取的內存部分。經過跟蹤此緩存中的包含和刪除頁面,咱們能夠免費得到「首次訪問」行爲。首次訪問新頁面時,它將被帶到緩存,進一步訪問將不須要轉到磁盤。若是頁面最終從緩存中刪除,由於再也不須要它,則須要新的用戶訪問磁盤,並記錄新的訪問條目。在咱們的性能調查場景中,咱們正在尋找的確切功能。小程序

探頭
咱們實現的探針跟蹤內核頁面緩存處理函數中的頁面緩存未命中,以在請求甚至提交到磁盤以前識別第一次請求塊。對相同內存區域的進一步請求(只要數據仍然被映射)將在緩存中返回命中,咱們不關心,也不跟蹤。這能夠防止咱們的代碼干擾進一步的訪問,嚴重削弱咱們的探針可能對性能的影響。緩存

經過跟蹤頁面緩存,咱們還可以將用戶程序直接請求的塊與內核中的Read Ahead邏輯請求的塊區分開來。知道哪些塊被預先讀取是應用程序開發人員和系統管理員很是有趣的信息,由於它容許他們調整他們的系統或應用程序以理智的方式預取他們想要的塊。app

與任何eBPF應用程序同樣,代碼很是簡單。若是咱們忽略了進行eBPF編譯所需的一些樣板文件,那麼探測結果將歸結爲如下函數:ide

int fblktrace_read_pages(struct pt_regs * ctx,struct address_space * mapping,
             struct list_head * pages,struct page * page,
             unsigned nr_pages,bool is_readahead)
{
    u64指數;
    unsigned blkbits = mapping-> host-> i_blkbits;
    unsigned long ino = mapping-> host-> i_ino;
     u64 block_in_file;

    for(int i = 0; i <32 && nr_pages--; i ++){
        if(pages){
            pages = pages-> prev;
            page = container_of(pages,struct page,lru);
        }
        index = page-> index;
        block_in_file =(unsigned long)index <<(12  -  blkbits);

        bpf_trace_printk(「=> inode:%ld:FSBLK =%lu BSIZ =%lu%s \\ n」,
                 ino,index,1 << blkbits,is_readahead?「[RA]」:「」);

    }
    返回0;
}

上面的函數做爲函數的跟蹤器安裝=ext4_mpage_readpages=,經過如下代碼片斷:函數

b.attach_kprobe(event =「ext4_mpage_readpages」,fn_name =「fblktrace_read_pages」)

每次內核頁面緩存都要求文件系統首次從磁盤中獲取某些頁面時,探針就會運行。應該讀取哪一個區域由進程的地址空間中頁面的索引和偏移間接標識。咱們使用該信息來計算要加載的文件的偏移量,在文件塊中,並將該信息與標識文件的inode編號一塊兒傳遞給打印函數。工具

用法示例
出於演示目的,咱們編寫了一個名爲touchblk的小程序,它以兩種方式讀取文件:使用同步讀/寫系統調用,以及使用mmap功能。在兩種狀況下,咱們讀取文件的兩個任意選擇的區域,塊34後面是塊100。

要運行探針,須要安裝包BCC工具中提供的eBPF編譯器。除了運行此示例以外,許多Linux發行版中已經提供的BCC軟件包還包含大量基於eBPF的探針示例,您可使用它們來學習如何使用此工具並編寫特定於您需求的實用程序。

bcc編譯器由iovisor項目提供:

https://github.com/iovisor/bcc

如今,讓咱們看一下實際的探測器。

讀/寫系統調用

[remote:root @ fblktrace~] $ ./fblktrace
印刷...
touchblk-2143 [002] d ... 91137.791064:0:=>打開inode 14:fname = test.img
touchblk-2143 [002] .N .. 91137.811093:0:=> inode:14:FSBLK = 34 BSIZ = 4096 [RA]
touchblk-2143 [002] .... 91137.828293:0:=> inode:14:FSBLK = 100 BSIZ = 4096 [RA]

上面的輸出顯示了我以前描述的測試應用程序的確切行爲。因爲讀/寫不必定會觸發預讀,所以對於咱們要查找的確切塊,只顯示兩個條目。還給出了時間戳和索引節點號。爲了改善輸出,安裝了第二個探針以將inode編號映射到文件名,但這顯然不是必需的。它僅用於簡化用戶的生活。

內存映射訪問
下面是內存映射版本的輸出。這很長......

[remote:root @ fblktrace~] $ ./fblktrace
印刷...
touchblk-2147 [003] d ... 91258.462486:0:=>打開inode 14:fname = image
touchblk-2147 [003] .... 91258.480927:0:=> inode:14:FSBLK = 18 BSIZ = 4096 [RA]
touchblk-2147 [003] .... 91258.480940:0:=> inode:14:FSBLK = 19 BSIZ = 4096 [RA]
touchblk-2147 [003] .... 91258.480942:0:=> inode:14:FSBLK = 20 BSIZ = 4096 [RA]
touchblk-2147 [003] .... 91258.480943:0:=> inode:14:FSBLK = 21 BSIZ = 4096 [RA]
touchblk-2147 [003] .... 91258.480944:0:=> inode:14:FSBLK = 22 BSIZ = 4096 [RA]
touchblk-2147 [003] .... 91258.480945:0:=> inode:14:FSBLK = 23 BSIZ = 4096 [RA]
touchblk-2147 [003] .... 91258.480946:0:=> inode:14:FSBLK = 24 BSIZ = 4096 [RA]
touchblk-2147 [003] .... 91258.480947:0:=> inode:14:FSBLK = 25 BSIZ = 4096 [RA]
touchblk-2147 [003] .... 91258.480948:0:=> inode:14:FSBLK = 26 BSIZ = 4096 [RA]
touchblk-2147 [003] .... 91258.480949:0:=> inode:14:FSBLK = 27 BSIZ = 4096 [RA]
touchblk-2147 [003] .... 91258.480952:0:=> inode:14:FSBLK = 28 BSIZ = 4096 [RA]
touchblk-2147 [003] .... 91258.480952:0:=> inode:14:FSBLK = 29 BSIZ = 4096 [RA]
touchblk-2147 [003] .... 91258.480954:0:=> inode:14:FSBLK = 30 BSIZ = 4096 [RA]
touchblk-2147 [003] .... 91258.480955:0:=> inode:14:FSBLK = 31 BSIZ = 4096 [RA]
touchblk-2147 [003] .... 91258.480955:0:=> inode:14:FSBLK = 32 BSIZ = 4096 [RA]
touchblk-2147 [003] .... 91258.480956:0:=> inode:14:FSBLK = 33 BSIZ = 4096 [RA]
touchblk-2147 [003] .... 91258.480957:0:=> inode:14:FSBLK = 34 BSIZ = 4096 [RA]
touchblk-2147 [003] .... 91258.480958:0:=> inode:14:FSBLK = 35 BSIZ = 4096 [RA]
touchblk-2147 [003] .... 91258.480959:0:=> inode:14:FSBLK = 36 BSIZ = 4096 [RA]
touchblk-2147 [003] .... 91258.480960:0:=> inode:14:FSBLK = 37 BSIZ = 4096 [RA]
touchblk-2147 [003] .... 91258.480961:0:=> inode:14:FSBLK = 38 BSIZ = 4096 [RA]
touchblk-2147 [003] .... 91258.480962:0:=> inode:14:FSBLK = 39 BSIZ = 4096 [RA]
touchblk-2147 [003] .... 91258.480963:0:=> inode:14:FSBLK = 40 BSIZ = 4096 [RA]
touchblk-2147 [003] .... 91258.480966:0:=> inode:14:FSBLK = 41 BSIZ = 4096 [RA]
touchblk-2147 [003] .... 91258.480967:0:=> inode:14:FSBLK = 42 BSIZ = 4096 [RA]
touchblk-2147 [003] .... 91258.480968:0:=> inode:14:FSBLK = 43 BSIZ = 4096 [RA]
touchblk-2147 [003] .... 91258.480969:0:=> inode:14:FSBLK = 44 BSIZ = 4096 [RA]
touchblk-2147 [003] .... 91258.480970:0:=> inode:14:FSBLK = 45 BSIZ = 4096 [RA]
touchblk-2147 [003] .... 91258.480971:0:=> inode:14:FSBLK = 46 BSIZ = 4096 [RA]
touchblk-2147 [003] .... 91258.480972:0:=> inode:14:FSBLK = 47 BSIZ = 4096 [RA]
touchblk-2147 [003] .... 91258.480973:0:=> inode:14:FSBLK = 48 BSIZ = 4096 [RA]
touchblk-2147 [003] .... 91258.480974:0:=> inode:14:FSBLK = 49 BSIZ = 4096 [RA]
touchblk-2147 [003] .... 91258.498554:0:=> inode:14:FSBLK = 84 BSIZ = 4096 [RA]
touchblk-2147 [003] .... 91258.498565:0:=> inode:14:FSBLK = 85 BSIZ = 4096 [RA]
touchblk-2147 [003] .... 91258.498566:0:=> inode:14:FSBLK = 86 BSIZ = 4096 [RA]
touchblk-2147 [003] .... 91258.498567:0:=> inode:14:FSBLK = 87 BSIZ = 4096 [RA]
touchblk-2147 [003] .... 91258.498568:0:=> inode:14:FSBLK = 88 BSIZ = 4096 [RA]
touchblk-2147 [003] .... 91258.498569:0:=> inode:14:FSBLK = 89 BSIZ = 4096 [RA]
touchblk-2147 [003] .... 91258.498570:0:=> inode:14:FSBLK = 90 BSIZ = 4096 [RA]
touchblk-2147 [003] .... 91258.498573:0:=> inode:14:FSBLK = 91 BSIZ = 4096 [RA]
touchblk-2147 [003] .... 91258.498574:0:=> inode:14:FSBLK = 92 BSIZ = 4096 [RA]
touchblk-2147 [003] .... 91258.498575:0:=> inode:14:FSBLK = 93 BSIZ = 4096 [RA]
touchblk-2147 [003] .... 91258.498576:0:=> inode:14:FSBLK = 94 BSIZ = 4096 [RA]
touchblk-2147 [003] .... 91258.498577:0:=> inode:14:FSBLK = 95 BSIZ = 4096 [RA]
touchblk-2147 [003] .... 91258.498578:0:=> inode:14:FSBLK = 96 BSIZ = 4096 [RA]
touchblk-2147 [003] .... 91258.498579:0:=> inode:14:FSBLK = 97 BSIZ = 4096 [RA]
touchblk-2147 [003] .... 91258.498580:0:=> inode:14:FSBLK = 98 BSIZ = 4096 [RA]
touchblk-2147 [003] .... 91258.498581:0:=> inode:14:FSBLK = 99 BSIZ = 4096 [RA]
touchblk-2147 [003] .... 91258.498582:0:=> inode:14:FSBLK = 100 BSIZ = 4096 [RA]
touchblk-2147 [003] .... 91258.498583:0:=> inode:14:FSBLK = 101 BSIZ = 4096 [RA]
touchblk-2147 [003] .... 91258.498584:0:=> inode:14:FSBLK = 102 BSIZ = 4096 [RA]
touchblk-2147 [003] .... 91258.498585:0:=> inode:14:FSBLK = 103 BSIZ = 4096 [RA]
touchblk-2147 [003] .... 91258.498586:0:=> inode:14:FSBLK = 104 BSIZ = 4096 [RA]
touchblk-2147 [003] .... 91258.498587:0:=> inode:14:FSBLK = 105 BSIZ = 4096 [RA]
touchblk-2147 [003] .... 91258.498588:0:=> inode:14:FSBLK = 106 BSIZ = 4096 [RA]
touchblk-2147 [003] .... 91258.498589:0:=> inode:14:FSBLK = 107 BSIZ = 4096 [RA]
touchblk-2147 [003] .... 91258.498591:0:=> inode:14:FSBLK = 108 BSIZ = 4096 [RA]
touchblk-2147 [003] .... 91258.498592:0:=> inode:14:FSBLK = 109 BSIZ = 4096 [RA]
touchblk-2147 [003] .... 91258.498593:0:=> inode:14:FSBLK = 110 BSIZ = 4096 [RA]
touchblk-2147 [003] .... 91258.498594:0:=> inode:14:FSBLK = 111 BSIZ = 4096 [RA]
touchblk-2147 [003] .... 91258.498595:0:=> inode:14:FSBLK = 112 BSIZ = 4096 [RA]
touchblk-2147 [003] .... 91258.498596:0:=> inode:14:FSBLK = 113 BSIZ = 4096 [RA]
touchblk-2147 [003] .... 91258.498597:0:=> inode:14:FSBLK = 114 BSIZ = 4096 [RA]
touchblk-2147 [003] .... 91258.498598:0:=> inode:14:FSBLK = 115 BSIZ = 4096 [RA]

爲何它比read()/ write()系統調用示例長得多?由於內核爲了優化昂貴的I / O操做,假設當在非順序讀取文件中訪問特定地址時,很快就會須要附近的區域,這樣它就會執行預讀(RA)I / O操做。

內核不能假設所需的下一個區域將緊跟在訪問的塊以後,所以它嘗試訪問目標塊以前和以後的鄰居。提早查看的鄰居數由sysfs中的文件系統特定參數定義。

[krisman @ dilma sda2] $ cat / sys / fs / ext4 / sda2 / inode_readahead_blks
32

此參數指示內核在預讀期間在目標塊周圍加載32個塊。若是您返回示例代碼跟蹤的第二個版本的輸出並計算爲兩次訪問中的每一次訪問而讀取的塊,您將觀察到每次訪問時只讀取了32個塊,緊接在目標以前的15個塊阻止,而且緊接着16。這爲預讀機制的工做原理提供了很是有趣的看法。

其餘類型的I / O和限制
此方法嘗試在經過頁面緩存時捕獲I / O訪問,這樣就不會跟蹤其餘非緩衝機制(如Direct I / O)。此示例也僅限於ext4,但它也能夠簡單地擴展到任何其餘Linux文件系統。

完整的代碼
與往常同樣,完整的代碼以咱們知道的惟一方式提供:在公共存儲庫中的自由軟件許可下。請享用!

https://gitlab.collabora.com/krisman/bcc/blob/master/tools/fblktrace.py**

相關文章
相關標籤/搜索