詳解ELF可執行文件格式：讀取頭部信息和程序表頭

要想實現ELF文件的入口劫持，不深刻掌握其運行原理與組成結構那是不可能的。ELF的內部結構複雜，加載邏輯難以理解，所以咱們須要經過切香腸的方式，將這個困難的技術點一點一滴的去攻克。

這一節咱們先掌握如何讀取頭部信息和程序表頭，咱們先看ELF文件的大體結構：

ELF文件格式最重要的就是所謂的段，特別是其中的代碼段和數據段。對應上圖就是.text,.data兩個段。每一個段都對應一個段表來描述，而若干隔斷會組成一個總體，它對應一個program,然後者則由program header table來指向，講解ELF數據結構最爲詳細的就是網址以下，有心的朋友能夠認真閱讀：
https://man7.org/linux/man-pages/man5/elf.5.html

咱們本次要解讀ELF文件的兩個部分，一個是其文件頭。文件頭描述了ELF文件不少重要信息，例如它運行的平臺，支持的CPU類型等。使用命令行readelf -h 能夠讀取指定ELF文件的頭部信息，以下圖所示：

其對應數據結構以下：

    
 
     #define EI_NIDENT 16

 typedef struct {
 unsigned char e_ident[EI_NIDENT];
 uint16_t e_type;
 uint16_t e_machine;
 uint32_t e_version;
 ElfN_Addr e_entry;
 ElfN_Off e_phoff;
 ElfN_Off e_shoff;
 uint32_t e_flags;
 uint16_t e_ehsize;
 uint16_t e_phentsize;
 uint16_t e_phnum;
 uint16_t e_shentsize;
 uint16_t e_shnum;
 uint16_t e_shstrndx;
 } ElfN_Ehdr;

e_ident數組的不一樣字節有不一樣含義，第0個字節必須是0x7f，接下來對應三個字符’ELF’，第4個字節說明它運行在32位仍是64位系統，第5個字節說明數據是大端仍是小端，第6個字節表示版本，大多數狀況下該字節爲1.

接下來的兩個字節也就是e_type對應ELF文件的類型，用於代表它是可執行文件，亦或是動態連接庫，仍是重定向文件也就是編譯後尚未被連接的二進制文件。e_machine用於代表它運行的CPU類型，e_entry表示它被加載到內存後，第一條指令所在的虛擬地址，e_phoff表示程序表頭相對於該文件內部偏移，後面咱們要讀取程序表頭時須要使用該值。e_shoff表示的是段表在文件內部的偏移。

段與程序頭有邏輯上的對應關係，就像前面圖形所示，一個程序頭對應多個段，程序頭用於告訴系統如何將各個段放入到內存中。段對應的數據有多種類型，其中最重要的就是.text和.data，分別對應代碼和數據，e_flags一般取值0，它的做用暫時用不到。

e_ehsize對應ELF文件頭數據結構的大小。e_phentsize用於代表程序表頭一條記錄的大小，程序表頭記錄用於描述每一個程序段對應的屬性和性質，e_phnum表示程序表頭記錄的個數，e_shentsize表示段記錄的大小，它用來描述每一個段的性質，e_shnum表示段記錄的個數，最後e_shstrndx表示段字符串表的下標。

該數據結構中有不少字段咱們不須要關係，須要關心的也就是程序表頭和段表頭對應的字段，這些字段的使用在後續說明中會詳細解讀，咱們首先展現如何使用python實現ELF文件頭的解讀，其中連接: https://pan.baidu.com/s/1YbApA8J_68E1UlLHpAtc9A 密碼: ao1d
對應的是代碼所解讀的ELF文件，如下是解讀ELF頭的實現:

    
 
    import struct
elf32_path = "/content/drive/My Drive/elf32/hello_world.o"


ET_REL = 1 #.o類型
ET_EXEC = 2 #可執行
ET_DYN = 3 #動態連接

ELFCLASSNONE = 0
ELFCLASS32 = 1
ELFCLASS64 = 2

LITTLE_ENDIAN = 1 #數據編碼是大端仍是小端
BIG_ENDIAN = 2

#支持的CPU類型
MACHINE_EM_386 = 3 #Intel 80386
MACHINE_EM_860 = 7 #Intel 80860
MACHINE_S570 = 9 #IBM System/370
VERSION_CURRENT = 1

PT_NONE = 0 #程序頭表未定義
PT_LOAD = 1 #對應的段要被加載到內存中
PT_DYNAMIC = 2 #包含動態連接對應的信息
PT_INTERP = 3 #鏈接器二進制可執行文件對應路徑
PT_NOTE = 4 #
PT_SHLIB = 5 #保留，不該該是該值
PT_PHDR = 6 #該程序頭專門用於描述程序頭表

PF_X = 1 #可執行
PF_W = 2 #可寫
PF_R = 3 #可讀

def read_elf_header(binary_data):
 format = "@"+ "".join(['c']*16)
 magic = struct.unpack(format, binary_data[0:16])
 print("Magic: ", magic)
 elf_class = int.from_bytes(magic[4], "little") #32位仍是64位
 if elf_class == ELFCLASS32:
 print("class ELF32")
 if elf_class == ELFCLASS64:
 print("class ELF64")
 endian = int.from_bytes(magic[5], "little")
 if endian == LITTLE_ENDIAN:
 print("little endian")
 elif endian == BIG_ENDIAN:
 print("big endian")
 version = int.from_bytes(magic[6], "little")
 if version == VERSION_CURRENT:
 print("Version Current")

 o_class = struct.unpack("h", binary_data[16:18])[0]
 file_type = "type: "
 if o_class == ET_REL:
 file_type += "ET_REL"
 if o_class == ET_EXEC:
 file_type += "ET_EXEC"
 if o_class == ET_DYN:
 file_type += "ET_DYN"
 print(file_type)

 machine_type = struct.unpack("h", binary_data[18:20])[0]
 if machine_type == MACHINE_EM_386:
 print("Machine: Intel 80386")
 obj_file_version = struct.unpack("I", binary_data[20: 24])[0]
 print("object file version: ", obj_file_version)
 virtual_entry = struct.unpack("i", binary_data[24:28])[0]
 print("Entry point address: ", hex(virtual_entry))
 program_header_offset = struct.unpack("i", binary_data[28:32])[0]
 print("program header offset: ", program_header_offset) #程序頭表在文件內部偏移
 section_header_offset = struct.unpack("i", binary_data[32:36])[0]
 print("section header offset: ", section_header_offset)#段頭表在文件內部偏移
 processor_flag = struct.unpack("i", binary_data[36:40])[0]
 print("processor flag: ", processor_flag )
 this_header_size = struct.unpack("h", binary_data[40:42])[0]
 print("size of this header: ", this_header_size)
 program_header_entry_size = struct.unpack("h", binary_data[42:44])[0] #程序頭表中一條記錄的大小
 print("program header entry size: ", program_header_entry_size)
 program_entry_count = struct.unpack("h", binary_data[44:46])[0] #程序頭表中記錄的數量
 print("program header entry count: ", program_entry_count)
 section_header_entry_size = struct.unpack("h", binary_data[46:48])[0] #段記錄的大小
 print("section header entry size: ", section_header_entry_size)
 section_header_count = struct.unpack("h", binary_data[48:50])[0] #段表記錄的數量
 print("section header count: ", section_header_count)
 section_string_table = struct.unpack("h", binary_data[50:52])[0]
 print("section string table index: ", section_string_table)
 return (program_header_offset, section_header_offset, program_header_entry_size, program_entry_count, section_header_entry_size, section_header_count)

接下來咱們看程序表頭的讀取，使用readelf -l 就能獲取程序表頭的信息。程序表頭記錄告訴系統如何將ELF文件內的數據加載到內存中，它對應的數據結構以下：

    
 
    typedef struct {
 uint32_t p_type;
 Elf32_Off p_offset;
 Elf32_Addr p_vaddr;
 Elf32_Addr p_paddr;
 uint32_t p_filesz;
 uint32_t p_memsz;
 uint32_t p_flags;
 uint32_t p_align;
 } Elf32_Phdr;

p_type對應程序表頭記錄所描述的數據的類型。特別重要的有如下幾種，PT_LOAD表示它描述的數據須要被加載到內存中，p_vaddr是要加載到內存中的虛擬地址，p_addr是內存對應的物理地址，瞭解計算機結構體系的同窗會知道內存虛擬地址實際上是將內存物理地址通過一系列映射或轉換後獲得的數值，p_addr該字段在絕大多數狀況下用不到。咱們基於儘量簡化認知負擔的原則，對p_type的其餘值暫時不考量，使用到時再詳細說明。

p_offset表示程序表頭記錄相對於文件內的偏移。p_filesz表示程序表頭記錄所描述的數據長度，p_memsz表示對應數據加載到內存後的長度。一般狀況下這二者相同，但因爲加載到內存時可能須要字節對齊，所以後者有可能比前者要大。

p_flags描述程序表頭記錄所描述數據的屬性，若是取值PF_X表示描述的數據是可執行的代碼，PF_W表示所描述數據是可修改的數據，PF_R表示所描述數據具備可讀性質。p_align用於代表所描述數據是否須要內存對齊，取值0表示不須要對齊，要否則它必須取值2的指數，同時必須使得 p_vaddr % p_align == p_offset % p_align，這些知識涉及到計算機體系結構，咱們暫時先忽略，接下來看解讀程序表頭的代碼：

    
 
    def readelf_program_header(binary_data, size, count):
 print("there are {} program header entries".format(count))
 for i in range(count):
 binary_data = binary_data[size * i :]
 program_type = struct.unpack("i", binary_data[0:4])[0]
 if program_type == PT_NONE:
 print("header type: PT_NONE")
 if program_type == PT_LOAD:
 print("header type: PT_LOAD") #可轉載到內存裏的程序段，對應代碼和數據,這些段纔是咱們關心的
 elif program_type == PT_DYNAMIC:
 print("header type: PT_DYNAMIC")
 elif program_type == PT_INTERP:
 print("header type: PT_INTERP")
 elif program_type == PT_NOTE:
 print("header type: PT_NOTE")
 elif program_type == PT_SHLIB:
 print("header type: PT_SHLIB")
 elif program_type == PT_PHDR:
 print("header type: PT_PHDR")
 else:
 print("header type hex: ", hex(program_type))

 header_offset = struct.unpack("i", binary_data[4:8])[0]
 print("program header content offset: ", hex(header_offset))
 virtual_addr = struct.unpack("i", binary_data[8:12])[0]
 print("program header content virtual address: ", hex(virtual_addr))
 pysical_addr = struct.unpack("i", binary_data[12:16])[0]
 print("program header content pysical address: ", hex(pysical_addr))
 header_file_size = struct.unpack("i", binary_data[16:20])[0]
 print("program header file size: ", header_file_size)
 header_memory_size = struct.unpack("i", binary_data[20:24])[0]
 print("program header memory size: ", header_memory_size)
 header_flags = struct.unpack("i", binary_data[24:28])[0]
 if (header_flags & PF_X):
 print("this segment can be execute")
 if (header_flags & PF_R):
 print("this segment can be read")
 if (header_flags & PF_W):
 print("this segment cab be write")
 header_align = struct.unpack("i", binary_data[28:32])[0]
 print("align value: ", header_align)

最後咱們將兩部分實現銜接起來：

    
 
    with open(elf32_path, 'rb') as f:
 binary_data = f.read()
 elf32_info = read_elf_header(binary_data)
 program_header_offset = elf32_info[0]
 program_header_entry_size = elf32_info[2]
 program_header_entry_count = elf32_info[3]
 print("header offset: ", program_header_offset)
 readelf_program_header(binary_data[program_header_offset:], program_header_entry_size,
 program_header_entry_count)

上面代碼運行後能夠看到輸出的內容與使用readelf -h 或-l所得的結果差很少。深入掌握ELF文件結構及其加載原理是實現Linux上二進制劫持的基礎，其過程很煩瑣，同時又涉及到不少日常用不上的關於硬件和體系結構的知識，可否啃得下這些枯燥的知識點，決定了一我的是否有毅力和決心在技術之路上走的足夠遠而且最終能脫穎而出。

更多精彩內容請點擊」閱讀原文「

本文分享自微信公衆號 - Coding迪斯尼（gh_c9f933e7765d）。
若有侵權，請聯繫 support@oschina.cn 刪除。
本文參與「OSC源創計劃」，歡迎正在閱讀的你也加入，一塊兒分享。html