stackoverflow:Purpose of memory align/內存對齊的目的(原文+翻譯)

原文:https://stackoverflow.com/questions/381244/purpose-of-memory-alignment
翻譯:joeyhtml

The memory subsystem on a modern processor is restricted to accessing memory at the granularity and alignment of its word size; this is the case for a number of reasons.
現代處理器上的內存系統,都對於內存存取的粒度和是否對齊有所限制,這就是不少理由的源頭。
(譯註:本人的理解就是兩點,必須從指定的地址開始讀取和每次必須讀一個字)編程

Speed 速度

Modern processors have multiple levels of cache memory that data must be pulled through; supporting single-byte reads would make the memory subsystem throughput tightly bound to the execution unit throughput(aka cpu-bound); this is all reminiscent of how PIO mode was surpassed by DMA for many of the same reasons in hard drives.
現代處理器都有多級緩存,數據必須通過這些緩存。因爲要支持單字節的存取,使得內存系統的吞吐量和執行單元(EU)的吞吐量密切相關(又名:cpu-bound)。這讓人聯想起編程實現IO(PIO)模式是怎樣被直接內存存取(DMA)取代的,在許多硬件中也是由於如上相同的理由。
The CPU always reads at its word size(4 bytes on a 32-bit processor), so when you do a unaligned address access -- on a processor that supports it -- the processor is going to read multiple words. The CPU will read each word of memory that your requested address straddles. This causes an amplification of up to 2X the number of memory transactions required to access the requested data.
CPU老是一次讀取一個字的數據(對於一個32位的處理器,一個字是4字節,64位則是8字節),因此當你進行一次非對齊的內存存取時——若是處理器支持的話(譯註:有的處理器不支持非對齊的內存存取)——處理器會從內存中讀取多個字。CPU會讀取每一個你請求讀取的變量橫跨過的單元,這就形成了在請求存取指定數據的時候(能夠是float,double等),至多會有相較於對齊時2倍的內存單元訪問量。
Because of this, it can very easily be slower to read two bytes than four. For example, say you have a struct in memory that looks like this:
正由於這樣,讀取兩個字節很容易就能夠比讀四個字節慢。好比,假設在內存中有一個以下的結構體:
(譯註:好比兩個字節是分散在兩個不對齊的word裏,就要讀取兩次,而若是四個字節都在一個對齊的word裏,就只須要讀取一次)緩存

struct mystruct {
    char c; // one byte
    int i; // four bytes
    short s; // two bytes
}

On a 32-bit processor it would most likely be aligned like shown here:
在32位機上,對齊後內存佈局大概像這樣:

The processor can read each of these members in on transaction.
對於每一個成員變量,處理器經過讀取一個字(word)均可以讀到。
Say you had a packed version of the struct, maybe from the network where is was packed for transmission efficiency, it might look something like this:
假如你有一個「擁擠版」的結構體(譯註:我的認爲就是未對齊的版本),可能對於網絡傳輸來講,是爲了傳輸效率,可能看起來像這樣:

Reading the first byte is going to be the same.
讀char c的時候,未對齊版本和對齊版本是同樣的。
When you ask the processor to give you 16bits form 0x0005 it will have to read a word from 0x004 and shift left 1 byte to place it in a 16-bit register, some extra work, but most can handle that in one cycle.
當你讓處理器從0x0005給你16位數據時,它會從0x0004開始,讀取一個字,並把數據左移(<<)一8位,以便將其放入一個16位的寄存器裏,此外還有些額外的工做,可是幾乎均可以在一個週期內(譯註:總線週期?)完成。
When you ask for 32bits from 0x0001 you'll get a 2X amplification. The processor will read from 0x0000 into the result register and shift left 1 byte, then read again from 0x0004 into a temporary register, shift right 3 bytes, then OR it with the result register.
當你從0x0001請求32位數據時,這將會是雙倍的開銷。處理器會首先從0x0000開始讀取數據放入結果寄存器,並將其左移8位,而後再從0x0004開始讀取,把數據放入暫存寄存器裏,並將數據右移24位,最後將暫存寄存器中的數據與結果寄存器裏的數據做「或」運算。網絡

Range 地址範圍

For any given address space, if the architecture can assume that the 2 LSBs are always 0 (e.g., 32-bit machines) then it can access 4 times more memory (the 2 saved bits can represent 4 distinct states), Taking the 2 LSBs off of an address would give you a 4-byte alignment; also referred to as a stride of 4 bytes. Each time an address is incremented it is effectively incrementing bit 2, not bit 0, i.e., the last 2 bits will always continue to be 00.
對於給定的地址空間,若是處理器構架能夠假設地址的最低兩位恆爲0(好比32位機),而後它就能夠訪問四倍大於如今地址空間的內存(由於保留的兩位能夠表示四個不一樣的狀態)(譯註:我的理解爲地址線中權重最高的兩位都爲0了),拿走地址線的最低兩位會爲你帶來4字節的對齊,也能夠理解爲是一次地址線的變更,地址空間就跨過4個字節。每次地址的增長,都增長的是地址線的bit2,而不是bit0。也就是說,最低兩個仍然將一直爲00。
This can even affect the physical design of the system. If the address bus needs 2 fewer bits, there can be 2 fewer pins on the CPU, and 2 fewer traces on the circuit board.
這甚至能夠影響到系統的物理設計(譯註:不就是芯片設計嘛,說那麼文縐縐的幹嗎)。若是地址總線不須要那兩個位,CPU上能夠少設計兩個引腳,電路板上也能夠少走兩根線。數據結構

Atomicity 原子性

The CPU can operate on an aligned word of memory atomically, meaning that no other instruction can interrupt that operation. This is critical to the correct operation of many lock-free data structures and other concurrency paradigms.
CPU對於內存中已經對齊了的字的操做,是原子的,也就是說沒有其餘的指令能夠打斷這個操做,這對於不少無鎖數據結構和併發編程範式的正確操做是相當重要的。架構

Conclusion 總結

The memory system of a processor is quite a bit more complex and involved than described here; a discussion on how an x86 processor actually addresses memory can help (many processors work similarly).
處理器的內存系統比本文描述的要複雜的多,這篇文章可能會有所幫助:how an x86 processor actually addresses memory can help
There are many more benefits to adhering to memory alignment that you can read at this IBM article.
從這篇文章中你能夠讀到很是多遵照內存對齊的好處。
computer's primary use is to transform data. Modern memory architectures and technologies have been optimized over decades to facilitate getting more data, in, out, and between more and faster execution units–in a highly reliable way.
計算機的首要任務是傳輸數據,爲了促進吞吐吞吐更多數據,現代內存架構和技術已經被優化了幾十年,它是一個存在於更多更快的執行單元的一個高度可靠的通道。併發

相關文章
相關標籤/搜索