每一個程序員都應該瞭解的內存知識

本文寫的很是好,可是翻譯的質量不是很好,所以本文基於翻譯作了修改。前端

轉自http://www.oschina.net/translate/what-every-programmer-should-know-about-memory-part1node

 

Editor's introduction: Ulrich Drepper recently approached us asking if we would be interested in publishing a lengthy document he had written on how memory and software interact. We did not have to look at the text for long to realize that it would be of interest to many LWN readers. Memory usage is often the determining factor in how software performs, but good information on how to avoid memory bottlenecks is hard to find. This series of articles should change that situation.

The original document prints out at over 100 pages. We will be splitting it into about seven segments, each run 1-2 weeks after its predecessor. Once the entire series is out, Ulrich will be releasing the full text.ios

Reformatting the text from the original LaTeX has been a bit of a challenge, but the results, hopefully, will be good. For ease of online reading, Ulrich's footnotes have been placed {inline in the text}. Hyperlinked cross-references (and [bibliography references]) will not be possible until the full series is published.程序員

Many thanks to Ulrich for allowing LWN to publish this material; we hope that it will lead to more memory-efficient software across our systems in the near future.]web

 

譯者信息

[編輯的話: Ulrich Drepper最近問咱們,是否是有興趣發表一篇他寫的內存方面的長文。咱們不用看太多就已經知道,LWN的讀者們會喜歡這篇文章的。內存的使用經常是軟件性能的決定性因子,而如何避免內存瓶頸的好文章卻很差找。這篇文章應該會有所幫助。spring

他的原文很長,超過100頁。咱們把它分紅了7篇,每隔一到兩週發表一篇。7篇發完後,Ulrich會把全文發出來數據庫

對原文從新格式化是個頗有挑戰性的工做,希望結果會不錯吧。爲了便於網上閱讀,咱們把Ulrich的腳註{放在了文章裏},而互相引用的超連接(和[參考書目])要等到全文出來才能提供。編程

很是感謝Ultich,感謝他讓LWN發表這篇文章,期待你們在不久的未來都能寫出內存優化很棒的軟件。]小程序

 

1 Introduction

In the early days computers were much simpler. The various components of a system, such as the CPU, memory, mass storage, and network interfaces, were developed together and, as a result, were quite balanced in their performance. For example, the memory and network interfaces were not (much) faster than the CPU at providing data.api

This situation changed once the basic structure of computers stabilized and hardware developers concentrated on optimizing individual subsystems. Suddenly the performance of some components of the computer fell significantly behind and bottlenecks developed. This was especially true for mass storage and memory subsystems which, for cost reasons, improved more slowly relative to other components.

譯者信息

1 簡介

早期計算機比如今更爲簡單。系統的各類組件例如CPU,內存,大容量存儲器和網口,因爲被共同開發於是有很是均衡的表現。例如,內存和網口並不比CPU在提供數據的時候更(特別的)快。

隨着計算機穩定的基本結構,硬件開發人員開始致力於優化單個子系統,狀況開始發生變化。因而電腦一些組件的性能大大的落後於是成爲了瓶頸。因爲成本緣由,大容量存儲器和內存子系統相對於其餘組件來講改善得更爲緩慢。

 

The slowness of mass storage has mostly been dealt with using software techniques: operating systems keep most often used (and most likely to be used) data in main memory, which can be accessed at a rate orders of magnitude faster than the hard disk. Cache storage was added to the storage devices themselves, which requires no changes in the operating system to increase performance. {Changes are needed, however, to guarantee data integrity when using storage device caches.} For the purposes of this paper, we will not go into more details of software optimizations for the mass storage access.

Unlike storage subsystems, removing the main memory as a bottleneck has proven much more difficult and almost all solutions require changes to the hardware. Today these changes mainly come in the following forms:

  • RAM hardware design (speed and parallelism).
  • Memory controller designs.
  • CPU caches.
  • Direct memory access (DMA) for devices.

 

譯者信息

大多狀況下,採用軟件技術解決大容量存儲的慢的問題。如:操做系統將經常使用(且最有可能被用)的數據放在主存中,由於它比內存快上好幾個數量級。或者將緩存加入存儲設備中,這樣就能夠在不修改操做系統的前提下提高性能。{然而,爲了在使用緩存時保證數據的完整性,仍然要做出一些修改。}這些內容不在本文的談論範圍以內,就不做贅述了。

與存儲子系統不一樣,消除內存瓶頸很是困難,並且幾乎每種方案都須要對硬件做出修改。目前,這些變動主要有如下這些方式:

  • RAM的硬件設計(速度與併發度)
  • 內存控制器的設計
  • CPU緩存
  • 設備的直接內存訪問(DMA)
For the most part, this document will deal with CPU caches and some effects of memory controller design. In the process of exploring these topics, we will explore DMA and bring it into the larger picture. However, we will start with an overview of the design for today's commodity hardware. This is a prerequisite to understanding the problems and the limitations of efficiently using memory subsystems. We will also learn about, in some detail, the different types of RAM and illustrate why these differences still exist.

 

This document is in no way all inclusive and final. It is limited to commodity hardware and further limited to a subset of that hardware. Also, many topics will be discussed in just enough detail for the goals of this paper. For such topics, readers are recommended to find more detailed documentation.

譯者信息

本文主要關心的是CPU緩存和內存控制器的設計的一些效果。在討論這些主題的過程當中,咱們還會研究DMA。不過,咱們首先會從當今商用硬件的設計談起。這有助於咱們理解目前在使用內存子系統時可能遇到的問題和限制。咱們還會詳細介紹RAM的分類,說明爲何會存在這麼多不一樣類型的內存。

本文不會包括全部內容,也不會包括最終性質的內容。咱們的討論範圍僅止於商用硬件,並且只限於其中的一小部分。另外,本文中的許多論題,咱們只會點到爲止,以達到本文目標爲標準。對於這些論題,你們能夠閱讀其它文檔,得到更詳細的說明。

 

When it comes to operating-system-specific details and solutions, the text exclusively describes Linux. At no time will it contain any information about other OSes. The author has no interest in discussing the implications for other OSes. If the reader thinks s/he has to use a different OS they have to go to their vendors and demand they write documents similar to this one.

One last comment before the start. The text contains a number of occurrences of the term 「usually」 and other, similar qualifiers. The technology discussed here exists in many, many variations in the real world and this paper only addresses the most common, mainstream versions. It is rare that absolute statements can be made about this technology, thus the qualifiers.

譯者信息

當本文提到操做系統特定的細節和解決方案時,針對的都是Linux。不管什麼時候都不會包含別的操做系統的任何信息,做者無心討論其餘操做系統的狀況。若是讀者認爲他/她不得不使用別的操做系統,那麼必須去要求供應商提供其操做系統相似於本文的文檔。

在開始以前最後的一點說明,本文包含大量出現的術語「常常」和別的相似的限定詞。這裏討論的技術在現實中存在於不少不一樣的實現,因此本文只闡述使用得最普遍最主流的版本。在闡述中不多有地方能用到絕對的限定詞。

 

1.1 Document Structure

This document is mostly for software developers. It does not go into enough technical details of the hardware to be useful for hardware-oriented readers. But before we can go into the practical information for developers a lot of groundwork must be laid.

To that end, the second section describes random-access memory (RAM) in technical detail. This section's content is nice to know but not absolutely critical to be able to understand the later sections. Appropriate back references to the section are added in places where the content is required so that the anxious reader could skip most of this section at first.

The third section goes into a lot of details of CPU cache behavior. Graphs have been used to keep the text from being as dry as it would otherwise be. This content is essential for an understanding of the rest of the document. Section 4 describes briefly how virtual memory is implemented. This is also required groundwork for the rest.

譯者信息

1.1文檔結構

這個文檔主要視爲軟件開發者而寫的。本文不會涉及太多硬件細節,因此喜歡硬件的讀者也許不會以爲有用。可是在咱們討論一些有用的細節以前,咱們先要描述足夠多的背景。

在這個基礎上,本文的第二部分將描述RAM(隨機寄存器)。懂得這個部分的內容很好,可是此部分的內容並非懂得其後內容必須部分。咱們會在以後引用很多以前的部分,因此心急的讀者能夠先跳過這本分。

第三部分會談到很多關於CPU緩存行爲模式的內容。咱們會列出一些圖標,這樣大家不至於以爲太枯燥。第三部分對於理解整個文章很是重要。第四部分將簡短的描述虛擬內存是怎麼被實現的。這也是理解其餘部分的背景知識。

 

Section 5 goes into a lot of detail about Non Uniform Memory Access (NUMA) systems.

Section 6 is the central section of this paper. It brings together all the previous sections' information and gives programmers advice on how to write code which performs well in the various situations. The very impatient reader could start with this section and, if necessary, go back to the earlier sections to freshen up the knowledge of the underlying technology.

Section 7 introduces tools which can help the programmer do a better job. Even with a complete understanding of the technology it is far from obvious where in a non-trivial software project the problems are. Some tools are necessary.

In section 8 we finally give an outlook of technology which can be expected in the near future or which might just simply be good to have.

 

譯者信息

第五部分會提到許多關於Non Uniform Memory Access (NUMA)系統。

第六部分是本文的中心部分。在這個部分裏面,咱們將回顧其餘許多部分中的信息,而且咱們將給閱讀本文的程序員許多在各類狀況下的編程建議。若是你真的很心急,那麼你能夠直接閱讀第六部分,而且咱們建議你在必要的時候回到以前的章節回顧一下必要的背景知識。

本文的第七部分將介紹一些可以幫助程序員更好的完成任務的工具。即使在完全理解了某一項技術的狀況下,距離完全理解在非測試環境下的程序仍是很遙遠的。咱們須要藉助一些工具。

第八部分,咱們將展望一些在將來咱們可能認爲好用的科技。

 

1.2 Reporting Problems

The author intends to update this document for some time. This includes updates made necessary by advances in technology but also to correct mistakes. Readers willing to report problems are encouraged to send email.

1.3 Thanks

I would like to thank Johnray Fuller and especially Jonathan Corbet for taking on part of the daunting task of transforming the author's form of English into something more traditional. Markus Armbruster provided a lot of valuable input on problems and omissions in the text.

1.4 About this Document

The title of this paper is an homage to David Goldberg's classic paper 「What Every Computer Scientist Should Know About Floating-Point Arithmetic」 [goldberg]. Goldberg's paper is still not widely known, although it should be a prerequisite for anybody daring to touch a keyboard for serious programming.

 

譯者信息

1.2 反饋問題

做者會不按期更新本文檔。這些更新既包括伴隨技術進步而來的更新也包含更改錯誤。很是歡迎有志於反饋問題的讀者發送電子郵件。

1.3 致謝

我首先須要感謝Johnray Fuller尤爲是Jonathan Corbet,感謝他們將做者的英語轉化成爲更爲規範的形式。Markus Armbruster提供大量本文中對於問題和縮寫有價值的建議。

1.4 關於本文

本文題目對David Goldberg的經典文獻《What Every Computer Scientist Should Know About Floating-Point Arithmetic》[goldberg]表示致敬。Goldberg的論文雖然不普及,可是對於任何有志於嚴格編程的人都會是一個先決條件。

 

2 Commodity Hardware Today

Understanding commodity hardware is important because specialized hardware is in retreat. Scaling these days is most often achieved horizontally instead of vertically, meaning today it is more cost-effective to use many smaller, connected commodity computers instead of a few really large and exceptionally fast (and expensive) systems. This is the case because fast and inexpensive network hardware is widely available. There are still situations where the large specialized systems have their place and these systems still provide a business opportunity, but the overall market is dwarfed by the commodity hardware market. Red Hat, as of 2007, expects that for future products, the 「standard building blocks」 for most data centers will be a computer with up to four sockets, each filled with a quad core CPU that, in the case of Intel CPUs, will be hyper-threaded. {Hyper-threading enables a single processor core to be used for two or more concurrent executions with just a little extra hardware.} This means the standard system in the data center will have up to 64 virtual processors. Bigger machines will be supported, but the quad socket, quad CPU core case is currently thought to be the sweet spot and most optimizations are targeted for such machines.

譯者信息

2 商用硬件現狀

鑑於目前專用硬件正在逐漸淡出,理解商用硬件的現狀變得十分重要。現現在,人們更多的採用水平擴展,也就是說,用大量小型、互聯的商用計算機代替巨大、超快(但超貴)的系統。緣由在於,快速而廉價的網絡硬件已經崛起。那些大型的專用系統仍然有一席之地,但已被商用硬件後來居上。2007年,Red Hat認爲,將來構成數據中心的「積木」將會是擁有最多4個插槽的計算機,每一個插槽插入一個四核CPU,這些CPU都是超線程的。{超線程使單個處理器核心能同時處理兩個以上的任務,只需加入一點點額外硬件}。也就是說,這些數據中心中的標準系統擁有最多64個虛擬處理器。固然能夠支持更大的系統,但人們認爲4插槽、4核CPU是最佳配置,絕大多數的優化都針對這樣的配置。

 

Large differences exist in the structure of commodity computers. That said, we will cover more than 90% of such hardware by concentrating on the most important differences. Note that these technical details tend to change rapidly, so the reader is advised to take the date of this writing into account.

Over the years the personal computers and smaller servers standardized on a chipset with two parts: the Northbridge and Southbridge. Figure 2.1 shows this structure.

 

Figure 2.1: Structure with Northbridge and Southbridge

All CPUs (two in the previous example, but there can be more) are connected via a common bus (the Front Side Bus, FSB) to the Northbridge. The Northbridge contains, among other things, the memory controller, and its implementation determines the type of RAM chips used for the computer. Different types of RAM, such as DRAM, Rambus, and SDRAM, require different memory controllers.

譯者信息

在不一樣商用計算機之間,也存在着巨大的差別。不過,咱們關注在主要的差別上,能夠涵蓋到超過90%以上的硬件。須要注意的是,這些技術上的細節每每突飛猛進,變化極快,所以你們在閱讀的時候也須要注意本文的寫做時間。

這麼多年來,我的計算機和小型服務器被標準化到了一個芯片組上,它由兩部分組成: 北橋和南橋,見圖2.1。

圖2.1 北橋和南橋組成的結構

CPU經過一條通用總線(前端總線,FSB)鏈接到北橋。北橋主要包括內存控制器和其它一些組件,內存控制器決定了RAM芯片的類型。不一樣的類型,包括DRAM、Rambus和SDRAM等等,要求不一樣的內存控制器。

 

To reach all other system devices, the Northbridge must communicate with the Southbridge. The Southbridge, often referred to as the I/O bridge, handles communication with devices through a variety of different buses. Today the PCI, PCI Express, SATA, and USB buses are of most importance, but PATA, IEEE 1394, serial, and parallel ports are also supported by the Southbridge. Older systems had AGP slots which were attached to the Northbridge. This was done for performance reasons related to insufficiently fast connections between the Northbridge and Southbridge. However, today the PCI-E slots are all connected to the Southbridge.

Such a system structure has a number of noteworthy consequences:

  • All data communication from one CPU to another must travel over the same bus used to communicate with the Northbridge.
  • All communication with RAM must pass through the Northbridge.
  • The RAM has only a single port. {We will not discuss multi-port RAM in this document as this type of RAM is not found in commodity hardware, at least not in places where the programmer has access to it. It can be found in specialized hardware such as network routers which depend on utmost speed.}
  • Communication between a CPU and a device attached to the Southbridge is routed through the Northbridge.

 

譯者信息

爲了連通其它系統設備,北橋須要與南橋通訊。南橋又叫I/O橋,經過多條不一樣總線與設備們通訊。目前,比較重要的總線有PCI、PCI Express、SATA和USB總線,除此之外,南橋還支持PATA、IEEE 139四、串行口和並行口等。比較老的系統上有鏈接北橋的AGP槽。那是因爲南北橋間缺少高速鏈接而採起的措施。如今的PCI-E都是直接連到南橋的。

這種結構有一些須要注意的地方:

  • 從某個CPU到另外一個CPU的數據須要走它與北橋通訊的同一條總線。
  • 與RAM的通訊須要通過北橋
  • RAM只有一個端口。{本文不會介紹多端口RAM,由於商用硬件不採用這種內存,至少程序員沒法訪問到。這種內存通常在路由器等專用硬件中採用。}
  • CPU與南橋設備間的通訊須要通過北橋
A couple of bottlenecks are immediately apparent in this design. One such bottleneck involves access to RAM for devices. In the earliest days of the PC, all communication with devices on either bridge had to pass through the CPU, negatively impacting overall system performance. To work around this problem some devices became capable of direct memory access (DMA). DMA allows devices, with the help of the Northbridge, to store and receive data in RAM directly without the intervention of the CPU (and its inherent performance cost). Today all high-performance devices attached to any of the buses can utilize DMA. While this greatly reduces the workload on the CPU, it also creates contention for the bandwidth of the Northbridge as DMA requests compete with RAM access from the CPUs. This problem, therefore, must to be taken into account.

譯者信息

在上面這種設計中,瓶頸立刻出現了。第一個瓶頸與設備對RAM的訪問有關。早期,全部設備之間的通訊都須要通過CPU,結果嚴重影響了整個系統的性能。爲了解決這個問題,有些設備加入了直接內存訪問(DMA)的能力。DMA容許設備在北橋的幫助下,無需CPU的干涉,直接讀寫RAM。到了今天,全部高性能的設備均可以使用DMA。雖然DMA大大下降了CPU的負擔,卻佔用了北橋的帶寬,與CPU造成了爭用。

 

A second bottleneck involves the bus from the Northbridge to the RAM. The exact details of the bus depend on the memory types deployed. On older systems there is only one bus to all the RAM chips, so parallel access is not possible. Recent RAM types require two separate buses (or channels as they are called for DDR2, see Figure 2.8) which doubles the available bandwidth. The Northbridge interleaves memory access across the channels. More recent memory technologies (FB-DRAM, for instance) add more channels.

With limited bandwidth available, it is important to schedule memory access in ways that minimize delays. As we will see, processors are much faster and must wait to access memory, despite the use of CPU caches. If multiple hyper-threads, cores, or processors access memory at the same time, the wait times for memory access are even longer. This is also true for DMA operations.

譯者信息

第二個瓶頸來自北橋與RAM間的總線。總線的具體狀況與內存的類型有關。在早期的系統上,只有一條總線,所以不能實現並行訪問。近期的RAM須要兩條獨立總線(或者說通道,DDR2就是這麼叫的,見圖2.8),能夠實現帶寬加倍。北橋將內存訪問交錯地分配到兩個通道上。更新的內存技術(如FB-DRAM)甚至加入了更多的通道。

因爲帶寬有限,咱們須要以一種使延遲最小化的方式來對內存訪問進行調度。咱們將會看到,處理器的速度比內存要快得多,須要等待內存。若是有多個超線程核心或CPU同時訪問內存,等待時間則會更長。對於DMA也是一樣。

 

There is more to accessing memory than concurrency, however. Access patterns themselves also greatly influence the performance of the memory subsystem, especially with multiple memory channels. Refer to Section 2.2 for more details of RAM access patterns.

On some more expensive systems, the Northbridge does not actually contain the memory controller. Instead the Northbridge can be connected to a number of external memory controllers (in the following example, four of them).

Figure 2.2: Northbridge with External Controllers

 

譯者信息

除了併發之外,訪問模式也會極大地影響內存子系統、特別是多通道內存子系統的性能。關於訪問模式,可參見2.2節。

在一些比較昂貴的系統上,北橋本身不含內存控制器,而是鏈接到外部的多個內存控制器上(在下例中,共有4個)。

圖2.2 擁有外部控制器的北橋

The advantage of this architecture is that more than one memory bus exists and therefore total bandwidth increases. This design also supports more memory. Concurrent memory access patterns reduce delays by simultaneously accessing different memory banks. This is especially true when multiple processors are directly connected to the Northbridge, as in Figure 2.2. For such a design, the primary limitation is the internal bandwidth of the Northbridge, which is phenomenal for this architecture (from Intel). { For completeness it should be mentioned that such a memory controller arrangement can be used for other purposes such as 「memory RAID」 which is useful in combination with hotplug memory.}

譯者信息

這種架構的好處在於,多條內存總線的存在,使得總帶寬也隨之增長了。並且也能夠支持更多的內存。經過同時訪問不一樣內存區,還能夠下降延時。對於像圖2.2中這種多處理器直連北橋的設計來講,尤爲有效。而這種架構的侷限在於北橋的內部帶寬,很是巨大(來自Intel)。{出於完整性的考慮,還須要補充一下,這樣的內存控制器佈局還能夠用於其它用途,好比說「內存RAID」,它能夠與熱插拔技術一塊兒使用。}

 

Using multiple external memory controllers is not the only way to increase memory bandwidth. One other increasingly popular way is to integrate memory controllers into the CPUs and attach memory to each CPU. This architecture is made popular by SMP systems based on AMD's Opteron processor. Figure 2.3 shows such a system. Intel will have support for the Common System Interface (CSI) starting with the Nehalem processors; this is basically the same approach: an integrated memory controller with the possibility of local memory for each processor.

Figure 2.3: Integrated Memory Controller

With an architecture like this there are as many memory banks available as there are processors. On a quad-CPU machine the memory bandwidth is quadrupled without the need for a complicated Northbridge with enormous bandwidth. Having a memory controller integrated into the CPU has some additional advantages; we will not dig deeper into this technology here.

譯者信息

使用外部內存控制器並非惟一的辦法,另外一個最近比較流行的方法是將控制器集成到CPU內部,將內存直連到每一個CPU。這種架構的走紅歸功於基於AMD Opteron處理器的SMP系統。圖2.3展現了這種架構。Intel則會從Nehalem處理器開始支持通用系統接口(CSI),基本上也是相似的思路——集成內存控制器,爲每一個處理器提供本地內存。

圖2.3 集成的內存控制器

經過採用這樣的架構,系統裏有幾個處理器,就能夠有幾個內存庫(memory bank)。好比,在4 CPU的計算機上,不須要一個擁有巨大帶寬的複雜北橋,就能夠實現4倍的內存帶寬。另外,將內存控制器集成到CPU內部還有其它一些優勢,這裏就不贅述了。

 

There are disadvantages to this architecture, too. First of all, because the machine still has to make all the memory of the system accessible to all processors, the memory is not uniform anymore (hence the name NUMA - Non-Uniform Memory Architecture - for such an architecture). Local memory (memory attached to a processor) can be accessed with the usual speed. The situation is different when memory attached to another processor is accessed. In this case the interconnects between the processors have to be used. To access memory attached to CPU2 from CPU1 requires communication across one interconnect. When the same CPU accesses memory attached to CPU4 two interconnects have to be crossed.

Each such communication has an associated cost. We talk about 「NUMA factors」 when we describe the extra time needed to access remote memory. The example architecture in Figure 2.3 has two levels for each CPU: immediately adjacent CPUs and one CPU which is two interconnects away. With more complicated machines the number of levels can grow significantly. There are also machine architectures (for instance IBM's x445 and SGI's Altix series) where there is more than one type of connection. CPUs are organized into nodes; within a node the time to access the memory might be uniform or have only small NUMA factors. The connection between nodes can be very expensive, though, and the NUMA factor can be quite high.

譯者信息

一樣也有缺點。首先,系統仍然要讓全部內存能被全部處理器所訪問,致使內存再也不是統一的資源(NUMA即得名於此)。處理器能以正常的速度訪問本地內存(鏈接到該處理器的內存)。但它訪問其它處理器的內存時,卻須要使用處理器之間的互聯通道。好比說,CPU 1若是要訪問CPU 2的內存,則須要使用它們之間的互聯通道。若是它須要訪問CPU 4的內存,那麼須要跨越兩條互聯通道。

使用互聯通道是有代價的。在討論訪問遠端內存的代價時,咱們用「NUMA因子」這個詞。在圖2.3中,每一個CPU有兩個層級: 相鄰的CPU,以及兩個互聯通道外的CPU。在更加複雜的系統中,層級也更多。甚至有些機器有不止一種鏈接,好比說IBM的x445和SGI的Altix系列。CPU被納入節點,節點內的內存訪問時間是一致的,或者只有很小的NUMA因子。而在節點之間的鏈接代價很大,並且有巨大的NUMA因子。

 

Commodity NUMA machines exist today and will likely play an even greater role in the future. It is expected that, from late 2008 on, every SMP machine will use NUMA. The costs associated with NUMA make it important to recognize when a program is running on a NUMA machine. In Section 5 we will discuss more machine architectures and some technologies the Linux kernel provides for these programs.

Beyond the technical details described in the remainder of this section, there are several additional factors which influence the performance of RAM. They are not controllable by software, which is why they are not covered in this section. The interested reader can learn about some of these factors in Section 2.1. They are really only needed to get a more complete picture of RAM technology and possibly to make better decisions when purchasing computers.

The following two sections discuss hardware details at the gate level and the access protocol between the memory controller and the DRAM chips. Programmers will likely find this information enlightening since these details explain why RAM access works the way it does. It is optional knowledge, though, and the reader anxious to get to topics with more immediate relevance for everyday life can jump ahead to Section 2.2.5.

譯者信息

目前,已經有商用的NUMA計算機,並且它們在將來應該會扮演更加劇要的角色。人們預計,從2008年末開始,每臺SMP機器都會使用NUMA。每一個在NUMA上運行的程序都應該認識到NUMA的代價。在第5節中,咱們將討論更多的架構,以及Linux內核爲這些程序提供的一些技術。

除了本節中所介紹的技術以外,還有其它一些影響RAM性能的因素。它們沒法被軟件所左右,因此沒有放在這裏。若是你們有興趣,能夠在第2.1節中看一下。介紹這些技術,僅僅是由於它們能讓咱們繪製的RAM技術全圖更爲完整,或者是可能在你們購買計算機時可以提供一些幫助。

如下的兩節主要介紹一些入門級的硬件知識,同時討論內存控制器與DRAM芯片間的訪問協議。這些知識解釋了內存訪問的原理,程序員可能會獲得一些啓發。不過,這部分並非必讀的,心急的讀者能夠直接跳到第2.2.5節。

 

2.1 RAM Types

There have been many types of RAM over the years and each type varies, sometimes significantly, from the other. The older types are today really only interesting to the historians. We will not explore the details of those. Instead we will concentrate on modern RAM types; we will only scrape the surface, exploring some details which are visible to the kernel or application developer through their performance characteristics.

The first interesting details are centered around the question why there are different types of RAM in the same machine. More specifically, why there are both static RAM (SRAM {In other contexts SRAM might mean 「synchronous RAM」.}) and dynamic RAM (DRAM). The former is much faster and provides the same functionality. Why is not all RAM in a machine SRAM? The answer is, as one might expect, cost. SRAM is much more expensive to produce and to use than DRAM. Both these cost factors are important, the second one increasing in importance more and more. To understand these difference we look at the implementation of a bit of storage for both SRAM and DRAM.

In the remainder of this section we will discuss some low-level details of the implementation of RAM. We will keep the level of detail as low as possible. To that end, we will discuss the signals at a 「logic level」 and not at a level a hardware designer would have to use. That level of detail is unnecessary for our purpose here.

譯者信息

2.1 RAM類型

這些年來,出現了許多不一樣類型的RAM,各有差別,有些甚至有很是巨大的不一樣。那些很古老的類型已經乏人問津,咱們就不仔細研究了。咱們主要專一於幾類現代RAM,剖開它們的表面,研究一下內核和應用開發人員們能夠看到的一些細節。

第一個有趣的細節是,爲何在同一臺機器中有不一樣的RAM?或者說得更詳細一點,爲何既有靜態RAM(SRAM {SRAM還能夠表示「同步內存」。}),又有動態RAM(DRAM)。功能相同,前者更快。那麼,爲何不所有使用SRAM?答案是,代價。不管在生產仍是在使用上,SRAM都比DRAM要貴得多。生產和使用,這兩個代價因子都很重要,後者則是愈來愈重要。爲了理解這一點,咱們分別看一下SRAM和DRAM一個位的存儲的實現過程。

在本節的餘下部分,咱們將討論RAM實現的底層細節。咱們將盡可能控制細節的層面,好比,在「邏輯的層面」討論信號,而不是硬件設計師那種層面,由於那毫無必要。

 

2.1.1 Static RAM

 

 

Figure 2.4: 6-T Static RAM

 

Figure 2.4 shows the structure of a 6 transistor SRAM cell. The core of this cell is formed by the four transistorsM1toM4which form two cross-coupled inverters. They have two stable states, representing 0 and 1 respectively. The state is stable as long as power onVddis available.

If access to the state of the cell is needed the word access lineWLis raised. This makes the state of the cell immediately available for reading onBLandBL. If the cell state must be overwritten theBLandBLlines are first set to the desired values and thenWLis raised. Since the outside drivers are stronger than the four transistors (M1throughM4) this allows the old state to be overwritten.

See [sramwiki] for a more detailed description of the way the cell works. For the following discussion it is important to note that

  • one cell requires six transistors. There are variants with four transistors but they have disadvantages.
  • maintaining the state of the cell requires constant power.
  • the cell state is available for reading almost immediately once the word access lineWLis raised. The signal is as rectangular (changing quickly between the two binary states) as other transistor-controlled signals.
  • the cell state is stable, no refresh cycles are needed.

There are other, slower and less power-hungry, SRAM forms available, but those are not of interest here since we are looking at fast RAM. These slow variants are mainly interesting because they can be more easily used in a system than dynamic RAM because of their simpler interface.

 

譯者信息

2.1.1 靜態RAM

圖2.6 6-T靜態RAM

圖2.4展現了6晶體管SRAM的一個單元。核心是4個晶體管M1-M4,它們組成兩個交叉耦合的反相器。它們有兩個穩定的狀態,分別表明0和1。只要保持Vdd有電,狀態就是穩定的。

當須要訪問單元的狀態時,升起字訪問線WL。BL和BL上就能夠讀取狀態。若是須要覆蓋狀態,先將BL和BL設置爲指望的值,而後升起WL。因爲外部的驅動強於內部的4個晶體管,因此舊狀態會被覆蓋。

更多詳情,能夠參考[sramwiki]。爲了下文的討論,須要注意如下問題:

一個單元須要6個晶體管。也有采用4個晶體管的SRAM,但有缺陷。

維持狀態須要恆定的電源。

升起WL後當即能夠讀取狀態。信號與其它晶體管控制的信號同樣,是直角的(快速在兩個狀態間變化)。

狀態穩定,不須要刷新循環。

SRAM也有其它形式,不那麼費電,但比較慢。因爲咱們須要的是快速RAM,所以不在關注範圍內。這些較慢的SRAM的主要優勢在於接口簡單,比動態RAM更容易使用。

 

2.1.2 Dynamic RAM

Dynamic RAM is, in its structure, much simpler than static RAM. Figure 2.5 shows the structure of a usual DRAM cell design. All it consists of is one transistor and one capacitor. This huge difference in complexity of course means that it functions very differently than static RAM.

 

Figure 2.5: 1-T Dynamic RAM

A dynamic RAM cell keeps its state in the capacitorC. The transistorMis used to guard the access to the state. To read the state of the cell the access lineALis raised; this either causes a current to flow on the data lineDLor not, depending on the charge in the capacitor. To write to the cell the data lineDLis appropriately set and thenALis raised for a time long enough to charge or drain the capacitor.

There are a number of complications with the design of dynamic RAM. The use of a capacitor means that reading the cell discharges the capacitor. The procedure cannot be repeated indefinitely, the capacitor must be recharged at some point. Even worse, to accommodate the huge number of cells (chips with 109 or more cells are now common) the capacity to the capacitor must be low (in the femto-farad range or lower). A fully charged capacitor holds a few 10's of thousands of electrons. Even though the resistance of the capacitor is high (a couple of tera-ohms) it only takes a short time for the capacity to dissipate. This problem is called 「leakage」.

譯者信息

2.1.2 動態RAM

動態RAM比靜態RAM要簡單得多。圖2.5展現了一種普通DRAM的結構。它只含有一個晶體管和一個電容器。顯然,這種複雜性上的巨大差別意味着功能上的迥異。

圖2.5 1-T動態RAM

動態RAM的狀態是保持在電容器C中。晶體管M用來控制訪問。若是要讀取狀態,升起訪問線AL,這時,可能會有電流流到數據線DL上,也可能沒有,取決於電容器是否有電。若是要寫入狀態,先設置DL,而後升起AL一段時間,直到電容器充電或放電完畢。

動態RAM的設計有幾個複雜的地方。因爲讀取狀態時須要對電容器放電,因此這一過程不能無限重複,不得不在某個點上對它從新充電。

更糟糕的是,爲了容納大量單元(如今通常在單個芯片上容納10的9次方以上的RAM單元),電容器的容量必須很小(0.000000000000001法拉如下)。這樣,完整充電後大約持有幾萬個電子。即便電容器的電阻很大(若干兆歐姆),仍然只需很短的時間就會耗光電荷,稱爲「泄漏」。

 

This leakage is why a DRAM cell must be constantly refreshed. For most DRAM chips these days this refresh must happen every 64ms. During the refresh cycle no access to the memory is possible. For some workloads this overhead might stall up to 50% of the memory accesses (see [highperfdram]).

A second problem resulting from the tiny charge is that the information read from the cell is not directly usable. The data line must be connected to a sense amplifier which can distinguish between a stored 0 or 1 over the whole range of charges which still have to count as 1.

A third problem is that charging and draining a capacitor is not instantaneous. The signals received by the sense amplifier are not rectangular, so a conservative estimate as to when the output of the cell is usable has to be used. The formulas for charging and discharging a capacitor are

 

[Formulas]

 

譯者信息

這種泄露就是如今的大部分DRAM芯片每隔64ms就必須進行一次刷新的緣由。在刷新期間,對於該芯片的訪問是不可能的,這甚至會形成半數任務的延宕。(相關內容請察看【highperfdram】一章)

這個問題的另外一個後果就是沒法直接讀取芯片單元中的信息,而必須經過信號放大器將0和1兩種信號間的電勢差增大。

最後一個問題在於電容器的衝放電是須要時間的,這就致使了信號放大器讀取的信號並非典型的矩形信號。因此當放大器輸出信號的時候就須要一個小小的延宕,相關公式以下

[Formulas]
This means it takes some time (determined by the capacity C and resistance R) for the capacitor to be charged and discharged. It also means that the current which can be detected by the sense amplifiers is not immediately available. Figure 2.6 shows the charge and discharge curves. The X—axis is measured in units of RC (resistance multiplied by capacitance) which is a unit of time.

 

 

Figure 2.6: Capacitor Charge and Discharge Timing

Unlike the static RAM case where the output is immediately available when the word access line is raised, it will always take a bit of time until the capacitor discharges sufficiently. This delay severely limits how fast DRAM can be.

The simple approach has its advantages, too. The main advantage is size. The chip real estate needed for one DRAM cell is many times smaller than that of an SRAM cell. The SRAM cells also need individual power for the transistors maintaining the state. The structure of the DRAM cell is also simpler and more regular which means packing many of them close together on a die is simpler.

Overall, the (quite dramatic) difference in cost wins. Except in specialized hardware — network routers, for example — we have to live with main memory which is based on DRAM. This has huge implications on the programmer which we will discuss in the remainder of this paper. But first we need to look into a few more details of the actual use of DRAM cells.

 

譯者信息

這就意味着須要一些時間(時間長短取決於電容C和電阻R)來對電容進行衝放電。另外一個負面做用是,信號放大器的輸出電流不能當即就做爲信號載體使用。圖2.6顯示了衝放電的曲線,x軸表示的是單位時間下的R*C

與靜態RAM能夠即刻讀取數據不一樣的是,當要讀取動態RAM的時候,必須花一點時間來等待電容的衝放電徹底。這一點點的時間最終限制了DRAM的速度。

固然了,這種讀取方式也是有好處的。最大的好處在於縮小了規模。一個動態RAM的尺寸是小於靜態RAM的。這種規模的減少不僅僅創建在動態RAM的簡單結構之上,也是因爲減小了靜態RAM的各個單元獨立的供電部分。以上也同時致使了動態RAM模具的簡單化。

綜上所述,因爲難以想象的成本差別,除了一些特殊的硬件(包括路由器什麼的)以外,咱們的硬件大可能是使用DRAM的。這一點深深的影響了我們這些程序員,後文將會對此進行討論。在此以前,咱們仍是先了解下DRAM的更多細節。

 

2.1.3 DRAM Access

A program selects a memory location using a virtual address. The processor translates this into a physical address and finally the memory controller selects the RAM chip corresponding to that address. To select the individual memory cell on the RAM chip, parts of the physical address are passed on in the form of a number of address lines.

It would be completely impractical to address memory locations individually from the memory controller: 4GB of RAM would require 232 address lines. Instead the address is passed encoded as a binary number using a smaller set of address lines. The address passed to the DRAM chip this way must be demultiplexed first. A demultiplexer with N address lines will have 2N output lines. These output lines can be used to select the memory cell. Using this direct approach is no big problem for chips with small capacities.

譯者信息

2.1.3 DRAM 訪問

一個程序選擇了一個內存位置使用到了一個虛擬地址。處理器轉換這個到物理地址最後將內存控制選擇RAM芯片匹配了那個地址。在RAM芯片去選擇單個內存單元,部分的物理地址以許多地址行的形式被傳遞。

它單獨地去處理來自於內存控制器的內存位置將徹底不切實際:4G的RAM將須要 232 地址行。地址傳遞DRAM芯片的這種方式首先必須被路由器解析。一個路由器的N多地址行將有2N 輸出行。這些輸出行能被使用到選擇內存單元。使用這個直接方法對於小容量芯片再也不是個大問題

 

But if the number of cells grows this approach is not suitable anymore. A chip with 1Gbit {I hate those SI prefixes. For me a giga-bit will always be 230 and not 109 bits.} capacity would need 30 address lines and 230select lines. The size of a demultiplexer increases exponentially with the number of input lines when speed is not to be sacrificed. A demultiplexer for 30 address lines needs a whole lot of chip real estate in addition to the complexity (size and time) of the demultiplexer. Even more importantly, transmitting 30 impulses on the address lines synchronously is much harder than transmitting 「only」 15 impulses. Fewer lines have to be laid out at exactly the same length or timed appropriately. {Modern DRAM types like DDR3 can automatically adjust the timing but there is a limit as to what can be tolerated.}

 

Figure 2.7: Dynamic RAM Schematic

Figure 2.7 shows a DRAM chip at a very high level. The DRAM cells are organized in rows and columns. They could all be aligned in one row but then the DRAM chip would need a huge demultiplexer. With the array approach the design can get by with one demultiplexer and one multiplexer of half the size. {Multiplexers and demultiplexers are equivalent and the multiplexer here needs to work as a demultiplexer when writing. So we will drop the differentiation from now on.} This is a huge saving on all fronts. In the example the address linesa0anda1through the row address selection (RAS) demultiplexer select the address lines of a whole row of cells. When reading, the content of all cells is thusly made available to the column address selection (CAS) {The line over the name indicates that the signal is negated} multiplexer. Based on the address linesa2anda3the content of one column is then made available to the data pin of the DRAM chip. This happens many times in parallel on a number of DRAM chips to produce a total number of bits corresponding to the width of the data bus.

譯者信息

但若是許多的單元生成這種方法不在適合。一個1G的芯片容量(我反感那些SI前綴,對於我一個giga-bit將老是230 而不是109字節)將須要30地址行和230 選項行。一個路由器的大小及許多的輸入行以指數方式遞增當速度不被犧牲時。一個30地址行路由器須要一大堆芯片的真實身份另外路由器也就複雜起來了。更重要的是,傳遞30脈衝在地址行同步要比僅僅傳遞15脈衝困難的多。較少列能精確佈局相同長度或恰當的時機(現代DRAM類型像DDR3能自動調整時序但這個限制能讓他什麼都能忍受)

圖2.7展現了一個很高級別的一個DRAM芯片,DRAM被組織在行和列裏。他們能在一行中對奇但DRAM芯片須要一個大的路由器。經過陣列方法設計能被一個路由器和一個半的multiplexer得到{多路複用器(multiplexer)和路由器是同樣的,這的multiplexer須要以路由器身份工做當寫數據時候。那麼從如今開始咱們開始討論其區別.}這在全部方面會是一個大的存儲。例如地址linesa0和a1經過行地址選擇路由器來選擇整個行的芯片的地址列,當讀的時候,全部的芯片目錄能使其縱列選擇路由器可用,依據地址linesa2和a3一個縱列的目錄用於數據DRAM芯片的接口類型。這發生了許屢次在許多DRAM芯片產生一個總記錄數的字節匹配給一個寬範圍的數據總線。

 

For writing, the new cell value is put on the data bus and, when the cell is selected using theRASandCAS, it is stored in the cell. A pretty straightforward design. There are in reality — obviously — many more complications. There need to be specifications for how much delay there is after the signal before the data will be available on the data bus for reading. The capacitors do not unload instantaneously, as described in the previous section. The signal from the cells is so weak that it needs to be amplified. For writing it must be specified how long the data must be available on the bus after theRASandCASis done to successfully store the new value in the cell (again, capacitors do not fill or drain instantaneously). These timing constants are crucial for the performance of the DRAM chip. We will talk about this in the next section.

A secondary scalability problem is that having 30 address lines connected to every RAM chip is not feasible either. Pins of a chip are a precious resources. It is 「bad」 enough that the data must be transferred as much as possible in parallel (e.g., in 64 bit batches). The memory controller must be able to address each RAM module (collection of RAM chips). If parallel access to multiple RAM modules is required for performance reasons and each RAM module requires its own set of 30 or more address lines, then the memory controller needs to have, for 8 RAM modules, a whopping 240+ pins only for the address handling.

譯者信息

對於寫操做,內存單元的數據新值被放到了數據總線,當使用RAS和CAS方式選中內存單元時,數據是存放在內存單元內的。這是一個至關直觀的設計,在現實中——很顯然——會複雜得多,對於讀,須要規範從發出信號到數據在數據總線上變得可讀的時延。電容不會像前面章節裏面描述的那樣馬上自動放電,從內存單元發出的信號是如此這微弱以致於它須要被放大。對於寫,必須規範從數據RAS和CAS操做完成後到數據成功的被寫入內存單元的時延(固然,電容不會馬上自動充電和放電)。這些時間常量對於DRAM芯片的性能是相當重要的,咱們將在下章討論它。

另外一個關於伸縮性的問題是,用30根地址線鏈接到每個RAM芯片是行不通的。芯片的針腳是很是珍貴的資源,以致數據必須能並行傳輸就並行傳輸(好比:64位爲一組)。內存控制器必須有能力解析每個RAM模塊(RAM芯片集合)。若是由於性能的緣由要求併發行訪問多個RAM模塊而且每一個RAM模塊須要本身獨佔的30或多個地址線,那麼對於8個RAM模塊,僅僅是解析地址,內存控制器就須要240+之多的針腳。

 

To counter these secondary scalability problems DRAM chips have, for a long time, multiplexed the address itself. That means the address is transferred in two parts. The first part consisting of address bitsa0anda1in the example in Figure 2.7) select the row. This selection remains active until revoked. Then the second part, address bitsa2anda3, select the column. The crucial difference is that only two external address lines are needed. A few more lines are needed to indicate when theRASandCASsignals are available but this is a small price to pay for cutting the number of address lines in half. This address multiplexing brings its own set of problems, though. We will discuss them in Section 2.2.

 

譯者信息在很長一段時間裏,地址線被複用以解決DRAM芯片的這些次要的可擴展性問題。這意味着地址被轉換成兩部分。第一部分由地址位a0和a1選擇行(如圖2.7)。這個選擇保持有效直到撤銷。而後是第二部分,地址位a2和a3選擇列。關鍵差異在於:只須要兩根外部地址線。須要一些不多的線指明RAS和CAS信號有效,可是把地址線的數目減半所付出的代價更小。但是地址複用也帶來自身的一些問題。咱們將在2.2章中提到。

 

2.1.4 Conclusions

Do not worry if the details in this section are a bit overwhelming. The important things to take away from this section are:

  • there are reasons why not all memory is SRAM
  • memory cells need to be individually selected to be used
  • the number of address lines is directly responsible for the cost of the memory controller, motherboards, DRAM module, and DRAM chip
  • it takes a while before the results of the read or write operation are available

The following section will go into more details about the actual process of accessing DRAM memory. We are not going into more details of accessing SRAM, which is usually directly addressed. This happens for speed and because the SRAM memory is limited in size. SRAM is currently used in CPU caches and on-die where the connections are small and fully under control of the CPU designer. CPU caches are a topic which we discuss later but all we need to know is that SRAM cells have a certain maximum speed which depends on the effort spent on the SRAM. The speed can vary from only slightly slower than the CPU core to one or two orders of magnitude slower.

 

譯者信息

2.1.4 總結

若是這章節的內容有些難以應付,不用擔憂。縱觀這章節的重點,有:

  • 爲何不是全部的存儲器都是SRAM的緣由
  • 存儲單元須要單獨選擇來使用
  • 地址線數目直接負責存儲控制器,主板,DRAM模塊和DRAM芯片的成本
  • 在讀或寫操做結果以前須要佔用一段時間是可行的
接下來的章節會涉及更多的有關訪問DRAM存儲器的實際操做的細節。咱們不會提到更多有關訪問SRAM的具體內容,它一般是直接尋址。這裏是因爲速度和有限的SRAM存儲器的尺寸。SRAM如今應用在CPU的高速緩存和芯片,它們的鏈接件很小並且徹底能在CPU設計師的掌控之下。咱們之後會討論到CPU高速緩存這個主題,但咱們所須要知道的是SRAM存儲單元是有肯定的最大速度,這取決於花在SRAM上的艱難的嘗試。這速度與CPU核心相比略慢一到兩個數量級。 

 

2.2 DRAM Access Technical Details

In the section introducing DRAM we saw that DRAM chips multiplex the addresses in order to save resources. We also saw that accessing DRAM cells takes time since the capacitors in those cells do not discharge instantaneously to produce a stable signal; we also saw that DRAM cells must be refreshed. Now it is time to put this all together and see how all these factors determine how the DRAM access has to happen.

We will concentrate on current technology; we will not discuss asynchronous DRAM and its variants as they are simply not relevant anymore. Readers interested in this topic are referred to [highperfdram] and [arstechtwo]. We will also not talk about Rambus DRAM (RDRAM) even though the technology is not obsolete. It is just not widely used for system memory. We will concentrate exclusively on Synchronous DRAM (SDRAM) and its successors Double Data Rate DRAM (DDR).

譯者信息

2.2 DRAM訪問細節

在上文介紹DRAM的時候,咱們已經看到DRAM芯片爲了節約資源,對地址進行了複用。並且,訪問DRAM單元是須要一些時間的,由於電容器的放電並非瞬時的。此外,咱們還看到,DRAM須要不停地刷新。在這一節裏,咱們將把這些因素拼合起來,看看它們是如何決定DRAM的訪問過程。

咱們將主要關注在當前的科技上,不會再去討論異步DRAM以及它的各類變體。若是對它感興趣,能夠去參考[highperfdram]及[arstechtwo]。咱們也不會討論Rambus DRAM(RDRAM),雖然它並不過期,但在系統內存領域應用不廣。咱們將主要介紹同步DRAM(SDRAM)及其後繼者雙倍速DRAM(DDR)。

 

Synchronous DRAM, as the name suggests, works relative to a time source. The memory controller provides a clock, the frequency of which determines the speed of the Front Side Bus (FSB) — the memory controller interface used by the DRAM chips. As of this writing, frequencies of 800MHz, 1,066MHz, or even 1,333MHz are available with higher frequencies (1,600MHz) being announced for the next generation. This does not mean the frequency used on the bus is actually this high. Instead, today's buses are double- or quad-pumped, meaning that data is transported two or four times per cycle. Higher numbers sell so the manufacturers like to advertise a quad-pumped 200MHz bus as an 「effective」 800MHz bus.

For SDRAM today each data transfer consists of 64 bits — 8 bytes. The transfer rate of the FSB is therefore 8 bytes multiplied by the effective bus frequency (6.4GB/s for the quad-pumped 200MHz bus). That sounds a lot but it is the burst speed, the maximum speed which will never be surpassed. As we will see now the protocol for talking to the RAM modules has a lot of downtime when no data can be transmitted. It is exactly this downtime which we must understand and minimize to achieve the best performance.

譯者信息

同步DRAM,顧名思義,是參照一個時間源工做的。由內存控制器提供一個時鐘,時鐘的頻率決定了前端總線(FSB)的速度。FSB是內存控制器提供給DRAM芯片的接口。在我寫做本文的時候,FSB已經達到800MHz、1066MHz,甚至1333MHz,而且下一代的1600MHz也已經宣佈。但這並不表示時鐘頻率有這麼高。實際上,目前的總線都是雙倍或四倍傳輸的,每一個週期傳輸2次或4次數據。報的越高,賣的越好,因此這些廠商們喜歡把四倍傳輸的200MHz總線宣傳爲「有效的」800MHz總線。

以今天的SDRAM爲例,每次數據傳輸包含64位,即8字節。因此FSB的傳輸速率應該是有效總線頻率乘於8字節(對於4倍傳輸200MHz總線而言,傳輸速率爲6.4GB/s)。聽起來很高,但要知道這只是峯值速率,實際上沒法達到的最高速率。咱們將會看到,與RAM模塊交流的協議有大量時間是處於非工做狀態,不進行數據傳輸。咱們必須對這些非工做時間有所瞭解,並儘可能縮短它們,才能得到最佳的性能。

 

2.2.1 Read Access Protocol

Figure 2.8: SDRAM Read Access Timing

Figure 2.8 shows the activity on some of the connectors of a DRAM module which happens in three differently colored phases. As usual, time flows from left to right. A lot of details are left out. Here we only talk about the bus clock,RASandCASsignals, and the address and data buses. A read cycle begins with the memory controller making the row address available on the address bus and lowering theRASsignal. All signals are read on the rising edge of the clock (CLK) so it does not matter if the signal is not completely square as long as it is stable at the time it is read. Setting the row address causes the RAM chip to start latching the addressed row.

TheCASsignal can be sent aftertRCD(RAS-to-CASDelay) clock cycles. The column address is then transmitted by making it available on the address bus and lowering theCASline. Here we can see how the two parts of the address (more or less halves, nothing else makes sense) can be transmitted over the same address bus.

譯者信息

2.2.1 讀訪問協議

 
圖2.8: SDRAM讀訪問的時序

圖2.8展現了某個DRAM模塊一些鏈接器上的活動,可分爲三個階段,圖上以不一樣顏色表示。按慣例,時間爲從左向右流逝。這裏忽略了許多細節,咱們只關注時鐘頻率、RAS與CAS信號、地址總線和數據總線。首先,內存控制器將行地址放在地址總線上,並下降RAS信號,讀週期開始。全部信號都在時鐘(CLK)的上升沿讀取,所以,只要信號在讀取的時間點上保持穩定,就算不是標準的方波也沒有關係。設置行地址會促使RAM芯片鎖住指定的行。

CAS信號在tRCD(RAS到CAS時延)個時鐘週期後發出。內存控制器將列地址放在地址總線上,下降CAS線。這裏咱們能夠看到,地址的兩個組成部分是怎麼經過同一條總線傳輸的。

 

Now the addressing is complete and the data can be transmitted. The RAM chip needs some time to prepare for this. The delay is usually calledCASLatency (CL). In Figure 2.8 theCASlatency is 2. It can be higher or lower, depending on the quality of the memory controller, motherboard, and DRAM module. The latency can also have half values. With CL=2.5 the first data would be available at the first falling flank in the blue area.

With all this preparation to get to the data it would be wasteful to only transfer one data word. This is why DRAM modules allow the memory controller to specify how much data is to be transmitted. Often the choice is between 2, 4, or 8 words. This allows filling entire lines in the caches without a newRAS/CASsequence. It is also possible for the memory controller to send a newCASsignal without resetting the row selection. In this way, consecutive memory addresses can be read from or written to significantly faster because theRASsignal does not have to be sent and the row does not have to be deactivated (see below). Keeping the row 「open」 is something the memory controller has to decide. Speculatively leaving it open all the time has disadvantages with real-world applications (see [highperfdram]). Sending newCASsignals is only subject to the Command Rate of the RAM module (usually specified as Tx, where x is a value like 1 or 2; it will be 1 for high-performance DRAM modules which accept new commands every cycle).

In this example the SDRAM spits out one word per cycle. This is what the first generation does. DDR is able to transmit two words per cycle. This cuts down on the transfer time but does not change the latency. In principle, DDR2 works the same although in practice it looks different. There is no need to go into the details here. It is sufficient to note that DDR2 can be made faster, cheaper, more reliable, and is more energy efficient (see [ddrtwo] for more information).

譯者信息

至此,尋址結束,是時候傳輸數據了。但RAM芯片任然須要一些準備時間,這個時間稱爲CAS時延(CL)。在圖2.8中CL爲2。這個值可大可小,它取決於內存控制器、主板和DRAM模塊的質量。CL還多是半週期。假設CL爲2.5,那麼數據將在藍色區域內的第一個降低沿準備就緒。

既然數據的傳輸須要這麼多的準備工做,僅僅傳輸一個字顯然是太浪費了。所以,DRAM模塊容許內存控制指定本次傳輸多少數據。能夠是二、4或8個字。這樣,就能夠一次填滿高速緩存的整條線,而不須要額外的RAS/CAS序列。另外,內存控制器還能夠在不重置行選擇的前提下發送新的CAS信號。這樣,讀取或寫入連續的地址就能夠變得很是快,由於不須要發送RAS信號,也不須要把行置爲非激活狀態(見下文)。是否要將行保持爲「打開」狀態是內存控制器判斷的事情。讓它一直保持打開的話,對真正的應用會有很差的影響(參見[highperfdram])。CAS信號的發送僅與RAM模塊的命令速率(Command Rate)有關(經常記爲Tx,其中x爲1或2,高性能的DRAM模塊通常爲1,表示在每一個週期均可以接收新命令)。

在上圖中,SDRAM的每一個週期輸出一個字的數據。這是第一代的SDRAM。而DDR能夠在一個週期中輸出兩個字。這種作法能夠減小傳輸時間,但沒法下降時延。DDR2儘管看上去不一樣,但在本質上也是相同的作法。對於DDR2,不須要再深刻介紹了,咱們只須要知道DDR2更快、更便宜、更可靠、更節能(參見[ddrtwo])就足夠了。

 

2.2.2 Precharge and Activation

Figure 2.8 does not cover the whole cycle. It only shows parts of the full cycle of accessing DRAM. Before a newRASsignal can be sent the currently latched row must be deactivated and the new row must be precharged. We can concentrate here on the case where this is done with an explicit command. There are improvements to the protocol which, in some situations, allows this extra step to be avoided. The delays introduced by precharging still affect the operation, though.

 

Figure 2.9: SDRAM Precharge and Activation

Figure 2.9 shows the activity starting from oneCASsignal to theCASsignal for another row. The data requested with the firstCASsignal is available as before, after CL cycles. In the example two words are requested which, on a simple SDRAM, takes two cycles to transmit. Alternatively, imagine four words on a DDR chip.

譯者信息

2.2.2 預充電與激活

圖2.8並不完整,它只畫出了訪問DRAM的完整循環的一部分。在發送RAS信號以前,必須先把當前鎖住的行置爲非激活狀態,並對新行進行預充電。在這裏,咱們主要討論因爲顯式發送指令而觸發以上行爲的狀況。協議自己做了一些改進,在某些狀況下是能夠省略這個步驟的,但預充電帶來的時延仍是會影響整個操做。

 
圖2.9: SDRAM的預充電與激活

圖2.9顯示的是兩次CAS信號的時序圖。第一次的數據在CL週期後準備就緒。圖中的例子裏,是在SDRAM上,用兩個週期傳輸了兩個字的數據。若是換成DDR的話,則能夠傳輸4個字。

 

Even on DRAM modules with a command rate of one the precharge command cannot be issued right away. It is necessary to wait as long as it takes to transmit the data. In this case it takes two cycles. This happens to be the same as CL but that is just a coincidence. The precharge signal has no dedicated line; instead, some implementations issue it by lowering the Write Enable (WE) andRASline simultaneously. This combination has no useful meaning by itself (see [micronddr] for encoding details).

Once the precharge command is issued it takestRP(Row Precharge time) cycles until the row can be selected. In Figure 2.9 much of the time (indicated by the purplish color) overlaps with the memory transfer (light blue). This is good! ButtRPis larger than the transfer time and so the nextRASsignal is stalled for one cycle.

譯者信息

即便是在一個命令速率爲1的DRAM模塊上,也沒法當即發出預充電命令,而要等數據傳輸完成。在上圖中,即爲兩個週期。恰好與CL相同,但只是巧合而已。預充電信號並無專用線,某些實現是用同時下降寫使能(WE)線和RAS線的方式來觸發。這一組合方式自己沒有特殊的意義(參見[micronddr])。

發出預充電信命令後,還需等待tRP(行預充電時間)個週期以後才能使行被選中。在圖2.9中,這個時間(紫色部分)大部分與內存傳輸的時間(淡藍色部分)重合。不錯。但tRP大於傳輸時間,所以下一個RAS信號只能等待一個週期。

 

If we were to continue the timeline in the diagram we would find that the next data transfer happens 5 cycles after the previous one stops. This means the data bus is only in use two cycles out of seven. Multiply this with the FSB speed and the theoretical 6.4GB/s for a 800MHz bus become 1.8GB/s. That is bad and must be avoided. The techniques described in Section 6 help to raise this number. But the programmer usually has to do her share.

There is one more timing value for a SDRAM module which we have not discussed. In Figure 2.9 the precharge command was only limited by the data transfer time. Another constraint is that an SDRAM module needs time after aRASsignal before it can precharge another row (denoted astRAS). This number is usually pretty high, in the order of two or three times thetRPvalue. This is a problem if, after aRASsignal, only oneCASsignal follows and the data transfer is finished in a few cycles. Assume that in Figure 2.9 the initialCASsignal was preceded directly by aRASsignal and thattRASis 8 cycles. Then the precharge command would have to be delayed by one additional cycle since the sum oftRCD, CL, andtRP(since it is larger than the data transfer time) is only 7 cycles.

譯者信息

若是咱們補充完整上圖中的時間線,最後會發現下一次數據傳輸發生在前一次的5個週期以後。這意味着,數據總線的7個週期中只有2個週期纔是真正在用的。再用它乘於FSB速度,結果就是,800MHz總線的理論速率6.4GB/s降到了1.8GB/s。真是太糟了。第6節將介紹一些技術,能夠幫助咱們提升總線有效速率。程序員們也須要盡本身的努力。

SDRAM還有一些定時值,咱們並無談到。在圖2.9中,預充電命令僅受制於數據傳輸時間。除此以外,SDRAM模塊在RAS信號以後,須要通過一段時間,才能進行預充電(記爲tRAS)。它的值很大,通常達到tRP的2到3倍。若是在某個RAS信號以後,只有一個CAS信號,並且數據只傳輸不多幾個週期,那麼就有問題了。假設在圖2.9中,第一個CAS信號是直接跟在一個RAS信號後免的,而tRAS爲8個週期。那麼預充電命令還須要被推遲一個週期,由於tRCD、CL和tRP加起來才7個週期。

 

DDR modules are often described using a special notation: w-x-y-z-T. For instance: 2-3-2-8-T1. This means:

w 2 CASLatency (CL)
x 3 RAS-to-CASdelay (tRCD)
y 2 RASPrecharge (tRP)
z 8 Active to Precharge delay (tRAS)
T T1 Command Rate

There are numerous other timing constants which affect the way commands can be issued and are handled. Those five constants are in practice sufficient to determine the performance of the module, though.

It is sometimes useful to know this information for the computers in use to be able to interpret certain measurements. It is definitely useful to know these details when buying computers since they, along with the FSB and SDRAM module speed, are among the most important factors determining a computer's speed.

The very adventurous reader could also try to tweak a system. Sometimes the BIOS allows changing some or all these values. SDRAM modules have programmable registers where these values can be set. Usually the BIOS picks the best default value. If the quality of the RAM module is high it might be possible to reduce the one or the other latency without affecting the stability of the computer. Numerous overclocking websites all around the Internet provide ample of documentation for doing this. Do it at your own risk, though and do not say you have not been warned.

譯者信息

DDR模塊每每用w-z-y-z-T來表示。例如,2-3-2-8-T1,意思是:

w 2 CAS時延(CL) 
x 3 RAS-to-CAS時延(t  RCD
y 2 RAS預充電時間(t  RP
z 8 激活到預充電時間(t  RAS
T T1 命令速率

固然,除以上的參數外,還有許多其它參數影響命令的發送與處理。但以上5個參數已經足以肯定模塊的性能。

在解讀計算機性能參數時,這些信息可能會派上用場。而在購買計算機時,這些信息就更有用了,由於它們與FSB/SDRAM速度一塊兒,都是決定計算機速度的關鍵因素。

喜歡冒險的讀者們還能夠利用它們來調優系統。有些計算機的BIOS可讓你修改這些參數。SDRAM模塊有一些可編程寄存器,可供設置參數。BIOS通常會挑選最佳值。若是RAM模塊的質量足夠好,咱們能夠在保持系統穩定的前提下將減少以上某個時延參數。互聯網上有大量超頻網站提供了相關的文檔。不過,這是有風險的,須要你們本身承擔,可別怪我沒有事先提醒喲。

 

2.2.3 Recharging

A mostly-overlooked topic when it comes to DRAM access is recharging. As explained in Section 2.1.2, DRAM cells must constantly be refreshed. This does not happen completely transparently for the rest of the system. At times when a row {Rows are the granularity this happens with despite what [highperfdram] and other literature says (see [micronddr]).} is recharged no access is possible. The study in [highperfdram] found that 「[s]urprisingly, DRAM refresh organization can affect performance dramatically」.

Each DRAM cell must be refreshed every 64ms according to the JEDEC specification. If a DRAM array has 8,192 rows this means the memory controller has to issue a refresh command on average every 7.8125µs (refresh commands can be queued so in practice the maximum interval between two requests can be higher). It is the memory controller's responsibility to schedule the refresh commands. The DRAM module keeps track of the address of the last refreshed row and automatically increases the address counter for each new request.

There is really not much the programmer can do about the refresh and the points in time when the commands are issued. But it is important to keep this part to the DRAM life cycle in mind when interpreting measurements. If a critical word has to be retrieved from a row which currently is being refreshed the processor could be stalled for quite a long time. How long each refresh takes depends on the DRAM module.

譯者信息

2.2.3 重充電

談到DRAM的訪問時,重充電是經常被忽略的一個主題。在2.1.2中曾經介紹,DRAM必須保持刷新。……行在充電時是沒法訪問的。[highperfdram]的研究發現,「使人吃驚,DRAM刷新對性能有着巨大的影響」。

根據JEDEC規範,DRAM單元必須保持每64ms刷新一次。對於8192行的DRAM,這意味着內存控制器平均每7.8125µs就須要發出一個刷新命令(在實際狀況下,因爲刷新命令能夠歸入隊列,所以這個時間間隔能夠更大一些)。刷新命令的調度由內存控制器負責。DRAM模塊會記錄上一次刷新行的地址,而後在下次刷新請求時自動對這個地址進行遞增。

對於刷新及發出刷新命令的時間點,程序員沒法施加影響。但咱們在解讀性能參數時有必要知道,它也是DRAM生命週期的一個部分。若是系統須要讀取某個重要的字,而恰好它所在的行正在刷新,那麼處理器將會被延遲很長一段時間。刷新的具體耗時取決於DRAM模塊自己。

 

2.2.4 Memory Types

It is worth spending some time on the current and soon-to-be current memory types in use. We will start with SDR (Single Data Rate) SDRAMs since they are the basis of the DDR (Double Data Rate) SDRAMs. SDRs were pretty simple. The memory cells and the data transfer rate were identical.

Figure 2.10: SDR SDRAM Operation

In Figure 2.10 the DRAM cell array can output the memory content at the same rate it can be transported over the memory bus. If the DRAM cell array can operate at 100MHz, the data transfer rate of the bus is thus 100Mb/s. The frequency f for all components is the same. Increasing the throughput of the DRAM chip is expensive since the energy consumption rises with the frequency. With a huge number of array cells this is prohibitively expensive. {Power = Dynamic Capacity × Voltage2 × Frequency.} In reality it is even more of a problem since increasing the frequency usually also requires increasing the voltage to maintain stability of the system. DDR SDRAM (called DDR1 retroactively) manages to improve the throughput without increasing any of the involved frequencies.

Figure 2.11: DDR1 SDRAM Operation

 

譯者信息

2.2.4 內存類型

咱們有必要花一些時間來了解一下目前流行的內存,以及那些即將流行的內存。首先從SDR(單倍速)SDRAM開始,由於它們是DDR(雙倍速)SDRAM的基礎。SDR很是簡單,內存單元和數據傳輸率是相等的。

 
圖2.10: SDR SDRAM的操做

在圖2.10中,DRAM單元陣列能以等同於內存總線的速率輸出內容。假設DRAM單元陣列工做在100MHz上,那麼總線的數據傳輸率能夠達到100Mb/s。全部組件的頻率f保持相同。因爲提升頻率會致使耗電量增長,因此提升吞吐量須要付出很高的的代價。若是是很大規模的內存陣列,代價會很是巨大。{功率 = 動態電容 x 電壓2 x 頻率}。並且,提升頻率還須要在保持系統穩定的狀況下提升電壓,這更是一個問題。所以,就有了DDR SDRAM(如今叫DDR1),它能夠在不提升頻率的前提下提升吞吐量。

 
圖2.11 DDR1 SDRAM的操做
The difference between SDR and DDR1 is, as can be seen in Figure 2.11 and guessed from the name, that twice the amount of data is transported per cycle. I.e., the DDR1 chip transports data on the rising and falling edge. This is sometimes called a 「double-pumped」 bus. To make this possible without increasing the frequency of the cell array a buffer has to be introduced. This buffer holds two bits per data line. This in turn requires that, in the cell array in Figure 2.7, the data bus consists of two lines. Implementing this is trivial: one only has the use the same column address for two DRAM cells and access them in parallel. The changes to the cell array to implement this are also minimal.

 

The SDR DRAMs were known simply by their frequency (e.g., PC100 for 100MHz SDR). To make DDR1 DRAM sound better the marketers had to come up with a new scheme since the frequency did not change. They came with a name which contains the transfer rate in bytes a DDR module (they have 64-bit busses) can sustain:

100MHz × 64bit × 2 = 1,600MB/s

 

譯者信息

咱們從圖2.11上能夠看出DDR1與SDR的不一樣之處,也能夠從DDR1的名字裏猜到那麼幾分,DDR1的每一個週期能夠傳輸兩倍的數據,它的上升沿和降低沿都傳輸數據。有時又被稱爲「雙泵(double-pumped)」總線。爲了在不提高頻率的前提下實現雙倍傳輸,DDR引入了一個緩衝區。緩衝區的每條數據線都持有兩位。它要求內存單元陣列的數據總線包含兩條線。實現的方式很簡單,用同一個列地址同時訪問兩個DRAM單元。對單元陣列的修改也很小。

SDR DRAM是以頻率來命名的(例如,對應於100MHz的稱爲PC100)。爲了讓DDR1聽上去更好聽,營銷人員們不得不想了一種新的命名方案。這種新方案中含有DDR模塊可支持的傳輸速率(DDR擁有64位總線):

100MHz x 64位 x 2 = 1600MB/s
Hence a DDR module with 100MHz frequency is called PC1600. With 1600 > 100 all marketing requirements are fulfilled; it sounds much better although the improvement is really only a factor of two. { I will take the factor of two but I do not have to like the inflated numbers.}

 

Figure 2.12: DDR2 SDRAM Operation

To get even more out of the memory technology DDR2 includes a bit more innovation. The most obvious change that can be seen in Figure 2.12 is the doubling of the frequency of the bus. Doubling the frequency means doubling the bandwidth. Since this doubling of the frequency is not economical for the cell array it is now required that the I/O buffer gets four bits in each clock cycle which it then can send on the bus. This means the changes to the DDR2 modules consist of making only the I/O buffer component of the DIMM capable of running at higher speeds. This is certainly possible and will not require measurably more energy, it is just one tiny component and not the whole module. The names the marketers came up with for DDR2 are similar to the DDR1 names only in the computation of the value the factor of two is replaced by four (we now have a quad-pumped bus). Figure 2.13 shows the names of the modules in use today.

Array
Freq.
Bus
Freq.
Data
Rate
Name
(Rate)
Name
(FSB)
133MHz 266MHz 4,256MB/s PC2-4200 DDR2-533
166MHz 333MHz 5,312MB/s PC2-5300 DDR2-667
200MHz 400MHz 6,400MB/s PC2-6400 DDR2-800
250MHz 500MHz 8,000MB/s PC2-8000 DDR2-1000
266MHz 533MHz 8,512MB/s PC2-8500 DDR2-1066

Figure 2.13: DDR2 Module Names

 

譯者信息

因而,100MHz頻率的DDR模塊就被稱爲PC1600。因爲1600 > 100,營銷方面的需求獲得了知足,聽起來很是棒,但實際上僅僅只是提高了兩倍而已。{我接受兩倍這個事實,但不喜歡相似的數字膨脹戲法。}

 
圖2.12: DDR2 SDRAM的操做

爲了更進一步,DDR2有了更多的創新。在圖2.12中,最明顯的變化是,總線的頻率加倍了。頻率的加倍意味着帶寬的加倍。若是對單元陣列的頻率加倍,顯然是不經濟的,所以DDR2要求I/O緩衝區在每一個時鐘週期讀取4位。也就是說,DDR2的變化僅在於使I/O緩衝區運行在更高的速度上。這是可行的,並且耗電也不會顯著增長。DDR2的命名與DDR1相仿,只是將因子2替換成4(四泵總線)。圖2.13顯示了目前經常使用的一些模塊的名稱。

陣列頻率 總線頻率 數據率 名稱(速率) 名稱
(FSB)
133MHz 266MHz 4,256MB/s PC2-4200 DDR2-533
166MHz 333MHz 5,312MB/s PC2-5300 DDR2-667
200MHz 400MHz 6,400MB/s PC2-6400 DDR2-800
250MHz 500MHz 8,000MB/s PC2-8000 DDR2-1000
266MHz 533MHz 8,512MB/s PC2-8500 DDR2-1066

圖2.13: DDR2模塊名
There is one more twist to the naming. The FSB speed used by CPU, motherboard, and DRAM module is specified by using the effective frequency. I.e., it factors in the transmission on both flanks of the clock cycle and thereby inflates the number. So, a 133MHz module with a 266MHz bus has an FSB 「frequency」 of 533MHz.

 

The specification for DDR3 (the real one, not the fake GDDR3 used in graphics cards) calls for more changes along the lines of the transition to DDR2. The voltage will be reduced from 1.8V for DDR2 to 1.5V for DDR3. Since the power consumption equation is calculated using the square of the voltage this alone brings a 30% improvement. Add to this a reduction in die size plus other electrical advances and DDR3 can manage, at the same frequency, to get by with half the power consumption. Alternatively, with higher frequencies, the same power envelope can be hit. Or with double the capacity the same heat emission can be achieved.

譯者信息

在命名方面還有一個擰巴的地方。FSB速度是用有效頻率來標記的,即把上升、降低沿均傳輸數據的因素考慮進去,所以數字被撐大了。因此,擁有266MHz總線的133MHz模塊有着533MHz的FSB「頻率」。

DDR3要求更多的改變(這裏指真正的DDR3,而不是圖形卡中假冒的GDDR3)。電壓從1.8V降低到1.5V。因爲耗電是與電壓的平方成正比,所以能夠節約30%的電力。加上管芯(die)的縮小和電氣方面的其它進展,DDR3能夠在保持相同頻率的狀況下,下降一半的電力消耗。或者,在保持相同耗電的狀況下,達到更高的頻率。又或者,在保持相同熱量排放的狀況下,實現容量的翻番。

 

The cell array of DDR3 modules will run at a quarter of the speed of the external bus which requires an 8 bit I/O buffer, up from 4 bits for DDR2. See Figure 2.14 for the schematics.

 

Figure 2.14: DDR3 SDRAM Operation

Initially DDR3 modules will likely have slightly higherCASlatencies just because the DDR2 technology is more mature. This would cause DDR3 to be useful only at frequencies which are higher than those which can be achieved with DDR2, and, even then, mostly when bandwidth is more important than latency. There is already talk about 1.3V modules which can achieve the sameCASlatency as DDR2. In any case, the possibility of achieving higher speeds because of faster buses will outweigh the increased latency.

譯者信息

DDR3模塊的單元陣列將運行在內部總線的四分之一速度上,DDR3的I/O緩衝區從DDR2的4位提高到8位。見圖2.14。

 
圖2.14: DDR3 SDRAM的操做

一開始,DDR3可能會有較高的CAS時延,由於DDR2的技術相比之下更爲成熟。因爲這個緣由,DDR3可能只會用於DDR2沒法達到的高頻率下,並且帶寬比時延更重要的場景。此前,已經有討論指出,1.3V的DDR3能夠達到與DDR2相同的CAS時延。不管如何,更高速度帶來的價值都會超過期延增長帶來的影響。

 

One possible problem with DDR3 is that, for 1,600Mb/s transfer rate or higher, the number of modules per channel may be reduced to just one. In earlier versions this requirement held for all frequencies, so one can hope that the requirement will at some point be lifted for all frequencies. Otherwise the capacity of systems will be severely limited.

Figure 2.15 shows the names of the expected DDR3 modules. JEDEC agreed so far on the first four types. Given that Intel's 45nm processors have an FSB speed of 1,600Mb/s, the 1,866Mb/s is needed for the overclocking market. We will likely see more of this towards the end of the DDR3 lifecycle.

Array
Freq.
Bus
Freq.
Data
Rate
Name
(Rate)
Name
(FSB)
100MHz 400MHz 6,400MB/s PC3-6400 DDR3-800
133MHz 533MHz 8,512MB/s PC3-8500 DDR3-1066
166MHz 667MHz 10,667MB/s PC3-10667 DDR3-1333
200MHz 800MHz 12,800MB/s PC3-12800 DDR3-1600
233MHz 933MHz 14,933MB/s PC3-14900 DDR3-1866

Figure 2.15: DDR3 Module Names

 

譯者信息

DDR3可能會有一個問題,即在1600Mb/s或更高速率下,每一個通道的模塊數可能會限制爲1。在早期版本中,這一要求是針對全部頻率的。咱們但願這個要求能夠提升一些,不然系統容量將會受到嚴重的限制。

圖2.15顯示了咱們預計中各DDR3模塊的名稱。JEDEC目前贊成了前四種。因爲Intel的45nm處理器是1600Mb/s的FSB,1866Mb/s能夠用於超頻市場。隨着DDR3的發展,可能會有更多類型加入。

陣列頻率 總線頻率 數據速率 名稱(速率) 名稱
(FSB)
100MHz 400MHz 6,400MB/s PC3-6400 DDR3-800
133MHz 533MHz 8,512MB/s PC3-8500 DDR3-1066
166MHz 667MHz 10,667MB/s PC3-10667 DDR3-1333
200MHz 800MHz 12,800MB/s PC3-12800 DDR3-1600
233MHz 933MHz 14,933MB/s PC3-14900 DDR3-1866

圖2.15: DDR3模塊名
All DDR memory has one problem: the increased bus frequency makes it hard to create parallel data busses. A DDR2 module has 240 pins. All connections to data and address pins must be routed so that they have approximately the same length. Even more of a problem is that, if more than one DDR module is to be daisy-chained on the same bus, the signals get more and more distorted for each additional module. The DDR2 specification allow only two modules per bus (aka channel), the DDR3 specification only one module for high frequencies. With 240 pins per channel a single Northbridge cannot reasonably drive more than two channels. The alternative is to have external memory controllers (as in Figure 2.2) but this is expensive.

 

What this means is that commodity motherboards are restricted to hold at most four DDR2 or DDR3 modules. This restriction severely limits the amount of memory a system can have. Even old 32-bit IA-32 processors can handle 64GB of RAM and memory demand even for home use is growing, so something has to be done.

譯者信息

全部的DDR內存都有一個問題:不斷增長的頻率使得創建並行數據總線變得十分困難。一個DDR2模塊有240根引腳。全部到地址和數據引腳的連線必須被佈置得差很少同樣長。更大的問題是,若是多於一個DDR模塊經過菊花鏈鏈接在同一個總線上,每一個模塊所接收到的信號隨着模塊的增長會變得愈來愈扭曲。DDR2規範容許每條總線(又稱通道)鏈接最多兩個模塊,DDR3在高頻率下只容許每一個通道鏈接一個模塊。每條總線多達240根引腳使得單個北橋沒法以合理的方式驅動兩個通道。替代方案是增長外部內存控制器(如圖2.2),但這會提升成本。

這意味着商品主板所搭載的DDR2或DDR3模塊數將被限制在最多四條,這嚴重限制了系統的最大內存容量。即便是老舊的32位IA-32處理器也能夠使用64GB內存。即便是家庭對內存的需求也在不斷增加,因此,某些事必須開始作了。

 

One answer is to add memory controllers into each processor as explained in Section 2. AMD does it with the Opteron line and Intel will do it with their CSI technology. This will help as long as the reasonable amount of memory a processor is able to use can be connected to a single processor. In some situations this is not the case and this setup will introduce a NUMA architecture and its negative effects. For some situations another solution is needed.

Intel's answer to this problem for big server machines, at least for the next years, is called Fully Buffered DRAM (FB-DRAM). The FB-DRAM modules use the same components as today's DDR2 modules which makes them relatively cheap to produce. The difference is in the connection with the memory controller. Instead of a parallel data bus FB-DRAM utilizes a serial bus (Rambus DRAM had this back when, too, and SATA is the successor of PATA, as is PCI Express for PCI/AGP). The serial bus can be driven at a much higher frequency, reverting the negative impact of the serialization and even increasing the bandwidth. The main effects of using a serial bus are

  1. more modules per channel can be used.
  2. more channels per Northbridge/memory controller can be used.
  3. the serial bus is designed to be fully-duplex (two lines).

 

譯者信息

一種解法是,在處理器中加入內存控制器,咱們在第2節中曾經介紹過。AMD的Opteron系列和Intel的CSI技術就是採用這種方法。只要咱們能把處理器要求的內存鏈接處處理器上,這種解法就是有效的。若是不能,按照這種思路就會引入NUMA架構,固然同時也會引入它的缺點。而在有些狀況下,咱們須要其它解法。

Intel針對大型服務器方面的解法(至少在將來幾年),是被稱爲全緩衝DRAM(FB-DRAM)的技術。FB-DRAM採用與DDR2相同的器件,所以造價低廉。不一樣之處在於它們與內存控制器的鏈接方式。FB-DRAM沒有用並行總線,而用了串行總線(Rambus DRAM had this back when, too, 而SATA是PATA的繼任者,就像PCI Express是PCI/AGP的繼承人同樣)。串行總線能夠達到更高的頻率,串行化的負面影響,甚至能夠增長帶寬。使用串行總線後

  1. 每一個通道能夠使用更多的模塊。
  2. 每一個北橋/內存控制器能夠使用更多的通道。
  3. 串行總線是全雙工的(兩條線)。
An FB-DRAM module has only 69 pins, compared with the 240 for DDR2. Daisy chaining FB-DRAM modules is much easier since the electrical effects of the bus can be handled much better. The FB-DRAM specification allows up to 8 DRAM modules per channel.

 

Compared with the connectivity requirements of a dual-channel Northbridge it is now possible to drive 6 channels of FB-DRAM with fewer pins: 2×240 pins versus 6×69 pins. The routing for each channel is much simpler which could also help reducing the cost of the motherboards.

Fully duplex parallel busses are prohibitively expensive for the traditional DRAM modules, duplicating all those lines is too costly. With serial lines (even if they are differential, as FB-DRAM requires) this is not the case and so the serial bus is designed to be fully duplexed, which means, in some situations, that the bandwidth is theoretically doubled alone by this. But it is not the only place where parallelism is used for bandwidth increase. Since an FB-DRAM controller can run up to six channels at the same time the bandwidth can be increased even for systems with smaller amounts of RAM by using FB-DRAM. Where a DDR2 system with four modules has two channels, the same capacity can handled via four channels using an ordinary FB-DRAM controller. The actual bandwidth of the serial bus depends on the type of DDR2 (or DDR3) chips used on the FB-DRAM module.

譯者信息

FB-DRAM只有69個腳。經過菊花鏈方式鏈接多個FB-DRAM也很簡單。FB-DRAM規範容許每一個通道鏈接最多8個模塊。

在對比下雙通道北橋的鏈接性,採用FB-DRAM後,北橋能夠驅動6個通道,並且腳數更少——6x69對比2x240。每一個通道的佈線也更爲簡單,有助於下降主板的成本。

全雙工的並行總線過於昂貴。而換成串行線後,這再也不是一個問題,所以串行總線按全雙工來設計的,這也意味着,在某些狀況下,僅靠這一特性,總線的理論帶寬已經翻了一倍。還不止於此。因爲FB-DRAM控制器可同時鏈接6個通道,所以能夠利用它來增長某些小內存系統的帶寬。對於一個雙通道、4模塊的DDR2系統,咱們能夠用一個普通FB-DRAM控制器,用4通道來實現相同的容量。串行總線的實際帶寬取決於在FB-DRAM模塊中所使用的DDR2(或DDR3)芯片的類型。

 

We can summarize the advantages like this:

  DDR2 FB-DRAM
 
譯者信息

咱們能夠像這樣總結這些優點:

DDR2 FB-DRAM
Pins 240 69 Channels 2 6 DIMMs/Channel 2 8 Max Memory 16GB 192GB Throughput ~10GB/s ~40GB/s

There are a few drawbacks to FB-DRAMs if multiple DIMMs on one channel are used. The signal is delayed—albeit minimally—at each DIMM in the chain, which means the latency increases. But for the same amount of memory with the same frequency FB-DRAM can always be faster than DDR2 and DDR3 since only one DIMM per channel is needed; for large memory systems DDR simply has no answer using commodity components.

 

譯者信息
  DDR2 FB-DRAM
240 69
通道 2 6
每通道DIMM數 2 8
最大內存 16GB 192GB
吞吐量 ~10GB/s ~40GB/s
若是在單個通道上使用多個DIMM,會有一些問題。信號在每一個DIMM上都會有延遲(儘管很小),也就是說,延遲是遞增的。不過,若是在相同頻率和相同容量上進行比較,FB-DRAM老是能快過DDR2及DDR3,由於FB-DRAM只須要在每一個通道上使用一個DIMM便可。而若是說到大型內存系統,那麼DDR更是沒有商用組件的解決方案。

 

2.2.5 Conclusions

This section should have shown that accessing DRAM is not an arbitrarily fast process. At least not fast compared with the speed the processor is running and with which it can access registers and cache. It is important to keep in mind the differences between CPU and memory frequencies. An Intel Core 2 processor running at 2.933GHz and a 1.066GHz FSB have a clock ratio of 11:1 (note: the 1.066GHz bus is quad-pumped). Each stall of one cycle on the memory bus means a stall of 11 cycles for the processor. For most machines the actual DRAMs used are slower, thusly increasing the delay. Keep these numbers in mind when we are talking about stalls in the upcoming sections.

The timing charts for the read command have shown that DRAM modules are capable of high sustained data rates. Entire DRAM rows could be transported without a single stall. The data bus could be kept occupied 100%. For DDR modules this means two 64-bit words transferred each cycle. With DDR2-800 modules and two channels this means a rate of 12.8GB/s.

But, unless designed this way, DRAM access is not always sequential. Non-continuous memory regions are used which means precharging and newRASsignals are needed. This is when things slow down and when the DRAM modules need help. The sooner the precharging can happen and theRASsignal sent the smaller the penalty when the row is actually used.

Hardware and software prefetching (see Section 6.3) can be used to create more overlap in the timing and reduce the stall. Prefetching also helps shift memory operations in time so that there is less contention at later times, right before the data is actually needed. This is a frequent problem when the data produced in one round has to be stored and the data required for the next round has to be read. By shifting the read in time, the write and read operations do not have to be issued at basically the same time.

譯者信息

2.2.5 結論

經過本節,你們應該瞭解到訪問DRAM的過程並非一個快速的過程。至少與處理器的速度相比,或與處理器訪問寄存器及緩存的速度相比,DRAM的訪問不算快。你們還須要記住CPU和內存的頻率是不一樣的。Intel Core 2處理器運行在2.933GHz,而1.066GHz FSB有11:1的時鐘比率(注: 1.066GHz的總線爲四泵總線)。那麼,內存總線上延遲一個週期意味着處理器延遲11個週期。絕大多數機器使用的DRAM更慢,所以延遲更大。在後續的章節中,咱們須要討論延遲這個問題時,請把以上的數字記在內心。

前文中讀命令的時序圖代表,DRAM模塊能夠支持高速數據傳輸。每一個完整行能夠被毫無延遲地傳輸。數據總線能夠100%被佔。對DDR而言,意味着每一個週期傳輸2個64位字。對於DDR2-800模塊和雙通道而言,意味着12.8GB/s的速率。

可是,除非是特殊設計,DRAM的訪問並不老是串行的。訪問不連續的內存區意味着須要預充電和RAS信號。因而,各類速度開始慢下來,DRAM模塊急需幫助。預充電的時間越短,數據傳輸所受的懲罰越小。

硬件和軟件的預取(參見第6.3節)能夠在時序中製造更多的重疊區,下降延遲。預取還能夠轉移內存操做的時間,從而減小爭用。咱們經常遇到的問題是,在這一輪中生成的數據須要被存儲,而下一輪的數據須要被讀出來。經過轉移讀取的時間,讀和寫就不須要同時發出了。

 

2.3 Other Main Memory Users

Beside the CPUs there are other system components which can access the main memory. High-performance cards such as network and mass-storage controllers cannot afford to pipe all the data they need or provide through the CPU. Instead, they read or write the data directly from/to the main memory (Direct Memory Access, DMA). In Figure 2.1 we can see that the cards can talk through the South- and Northbridge directly with the memory. Other buses, like USB, also require FSB bandwidth—even though they do not use DMA—since the Southbridge is connected to the Northbridge through the FSB, too.

While DMA is certainly beneficial, it means that there is more competition for the FSB bandwidth. In times with high DMA traffic the CPU might stall more than usual while waiting for data from the main memory. There are ways around this given the right hardware. With an architecture as in Figure 2.3 one can make sure the computation uses memory on nodes which are not affected by DMA. It is also possible to attach a Southbridge to each node, equally distributing the load on the FSB of all the nodes. There are a myriad of possibilities. In Section 6 we will introduce techniques and programming interfaces which help achieving the improvements which are possible in software.

譯者信息

2.3 主存的其它用戶

除了CPU外,系統中還有其它一些組件也能夠訪問主存。高性能網卡或大規模存儲控制器是沒法承受經過CPU來傳輸數據的,它們通常直接對內存進行讀寫(直接內存訪問,DMA)。在圖2.1中能夠看到,它們能夠經過南橋和北橋直接訪問內存。另外,其它總線,好比USB等也須要FSB帶寬,即便它們並不使用DMA,但南橋仍要經過FSB鏈接到北橋。

DMA固然有很大的優勢,但也意味着FSB帶寬會有更多的競爭。在有大量DMA流量的狀況下,CPU在訪問內存時必然會有更大的延遲。咱們能夠用一些硬件來解決這個問題。例如,經過圖2.3中的架構,咱們能夠挑選不受DMA影響的節點,讓它們的內存爲咱們的計算服務。還能夠在每一個節點上鍊接一個南橋,將FSB的負荷均勻地分擔到每一個節點上。除此之外,還有許多其它方法。咱們將在第6節中介紹一些技術和編程接口,它們可以幫助咱們經過軟件的方式改善這個問題。

 

Finally it should be mentioned that some cheap systems have graphics systems without separate, dedicated video RAM. Those systems use parts of the main memory as video RAM. Since access to the video RAM is frequent (for a 1024x768 display with 16 bpp at 60Hz we are talking 94MB/s) and system memory, unlike RAM on graphics cards, does not have two ports this can substantially influence the systems performance and especially the latency. It is best to ignore such systems when performance is a priority. They are more trouble than they are worth. People buying those machines know they will not get the best performance.

Continue to:

  • Part 2 (CPU caches)
  • Part 3 (Virtual memory)
  • Part 4 (NUMA systems)
  • Part 5 (What programmers can do - cache optimization)
  • Part 6 (What programmers can do - multi-threaded optimizations)
  • Part 7 (Memory performance tools)
  • Part 8 (Future technologies)
  • Part 9 (Appendices and bibliography)
譯者信息

最後,還須要提一下某些廉價系統,它們的圖形系統沒有專用的顯存,而是採用主存的一部分做爲顯存。因爲對顯存的訪問很是頻繁(例如,對於1024x76八、16bpp、60Hz的顯示設置來講,須要95MB/s的數據速率),而主存並不像顯卡上的顯存,並無兩個端口,所以這種配置會對系統性能、尤爲是時延形成必定的影響。若是你們對系統性能要求比較高,最好不要採用這種配置。這種系統帶來的問題超過了自己的價值。人們在購買它們時已經作好了性能不佳的心理準備。

繼續閱讀:

 

Part 2 CPU Cache

 

CPUs are today much more sophisticated than they were only 25 years ago. In those days, the frequency of the CPU core was at a level equivalent to that of the memory bus. Memory access was only a bit slower than register access. But this changed dramatically in the early 90s, when CPU designers increased the frequency of the CPU core but the frequency of the memory bus and the performance of RAM chips did not increase proportionally. This is not due to the fact that faster RAM could not be built, as explained in the previous section. It is possible but it is not economical. RAM as fast as current CPU cores is orders of magnitude more expensive than any dynamic RAM.

譯者信息如今的CPU比25年前要精密得多了。在那個年代,CPU的頻率與內存總線的頻率基本在同一層面上。內存的訪問速度僅比寄存器慢那麼一點點。可是,這一局面在上世紀90年代被打破了。CPU的頻率大大提高,但內存總線的頻率與內存芯片的性能卻沒有獲得成比例的提高。並非由於造不出更快的內存,只是由於太貴了。內存若是要達到目前CPU那樣的速度,那麼它的造價恐怕要貴上好幾個數量級。

 

If the choice is between a machine with very little, very fast RAM and a machine with a lot of relatively fast RAM, the second will always win given a working set size which exceeds the small RAM size and the cost of accessing secondary storage media such as hard drives. The problem here is the speed of secondary storage, usually hard disks, which must be used to hold the swapped out part of the working set. Accessing those disks is orders of magnitude slower than even DRAM access.

Fortunately it does not have to be an all-or-nothing decision. A computer can have a small amount of high-speed SRAM in addition to the large amount of DRAM. One possible implementation would be to dedicate a certain area of the address space of the processor as containing the SRAM and the rest the DRAM. The task of the operating system would then be to optimally distribute data to make use of the SRAM. Basically, the SRAM serves in this situation as an extension of the register set of the processor.

譯者信息

若是有兩個選項讓你選擇,一個是速度很是快、但容量很小的內存,一個是速度還算快、但容量不少的內存,若是你的工做集比較大,超過了前一種狀況,那麼人們老是會選擇第二個選項。緣由在於輔存(通常爲磁盤)的速度。因爲工做集超過主存,那麼必須用輔存來保存交換出去的那部分數據,而輔存的速度每每要比主存慢上好幾個數量級。

好在這問題也並不全然是非甲即乙的選擇。在配置大量DRAM的同時,咱們還能夠配置少許SRAM。將地址空間的某個部分劃給SRAM,剩下的部分劃給DRAM。通常來講,SRAM能夠看成擴展的寄存器來使用。

 

While this is a possible implementation, it is not viable. Ignoring the problem of mapping the physical resources of such SRAM-backed memory to the virtual address spaces of the processes (which by itself is terribly hard) this approach would require each process to administer in software the allocation of this memory region. The size of the memory region can vary from processor to processor (i.e., processors have different amounts of the expensive SRAM-backed memory). Each module which makes up part of a program will claim its share of the fast memory, which introduces additional costs through synchronization requirements. In short, the gains of having fast memory would be eaten up completely by the overhead of administering the resources.

譯者信息上面的作法看起來彷佛能夠,但實際上並不可行。首先,將SRAM內存映射到進程的虛擬地址空間就是個很是複雜的工做,並且,在這種作法中,每一個進程都須要管理這個SRAM區內存的分配。每一個進程可能有大小徹底不一樣的SRAM區,而組成程序的每一個模塊也須要索取屬於自身的SRAM,更引入了額外的同步需求。簡而言之,快速內存帶來的好處徹底被額外的管理開銷給抵消了。

 

So, instead of putting the SRAM under the control of the OS or user, it becomes a resource which is transparently used and administered by the processors. In this mode, SRAM is used to make temporary copies of (to cache, in other words) data in main memory which is likely to be used soon by the processor. This is possible because program code and data has temporal and spatial locality. This means that, over short periods of time, there is a good chance that the same code or data gets reused. For code this means that there are most likely loops in the code so that the same code gets executed over and over again (the perfect case forspatial locality). Data accesses are also ideally limited to small regions. Even if the memory used over short time periods is not close together there is a high chance that the same data will be reused before long (temporal locality). For code this means, for instance, that in a loop a function call is made and that function is located elsewhere in the address space. The function may be distant in memory, but calls to that function will be close in time. For data it means that the total amount of memory used at one time (the working set size) is ideally limited but the memory used, as a result of the random access nature of RAM, is not close together. Realizing that locality exists is key to the concept of CPU caches as we use them today.

譯者信息所以,SRAM是做爲CPU自動使用和管理的一個資源,而不是由OS或者用戶管理的。在這種模式下,SRAM用來複制保存(或者叫緩存)主內存中有可能即將被CPU使用的數據。這意味着,在較短期內,CPU頗有可能重複運行某一段代碼,或者重複使用某部分數據。從代碼上看,這意味着CPU執行了一個循環,因此相同的代碼一次又一次地執行(空間局部性的絕佳例子)。數據訪問也相對侷限在一個小的區間內。即便程序使用的物理內存不是相連的,在短時間內程序仍然頗有可能使用一樣的數據(時間局部性)。這個在代碼上表現爲,程序在一個循環體內調用了入口一個位於另外的物理地址的函數。這個函數可能與當前指令的物理位置相距甚遠,可是調用的時間差不大。在數據上表現爲,程序使用的內存是有限的(至關於工做集的大小)。可是實際上因爲RAM的隨機訪問特性,程序使用的物理內存並非連續的。正是因爲空間局部性和時間局部性的存在,咱們才提煉出今天的CPU緩存概念。

 

A simple computation can show how effective caches can theoretically be. Assume access to main memory takes 200 cycles and access to the cache memory take 15 cycles. Then code using 100 data elements 100 times each will spend 2,000,000 cycles on memory operations if there is no cache and only 168,500 cycles if all data can be cached. That is an improvement of 91.5%.

The size of the SRAM used for caches is many times smaller than the main memory. In the author's experience with workstations with CPU caches the cache size has always been around 1/1000th of the size of the main memory (today: 4MB cache and 4GB main memory). This alone does not constitute a problem. If the size of the working set (the set of data currently worked on) is smaller than the cache size it does not matter. But computers do not have large main memories for no reason. The working set is bound to be larger than the cache. This is especially true for systems running multiple processes where the size of the working set is the sum of the sizes of all the individual processes and the kernel.

譯者信息

咱們先用一個簡單的計算來展現一下高速緩存的效率。假設,訪問主存須要200個週期,而訪問高速緩存須要15個週期。若是使用100個數據元素100次,那麼在沒有高速緩存的狀況下,須要2000000個週期,而在有高速緩存、並且全部數據都已被緩存的狀況下,只須要168500個週期。節約了91.5%的時間。

用做高速緩存的SRAM容量比主存小得多。以個人經驗來講,高速緩存的大小通常是主存的千分之一左右(目前通常是4GB主存、4MB緩存)。這一點自己並非什麼問題。只是,計算機通常都會有比較大的主存,所以工做集的大小老是會大於緩存。特別是那些運行多進程的系統,它的工做集大小是全部進程加上內核的總和。

 

What is needed to deal with the limited size of the cache is a set of good strategies to determine what should be cached at any given time. Since not all data of the working set is used at exactly the same time we can use techniques to temporarily replace some data in the cache with other data. And maybe this can be done before the data is actually needed. This prefetching would remove some of the costs of accessing main memory since it happens asynchronously with respect to the execution of the program. All these techniques and more can be used to make the cache appear bigger than it actually is. We will discuss them in Section 3.3. Once all these techniques are exploited it is up to the programmer to help the processor. How this can be done will be discussed in Section 6.

譯者信息處理高速緩存大小的限制須要制定一套很好的策略來決定在給定的時間內什麼數據應該被緩存。因爲不是全部數據的工做集都是在徹底相同的時間段內被使用的,咱們能夠用一些技術手段將須要用到的數據臨時替換那些當前並未使用的緩存數據。這種預取將會減小部分訪問主存的成本,由於它與程序的執行是異步的。全部的這些技術將會使高速緩存在使用的時候看起來比實際更大。咱們將在3.3節討論這些問題。 咱們將在第6章討論如何讓這些技術能很好地幫助程序員,讓處理器更高效地工做。

 

3.1 CPU Caches in the Big Picture

Before diving into technical details of the implementation of CPU caches some readers might find it useful to first see in some more details how caches fit into the 「big picture」 of a modern computer system.

 

Figure 3.1: Minimum Cache Configuration

Figure 3.1 shows the minimum cache configuration. It corresponds to the architecture which could be found in early systems which deployed CPU caches. The CPU core is no longer directly connected to the main memory. {In even earlier systems the cache was attached to the system bus just like the CPU and the main memory. This was more a hack than a real solution.} All loads and stores have to go through the cache. The connection between the CPU core and the cache is a special, fast connection. In a simplified representation, the main memory and the cache are connected to the system bus which can also be used for communication with other components of the system. We introduced the system bus as 「FSB」 which is the name in use today; see Section 2.2. In this section we ignore the Northbridge; it is assumed to be present to facilitate the communication of the CPU(s) with the main memory.

譯者信息

3.1 高速緩存的位置

在深刻介紹高速緩存的技術細節以前,有必要說明一下它在現代計算機系統中所處的位置。

 
圖3.1: 最簡單的高速緩存配置圖

圖3.1展現了最簡單的高速緩存配置。早期的一些系統就是相似的架構。在這種架構中,CPU核心再也不直連到主存。{在一些更早的系統中,高速緩存像CPU與主存同樣連到系統總線上。那種作法更像是一種hack,而不是真正的解決方案。}數據的讀取和存儲都通過高速緩存。CPU核心與高速緩存之間是一條特殊的快速通道。在簡化的表示法中,主存與高速緩存都連到系統總線上,這條總線同時還用於與其它組件通訊。咱們管這條總線叫「FSB」——就是如今稱呼它的術語,參見第2.2節。在這一節裏,咱們將忽略北橋。

 

Even though computers for the last several decades have used the von Neumann architecture, experience has shown that it is of advantage to separate the caches used for code and for data. Intel has used separate code and data caches since 1993 and never looked back. The memory regions needed for code and data are pretty much independent of each other, which is why independent caches work better. In recent years another advantage emerged: the instruction decoding step for the most common processors is slow; caching decoded instructions can speed up the execution, especially when the pipeline is empty due to incorrectly predicted or impossible-to-predict branches.

譯者信息在過去的幾十年,經驗代表使用了馮諾伊曼結構的 計算機,將用於代碼和數據的高速緩存分開是存在巨大優點的。自1993年以來,Intel 而且一直堅持使用獨立的代碼和數據高速緩存。因爲所需的代碼和數據的內存區域是幾乎相互獨立的,這就是爲何獨立緩存工做得更完美的緣由。近年來,獨立緩存的另外一個優點慢慢顯現出來:常見處理器解碼 指令步驟 是緩慢的,尤爲當管線爲空的時候,每每會伴隨着錯誤的預測或沒法預測的分支的出現, 將高速緩存技術用於 指令 解碼能夠加快其執行速度。

 

Soon after the introduction of the cache, the system got more complicated. The speed difference between the cache and the main memory increased again, to a point that another level of cache was added, bigger and slower than the first-level cache. Only increasing the size of the first-level cache was not an option for economical reasons. Today, there are even machines with three levels of cache in regular use. A system with such a processor looks like Figure 3.2. With the increase on the number of cores in a single CPU the number of cache levels might increase in the future even more.

Figure 3.2: Processor with Level 3 Cache

Figure 3.2 shows three levels of cache and introduces the nomenclature we will use in the remainder of the document. L1d is the level 1 data cache, L1i the level 1 instruction cache, etc. Note that this is a schematic; the data flow in reality need not pass through any of the higher-level caches on the way from the core to the main memory. CPU designers have a lot of freedom designing the interfaces of the caches. For programmers these design choices are invisible.

譯者信息

在高速緩存出現後不久,系統變得更加複雜。高速緩存與主存之間的速度差別進一步拉大,直到加入了另外一級緩存。新加入的這一級緩存比第一級緩存更大,可是更慢。因爲加大一級緩存的作法從經濟上考慮是行不通的,因此有了二級緩存,甚至如今的有些系統擁有三級緩存,如圖3.2所示。隨着單個CPU中核數的增長,將來甚至可能會出現更多層級的緩存。

 
圖3.2: 三級緩存的處理器

圖3.2展現了三級緩存,並介紹了本文將使用的一些術語。L1d是一級數據緩存,L1i是一級指令緩存,等等。請注意,這只是示意圖,真正的數據流並不須要流經上級緩存。CPU的設計者們在設計高速緩存的接口時擁有很大的自由。而程序員是看不到這些設計選項的。

 

In addition we have processors which have multiple cores and each core can have multiple 「threads」. The difference between a core and a thread is that separate cores have separate copies of (almost {Early multi-core processors even had separate 2nd level caches and no 3rd level cache.}) all the hardware resources. The cores can run completely independently unless they are using the same resources—e.g., the connections to the outside—at the same time. Threads, on the other hand, share almost all of the processor's resources. Intel's implementation of threads has only separate registers for the threads and even that is limited, some registers are shared. The complete picture for a modern CPU therefore looks like Figure 3.3.

Figure 3.3: Multi processor, multi-core, multi-thread

In this figure we have two processors, each with two cores, each of which has two threads. The threads share the Level 1 caches. The cores (shaded in the darker gray) have individual Level 1 caches. All cores of the CPU share the higher-level caches. The two processors (the two big boxes shaded in the lighter gray) of course do not share any caches. All this will be important, especially when we are discussing the cache effects on multi-process and multi-thread applications.

譯者信息

另外,咱們有多核CPU,每一個核心能夠有多個「線程」。核心與線程的不一樣之處在於,核心擁有獨立的硬件資源({早期的多核CPU甚至有獨立的二級緩存。})。在不一樣時使用相同資源(好比,通往外界的鏈接)的狀況下,核心能夠徹底獨立地運行。而線程只是共享資源。Intel的線程只有獨立的寄存器,並且還有限制——不是全部寄存器都獨立,有些是共享的。綜上,現代CPU的結構就像圖3.3所示。

 
圖3.3 多處理器、多核心、多線程

在上圖中,有兩個處理器,每一個處理器有兩個核心,每一個核心有兩個線程。線程們共享一級緩存。核心(以深灰色表示)有獨立的一級緩存,同時共享二級緩存。處理器(淡灰色)之間不共享任何緩存。這些信息很重要,特別是在討論多進程和多線程狀況下緩存的影響時尤其重要。

 

3.2 Cache Operation at High Level

To understand the costs and savings of using a cache we have to combine the knowledge about the machine architecture and RAM technology from Section 2 with the structure of caches described in the previous section.

By default all data read or written by the CPU cores is stored in the cache. There are memory regions which cannot be cached but this is something only the OS implementers have to be concerned about; it is not visible to the application programmer. There are also instructions which allow the programmer to deliberately bypass certain caches. This will be discussed in Section 6.

譯者信息

3.2 高級的緩存操做

瞭解成本和節約使用緩存,咱們必須結合在第二節中講到的關於計算機體系結構和RAM技術,以及前一節講到的緩存描述來探討。

默認狀況下,CPU核心全部的數據的讀或寫都存儲在緩存中。固然,也有內存區域不能被緩存的,可是這種狀況只發生在操做系統的實現者對數據考慮的前提下;對程序實現者來講,這是不可見的。這也說明,程序設計者能夠故意繞過某些緩存,不過這將是第六節中討論的內容了。

 

If the CPU needs a data word the caches are searched first. Obviously, the cache cannot contain the content of the entire main memory (otherwise we would need no cache), but since all memory addresses are cacheable, each cache entry is tagged using the address of the data word in the main memory. This way a request to read or write to an address can search the caches for a matching tag. The address in this context can be either the virtual or physical address, varying based on the cache implementation.

Since the tag requires space in addition to the actual memory, it is inefficient to chose a word as the granularity of the cache. For a 32-bit word on an x86 machine the tag itself might need 32 bits or more. Furthermore, since spatial locality is one of the principles on which caches are based, it would be bad to not take this into account. Since neighboring memory is likely to be used together it should also be loaded into the cache together. Remember also what we learned in Section 2.2.1: RAM modules are much more effective if they can transport many data words in a row without a new CAS or even RAS signal. So the entries stored in the caches are not single words but, instead, 「lines」 of several contiguous words. In early caches these lines were 32 bytes long; now the norm is 64 bytes. If the memory bus is 64 bits wide this means 8 transfers per cache line. DDR supports this transport mode efficiently.

譯者信息

若是CPU須要訪問某個字(word),先檢索緩存。很顯然,緩存不可能容納主存全部內容(不然還須要主存幹嗎)。系統用字的內存地址來對緩存條目進行標記。若是須要讀寫某個地址的字,那麼根據標籤來檢索緩存便可。這裏用到的地址能夠是虛擬地址,也能夠是物理地址,取決於緩存的具體實現。

標籤是須要額外空間的,用字做爲緩存的粒度顯然毫無效率。好比,在x86機器上,32位字的標籤可能須要32位,甚至更長。另外一方面,因爲空間局部性的存在,與當前地址相鄰的地址有很大可能會被一塊兒訪問。再回憶下2.2.1節——內存模塊在傳輸位於同一行上的多份數據時,因爲不須要發送新CAS信號,甚至不須要發送RAS信號,所以能夠實現很高的效率。基於以上的緣由,緩存條目並不存儲單個字,而是存儲若干連續字組成的「線」。在早期的緩存中,線長是32字節,如今通常是64字節。對於64位寬的內存總線,每條線須要8次傳輸。而DDR對於這種傳輸模式的支持更爲高效。

 

When memory content is needed by the processor the entire cache line is loaded into the L1d. The memory address for each cache line is computed by masking the address value according to the cache line size. For a 64 byte cache line this means the low 6 bits are zeroed. The discarded bits are used as the offset into the cache line. The remaining bits are in some cases used to locate the line in the cache and as the tag. In practice an address value is split into three parts. For a 32-bit address it might look as follows:

With a cache line size of 2O the low O bits are used as the offset into the cache line. The next S bits select the 「cache set」. We will go into more detail soon on why sets, and not single slots, are used for cache lines. For now it is sufficient to understand there are 2S sets of cache lines. This leaves the top 32 - S - O = T bits which form the tag. These T bits are the value associated with each cache line to distinguish all the aliases {All cache lines with the same S part of the address are known by the same alias.} which are cached in the same cache set. The S bits used to address the cache set do not have to be stored since they are the same for all cache lines in the same set.

譯者信息

當處理器須要內存中的某塊數據時,整條緩存線被裝入L1d。緩存線的地址經過對內存地址進行掩碼操做生成。對於64字節的緩存線,是將低6位置0。這些被丟棄的位做爲線內偏移量。其它的位做爲標籤,並用於在緩存內定位。在實踐中,咱們將地址分爲三個部分。32位地址的狀況以下:

若是緩存線長度爲2O,那麼地址的低O位用做線內偏移量。上面的S位選擇「緩存集」。後面咱們會說明使用緩存集的緣由。如今只須要明白一共有2S個緩存集就夠了。剩下的32 - S - O = T位組成標籤。它們用來區分別名相同的各條線{有相同S部分的緩存線被稱爲有相同的別名。}用於定位緩存集的S部分不須要存儲,由於屬於同一緩存集的全部線的S部分都是相同的。

 

When an instruction modifies memory the processor still has to load a cache line first because no instruction modifies an entire cache line at once (exception to the rule: write-combining as explained in Section 6.1). The content of the cache line before the write operation therefore has to be loaded. It is not possible for a cache to hold partial cache lines. A cache line which has been written to and which has not been written back to main memory is said to be 「dirty」. Once it is written the dirty flag is cleared.

To be able to load new data in a cache it is almost always first necessary to make room in the cache. An eviction from L1d pushes the cache line down into L2 (which uses the same cache line size). This of course means room has to be made in L2. This in turn might push the content into L3 and ultimately into main memory. Each eviction is progressively more expensive. What is described here is the model for an exclusive cache as is preferred by modern AMD and VIA processors. Intel implements inclusive caches {This generalization is not completely correct. A few caches are exclusive and some inclusive caches have exclusive cache properties.} where each cache line in L1d is also present in L2. Therefore evicting from L1d is much faster. With enough L2 cache the disadvantage of wasting memory for content held in two places is minimal and it pays off when evicting. A possible advantage of an exclusive cache is that loading a new cache line only has to touch the L1d and not the L2, which could be faster.

譯者信息

當某條指令修改內存時,仍然要先裝入緩存線,由於任何指令都不可能同時修改整條線(只有一個例外——第6.1節中將會介紹的寫合併(write-combine))。所以須要在寫操做前先把緩存線裝載進來。若是緩存線被寫入,但尚未寫回主存,那就是所謂的「髒了」。髒了的線一旦寫回主存,髒標記即被清除。

爲了裝入新數據,基本上老是要先在緩存中清理出位置。L1d將內容逐出L1d,推入L2(線長相同)。固然,L2也須要清理位置。因而L2將內容推入L3,最後L3將它推入主存。這種逐出操做一級比一級昂貴。這裏所說的是現代AMD和VIA處理器所採用的獨佔型緩存(exclusive cache)。而Intel採用的是包容型緩存(inclusive cache),{並不徹底正確,Intel有些緩存是獨佔型的,還有一些緩存具備獨佔型緩存的特色。}L1d的每條線同時存在於L2裏。對這種緩存,逐出操做就很快了。若是有足夠L2,對於相同內容存在不一樣地方形成內存浪費的缺點能夠降到最低,並且在逐出時很是有利。而獨佔型緩存在裝載新數據時只須要操做L1d,不須要碰L2,所以會比較快。

 

The CPUs are allowed to manage the caches as they like as long as the memory model defined for the processor architecture is not changed. It is, for instance, perfectly fine for a processor to take advantage of little or no memory bus activity and proactively write dirty cache lines back to main memory. The wide variety of cache architectures among the processors for the x86 and x86-64, between manufacturers and even within the models of the same manufacturer, are testament to the power of the memory model abstraction.

In symmetric multi-processor (SMP) systems the caches of the CPUs cannot work independently from each other. All processors are supposed to see the same memory content at all times. The maintenance of this uniform view of memory is called 「cache coherency」. If a processor were to look simply at its own caches and main memory it would not see the content of dirty cache lines in other processors. Providing direct access to the caches of one processor from another processor would be terribly expensive and a huge bottleneck. Instead, processors detect when another processor wants to read or write to a certain cache line.

譯者信息

處理器體系結構中定義的做爲存儲器的模型只要尚未改變,那就容許多CPU按照本身的方式來管理高速緩存。這表示,例如,設計優良的處理器,利用不多或根本沒有內存總線活動,並主動寫回主內存髒高速緩存行。這種高速緩存架構在如x86和x86-64各類各樣的處理器間存在。製造商之間,即便在同一製造商生產的產品中,證實了的內存模型抽象的力量。

在對稱多處理器(SMP)架構的系統中,CPU的高速緩存不能獨立的工做。在任什麼時候候,全部的處理器都應該擁有相同的內存內容。保證這樣的統一的內存視圖被稱爲「高速緩存一致性」。若是在其本身的高速緩存和主內存間,處理器設計簡單,它將不會看到在其餘處理器上的髒高速緩存行的內容。從一個處理器直接訪問另外一個處理器的高速緩存這種模型設計代價將是很是昂貴的,它是一個至關大的瓶頸。相反,當另外一個處理器要讀取或寫入到高速緩存線上時,處理器會去檢測。 

 

If a write access is detected and the processor has a clean copy of the cache line in its cache, this cache line is marked invalid. Future references will require the cache line to be reloaded. Note that a read access on another CPU does not necessitate an invalidation, multiple clean copies can very well be kept around.

More sophisticated cache implementations allow another possibility to happen. If the cache line which another processor wants to read from or write to is currently marked dirty in the first processor's cache a different course of action is needed. In this case the main memory is out-of-date and the requesting processor must, instead, get the cache line content from the first processor. Through snooping, the first processor notices this situation and automatically sends the requesting processor the data. This action bypasses main memory, though in some implementations the memory controller is supposed to notice this direct transfer and store the updated cache line content in main memory. If the access is for writing the first processor then invalidates its copy of the local cache line.

譯者信息

若是CPU檢測到一個寫訪問,並且該CPU的cache中已經緩存了一個cache line的原始副本,那麼這個cache line將被標記爲無效的cache line。接下來在引用這個cache line以前,須要從新加載該cache line。須要注意的是讀訪問並不會致使cache line被標記爲無效的。

更精確的cache實現須要考慮到其餘更多的可能性,好比第二個CPU在讀或者寫他的cache line時,發現該cache line在第一個CPU的cache中被標記爲髒數據了,此時咱們就須要作進一步的處理。在這種狀況下,主存儲器已經失效,第二個CPU須要讀取第一個CPU的cache line。經過測試,咱們知道在這種狀況下第一個CPU會將本身的cache line數據自動發送給第二個CPU。這種操做是繞過主存儲器的,可是有時候存儲控制器是能夠直接將第一個CPU中的cache line數據存儲到主存儲器中。對第一個CPU的cache的寫訪問會致使本地cache line的全部拷貝被標記爲無效。

 

Over time a number of cache coherency protocols have been developed. The most important is MESI, which we will introduce in Section 3.3.4. The outcome of all this can be summarized in a few simple rules:

  • A dirty cache line is not present in any other processor's cache.
  • Clean copies of the same cache line can reside in arbitrarily many caches.

If these rules can be maintained, processors can use their caches efficiently even in multi-processor systems. All the processors need to do is to monitor each others' write accesses and compare the addresses with those in their local caches. In the next section we will go into a few more details about the implementation and especially the costs.

譯者信息隨着時間的推移,一大批緩存一致性協議已經創建。其中,最重要的是MESI,咱們將在第3.3.4節進行介紹。以上結論能夠歸納爲幾個簡單的規則: 
  • 一個髒緩存線不存在於任何其餘處理器的緩存之中。
  • 同一緩存線中的乾淨拷貝能夠駐留在任意多個其餘緩存之中。
若是遵照這些規則,處理器甚至能夠在多處理器系統中更加有效的使用它們的緩存。全部的處理器須要作的就是監控其餘每個寫訪問和比較本地緩存中的地址。在下一節中,咱們將介紹更多細節方面的實現,尤爲是存儲開銷方面的細節。 

 

Finally, we should at least give an impression of the costs associated with cache hits and misses. These are the numbers Intel lists for a Pentium M:

To Where Cycles
Register <= 1
L1d ~3
L2 ~14
Main Memory ~240

These are the actual access times measured in CPU cycles. It is interesting to note that for the on-die L2 cache a large part (probably even the majority) of the access time is caused by wire delays. This is a physical limitation which can only get worse with increasing cache sizes. Only process shrinking (for instance, going from 60nm for Merom to 45nm for Penryn in Intel's lineup) can improve those numbers.

譯者信息

最後,咱們至少應該關注高速緩存命中或未命中帶來的消耗。下面是英特爾奔騰 M 的數據:

To Where Cycles
Register <= 1
L1d ~3
L2 ~14
Main Memory ~240

 

這是在CPU週期中的實際訪問時間。有趣的是,對於L2高速緩存的訪問時間很大一部分(甚至是大部分)是由線路的延遲引發的。這是一個限制,增長高速緩存的大小變得更糟。只有當減少時(例如,從60納米的Merom到45納米Penryn處理器),能夠提升這些數據。

 

The numbers in the table look high but, fortunately, the entire cost does not have to be paid for each occurrence of the cache load and miss. Some parts of the cost can be hidden. Today's processors all use internal pipelines of different lengths where the instructions are decoded and prepared for execution. Part of the preparation is loading values from memory (or cache) if they are transferred to a register. If the memory load operation can be started early enough in the pipeline, it may happen in parallel with other operations and the entire cost of the load might be hidden. This is often possible for L1d; for some processors with long pipelines for L2 as well.

There are many obstacles to starting the memory read early. It might be as simple as not having sufficient resources for the memory access or it might be that the final address of the load becomes available late as the result of another instruction. In these cases the load costs cannot be hidden (completely).

譯者信息表格中的數字看起來很高,可是,幸運的是,整個成本沒必要須負擔每次出現的緩存加載和緩存失效。某些部分的成本能夠被隱藏。如今的處理器都使用不一樣長度的內部管道,在管道內指令被解碼,併爲準備執行。若是數據要傳送到一個寄存器,那麼部分的準備工做是從存儲器(或高速緩存)加載數據。若是內存加載操做在管道中足夠早的進行,它能夠與其餘操做並行發生,那麼加載的所有發銷可能會被隱藏。對L1D經常可能如此;某些有長管道的處理器的L2也能夠。 

提前啓動內存的讀取有許多障礙。它可能只是簡單的不具備足夠資源供內存訪問,或者地址從另外一個指令獲取,而後加載的最終地址才變得可用。在這種狀況下,加載成本是不能隱藏的(徹底的)。 

 

For write operations the CPU does not necessarily have to wait until the value is safely stored in memory. As long as the execution of the following instructions appears to have the same effect as if the value were stored in memory there is nothing which prevents the CPU from taking shortcuts. It can start executing the next instruction early. With the help of shadow registers which can hold values no longer available in a regular register it is even possible to change the value which is to be stored in the incomplete write operation.

Figure 3.4: Access Times for Random Writes

For an illustration of the effects of cache behavior see Figure 3.4. We will talk about the program which generated the data later; it is a simple simulation of a program which accesses a configurable amount of memory repeatedly in a random fashion. Each data item has a fixed size. The number of elements depends on the selected working set size. The Y–axis shows the average number of CPU cycles it takes to process one element; note that the scale for the Y–axis is logarithmic. The same applies in all the diagrams of this kind to the X–axis. The size of the working set is always shown in powers of two.

譯者信息

對於寫操做,CPU並不須要等待數據被安全地放入內存。只要指令具備相似的效果,就沒有什麼東西能夠阻止CPU走捷徑了。它能夠早早地執行下一條指令,甚至能夠在影子寄存器(shadow register)的幫助下,更改這個寫操做將要存儲的數據。

 
圖3.4: 隨機寫操做的訪問時間

圖3.4展現了緩存的效果。關於產生圖中數據的程序,咱們會在稍後討論。這裏大體說下,這個程序是連續隨機地訪問某塊大小可配的內存區域。每一個數據項的大小是固定的。數據項的多少取決於選擇的工做集大小。Y軸表示處理每一個元素平均須要多少個CPU週期,注意它是對數刻度。X軸也是一樣,工做集的大小都以2的n次方表示。

 

The graph shows three distinct plateaus. This is not surprising: the specific processor has L1d and L2 caches, but no L3. With some experience we can deduce that the L1d is 213 bytes in size and that the L2 is 220 bytes in size. If the entire working set fits into the L1d the cycles per operation on each element is below 10. Once the L1d size is exceeded the processor has to load data from L2 and the average time springs up to around 28. Once the L2 is not sufficient anymore the times jump to 480 cycles and more. This is when many or most operations have to load data from main memory. And worse: since data is being modified dirty cache lines have to be written back, too.

This graph should give sufficient motivation to look into coding improvements which help improve cache usage. We are not talking about a few measly percent here; we are talking about orders-of-magnitude improvements which are sometimes possible. In Section 6 we will discuss techniques which allow writing more efficient code. The next section goes into more details of CPU cache designs. The knowledge is good to have but not necessary for the rest of the paper. So this section could be skipped.

譯者信息

圖中有三個比較明顯的不一樣階段。很正常,這個處理器有L1d和L2,沒有L3。根據經驗能夠推測出,L1d有213字節,而L2有220字節。由於,若是整個工做集均可以放入L1d,那麼只需不到10個週期就能夠完成操做。若是工做集超過L1d,處理器不得不從L2獲取數據,因而時間飄升到28個週期左右。若是工做集更大,超過了L2,那麼時間進一步暴漲到480個週期以上。這時候,許多操做將不得不從主存中獲取數據。更糟糕的是,若是修改了數據,還須要將這些髒了的緩存線寫回內存。

看了這個圖,你們應該會有足夠的動力去檢查代碼、改進緩存的利用方式了吧?這裏的性能改善可不僅是微不足道的幾個百分點,而是幾個數量級呀。在第6節中,咱們將介紹一些編寫高效代碼的技巧。而下一節將進一步深刻緩存的設計。雖然精彩,但並非必修課,你們能夠選擇性地跳過。

 

3.3 CPU Cache Implementation Details

Cache implementers have the problem that each cell in the huge main memory potentially has to be cached. If the working set of a program is large enough this means there are many main memory locations which fight for each place in the cache. Previously it was noted that a ratio of 1-to-1000 for cache versus main memory size is not uncommon.

3.3.1 Associativity

It would be possible to implement a cache where each cache line can hold a copy of any memory location. This is called a fully associative cache. To access a cache line the processor core would have to compare the tags of each and every cache line with the tag for the requested address. The tag would be comprised of the entire part of the address which is not the offset into the cache line (that means, S in the figure on Section 3.2 is zero).

譯者信息

3.3 CPU緩存實現的細節

緩存的實現者們都要面對一個問題——主存中每個單元均可能需被緩存。若是程序的工做集很大,就會有許多內存位置爲了緩存而打架。前面咱們曾經提過緩存與主存的容量比,1:1000也十分常見。

3.3.1 關聯性

咱們可讓緩存的每條線能存聽任何內存地址的數據。這就是所謂的全關聯緩存(fully associative cache)。對於這種緩存,處理器爲了訪問某條線,將不得不檢索全部線的標籤。而標籤則包含了整個地址,而不只僅只是線內偏移量(也就意味着,圖3.2中的S爲0)。

 

 

 

There are caches which are implemented like this but, by looking at the numbers for an L2 in use today, will show that this is impractical. Given a 4MB cache with 64B cache lines the cache would have 65,536 entries. To achieve adequate performance the cache logic would have to be able to pick from all these entries the one matching a given tag in just a few cycles. The effort to implement this would be enormous.

Figure 3.5: Fully Associative Cache Schematics

For each cache line a comparator is needed to compare the large tag (note, S is zero). The letter next to each connection indicates the width in bits. If none is given it is a single bit line. Each comparator has to compare two T-bit-wide values. Then, based on the result, the appropriate cache line content is selected and made available. This requires merging as many sets of O data lines as there are cache buckets. The number of transistors needed to implement a single comparator is large especially since it must work very fast. No iterative comparator is usable. The only way to save on the number of comparators is to reduce the number of them by iteratively comparing the tags. This is not suitable for the same reason that iterative comparators are not: it takes too long.

譯者信息

高速緩存有相似這樣的實現,可是,看看在今天使用的L2的數目,代表這是不切實際的。給定4MB的高速緩存和64B的高速緩存段,高速緩存將有65,536個項。爲了達到足夠的性能,緩存邏輯必須可以在短短的幾個時鐘週期內,從全部這些項中,挑一個匹配給定的標籤。實現這一點的工做將是巨大的。

Figure 3.5: 全關聯高速緩存原理圖

對於每一個高速緩存行,比較器是須要比較大標籤(注意,S是零)。每一個鏈接旁邊的字母表示位的寬度。若是沒有給出,它是一個單比特線。每一個比較器都要比較兩個T-位寬的值。而後,基於該結果,適當的高速緩存行的內容被選中,並使其可用。這須要合併多套O數據線,由於他們是緩存桶(譯註:這裏相似把O輸出接入多選器,因此須要合併)。實現僅僅一個比較器,須要晶體管的數量就很是大,特別是由於它必須很是快。沒有迭代比較器是可用的。節省比較器的數目的惟一途徑是經過反覆比較標籤,以減小它們的數目。這是不適合的,出於一樣的緣由,迭代比較器不可用:它的時間太長。

 

Fully associative caches are practical for small caches (for instance, the TLB caches on some Intel processors are fully associative) but those caches are small, really small. We are talking about a few dozen entries at most.

For L1i, L1d, and higher level caches a different approach is needed. What can be done is to restrict the search. In the most extreme restriction each tag maps to exactly one cache entry. The computation is simple: given the 4MB/64B cache with 65,536 entries we can directly address each entry by using bits 6 to 21 of the address (16 bits). The low 6 bits are the index into the cache line.

譯者信息全關聯高速緩存對 小緩存是實用的(例如,在某些Intel處理器的TLB緩存是全關聯的),但這些緩存都很小,很是小的。咱們正在談論的最多幾十項。 

對於L1i,L1d和更高級別的緩存,須要採用不一樣的方法。能夠作的就是是限制搜索。最極端的限制是,每一個標籤映射到一個明確的緩存條目。計算很簡單:給定的4MB/64B緩存有65536項,咱們能夠使用地址的bit6到bit21(16位)來直接尋址高速緩存的每個項。地址的低6位做爲高速緩存段的索引。 

 

Figure 3.6: Direct-Mapped Cache Schematics

Such a direct-mapped cache is fast and relatively easy to implement as can be seen in Figure 3.6. It requires exactly one comparator, one multiplexer (two in this diagram where tag and data are separated, but this is not a hard requirement on the design), and some logic to select only valid cache line content. The comparator is complex due to the speed requirements but there is only one of them now; as a result more effort can be spent on making it fast. The real complexity in this approach lies in the multiplexers. The number of transistors in a simple multiplexer grows with O(log N), where N is the number of cache lines. This is tolerable but might get slow, in which case speed can be increased by spending more real estate on transistors in the multiplexers to parallelize some of the work and to increase the speed. The total number of transistors can grow slowly with a growing cache size which makes this solution very attractive. But it has a drawback: it only works well if the addresses used by the program are evenly distributed with respect to the bits used for the direct mapping. If they are not, and this is usually the case, some cache entries are heavily used and therefore repeatedly evicted while others are hardly used at all or remain empty.

譯者信息

Figure 3.6: Direct-Mapped Cache Schematics

在圖3.6中能夠看出,這種直接映射的高速緩存,速度快,比較容易實現。它只是須要一個比較器,一個多路複用器(在這個圖中有兩個,標記和數據是分離的,可是對於設計這不是一個硬性要求),和一些邏輯來選擇只是有效的高速緩存行的內容。因爲速度的要求,比較器是複雜的,可是如今只須要一個,結果是能夠花更多的精力,讓其變得快速。這種方法的複雜性在於在多路複用器。一個簡單的多路轉換器中的晶體管的數量增速是O(log N)的,其中N是高速緩存段的數目。這是能夠容忍的,但可能會很慢,在某種狀況下,速度可提高,經過增長多路複用器晶體管數量,來並行化的一些工做和自身增速。晶體管的總數只是隨着快速增加的高速緩存緩慢的增長,這使得這種解決方案很是有吸引力。但它有一個缺點:只有用於直接映射地址的相關的地址位均勻分佈,程序才能很好工做。若是分佈的不均勻,並且這是常態,一些緩存項頻繁的使用,並所以屢次被換出,而另外一些則幾乎不被使用或一直是空的。

 

Figure 3.7: Set-Associative Cache Schematics

This problem can be solved by making the cache set associative. A set-associative cache combines the features of the full associative and direct-mapped caches to largely avoid the weaknesses of those designs. Figure 3.7 shows the design of a set-associative cache. The tag and data storage are divided into sets which are selected by the address. This is similar to the direct-mapped cache. But instead of only having one element for each set value in the cache a small number of values is cached for the same set value. The tags for all the set members are compared in parallel, which is similar to the functioning of the fully associative cache.

譯者信息

Figure 3.7: 組關聯高速緩存原理圖

能夠經過使高速緩存的組關聯來解決此問題。組關聯結合高速緩存的全關聯和直接映射高速緩存特色,在很大程度上避免那些設計的弱點。圖3.7顯示了一個組關聯高速緩存的設計。標籤和數據存儲分紅不一樣的組並能夠經過地址選擇。這相似直接映射高速緩存。可是,小數目的值能夠在同一個高速緩存組緩存,而不是一個緩存組只有一個元素,用於在高速緩存中的每一個設定值是相同的一組值的緩存。全部組的成員的標籤能夠並行比較,這相似全關聯緩存的功能。

 

The result is a cache which is not easily defeated by unfortunate or deliberate selection of addresses with the same set numbers and at the same time the size of the cache is not limited by the number of comparators which can be implemented in parallel. If the cache grows it is (in this figure) only the number of columns which increases, not the number of rows. The number of rows only increases if the associativity of the cache is increased. Today processors are using associativity levels of up to 16 for L2 caches or higher. L1 caches usually get by with 8.

L2
Cache
Size
Associativity
Direct 2 4 8
CL=32 CL=64 CL=32 CL=64 CL=32 CL=64 CL=32 CL=64
512k 27,794,595 20,422,527 25,222,611 18,303,581 24,096,510 17,356,121 23,666,929 17,029,334
1M 19,007,315 13,903,854 16,566,738 12,127,174 15,537,500 11,436,705 15,162,895 11,233,896
2M 12,230,962 8,801,403 9,081,881 6,491,011 7,878,601 5,675,181 7,391,389 5,382,064
4M 7,749,986 5,427,836 4,736,187 3,159,507 3,788,122 2,418,898 3,430,713 2,125,103
8M 4,731,904 3,209,693 2,690,498 1,602,957 2,207,655 1,228,190 2,111,075 1,155,847
16M 2,620,587 1,528,592 1,958,293 1,089,580 1,704,878 883,530 1,671,541 862,324

Table 3.1: Effects of Cache Size, Associativity, and Line Size

Given our 4MB/64B cache and 8-way set associativity the cache we are left with has 8,192 sets and only 13 bits of the tag are used in addressing the cache set. To determine which (if any) of the entries in the cache set contains the addressed cache line 8 tags have to be compared. That is feasible to do in very short time. With an experiment we can see that this makes sense.

譯者信息

其結果是高速緩存,不容易被不幸或故意選擇同屬同一組編號的地址所擊敗,同時高速緩存的大小並不限於由比較器的數目,能夠以並行的方式實現。若是高速緩存增加,只(在該圖中)增長列的數目,而不增長行數。只有高速緩存之間的關聯性增長,行數纔會增長。今天,處理器的L2高速緩存或更高的高速緩存,使用的關聯性高達16。 L1高速緩存一般使用8。

L2
Cache
Size
Associativity
Direct 2 4 8
CL=32 CL=64 CL=32 CL=64 CL=32 CL=64 CL=32 CL=64
512k 27,794,595 20,422,527 25,222,611 18,303,581 24,096,510 17,356,121 23,666,929 17,029,334
1M 19,007,315 13,903,854 16,566,738 12,127,174 15,537,500 11,436,705 15,162,895 11,233,896
2M 12,230,962 8,801,403 9,081,881 6,491,011 7,878,601 5,675,181 7,391,389 5,382,064
4M 7,749,986 5,427,836 4,736,187 3,159,507 3,788,122 2,418,898 3,430,713 2,125,103
8M 4,731,904 3,209,693 2,690,498 1,602,957 2,207,655 1,228,190 2,111,075 1,155,847
16M 2,620,587 1,528,592 1,958,293 1,089,580 1,704,878 883,530 1,671,541 862,324

Table 3.1: 高速緩存大小,關聯行,段大小的影響

給定咱們4MB/64B高速緩存,8路組關聯,相關的緩存留給咱們的有8192組,只用標籤的13位,就能夠尋址緩集。要肯定哪些(若是有的話)的緩存組設置中的條目包含尋址的高速緩存行,8個標籤都要進行比較。在很短的時間內作出來是可行的。經過一個實驗,咱們能夠看到,這是有意義的。

 

 

Table 3.1 shows the number of L2 cache misses for a program (gcc in this case, the most important benchmark of them all, according to the Linux kernel people) for changing cache size, cache line size, and associativity set size. In Section 7.2 we will introduce the tool to simulate the caches as required for this test.

Just in case this is not yet obvious, the relationship of all these values is that the cache size is

cache line size × associativity × number of sets

The addresses are mapped into the cache by using

O = log  2 cache line size 
S = log  2 number of sets

in the way the figure in Section 3.2 shows.

Figure 3.8: Cache Size vs Associativity (CL=32)

Figure 3.8 makes the data of the table more comprehensible. It shows the data for a fixed cache line size of 32 bytes. Looking at the numbers for a given cache size we can see that associativity can indeed help to reduce the number of cache misses significantly. For an 8MB cache going from direct mapping to 2-way set associative cache saves almost 44% of the cache misses. The processor can keep more of the working set in the cache with a set associative cache compared with a direct mapped cache.

譯者信息

表3.1顯示一個程序在改變緩存大小,緩存段大小和關聯集大小,L2高速緩存的緩存失效數量(根據Linux內核相關的方面人的說法,GCC在這種狀況下,是他們全部中最重要的標尺)。在7.2節中,咱們將介紹工具來模擬此測試要求的高速緩存。

 萬一這還不是很明顯,全部這些值之間的關係是高速緩存的大小爲:

cache line size × associativity × number of sets 

地址被映射到高速緩存使用

O = log  2 cache line size 
S = log  2 number of sets

在第3.2節中的圖顯示的方式。

Figure 3.8: 緩存段大小 vs 關聯行 (CL=32)

圖3.8表中的數據更易於理解。它顯示一個固定的32個字節大小的高速緩存行的數據。對於一個給定的高速緩存大小,咱們能夠看出,關聯性,的確能夠幫助明顯減小高速緩存未命中的數量。對於8MB的緩存,從直接映射到2路組相聯,能夠減小近44%的高速緩存未命中。組相聯高速緩存和直接映射緩存相比,該處理器能夠把更多的工做集保持在緩存中。

 

In the literature one can occasionally read that introducing associativity has the same effect as doubling cache size. This is true in some extreme cases as can be seen in the jump from the 4MB to the 8MB cache. But it certainly is not true for further doubling of the associativity. As we can see in the data, the successive gains are much smaller. We should not completely discount the effects, though. In the example program the peak memory use is 5.6M. So with a 8MB cache there are unlikely to be many (more than two) uses for the same cache set. With a larger working set the savings can be higher as we can see from the larger benefits of associativity for the smaller cache sizes.

In general, increasing the associativity of a cache above 8 seems to have little effects for a single-thread workload. With the introduction of multi-core processors which use a shared L2 the situation changes. Now you basically have two programs hitting on the same cache which causes the associativity in practice to be halved (or quartered for quad-core processors). So it can be expected that, with increasing numbers of cores, the associativity of the shared caches should grow. Once this is not possible anymore (16-way set associativity is already hard) processor designers have to start using shared L3 caches and beyond, while L2 caches are potentially shared by a subset of the cores.

譯者信息

在文獻中,偶爾能夠讀到,引入關聯性,和加倍高速緩存的大小具備相同的效果。在從4M緩存躍升到8MB緩存的極端的狀況下,這是正確的。關聯性再提升一倍那就確定不正確啦。正如咱們所看到的數據,後面的收益要小得多。咱們不該該徹底低估它的效果,雖然。在示例程序中的內存使用的峯值是5.6M。所以,具備8MB緩存不太可能有不少(兩個以上)使用相同的高速緩存的組。從較小的緩存的關聯性的巨大收益能夠看出,較大工做集能夠節省更多

在通常狀況下,增長8以上的高速緩存之間的關聯性彷佛對只有一個單線程工做量影響不大。隨着介紹一個使用共享L2的多核處理器,形勢發生了變化。如今你基本上有兩個程序命中相同的緩存, 實際上致使高速緩存減半(對於四核處理器是1/4)。所以,能夠預期,隨着核的數目的增長,共享高速緩存的相關性也應增加。一旦這種方法再也不可行(16 路組關聯性已經很難)處理器設計者不得不開始使用共享的三級高速緩存和更高級別的,而L2高速緩存只被核的一個子集共享。

 

Another effect we can study in Figure 3.8 is how the increase in cache size helps with performance. This data cannot be interpreted without knowing about the working set size. Obviously, a cache as large as the main memory would lead to better results than a smaller cache, so there is in general no limit to the largest cache size with measurable benefits.

As already mentioned above, the size of the working set at its peak is 5.6M. This does not give us any absolute number of the maximum beneficial cache size but it allows us to estimate the number. The problem is that not all the memory used is contiguous and, therefore, we have, even with a 16M cache and a 5.6M working set, conflicts (see the benefit of the 2-way set associative 16MB cache over the direct mapped version). But it is a safe bet that with the same workload the benefits of a 32MB cache would be negligible. But who says the working set has to stay the same? Workloads are growing over time and so should the cache size. When buying machines, and one has to choose the cache size one is willing to pay for, it is worthwhile to measure the working set size. Why this is important can be seen in the figures on Figure 3.10.

Figure 3.9: Test Memory Layouts

Two types of tests are run. In the first test the elements are processed sequentially. The test program follows the pointernbut the array elements are chained so that they are traversed in the order in which they are found in memory. This can be seen in the lower part of Figure 3.9. There is one back reference from the last element. In the second test (upper part of the figure) the array elements are traversed in a random order. In both cases the array elements form a circular single-linked list.

譯者信息

從圖3.8中,咱們還能夠研究緩存大小對性能的影響。這一數據須要瞭解工做集的大小才能進行解讀。很顯然,與主存相同的緩存比小緩存能產生更好的結果,所以,緩存一般是越大越好。

上文已經說過,示例中最大的工做集爲5.6M。它並無給出最佳緩存大小值,但咱們能夠估算出來。問題主要在於內存的使用並不連續,所以,即便是16M的緩存,在處理5.6M的工做集時也會出現衝突(參見2路集合關聯式16MB緩存vs直接映射式緩存的優勢)。無論怎樣,咱們能夠有把握地說,在一樣5.6M的負載下,緩存從16MB升到32MB基本已沒有多少提升的餘地。可是,工做集是會變的。若是工做集不斷增大,緩存也須要隨之增大。在購買計算機時,若是須要選擇緩存大小,必定要先衡量工做集的大小。緣由能夠參見圖3.10。

 
圖3.9: 測試的內存分佈狀況

咱們執行兩項測試。第一項測試是按順序地訪問全部元素。測試程序循着指針n進行訪問,而全部元素是連接在一塊兒的,從而使它們的被訪問順序與在內存中排布的順序一致,如圖3.9的下半部分所示,末尾的元素有一個指向首元素的引用。而第二項測試(見圖3.9的上半部分)則是按隨機順序訪問全部元素。在上述兩個測試中,全部元素都構成一個單向循環鏈表。

 

3.3.2 Measurements of Cache Effects

All the figures are created by measuring a program which can simulate working sets of arbitrary size, read and write access, and sequential or random access. We have already seen some results in Figure 3.4. The program creates an array corresponding to the working set size of elements of this type:

  struct l {
    struct l *n;
    long int pad[NPAD];
  };

All entries are chained in a circular list using thenelement, either in sequential or random order. Advancing from one entry to the next always uses the pointer, even if the elements are laid out sequentially. Thepadelement is the payload and it can grow arbitrarily large. In some tests the data is modified, in others the program only performs read operations.

In the performance measurements we are talking about working set sizes. The working set is made up of an array ofstruct lelements. A working set of 2N bytes contains

N/sizeof(struct l)

elements. Obviouslysizeof(struct l)depends on the value ofNPAD. For 32-bit systems,NPAD=7 means the size of each array element is 32 bytes, for 64-bit systems the size is 64 bytes.

譯者信息

3.3.2 Cache的性能測試

用於測試程序的數據能夠模擬一個任意大小的工做集:包括讀、寫訪問,隨機、連續訪問。在圖3.4中咱們能夠看到,程序爲工做集建立了一個與其大小和元素類型相同的數組:

  struct l {
    struct l *n;
    long int pad[NPAD];
  };

n字段將全部節點隨機得或者順序的加入到環形鏈表中,用指針從當前節點進入到下一個節點。pad字段用來存儲數據,其能夠是任意大小。在一些測試程序中,pad字段是能夠修改的, 在其餘程序中,pad字段只能夠進行讀操做。

在性能測試中,咱們談到工做集大小的問題,工做集使用結構體l定義的元素表示的。2N 字節的工做集包含

N/sizeof(struct l)

個元素. 顯然sizeof(struct l) 的值取決於NPAD的大小。在32位系統上,NPAD=7意味着數組的每一個元素的大小爲32字節,在64位系統上,NPAD=7意味着數組的每一個元素的大小爲64字節。

 

Single Threaded Sequential Access

The simplest case is a simple walk over all the entries in the list. The list elements are laid out sequentially, densely packed. Whether the order of processing is forward or backward does not matter, the processor can deal with both directions equally well. What we measure here—and in all the following tests—is how long it takes to handle a single list element. The time unit is a processor cycle. Figure 3.10 shows the result. Unless otherwise specified, all measurements are made on a Pentium 4 machine in 64-bit mode which means the structurelwithNPAD=0is eight bytes in size.

Figure 3.10: Sequential Read Access, NPAD=0

Figure 3.11: Sequential Read for Several Sizes

 

譯者信息

單線程順序訪問

最簡單的狀況就是遍歷鏈表中順序存儲的節點。不管是從前向後處理,仍是從後向前,對於處理器來講沒有什麼區別。下面的測試中,咱們須要獲得處理鏈表中一個元素所須要的時間,以CPU時鐘週期最爲計時單元。圖3.10顯示了測試結構。除非有特殊說明, 全部的測試都是在Pentium 4 64-bit 平臺上進行的,所以結構體l中NPAD=0,大小爲8字節。

圖 3.10: 順序讀訪問, NPAD=0

圖 3.11: 順序讀多個字節

The first two measurements are polluted by noise. The measured workload is simply too small to filter the effects of the rest of the system out. We can safely assume that the values are all at the 4 cycles level. With this in mind we can see three distinct levels:

 

  • Up to a working set size of 214 bytes.
  • From 215 bytes to 220 bytes.
  • From 221 bytes and up.

These steps can be easily explained: the processor has a 16kB L1d and 1MB L2. We do not see sharp edges in the transition from one level to the other because the caches are used by other parts of the system as well and so the cache is not exclusively available for the program data. Specifically the L2 cache is a unified cache and also used for the instructions (NB: Intel uses inclusive caches).

譯者信息

一開始的兩個測試數據收到了噪音的污染。因爲它們的工做負荷過小,沒法過濾掉系統內其它進程對它們的影響。咱們能夠認爲它們都是4個週期之內的。這樣一來,整個圖能夠劃分爲比較明顯的三個部分:

  • 工做集小於214字節的。
  • 工做集從215字節到220字節的。
  • 工做集大於221字節的。

這樣的結果很容易解釋——是由於處理器有16KB的L1d和1MB的L2。而在這三個部分之間,並無很是銳利的邊緣,這是由於系統的其它部分也在使用緩存,咱們的測試程序並不能獨佔緩存的使用。尤爲是L2,它是統一式的緩存,處理器的指令也會使用它(注: Intel使用的是包容式緩存)。

 

 

 

What is perhaps not quite expected are the actual times for the different working set sizes. The times for the L1d hits are expected: load times after an L1d hit are around 4 cycles on the P4. But what about the L2 accesses? Once the L1d is not sufficient to hold the data one might expect it would take 14 cycles or more per element since this is the access time for the L2. But the results show that only about 9 cycles are required. This discrepancy can be explained by the advanced logic in the processors. In anticipation of using consecutive memory regions, the processor prefetches the next cache line. This means that when the next line is actually used it is already halfway loaded. The delay required to wait for the next cache line to be loaded is therefore much less than the L2 access time.

譯者信息測試的實際耗時可能會出乎你們的意料。L1d的部分跟咱們預想的差很少,在一臺P4上耗時爲4個週期左右。但L2的結果則出乎意料。你們可能以爲須要14個週期以上,但實際只用了9個週期。這要歸功於處理器先進的處理邏輯,當它使用連續的內存區時,會 預先讀取下一條緩存線的數據。這樣一來,當真正使用下一條線的時候,其實已經早已讀完一半了,因而真正的等待耗時會比L2的訪問時間少不少。

 

The effect of prefetching is even more visible once the working set size grows beyond the L2 size. Before we said that a main memory access takes 200+ cycles. Only with effective prefetching is it possible for the processor to keep the access times as low as 9 cycles. As we can see from the difference between 200 and 9, this works out nicely.

We can observe the processor while prefetching, at least indirectly. In Figure 3.11 we see the times for the same working set sizes but this time we see the graphs for different sizes of the structurel. This means we have fewer but larger elements in the list. The different sizes have the effect that the distance between thenelements in the (still consecutive) list grows. In the four cases of the graph the distance is 0, 56, 120, and 248 bytes respectively.

譯者信息

在工做集超過L2的大小以後,預取的效果更明顯了。前面咱們說過,主存的訪問須要耗時200個週期以上。但在預取的幫助下,實際耗時保持在9個週期左右。200 vs 9,效果很是不錯。

咱們能夠觀察到預取的行爲,至少能夠間接地觀察到。圖3.11中有4條線,它們表示處理不一樣大小結構時的耗時狀況。隨着結構的變大,元素間的距離變大了。圖中4條線對應的元素距離分別是0、5六、120和248字節。

 

At the bottom we can see the line from the previous graph, but this time it appears more or less as a flat line. The times for the other cases are simply so much worse. We can see in this graph, too, the three different levels and we see the large errors in the tests with the small working set sizes (ignore them again). The lines more or less all match each other as long as only the L1d is involved. There is no prefetching necessary so all element sizes just hit the L1d for each access.

For the L2 cache hits we see that the three new lines all pretty much match each other but that they are at a higher level (about 28). This is the level of the access time for the L2. This means prefetching from L2 into L1d is basically disabled. Even withNPAD=7 we need a new cache line for each iteration of the loop; forNPAD=0, instead, the loop has to iterate eight times before the next cache line is needed. The prefetch logic cannot load a new cache line every cycle. Therefore we see a stall to load from L2 in every iteration.

譯者信息

圖中最下面的這一條線來自前一個圖,但在這裏更像是一條直線。其它三條線的耗時狀況比較差。圖中這些線也有比較明顯的三個階段,同時,在小工做集的狀況下也有比較大的錯誤(請再次忽略這些錯誤)。在只使用L1d的階段,這些線條基本重合。由於這時候還不須要預取,只須要訪問L1d就行。

在L2階段,三條新加的線基本重合,並且耗時比老的那條線高不少,大約在28個週期左右,差很少就是L2的訪問時間。這代表,從L2到L1d的預取並無生效。這是由於,對於最下面的線(NPAD=0),因爲結構小,8次循環後才須要訪問一條新緩存線,而上面三條線對應的結構比較大,拿相對最小的NPAD=7來講,光是一次循環就須要訪問一條新線,更不用說更大的NPAD=15和31了。而預取邏輯是沒法在每一個週期裝載新線的,所以每次循環都須要從L2讀取,咱們看到的就是從L2讀取的時延。

 

It gets even more interesting once the working set size exceeds the L2 capacity. Now all four lines vary widely. The different element sizes play obviously a big role in the difference in performance. The processor should recognize the size of the strides and not fetch unnecessary cache lines forNPAD=15 and 31 since the element size is smaller than the prefetch window (see Section 6.3.1). Where the element size is hampering the prefetching efforts is a result of a limitation of hardware prefetching: it cannot cross page boundaries. We are reducing the effectiveness of the hardware scheduler by 50% for each size increase. If the hardware prefetcher were allowed to cross page boundaries and the next page is not resident or valid the OS would have to get involved in locating the page. That means the program would experience a page fault it did not initiate itself. This is completely unacceptable since the processor does not know whether a page is not present or does not exist. In the latter case the OS would have to abort the process. In any case, given that, forNPAD=7 and higher, we need one cache line per list element the hardware prefetcher cannot do much. There simply is no time to load the data from memory since all the processor does is read one word and then load the next element.

譯者信息更有趣的是工做集超過L2容量後的階段。快看,4條線遠遠地拉開了。元素的大小變成了主角,左右了性能。處理器應能識別每一步(stride)的大小,不去爲NPAD=15和31獲取那些實際並不須要的緩存線(參見6.3.1)。元素大小對預取的約束是根源於硬件預取的限制——它沒法跨越頁邊界。若是容許預取器跨越頁邊界,而下一頁不存在或無效,那麼OS還得去尋找它。這意味着,程序須要遭遇一次並不是由它本身產生的頁錯誤,這是徹底不能接受的。在NPAD=7或者更大的時候,因爲每一個元素都至少須要一條緩存線,預取器已經幫不上忙了,它沒有足夠的時間去從內存裝載數據。

 

Another big reason for the slowdown are the misses of the TLB cache. This is a cache where the results of the translation of a virtual address to a physical address are stored, as is explained in more detail in Section 4. The TLB cache is quite small since it has to be extremely fast. If more pages are accessed repeatedly than the TLB cache has entries for the translation from virtual to physical address has to be constantly repeated. This is a very costly operation. With larger element sizes the cost of a TLB lookup is amortized over fewer elements. That means the total number of TLB entries which have to be computed per list element is higher.

譯者信息另外一個致使慢下來的緣由是TLB緩存的未命中。TLB是存儲虛實地址映射的緩存,參見第4節。爲了保持快速,TLB只有很小的容量。若是有大量頁被反覆訪問,超出了TLB緩存容量,就會致使反覆地進行地址翻譯,這會耗費大量時間。TLB查找的代價分攤到全部元素上,若是元素越大,那麼元素的數量越少,每一個元素承擔的那一份就越多。

 

To observe the TLB effects we can run a different test. For one measurement we lay out the elements sequentially as usual. We useNPAD=7 for elements which occupy one entire cache line. For the second measurement we place each list element on a separate page. The rest of each page is left untouched and we do not count it in the total for the working set size. {Yes, this is a bit inconsistent because in the other tests we count the unused part of the struct in the element size and we could defineNPADso that each element fills a page. In that case the working set sizes would be very different. This is not the point of this test, though, and since prefetching is ineffective anyway this makes little difference.} The consequence is that, for the first measurement, each list iteration requires a new cache line and, for every 64 elements, a new page. For the second measurement each iteration requires loading a new cache line which is on a new page.

Figure 3.12: TLB Influence for Sequential Read

 

譯者信息

爲了觀察TLB的性能,咱們能夠進行另兩項測試。第一項:咱們仍是順序存儲列表中的元素,使NPAD=7,讓每一個元素佔滿整個cache line,第二項:咱們將列表的每一個元素存儲在一個單獨的頁上,忽略每一個頁沒有使用的部分以用來計算工做集的大小。(這樣作可能不太一致,由於在前面的測試中,我計算告終構體中每一個元素沒有使用的部分,從而用來定義NPAD的大小,所以每一個元素佔滿了整個頁,這樣以來工做集的大小將會有所不一樣。可是這不是這項測試的重點,預取的低效率多少使其有點不一樣)。結果代表,第一項測試中,每次列表的迭代都須要一個新的cache line,並且每64個元素就須要一個新的頁。第二項測試中,每次迭代都會在一個新的頁中加載一個新的cache line。

圖 3.12: TLB 對順序讀的影響

The result can be seen in Figure 3.12. The measurements were performed on the same machine as Figure 3.11. Due to limitations of the available RAM the working set size had to be restricted to 2 24 bytes which requires 1GB to place the objects on separate pages. The lower, red curve corresponds exactly to theNPAD=7 curve in Figure 3.11. We see the distinct steps showing the sizes of the L1d and L2 caches. The second curve looks radically different. The important feature is the huge spike starting when the working set size reaches 2 13bytes. This is when the TLB cache overflows. With an element size of 64 bytes we can compute that the TLB cache has 64 entries. There are no page faults affecting the cost since the program locks the memory to prevent it from being swapped out. 譯者信息結果見圖3.12。該測試與圖3.11是在同一臺機器上進行的。基於可用RAM空間的有限性,測試設置容量空間大小爲2的24次方字節,這就須要1GB的容量將對象放置在分頁上。圖3.12中下方的紅色曲線正好對應了圖3.11中NPAD等於7的曲線。咱們看到不一樣的步長顯示了高速緩存L1d和L2的大小。第二條曲線看上去徹底不一樣,其最重要的特色是當工做容量到達2的13次方字節時開始大幅度增加。這就是TLB緩存溢出的時候。咱們能計算出一個64字節大小的元素的TLB緩存有64個輸入。成本不會受頁面錯誤影響,由於程序鎖定了存儲器以防止內存被換出。

 

As can be seen the number of cycles it takes to compute the physical address and store it in the TLB is very high. The graph in Figure 3.12 shows the extreme case, but it should now be clear that a significant factor in the slowdown for largerNPADvalues is the reduced efficiency of the TLB cache. Since the physical address has to be computed before a cache line can be read for either L2 or main memory the address translation penalties are additive to the memory access times. This in part explains why the total cost per list element forNPAD=31 is higher than the theoretical access time for the RAM.

Figure 3.13: Sequential Read and Write, NPAD=1

We can glimpse a few more details of the prefetch implementation by looking at the data of test runs where the list elements are modified. Figure 3.13 shows three lines. The element width is in all cases 16 bytes. The first line is the now familiar list walk which serves as a baseline. The second line, labeled 「Inc」, simply increments thepad[0]member of the current element before going on to the next. The third line, labeled 「Addnext0」, takes thepad[0]list element of the next element and adds it to thepad[0]member of the current list element.

譯者信息

能夠看出,計算物理地址並把它存儲在TLB中所花費的週期數量級是很是高的。圖3.12的表格顯示了一個極端的例子,但從中能夠清楚的獲得:TLB緩存效率下降的一個重要因素是大型NPAD值的減緩。因爲物理地址必須在緩存行能被L2或主存讀取以前計算出來,地址轉換這個不利因素就增長了內存訪問時間。這一點部分解釋了爲何NPAD等於31時每一個列表元素的總花費比理論上的RAM訪問時間要高。


圖3.13 NPAD等於1時的順序讀和寫

經過查看鏈表元素被修改時測試數據的運行狀況,咱們能夠窺見一些更詳細的預取實現細節。圖3.13顯示了三條曲線。全部狀況下元素寬度都爲16個字節。第一條曲線「Follow」是熟悉的鏈表走線在這裏做爲基線。第二條曲線,標記爲「Inc」,僅僅在當前元素進入下一個前給其增長thepad[0]成員。第三條曲線,標記爲"Addnext0", 取出下一個元素的thepad[0]鏈表元素並把它添加爲當前鏈表元素的thepad[0]成員。

 

The naïve assumption would be that the 「Addnext0」 test runs slower because it has more work to do. Before advancing to the next list element a value from that element has to be loaded. This is why it is surprising to see that this test actually runs, for some working set sizes, faster than the 「Inc」 test. The explanation for this is that the load from the next list element is basically a forced prefetch. Whenever the program advances to the next list element we know for sure that element is already in the L1d cache. As a result we see that the 「Addnext0」 performs as well as the simple 「Follow」 test as long as the working set size fits into the L2 cache.

The 「Addnext0」 test runs out of L2 faster than the 「Inc」 test, though. It needs more data loaded from main memory. This is why the 「Addnext0」 test reaches the 28 cycles level for a working set size of 221 bytes. The 28 cycles level is twice as high as the 14 cycles level the 「Follow」 test reaches. This is easy to explain, too. Since the other two tests modify memory an L2 cache eviction to make room for new cache lines cannot simply discard the data. Instead it has to be written to memory. This means the available bandwidth on the FSB is cut in half, hence doubling the time it takes to transfer the data from main memory to L2.

譯者信息

在沒運行時,你們可能會覺得"Addnext0"更慢,由於它要作的事情更多——在沒進到下個元素以前就須要裝載它的值。但實際的運行結果使人驚訝——在某些小工做集下,"Addnext0"比"Inc"更快。這是爲何呢?緣由在於,系統通常會對下一個元素進行強制性預取。當程序前進到下個元素時,這個元素其實早已被預取在L1d裏。所以,只要工做集比L2小,"Addnext0"的性能基本就能與"Follow"測試媲美。

可是,"Addnext0"比"Inc"更快離開L2,這是由於它須要從主存裝載更多的數據。而在工做集達到2 21字節時,"Addnext0"的耗時達到了28個週期,是同期"Follow"14週期的兩倍。這個兩倍也很好解釋。"Addnext0"和"Inc"涉及對內存的修改,所以L2的逐出操做不能簡單地把數據一扔了事,而必須將它們寫入內存。所以FSB的可用帶寬變成了一半,傳輸等量數據的耗時也就變成了原來的兩倍。

 

Figure 3.14: Advantage of Larger L2/L3 Caches

One last aspect of the sequential, efficient cache handling is the size of the cache. This should be obvious but it still should be pointed out. Figure 3.14 shows the timing for the Increment benchmark with 128-byte elements (NPAD=15 on 64-bit machines). This time we see the measurement from three different machines. The first two machines are P4s, the last one a Core2 processor. The first two differentiate themselves by having different cache sizes. The first processor has a 32k L1d and an 1M L2. The second one has 16k L1d, 512k L2, and 2M L3. The Core2 processor has 32k L1d and 4M L2.

譯者信息
 
圖3.14: 更大L2/L3緩存的優點

決定順序式緩存處理性能的另外一個重要因素是緩存容量。雖然這一點比較明顯,但仍是值得一說。圖3.14展現了128字節長元素的測試結果(64位機,NPAD=15)。此次咱們比較三臺不一樣計算機的曲線,兩臺P4,一臺Core 2。兩臺P4的區別是緩存容量不一樣,一臺是32k的L1d和1M的L2,一臺是16K的L1d、512k的L2和2M的L3。Core 2那臺則是32k的L1d和4M的L2。

 

The interesting part of the graph is not necessarily how well the Core2 processor performs relative to the other two (although it is impressive). The main point of interest here is the region where the working set size is too large for the respective last level cache and the main memory gets heavily involved.

Set
Size
Sequential Random
L2 Hit L2 Miss #Iter Ratio Miss/Hit L2 Accesses Per Iter L2 Hit L2 Miss #Iter Ratio Miss/Hit L2 Accesses Per Iter
220 88,636 843 16,384 0.94% 5.5 30,462 4721 1,024 13.42% 34.4
221 88,105 1,584 8,192 1.77% 10.9 21,817 15,151 512 40.98% 72.2
222 88,106 1,600 4,096 1.78% 21.9 22,258 22,285 256 50.03% 174.0
223 88,104 1,614 2,048 1.80% 43.8 27,521 26,274 128 48.84% 420.3
224 88,114 1,655 1,024 1.84% 87.7 33,166 29,115 64 46.75% 973.1
225 88,112 1,730 512 1.93% 175.5 39,858 32,360 32 44.81% 2,256.8
226 88,112 1,906 256 2.12% 351.6 48,539 38,151 16 44.01% 5,418.1
227 88,114 2,244 128 2.48% 705.9 62,423 52,049 8 45.47% 14,309.0
228 88,120 2,939 64 3.23% 1,422.8 81,906 87,167 4 51.56% 42,268.3
229 88,137 4,318 32 4.67% 2,889.2 119,079 163,398 2 57.84% 141,238.5

Table 3.2: L2 Hits and Misses for Sequential and Random Walks, NPAD=0

As expected, the larger the last level cache is the longer the curve stays at the low level corresponding to the L2 access costs. The important part to notice is the performance advantage this provides. The second processor (which is slightly older) can perform the work on the working set of 220 bytes twice as fast as the first processor. All thanks to the increased last level cache size. The Core2 processor with its 4M L2 performs even better.

譯者信息

圖中最有趣的地方,並非Core 2如何大勝兩臺P4,而是工做集開始增加到連末級緩存也放不下、須要主存熱情參與以後的部分。

Set
Size
Sequential Random
L2 Hit L2 Miss #Iter Ratio Miss/Hit L2 Accesses Per Iter L2 Hit L2 Miss #Iter Ratio Miss/Hit L2 Accesses Per Iter
220 88,636 843 16,384 0.94% 5.5 30,462 4721 1,024 13.42% 34.4
221 88,105 1,584 8,192 1.77% 10.9 21,817 15,151 512 40.98% 72.2
222 88,106 1,600 4,096 1.78% 21.9 22,258 22,285 256 50.03% 174.0
223 88,104 1,614 2,048 1.80% 43.8 27,521 26,274 128 48.84% 420.3
224 88,114 1,655 1,024 1.84% 87.7 33,166 29,115 64 46.75% 973.1
225 88,112 1,730 512 1.93% 175.5 39,858 32,360 32 44.81% 2,256.8
226 88,112 1,906 256 2.12% 351.6 48,539 38,151 16 44.01% 5,418.1
227 88,114 2,244 128 2.48% 705.9 62,423 52,049 8 45.47% 14,309.0
228 88,120 2,939 64 3.23% 1,422.8 81,906 87,167 4 51.56% 42,268.3
229 88,137 4,318 32 4.67% 2,889.2 119,079 163,398 2 57.84% 141,238.5

表3.2: 順序訪問與隨機訪問時L2命中與未命中的狀況,NPAD=0

與咱們預計的類似,最末級緩存越大,曲線停留在L2訪問耗時區的時間越長。在220字節的工做集時,第二臺P4(更老一些)比第一臺P4要快上一倍,這要徹底歸功於更大的末級緩存。而Core 2拜它巨大的4M L2所賜,表現更爲卓越。

 

For a random workload this might not mean that much. But if the workload can be tailored to the size of the last level cache the program performance can be increased quite dramatically. This is why it sometimes is worthwhile to spend the extra money for a processor with a larger cache.

Single Threaded Random Access Measurements

We have seen that the processor is able to hide most of the main memory and even L2 access latency by prefetching cache lines into L2 and L1d. This can work well only when the memory access is predictable, though.

Figure 3.15: Sequential vs Random Read, NPAD=0

If the access is unpredictable or random the situation is quite different. Figure 3.15 compares the per-list-element times for the sequential access (same as in Figure 3.10) with the times when the list elements are randomly distributed in the working set. The order is determined by the linked list which is randomized. There is no way for the processor to reliably prefetch data. This can only work by chance if elements which are used shortly after one another are also close to each other in memory.

譯者信息

對於隨機的工做負荷而言,可能沒有這麼驚人的效果,可是,若是咱們能將工做負荷進行一些裁剪,讓它匹配末級緩存的容量,就徹底能夠獲得很是大的性能提高。也是因爲這個緣由,有時候咱們須要多花一些錢,買一個擁有更大緩存的處理器。

單線程隨機訪問模式的測量

前面咱們已經看到,處理器可以利用L1d到L2之間的預取消除訪問主存、甚至是訪問L2的時延。

 
圖3.15: 順序讀取vs隨機讀取,NPAD=0

可是,若是換成隨機訪問或者不可預測的訪問,狀況就大不相同了。圖3.15比較了順序讀取與隨機讀取的耗時狀況。

換成隨機以後,處理器沒法再有效地預取數據,只有少數狀況下靠運氣恰好碰到前後訪問的兩個元素挨在一塊兒的情形。

 

There are two important points to note in Figure 3.15. First, the large number is cycles needed for growing working set sizes. The machine makes it possible to access the main memory in 200-300 cycles but here we reach 450 cycles and more. We have seen this phenomenon before (compare Figure 3.11). The automatic prefetching is actually working to a disadvantage here.

The second interesting point is that the curve is not flattening at various plateaus as it has been for the sequential access cases. The curve keeps on rising. To explain this we can measure the L2 access of the program for the various working set sizes. The result can be seen in Figure 3.16 and Table 3.2.

譯者信息

圖3.15中有兩個須要關注的地方。首先,在大的工做集下須要很是多的週期。這臺機器訪問主存的時間大約爲200-300個週期,但圖中的耗時甚至超過了450個週期。咱們前面已經觀察到過相似現象(對比圖3.11)。這說明,處理器的自動預取在這裏起到了反效果。

其次,表明隨機訪問的曲線在各個階段不像順序訪問那樣保持平坦,而是不斷攀升。爲了解釋這個問題,咱們測量了程序在不一樣工做集下對L2的訪問狀況。結果如圖3.16和表3.2。

 

The figure shows that, when the working set size is larger than the L2 size, the cache miss ratio (L2 misses / L2 access) starts to grow. The curve has a similar form to the one in Figure 3.15: it rises quickly, declines slightly, and starts to rise again. There is a strong correlation with the cycles per list element graph. The L2 miss rate will grow until it eventually reaches close to 100%. Given a large enough working set (and RAM) the probability that any of the randomly picked cache lines is in L2 or is in the process of being loaded can be reduced arbitrarily.

The increasing cache miss rate alone explains some of the costs. But there is another factor. Looking at Table 3.2 we can see in the L2/#Iter columns that the total number of L2 uses per iteration of the program is growing. Each working set is twice as large as the one before. So, without caching we would expect double the main memory accesses. With caches and (almost) perfect predictability we see the modest increase in the L2 use shown in the data for sequential access. The increase is due to the increase of the working set size and nothing else.

Figure 3.16: L2d Miss Ratio

Figure 3.17: Page-Wise Randomization, NPAD=7

 

譯者信息

從圖中能夠看出,當工做集大小超過L2時,未命中率(L2未命中次數/L2訪問次數)開始上升。整條曲線的走向與圖3.15有些類似: 先急速爬升,隨後緩緩下滑,最後再度爬升。它與耗時圖有緊密的關聯。L2未命中率會一直爬升到100%爲止。只要工做集足夠大(而且內存也足夠大),就能夠將緩存線位於L2內或處於裝載過程當中的可能性降到很是低。

緩存未命中率的攀升已經能夠解釋一部分的開銷。除此之外,還有一個因素。觀察表3.2的L2/#Iter列,能夠看到每一個循環對L2的使用次數在增加。因爲工做集每次爲上一次的兩倍,若是沒有緩存的話,內存的訪問次數也將是上一次的兩倍。在按順序訪問時,因爲緩存的幫助及完美的預見性,對L2使用的增加比較平緩,徹底取決於工做集的增加速度。

 
圖3.16: L2d未命中率
 
圖3.17: 頁意義上(Page-Wise)的隨機化,NPAD=7
For random access the per-element time increases by more than 100% for each doubling of the working set size. This means the average access time per list element increases since the working set size only doubles. The reason behind this is a rising rate of TLB misses. In Figure 3.17 we see the cost for random accesses forNPAD=7. Only this time the randomization is modified. While in the normal case the entire list of randomized as one block (indicated by the label ∞) the other 11 curves show randomizations which are performed in smaller blocks. For the curve labeled '60' each set of 60 pages (245.760 bytes) is randomized individually. That means all list elements in the block are traversed before going over to an element in the next block. This has the effect that number of TLB entries which are used at any one time is limited. 譯者信息而換成隨機訪問後,單位耗時的增加超過了工做集的增加,根源是TLB未命中率的上升。圖3.17描繪的是NPAD=7時隨機訪問的耗時狀況。這一次,咱們修改了隨機訪問的方式。正常狀況下是把整個列表做爲一個塊進行隨機(以∞表示),而其它11條線則是在小一些的塊裏進行隨機。例如,標籤爲'60'的線表示以60頁(245760字節)爲單位進行隨機。先遍歷完這個塊裏的全部元素,再訪問另外一個塊。這樣一來,能夠保證任意時刻使用的TLB條目數都是有限的。

 

The element size forNPAD=7 is 64 bytes, which corresponds to the cache line size. Due to the randomized order of the list elements it is unlikely that the hardware prefetcher has any effect, most certainly not for more than a handful of elements. This means the L2 cache miss rate does not differ significantly from the randomization of the entire list in one block. The performance of the test with increasing block size approaches asymptotically the curve for the one-block randomization. This means the performance of this latter test case is significantly influenced by the TLB misses. If the TLB misses can be lowered the performance increases significantly (in one test we will see later up to 38%).

譯者信息NPAD=7對應於64字節,正好等於緩存線的長度。因爲元素順序隨機,硬件預取不可能有任何效果,特別是在元素較多的狀況下。這意味着,分塊隨機時的L2未命中率與整個列表隨機時的未命中率沒有本質的差異。隨着塊的增大,曲線逐漸逼近整個列表隨機對應的曲線。這說明,在這個測試裏,性能受到TLB命中率的影響很大,若是咱們能提升TLB命中率,就能大幅度地提高性能(在後面的一個例子裏,性能提高了38%之多)。

 

3.3.3 Write Behavior

Before we start looking at the cache behavior when multiple execution contexts (threads or processes) use the same memory we have to explore a detail of cache implementations. Caches are supposed to be coherent and this coherency is supposed to be completely transparent for the userlevel code. Kernel code is a different story; it occasionally requires cache flushes.

This specifically means that, if a cache line is modified, the result for the system after this point in time is the same as if there were no cache at all and the main memory location itself had been modified. This can be implemented in two ways or policies:

  • write-through cache implementation;
  • write-back cache implementation.

 

譯者信息

3.3.3 寫入時的行爲

在咱們開始研究多個線程或進程同時使用相同內存以前,先來看一下緩存實現的一些細節。咱們要求緩存是一致的,並且這種一致性必須對用戶級代碼徹底透明。而內核代碼則有所不一樣,它有時候須要對緩存進行轉儲(flush)。

這意味着,若是對緩存線進行了修改,那麼在這個時間點以後,系統的結果應該是與沒有緩存的狀況下是相同的,即主存的對應位置也已經被修改的狀態。這種要求能夠經過兩種方式或策略實現:

  • 寫通(write-through)
  • 寫回(write-back)
The write-through cache is the simplest way to implement cache coherency. If the cache line is written to, the processor immediately also writes the cache line into main memory. This ensures that, at all times, the main memory and cache are in sync. The cache content could simply be discarded whenever a cache line is replaced. This cache policy is simple but not very fast. A program which, for instance, modifies a local variable over and over again would create a lot of traffic on the FSB even though the data is likely not used anywhere else and might be short-lived.

 

The write-back policy is more sophisticated. Here the processor does not immediately write the modified cache line back to main memory. Instead, the cache line is only marked as dirty. When the cache line is dropped from the cache at some point in the future the dirty bit will instruct the processor to write the data back at that time instead of just discarding the content.

譯者信息

寫通比較簡單。當修改緩存線時,處理器當即將它寫入主存。這樣能夠保證主存與緩存的內容永遠保持一致。當緩存線被替代時,只須要簡單地將它丟棄便可。這種策略很簡單,可是速度比較慢。若是某個程序反覆修改一個本地變量,可能致使FSB上產生大量數據流,而無論這個變量是否是有人在用,或者是否是短時間變量。

寫回比較複雜。當修改緩存線時,處理器再也不立刻將它寫入主存,而是打上已弄髒(dirty)的標記。當之後某個時間點緩存線被丟棄時,這個已弄髒標記會通知處理器把數據寫回到主存中,而不是簡單地扔掉。

 

Write-back caches have the chance to be significantly better performing, which is why most memory in a system with a decent processor is cached this way. The processor can even take advantage of free capacity on the FSB to store the content of a cache line before the line has to be evacuated. This allows the dirty bit to be cleared and the processor can just drop the cache line when the room in the cache is needed.

But there is a significant problem with the write-back implementation. When more than one processor (or core or hyper-thread) is available and accessing the same memory it must still be assured that both processors see the same memory content at all times. If a cache line is dirty on one processor (i.e., it has not been written back yet) and a second processor tries to read the same memory location, the read operation cannot just go out to the main memory. Instead the content of the first processor's cache line is needed. In the next section we will see how this is currently implemented.

譯者信息

寫回有時候會有很是不錯的性能,所以較好的系統大多采用這種方式。採用寫回時,處理器們甚至能夠利用FSB的空閒容量來存儲緩存線。這樣一來,當須要緩存空間時,處理器只需清除髒標記,丟棄緩存線便可。

但寫回也有一個很大的問題。當有多個處理器(或核心、超線程)訪問同一塊內存時,必須確保它們在任什麼時候候看到的都是相同的內容。若是緩存線在其中一個處理器上弄髒了(修改了,但還沒寫回主存),而第二個處理器恰好要讀取同一個內存地址,那麼這個讀操做不能去讀主存,而須要讀第一個處理器的緩存線。在下一節中,咱們將研究如何實現這種需求。

 

Before we get to this there are two more cache policies to mention:

  • write-combining; and
  • uncacheable.

Both these policies are used for special regions of the address space which are not backed by real RAM. The kernel sets up these policies for the address ranges (on x86 processors using the Memory Type Range Registers, MTRRs) and the rest happens automatically. The MTRRs are also usable to select between write-through and write-back policies.

Write-combining is a limited caching optimization more often used for RAM on devices such as graphics cards. Since the transfer costs to the devices are much higher than the local RAM access it is even more important to avoid doing too many transfers. Transferring an entire cache line just because a word in the line has been written is wasteful if the next operation modifies the next word. One can easily imagine that this is a common occurrence, the memory for horizontal neighboring pixels on a screen are in most cases neighbors, too. As the name suggests, write-combining combines multiple write accesses before the cache line is written out. In ideal cases the entire cache line is modified word by word and, only after the last word is written, the cache line is written to the device. This can speed up access to RAM on devices significantly.

譯者信息

在此以前,還有其它兩種緩存策略須要提一下:

  • 寫入合併
  • 不可緩存

這兩種策略用於真實內存不支持的特殊地址區,內核爲地址區設置這些策略(x86處理器利用內存類型範圍寄存器MTRR),餘下的部分自動進行。MTRR還可用於寫通和寫回策略的選擇。

寫入合併是一種有限的緩存優化策略,更多地用於顯卡等設備之上的內存。因爲設備的傳輸開銷比本地內存要高的多,所以避免進行過多的傳輸顯得尤其重要。若是僅僅由於修改了緩存線上的一個字,就傳輸整條線,而下個操做恰好是修改線上的下一個字,那麼此次傳輸就過於浪費了。而這偏偏對於顯卡來講是比較常見的情形——屏幕上水平鄰接的像素每每在內存中也是靠在一塊兒的。顧名思義,寫入合併是在寫出緩存線前,先將多個寫入訪問合併起來。在理想的狀況下,緩存線被逐字逐字地修改,只有當寫入最後一個字時,纔將整條線寫入內存,從而極大地加速內存的訪問。

 

Finally there is uncacheable memory. This usually means the memory location is not backed by RAM at all. It might be a special address which is hardcoded to have some functionality outside the CPU. For commodity hardware this most often is the case for memory mapped address ranges which translate to accesses to cards and devices attached to a bus (PCIe etc). On embedded boards one sometimes finds such a memory address which can be used to turn an LED on and off. Caching such an address would obviously be a bad idea. LEDs in this context are used for debugging or status reports and one wants to see this as soon as possible. The memory on PCIe cards can change without the CPU's interaction, so this memory should not be cached.

譯者信息最後來說一下不可緩存的內存。通常指的是不被RAM支持的內存位置,它能夠是硬編碼的特殊地址,承擔CPU之外的某些功能。對於商用硬件來講,比較常見的是映射到外部卡或設備的地址。在嵌入式主板上,有時也有相似的地址,用來開關LED。對這些地址進行緩存顯然沒有什麼意義。好比上述的LED,通常是用來調試或報告狀態,顯然應該儘快點亮或關閉。而對於那些PCI卡上的內存,因爲不須要CPU的干涉便可更改,也不應緩存。

 

3.3.4 Multi-Processor Support

In the previous section we have already pointed out the problem we have when multiple processors come into play. Even multi-core processors have the problem for those cache levels which are not shared (at least the L1d).

It is completely impractical to provide direct access from one processor to the cache of another processor. The connection is simply not fast enough, for a start. The practical alternative is to transfer the cache content over to the other processor in case it is needed. Note that this also applies to caches which are not shared on the same processor.

譯者信息

3.3.4 多處理器支持

在上節中咱們已經指出當多處理器開始發揮做用的時候所遇到的問題。甚至對於那些不共享的高速級別的緩存(至少在L1d級別)的多核處理器也有問題。

直接提供從一個處理器到另外一處理器的高速訪問,這是徹底不切實際的。從一開始,鏈接速度根本就不夠快。實際的選擇是,在其須要的狀況下,轉移到其餘處理器。須要注意的是,這一樣應用在相同處理器上無需共享的高速緩存。

 

The question now is when does this cache line transfer have to happen? This question is pretty easy to answer: when one processor needs a cache line which is dirty in another processor's cache for reading or writing. But how can a processor determine whether a cache line is dirty in another processor's cache? Assuming it is just because a cache line is loaded by another processor would be suboptimal (at best). Usually the majority of memory accesses are read accesses and the resulting cache lines are not dirty. Processor operations on cache lines are frequent (of course, why else would we have this paper?) which means broadcasting information about changed cache lines after each write access would be impractical.

譯者信息如今的問題是,當該高速緩存線轉移的時候會發生什麼?這個問題回答起來至關容易:當一個處理器須要在另外一個處理器的高速緩存中讀或者寫的髒的高速緩存線的時候。但怎樣處理器怎樣肯定在另外一個處理器的緩存中的高速緩存線是髒的?假設它僅僅是由於一個高速緩存線被另外一個處理器加載將是次優的(最好的)。一般狀況下,大多數的內存訪問是隻讀的訪問和產生高速緩存線,並不髒。在高速緩存線上處理器頻繁的操做(固然,不然爲何咱們有這樣的文件呢?),也就意味着每一次寫訪問後,都要廣播關於高速緩存線的改變將變得不切實際。

 

What developed over the years is the MESI cache coherency protocol (Modified, Exclusive, Shared, Invalid). The protocol is named after the four states a cache line can be in when using the MESI protocol:

  • Modified: The local processor has modified the cache line. This also implies it is the only copy in any cache.
  • Exclusive: The cache line is not modified but known to not be loaded into any other processor's cache.
  • Shared: The cache line is not modified and might exist in another processor's cache.
  • Invalid: The cache line is invalid, i.e., unused.

This protocol developed over the years from simpler versions which were less complicated but also less efficient. With these four states it is possible to efficiently implement write-back caches while also supporting concurrent use of read-only data on different processors.

譯者信息

多年來,人們開發除了MESI緩存一致性協議(MESI=Modified, Exclusive, Shared, Invalid,變動的、獨佔的、共享的、無效的)。協議的名稱來自協議中緩存線能夠進入的四種狀態:

  • 變動的: 本地處理器修改了緩存線。同時暗示,它是全部緩存中惟一的拷貝。
  • 獨佔的: 緩存線沒有被修改,並且沒有被裝入其它處理器緩存。
  • 共享的: 緩存線沒有被修改,但可能已被裝入其它處理器緩存。
  • 無效的: 緩存線無效,即,未被使用。

MESI協議開發了不少年,最初的版本比較簡單,可是效率也比較差。如今的版本經過以上4個狀態能夠有效地實現寫回式緩存,同時支持不一樣處理器對只讀數據的併發訪問。

 

Figure 3.18: MESI Protocol Transitions

The state changes are accomplished without too much effort by the processors listening, or snooping, on the other processors' work. Certain operations a processor performs are announced on external pins and thus make the processor's cache handling visible to the outside. The address of the cache line in question is visible on the address bus. In the following description of the states and their transitions (shown in Figure 3.18) we will point out when the bus is involved.

Initially all cache lines are empty and hence also Invalid. If data is loaded into the cache for writing the cache changes to Modified. If the data is loaded for reading the new state depends on whether another processor has the cache line loaded as well. If this is the case then the new state is Shared, otherwise Exclusive.

譯者信息

 
圖3.18: MESI協議的狀態躍遷圖

在協議中,經過處理器監聽其它處理器的活動,不需太多努力便可實現狀態變動。處理器將操做發佈在外部引腳上,使外部能夠了解處處理過程。目標的緩存線地址則能夠在地址總線上看到。在下文講述狀態時,咱們將介紹總線參與的時機。

一開始,全部緩存線都是空的,緩存爲無效(Invalid)狀態。當有數據裝進緩存供寫入時,緩存變爲變動(Modified)狀態。若是有數據裝進緩存供讀取,那麼新狀態取決於其它處理器是否已經狀態了同一條緩存線。若是是,那麼新狀態變成共享(Shared)狀態,不然變成獨佔(Exclusive)狀態。

 

If a Modified cache line is read from or written to on the local processor, the instruction can use the current cache content and the state does not change. If a second processor wants to read from the cache line the first processor has to send the content of its cache to the second processor and then it can change the state to Shared. The data sent to the second processor is also received and processed by the memory controller which stores the content in memory. If this did not happen the cache line could not be marked as Shared. If the second processor wants to write to the cache line the first processor sends the cache line content and marks the cache line locally as Invalid. This is the infamous 「Request For Ownership」 (RFO) operation. Performing this operation in the last level cache, just like the I→M transition is comparatively expensive. For write-through caches we also have to add the time it takes to write the new cache line content to the next higher-level cache or the main memory, further increasing the cost.

譯者信息若是本地處理器對某條Modified緩存線進行讀寫,那麼直接使用緩存內容,狀態保持不變。若是另外一個處理器但願讀它,那麼第一個處理器將內容發給第一個處理器,而後能夠將緩存狀態置爲Shared。而發給第二個處理器的數據由內存控制器接收,並放入內存中。若是這一步沒有發生,就不能將這條線置爲Shared。若是第二個處理器但願的是寫,那麼第一個處理器將內容發給它後,將緩存置爲Invalid。這就是臭名昭著的"請求全部權(Request For Ownership,RFO)"操做。在末級緩存執行RFO操做的代價比較高。若是是寫通式緩存,還要加上將內容寫入上一層緩存或主存的時間,進一步提高了代價。

 

If a cache line is in the Shared state and the local processor reads from it no state change is necessary and the read request can be fulfilled from the cache. If the cache line is locally written to the cache line can be used as well but the state changes to Modified. It also requires that all other possible copies of the cache line in other processors are marked as Invalid. Therefore the write operation has to be announced to the other processors via an RFO message. If the cache line is requested for reading by a second processor nothing has to happen. The main memory contains the current data and the local state is already Shared. In case a second processor wants to write to the cache line (RFO) the cache line is simply marked Invalid. No bus operation is needed.

譯者信息對於Shared緩存線,本地處理器的讀取操做並不須要修改狀態,並且能夠直接從緩存知足。而本地處理器的寫入操做則須要將狀態置爲Modified,並且須要將緩存線在其它處理器的全部拷貝置爲Invalid。所以,這個寫入操做須要經過RFO消息發通知其它處理器。若是第二個處理器請求讀取,無事發生。由於主存已經包含了當前數據,並且狀態已經爲Shared。若是第二個處理器須要寫入,則將緩存線置爲Invalid。不須要總線操做。

 

The Exclusive state is mostly identical to the Shared state with one crucial difference: a local write operation does not have to be announced on the bus. The local cache copy is known to be the only one. This can be a huge advantage so the processor will try to keep as many cache lines as possible in the Exclusive state instead of the Shared state. The latter is the fallback in case the information is not available at that moment. The Exclusive state can also be left out completely without causing functional problems. It is only the performance that will suffer since the E→M transition is much faster than the S→M transition.

From this description of the state transitions it should be clear where the costs specific to multi-processor operations are. Yes, filling caches is still expensive but now we also have to look out for RFO messages. Whenever such a message has to be sent things are going to be slow.

譯者信息

Exclusive狀態與Shared狀態很像,只有一個不一樣之處: 在Exclusive狀態時,本地寫入操做不須要在總線上聲明,由於本地的緩存是系統中惟一的拷貝。這是一個巨大的優點,因此處理器會盡可能將緩存線保留在Exclusive狀態,而不是Shared狀態。只有在信息不可用時,才退而求其次選擇shared。放棄Exclusive不會引發任何功能缺失,但會致使性能降低,由於E→M要遠遠快於S→M。

從以上的說明中應該已經能夠看出,在多處理器環境下,哪一步的代價比較大了。填充緩存的代價固然仍是很高,但咱們還須要留意RFO消息。一旦涉及RFO,操做就快不起來了。

 

There are two situations when RFO messages are necessary:

  • A thread is migrated from one processor to another and all the cache lines have to be moved over to the new processor once.
  • A cache line is truly needed in two different processors. {At a smaller level the same is true for two cores on the same processor. The costs are just a bit smaller. The RFO message is likely to be sent many times.}

In multi-thread or multi-process programs there is always some need for synchronization; this synchronization is implemented using memory. So there are some valid RFO messages. They still have to be kept as infrequent as possible. There are other sources of RFO messages, though. In Section 6 we will explain these scenarios. The Cache coherency protocol messages must be distributed among the processors of the system. A MESI transition cannot happen until it is clear that all the processors in the system have had a chance to reply to the message. That means that the longest possible time a reply can take determines the speed of the coherency protocol. {Which is why we see nowadays, for instance, AMD Opteron systems with three sockets. Each processor is exactly one hop away given that the processors only have three hyperlinks and one is needed for the Southbridge connection.} Collisions on the bus are possible, latency can be high in NUMA systems, and of course sheer traffic volume can slow things down. All good reasons to focus on avoiding unnecessary traffic.

譯者信息

RFO在兩種狀況下是必需的:

  • 線程從一個處理器遷移到另外一個處理器,須要將全部緩存線移到新處理器。
  • 某條緩存線確實須要被兩個處理器使用。{對於同一處理器的兩個核心,也有一樣的狀況,只是代價稍低。RFO消息可能會被髮送屢次。}

多線程或多進程的程序老是須要同步,而這種同步依賴內存來實現。所以,有些RFO消息是合理的,但仍然須要儘可能下降發送頻率。除此之外,還有其它來源的RFO。在第6節中,咱們將解釋這些場景。緩存一致性協議的消息必須發給系統中全部處理器。只有當協議肯定已經給過全部處理器響應機會以後,才能進行狀態躍遷。也就是說,協議的速度取決於最長響應時間。{這也是如今能看到三插槽AMD Opteron系統的緣由。這類系統只有三個超級鏈路(hyperlink),其中一個鏈接南橋,每一個處理器之間都只有一跳的距離。}總線上可能會發生衝突,NUMA系統的延時很大,突發的流量會拖慢通訊。這些都是讓咱們避免無謂流量的充足理由。

 

There is one more problem related to having more than one processor in play. The effects are highly machine specific but in principle the problem always exists: the FSB is a shared resource. In most machines all processors are connected via one single bus to the memory controller (see Figure 2.1). If a single processor can saturate the bus (as is usually the case) then two or four processors sharing the same bus will restrict the bandwidth available to each processor even more.

Even if each processor has its own bus to the memory controller as in Figure 2.2 there is still the bus to the memory modules. Usually this is one bus but, even in the extended model in Figure 2.2, concurrent accesses to the same memory module will limit the bandwidth.

The same is true with the AMD model where each processor can have local memory. Yes, all processors can concurrently access their local memory quickly. But multi-thread and multi-process programs--at least from time to time--have to access the same memory regions to synchronize.

譯者信息

此外,關於多處理器還有一個問題。雖然它的影響與具體機器密切相關,但根源是惟一的——FSB是共享的。在大多數狀況下,全部處理器經過惟一的總線鏈接到內存控制器(參見圖2.1)。若是一個處理器就能佔滿總線(十分常見),那麼共享總線的兩個或四個處理器顯然只會獲得更有限的帶寬。

即便每一個處理器有本身鏈接內存控制器的總線,如圖2.2,但還須要通往內存模塊的總線。通常狀況下,這種總線只有一條。退一步說,即便像圖2.2那樣不止一條,對同一個內存模塊的併發訪問也會限制它的帶寬。

對於每一個處理器擁有本地內存的AMD模型來講,也是一樣的問題。的確,全部處理器能夠很是快速地同時訪問它們本身的內存。可是,多線程呢?多進程呢?它們仍然須要經過訪問同一塊內存來進行同步。

 

Concurrency is severely limited by the finite bandwidth available for the implementation of the necessary synchronization. Programs need to be carefully designed to minimize accesses from different processors and cores to the same memory locations. The following measurements will show this and the other cache effects related to multi-threaded code.

Multi Threaded Measurements

To ensure that the gravity of the problems introduced by concurrently using the same cache lines on different processors is understood, we will look here at some more performance graphs for the same program we used before. This time, though, more than one thread is running at the same time. What is measured is the fastest runtime of any of the threads. This means the time for a complete run when all threads are done is even higher. The machine used has four processors; the tests use up to four threads. All processors share one bus to the memory controller and there is only one bus to the memory modules.

譯者信息

對同步來講,有限的帶寬嚴重地制約着併發度。程序須要更加謹慎的設計,將不一樣處理器訪問同一塊內存的機會降到最低。如下的測試展現了這一點,還展現了與多線程代碼相關的其它效果。

多線程測量

爲了幫助你們理解問題的嚴重性,咱們來看一些曲線圖,主角也是前文的那個程序。只不過這一次,咱們運行多個線程,並測量這些線程中最快那個的運行時間。也就是說,等它們所有運行完是須要更長時間的。咱們用的機器有4個處理器,而測試是作多跑4個線程。全部處理器共享同一條通往內存控制器的總線,另外,通往內存模塊的總線也只有一條。

 

Figure 3.19: Sequential Read Access, Multiple Threads

Figure 3.19 shows the performance for sequential read-only access for 128 bytes entries (NPAD=15 on 64-bit machines). For the curve for one thread we can expect a curve similar to Figure 3.11. The measurements are for a different machine so the actual numbers vary.

The important part in this figure is of course the behavior when running multiple threads. Note that no memory is modified and no attempts are made to keep the threads in sync when walking the linked list. Even though no RFO messages are necessary and all the cache lines can be shared, we see up to an 18% performance decrease for the fastest thread when two threads are used and up to 34% when four threads are used. Since no cache lines have to be transported between the processors this slowdown is solely caused by one or both of the two bottlenecks: the shared bus from the processor to the memory controller and bus from the memory controller to the memory modules. Once the working set size is larger than the L3 cache in this machine all three threads will be prefetching new list elements. Even with two threads the available bandwidth is not sufficient to scale linearly (i.e., have no penalty from running multiple threads).

譯者信息
 
圖3.19: 順序讀操做,多線程

圖3.19展現了順序讀訪問時的性能,元素爲128字節長(64位計算機,NPAD=15)。對於單線程的曲線,咱們預計是與圖3.11類似,只不過是換了一臺機器,因此實際的數字會有些小差異。

更重要的部分固然是多線程的環節。因爲是隻讀,不會去修改內存,不會嘗試同步。但即便不須要RFO,並且全部緩存線均可共享,性能仍然分別降低了18%(雙線程)和34%(四線程)。因爲不須要在處理器之間傳輸緩存,所以這裏的性能降低徹底由如下兩個瓶頸之一或同時引發: 一是從處理器到內存控制器的共享總線,二是從內存控制器到內存模塊的共享總線。當工做集超過L3後,三種狀況下都要預取新元素,而即便是雙線程,可用的帶寬也沒法知足線性擴展(無懲罰)。

 

When we modify memory things get even uglier. Figure 3.20 shows the results for the sequential Increment test.

Figure 3.20: Sequential Increment, Multiple Threads

This graph is using a logarithmic scale for the Y axis. So, do not be fooled by the apparently small differences. We still have about a 18% penalty for running two threads and now an amazing 93% penalty for running four threads. This means the prefetch traffic together with the write-back traffic is pretty much saturating the bus when four threads are used.

We use the logarithmic scale to show the results for the L1d range. What can be seen is that, as soon as more than one thread is running, the L1d is basically ineffective. The single-thread access times exceed 20 cycles only when the L1d is not sufficient to hold the working set. When multiple threads are running, those access times are hit immediately, even with the smallest working set sizes.

譯者信息

當加入修改以後,場面更加難看了。圖3.20展現了順序遞增測試的結果。

 
圖3.20: 順序遞增,多線程

圖中Y軸採用的是對數刻度,不要被看起來很小的差值欺騙了。如今,雙線程的性能懲罰仍然是18%,但四線程的懲罰飆升到了93%!緣由在於,採用四線程時,預取的流量與寫回的流量加在一塊兒,佔滿了整個總線。

咱們用對數刻度來展現L1d範圍的結果。能夠發現,當超過一個線程後,L1d就無力了。單線程時,僅當工做集超過L1d時訪問時間纔會超過20個週期,而多線程時,即便在很小的工做集狀況下,訪問時間也達到了那個水平。

 

One aspect of the problem is not shown here. It is hard to measure with this specific test program. Even though the test modifies memory and we therefore must expect RFO messages we do not see higher costs for the L2 range when more than one thread is used. The program would have to use a large amount of memory and all threads must access the same memory in parallel. This is hard to achieve without a lot of synchronization which would then dominate the execution time.

Figure 3.21: Random Addnextlast, Multiple Threads

Finally in Figure 3.21 we have the numbers for the Addnextlast test with random access of memory. This figure is provided mainly to show to the appallingly high numbers. It now take around 1,500 cycles to process a single list element in the extreme case. The use of more threads is even more questionable. We can summarize the efficiency of multiple thread use in a table.

#Threads Seq Read Seq Inc Rand Add
2 1.69 1.69 1.54
4 2.98 2.07 1.65

Table 3.3: Efficiency for Multiple Threads

The table shows the efficiency for the multi-thread run with the largest working set size in the three figures on Figure 3.21. The number shows the best possible speed-up the test program incurs for the largest working set size by using two or four threads. For two threads the theoretical limits for the speed-up are 2 and, for four threads, 4. The numbers for two threads are not that bad. But for four threads the numbers for the last test show that it is almost not worth it to scale beyond two threads. The additional benefit is minuscule. We can see this more easily if we represent the data in Figure 3.21 a bit differently.

 

譯者信息

這裏並無揭示問題的另外一方面,主要是用這個程序很難進行測量。問題是這樣的,咱們的測試程序修改了內存,因此本應看到RFO的影響,但在結果中,咱們並無在L2階段看到更大的開銷。緣由在於,要看到RFO的影響,程序必須使用大量內存,並且全部線程必須同時訪問同一塊內存。若是沒有大量的同步,這是很難實現的,而若是加入同步,則會佔滿執行時間。

 
圖3.21: 隨機的Addnextlast,多線程

最後,在圖3.21中,咱們展現了隨機訪問的Addnextlast測試的結果。這裏主要是爲了讓你們感覺一下這些巨大到爆的數字。極端狀況下,甚至用了1500個週期才處理完一個元素。若是加入更多線程,真是不可想象哪。咱們把多線程的效能總結了一下:

 

#Threads Seq Read Seq Inc Rand Add
2 1.69 1.69 1.54
4 2.98 2.07 1.65
表3.3: 多線程的效能

這個表展現了圖3.21中多線程運行大工做集時的效能。表中的數字表示測試程序在使用多線程處理大工做集時可能達到的最大加速因子。雙線程和四線程的理論最大加速因子分別是2和4。從表中數據來看,雙線程的結果還能接受,但四線程的結果代表,擴展到雙線程以上是沒有什麼意義的,帶來的收益能夠忽略不計。只要咱們把圖3.21換個方式呈現,就能夠很容易看清這一點。

 

Figure 3.22: Speed-Up Through Parallelism

The curves in Figure 3.22 show the speed-up factors, i.e., relative performance compared to the code executed by a single thread. We have to ignore the smallest sizes, the measurements are not accurate enough. For the range of the L2 and L3 cache we can see that we indeed achieve almost linear acceleration. We almost reach factors of 2 and 4 respectively. But as soon as the L3 cache is not sufficient to hold the working set the numbers crash. They crash to the point that the speed-up of two and four threads is identical (see the fourth column in Table 3.3). This is one of the reasons why one can hardly find motherboard with sockets for more than four CPUs all using the same memory controller. Machines with more processors have to be built differently (see Section 5).

譯者信息

 
圖3.22: 經過並行化實現的加速因子

圖3.22中的曲線展現了加速因子,即多線程相對於單線程所能獲取的性能加成值。測量值的精確度有限,所以咱們須要忽略比較小的那些數字。能夠看到,在L2與L3範圍內,多線程基本能夠作到線性加速,雙線程和四線程分別達到了2和4的加速因子。可是,一旦工做集的大小超出L3,曲線就崩塌了,雙線程和四線程降到了基本相同的數值(參見表3.3中第4列)。也是部分因爲這個緣由,咱們不多看到4CPU以上的主板共享同一個內存控制器。若是須要配置更多處理器,咱們只能選擇其它的實現方式(參見第5節)。

 

These numbers are not universal. In some cases even working sets which fit into the last level cache will not allow linear speed-ups. In fact, this is the norm since threads are usually not as decoupled as is the case in this test program. On the other hand it is possible to work with large working sets and still take advantage of more than two threads. Doing this requires thought, though. We will talk about some approaches in Section 6.

Special Case: Hyper-Threads

Hyper-Threads (sometimes called Symmetric Multi-Threading, SMT) are implemented by the CPU and are a special case since the individual threads cannot really run concurrently. They all share almost all the processing resources except for the register set. Individual cores and CPUs still work in parallel but the threads implemented on each core are limited by this restriction. In theory there can be many threads per core but, so far, Intel's CPUs at most have two threads per core. The CPU is responsible for time-multiplexing the threads. This alone would not make much sense, though. The real advantage is that the CPU can schedule another hyper-thread when the currently running hyper-thread is delayed. In most cases this is a delay caused by memory accesses.

譯者信息

惋惜,上圖中的數據並非廣泛狀況。在某些狀況下,即便工做集可以放入末級緩存,也沒法實現線性加速。實際上,這反而是正常的,由於普通的線程都有必定的耦合關係,不會像咱們的測試程序這樣徹底獨立。而反過來講,即便是很大的工做集,即便是兩個以上的線程,也是能夠經過並行化受益的,可是須要程序員的聰明才智。咱們會在第6節進行一些介紹。

特例: 超線程

由CPU實現的超線程(有時又叫對稱多線程,SMT)是一種比較特殊的狀況,每一個線程並不能真正併發地運行。它們共享着除寄存器外的絕大多數處理資源。每一個核心和CPU仍然是並行工做的,但核心上的線程則受到這個限制。理論上,每一個核心能夠有大量線程,不過到目前爲止,Intel的CPU最多隻有兩個線程。CPU負責對各線程進行時分複用,但這種複用自己並無多少厲害。它真正的優點在於,CPU能夠在當前運行的超線程發生延遲時,調度另外一個線程。這種延遲通常由內存訪問引發。

 

If two threads are running on one hyper-threaded core the program is only more efficient than the single-threaded code if the combined runtime of both threads is lower than the runtime of the single-threaded code. This is possible by overlapping the wait times for different memory accesses which usually would happen sequentially. A simple calculation shows the minimum requirement on the cache hit rate to achieve a certain speed-up.

The execution time for a program can be approximated with a simple model with only one level of cache as follows (see [htimpact]):

exe = N[(1-F  mem)T  proc + F  mem(G  hitcache + (1-G  hit)T  miss)]

The meaning of the variables is as follows:

N = Number of instructions.
Fmem = Fraction of N that access memory.
Ghit = Fraction of loads that hit the cache.
Tproc = Number of cycles per instruction.
Tcache = Number of cycles for cache hit.
Tmiss = Number of cycles for cache miss.
Texe = Execution time for program.

 

譯者信息

若是兩個線程運行在一個超線程核心上,那麼只有當兩個線程合起來運行時間少於單線程運行時間時,效率纔會比較高。咱們能夠將一般前後發生的內存訪問疊合在一塊兒,以實現這個目標。有一個簡單的計算公式,能夠幫助咱們計算若是須要某個加速因子,最少須要多少的緩存命中率。

程序的執行時間能夠經過一個只有一級緩存的簡單模型來進行估算(參見[htimpact]):

  exe   = N[(1-F   mem  )T   proc   + F   mem  (G   hit    cache   + (1-G   hit  )T   miss  )]

各變量的含義以下:

N = 指令數
Fmem = N中訪問內存的比例
Ghit = 命中緩存的比例
Tproc = 每條指令所用的週期數
Tcache = 緩存命中所用的週期數
Tmiss = 緩衝未命中所用的週期數
Texe = 程序的執行時間
For it to make any sense to use two threads the execution time of each of the two threads must be at most half of that of the single-threaded code. The only variable on either side is the number of cache hits. If we solve the equation for the minimum cache hit rate required to not slow down the thread execution by 50% or more we get the graph in Figure 3.23.

 

Figure 3.23: Minimum Cache Hit Rate For Speed-Up

The X–axis represents the cache hit rate Ghit of the single-thread code. The Y–axis shows the required cache hit rate for the multi-threaded code. This value can never be higher than the single-threaded hit rate since, otherwise, the single-threaded code would use that improved code, too. For single-threaded hit rates below 55% the program can in all cases benefit from using threads. The CPU is more or less idle enough due to cache misses to enable running a second hyper-thread.

譯者信息

爲了讓任何判讀使用雙線程,兩個線程之中任一線程的執行時間最多爲單線程指令的一半。二者都有一個惟一的變量緩存命中數。 若是咱們要解決最小緩存命中率相等的問題須要使咱們得到的線程的執行率很多於50%或更多,如圖 3.23.

圖 3.23: 最小緩存命中率-加速

X軸表示單線程指令的緩存命中率Ghit,Y軸表示多線程指令所需的緩存命中率。這個值永遠不能高於單線程命中率,不然,單線程指令也會使用改良的指令。爲了使單線程的命中率在低於55%的全部狀況下優於使用多線程,cup要或多或少的足夠空閒由於緩存丟失會運行另一個超線程。

 

The green area is the target. If the slowdown for the thread is less than 50% and the workload of each thread is halved the combined runtime might be less than the single-thread runtime. For the modeled system here (numbers for a P4 with hyper-threads were used) a program with a hit rate of 60% for the single-threaded code requires a hit rate of at least 10% for the dual-threaded program. That is usually doable. But if the single-threaded code has a hit rate of 95% then the multi-threaded code needs a hit rate of at least 80%. That is harder. Especially, and this is the problem with hyper-threads, because now the effective cache size (L1d here, in practice also L2 and so on) available to each hyper-thread is cut in half. Both hyper-threads use the same cache to load their data. If the working set of the two threads is non-overlapping the original 95% hit rate could also be cut in half and is therefore much lower than the required 80%.

譯者信息綠色區域是咱們的目標。若是線程的速度沒有慢過50%,而每一個線程的工做量只有原來的一半,那麼它們合起來的耗時應該會少於單線程的耗時。對咱們用的示例系統來講(使用超線程的P4機器),若是單線程代碼的命中率爲60%,那麼多線程代碼至少要達到10%才能得到收益。這個要求通常來講仍是能夠作到的。可是,若是單線程代碼的命中率達到了95%,那麼多線程代碼要作到80%才行。這就很難了。並且,這裏還涉及到超線程,在兩個超線程的狀況下,每一個超線程只能分到一半的有效緩存。由於全部超線程是使用同一個緩存來裝載數據的,若是兩個超線程的工做集沒有重疊,那麼原始的95%也會被打對摺——47%,遠低於80%。

 

Hyper-Threads are therefore only useful in a limited range of situations. The cache hit rate of the single-threaded code must be low enough that given the equations above and reduced cache size the new hit rate still meets the goal. Then and only then can it make any sense at all to use hyper-threads. Whether the result is faster in practice depends on whether the processor is sufficiently able to overlap the wait times in one thread with execution times in the other threads. The overhead of parallelizing the code must be added to the new total runtime and this additional cost often cannot be neglected.

In Section 6.3.4 we will see a technique where threads collaborate closely and the tight coupling through the common cache is actually an advantage. This technique can be applicable to many situations if only the programmers are willing to put in the time and energy to extend their code.

譯者信息

所以,超線程只在某些狀況下才比較有用。單線程代碼的緩存命中率必須低到必定程度,從而使緩存容量變小時新的命中率仍能知足要求。只有在這種狀況下,超線程纔是有意義的。在實踐中,採用超線程可否得到更快的結果,取決於處理器可否有效地將每一個進程的等待時間與其它進程的執行時間重疊在一塊兒。並行化也須要必定的開銷,須要加到總的運行時間裏,這個開銷每每是不能忽略的。

在6.3.4節中,咱們會介紹一種技術,它將多個線程經過公用緩存緊密地耦合起來。這種技術適用於許多場合,前提是程序員們樂意花費時間和精力擴展本身的代碼。

 

What should be clear is that if the two hyper-threads execute completely different code (i.e., the two threads are treated like separate processors by the OS to execute separate processes) the cache size is indeed cut in half which means a significant increase in cache misses. Such OS scheduling practices are questionable unless the caches are sufficiently large. Unless the workload for the machine consists of processes which, through their design, can indeed benefit from hyper-threads it might be best to turn off hyper-threads in the computer's BIOS. {Another reason to keep hyper-threads enabled is debugging. SMT is astonishingly good at finding some sets of problems in parallel code.}

譯者信息若是兩個超線程執行徹底不一樣的代碼(兩個線程就像被當成兩個處理器,分別執行不一樣進程),那麼緩存容量就真的會降爲一半,致使緩衝未命中率大爲攀升,這一點應該是很清楚的。這樣的調度機制是頗有問題的,除非你的緩存足夠大。因此,除非程序的工做集設計得比較合理,可以確實從超線程獲益,不然仍是建議在BIOS中把超線程功能關掉。{咱們可能會由於另外一個緣由 開啓 超線程,那就是調試,由於SMT在查找並行代碼的問題方面真的很是好用。}

 

3.3.5 Other Details

So far we talked about the address as consisting of three parts, tag, set index, and cache line offset. But what address is actually used? All relevant processors today provide virtual address spaces to processes, which means that there are two different kinds of addresses: virtual and physical.

The problem with virtual addresses is that they are not unique. A virtual address can, over time, refer to different physical memory addresses. The same address in different process also likely refers to different physical addresses. So it is always better to use the physical memory address, right?

譯者信息

3.3.5 其它細節

咱們已經介紹了地址的組成,即標籤、集合索引和偏移三個部分。那麼,實際會用到什麼樣的地址呢?目前,處理器通常都向進程提供虛擬地址空間,意味着咱們有兩種不一樣的地址: 虛擬地址和物理地址。

虛擬地址有個問題——並不惟一。隨着時間的變化,虛擬地址能夠變化,指向不一樣的物理地址。同一個地址在不一樣的進程裏也能夠表示不一樣的物理地址。那麼,是否是用物理地址會比較好呢?

 

The problem here is that instructions use virtual addresses and these have to be translated with the help of the Memory Management Unit (MMU) into physical addresses. This is a non-trivial operation. In the pipeline to execute an instruction the physical address might only be available at a later stage. This means that the cache logic has to be very quick in determining whether the memory location is cached. If virtual addresses could be used the cache lookup can happen much earlier in the pipeline and in case of a cache hit the memory content can be made available. The result is that more of the memory access costs could be hidden by the pipeline.

譯者信息問題是,處理器指令用的虛擬地址,並且須要在內存管理單元(MMU)的協助下將它們翻譯成物理地址。這並非一個很小的操做。在執行指令的管線(pipeline)中,物理地址只能在很後面的階段才能獲得。這意味着,緩存邏輯須要在很短的時間裏判斷地址是否已被緩存過。而若是能夠使用虛擬地址,緩存查找操做就能夠更早地發生,一旦命中,就能夠立刻使用內存的內容。結果就是,使用虛擬內存後,可讓管線把更多內存訪問的開銷隱藏起來。

 

Processor designers are currently using virtual address tagging for the first level caches. These caches are rather small and can be cleared without too much pain. At least partial clearing the cache is necessary if the page table tree of a process changes. It might be possible to avoid a complete flush if the processor has an instruction which specifies the virtual address range which has changed. Given the low latency of L1i and L1d caches (~3 cycles) using virtual addresses is almost mandatory.

For larger caches including L2, L3, ... caches physical address tagging is needed. These caches have a higher latency and the virtual→physical address translation can finish in time. Because these caches are larger (i.e., a lot of information is lost when they are flushed) and refilling them takes a long time due to the main memory access latency, flushing them often would be costly.

譯者信息

處理器的設計人員們如今使用虛擬地址來標記第一級緩存。這些緩存很小,很容易被清空。在進程頁表樹發生變動的狀況下,至少是須要清空部分緩存的。若是處理器擁有指定變動地址範圍的指令,那麼能夠避免緩存的徹底刷新。因爲一級緩存L1i及L1d的時延都很小(~3週期),基本上必須使用虛擬地址。

對於更大的緩存,包括L2和L3等,則須要以物理地址做爲標籤。由於這些緩存的時延比較大,虛擬到物理地址的映射能夠在容許的時間裏完成,並且因爲主存時延的存在,從新填充這些緩存會消耗比較長的時間,刷新的代價比較昂貴。

 

It should, in general, not be necessary to know about the details of the address handling in those caches. They cannot be changed and all the factors which would influence the performance are normally something which should be avoided or is associated with high cost. Overflowing the cache capacity is bad and all caches run into problems early if the majority of the used cache lines fall into the same set. The latter can be avoided with virtually addressed caches but is impossible for user-level processes to avoid for caches addressed using physical addresses. The only detail one might want to keep in mind is to not map the same physical memory location to two or more virtual addresses in the same process, if at all possible.

譯者信息通常來講,咱們並不須要瞭解這些緩存處理地址的細節。咱們不能更改它們,而那些可能影響性能的因素,要麼是應該避免的,要麼是有很高代價的。填滿緩存是很差的行爲,緩存線都落入同一個集合,也會讓緩存早早地出問題。對於後一個問題,能夠經過緩存虛擬地址來避免,但做爲一個用戶級程序,是不可能避免緩存物理地址的。咱們惟一能夠作的,是盡最大努力不要在同一個進程裏用多個虛擬地址映射同一個物理地址。

 

Another detail of the caches which is rather uninteresting to programmers is the cache replacement strategy. Most caches evict the Least Recently Used (LRU) element first. This is always a good default strategy. With larger associativity (and associativity might indeed grow further in the coming years due to the addition of more cores) maintaining the LRU list becomes more and more expensive and we might see different strategies adopted.

As for the cache replacement there is not much a programmer can do. If the cache is using physical address tags there is no way to find out how the virtual addresses correlate with the cache sets. It might be that cache lines in all logical pages are mapped to the same cache sets, leaving much of the cache unused. If anything, it is the job of the OS to arrange that this does not happen too often.

譯者信息

另外一個細節對程序員們來講比較乏味,那就是緩存的替換策略。大多數緩存會優先逐出最近最少使用(Least Recently Used,LRU)的元素。這每每是一個效果比較好的策略。在關聯性很大的狀況下(隨着之後核心數的增長,關聯性勢必會變得愈來愈大),維護LRU列表變得愈來愈昂貴,因而咱們開始看到其它的一些策略。

在緩存的替換策略方面,程序員能夠作的事情很少。若是緩存使用物理地址做爲標籤,咱們是沒法找出虛擬地址與緩存集之間關聯的。有可能會出現這樣的情形: 全部邏輯頁中的緩存線都映射到同一個緩存集,而其它大部分緩存卻空閒着。即便有這種狀況,也只能依靠OS進行合理安排,避免頻繁出現。

 

With the advent of virtualization things get even more complicated. Now not even the OS has control over the assignment of physical memory. The Virtual Machine Monitor (VMM, aka hypervisor) is responsible for the physical memory assignment.

The best a programmer can do is to a) use logical memory pages completely and b) use page sizes as large as meaningful to diversify the physical addresses as much as possible. Larger page sizes have other benefits, too, but this is another topic (see Section 4).

譯者信息

虛擬化的出現使得這一切變得更加複雜。如今不只操做系統能夠控制物理內存的分配。虛擬機監視器(VMM,也稱爲 hypervisor)也負責分配內存。

對程序員來講,最好 a) 徹底使用邏輯內存頁面 b) 在有意義的狀況下,使用盡量大的頁面大小來分散物理地址。更大的頁面大小也有其餘好處,不過這是另外一個話題(見第4節)。

 

3.4 Instruction Cache

Not just the data used by the processor is cached; the instructions executed by the processor are also cached. However, this cache is much less problematic than the data cache. There are several reasons:

  • The quantity of code which is executed depends on the size of the code that is needed. The size of the code in general depends on the complexity of the problem. The complexity of the problem is fixed.
  • While the program's data handling is designed by the programmer the program's instructions are usually generated by a compiler. The compiler writers know about the rules for good code generation.
  • Program flow is much more predictable than data access patterns. Today's CPUs are very good at detecting patterns. This helps with prefetching.
  • Code always has quite good spatial and temporal locality

There are a few rules programmers should follow but these mainly consist of rules on how to use the tools. We will discuss them in Section 6. Here we talk only about the technical details of the instruction cache.

譯者信息

3.4 指令緩存

其實,不光處理器使用的數據被緩存,它們執行的指令也是被緩存的。只不過,指令緩存的問題相對來講要少得多,由於:

  • 執行的代碼量取決於代碼大小。而代碼大小一般取決於問題複雜度。問題複雜度則是固定的。
  • 程序的數據處理邏輯是程序員設計的,而程序的指令倒是編譯器生成的。編譯器的做者知道如何生成優良的代碼。
  • 程序的流向比數據訪問模式更容易預測。現現在的CPU很擅長模式檢測,對預取頗有利。
  • 代碼永遠都有良好的時間局部性和空間局部性。

有一些準則是須要程序員們遵照的,但大都是關於如何使用工具的,咱們會在第6節介紹它們。而在這裏咱們只介紹一下指令緩存的技術細節。

 

Ever since the core clock of CPUs increased dramatically and the difference in speed between cache (even first level cache) and core grew, CPUs have been pipelined. That means the execution of an instruction happens in stages. First an instruction is decoded, then the parameters are prepared, and finally it is executed. Such a pipeline can be quite long (> 20 stages for Intel's Netburst architecture). A long pipeline means that if the pipeline stalls (i.e., the instruction flow through it is interrupted) it takes a while to get up to speed again. Pipeline stalls happen, for instance, if the location of the next instruction cannot be correctly predicted or if it takes too long to load the next instruction (e.g., when it has to be read from memory).

譯者信息隨着CPU的核心頻率大幅上升,緩存與核心的速度差越拉越大,CPU的處理開始管線化。也就是說,指令的執行分紅若干階段。首先,對指令進行解碼,隨後,準備參數,最後,執行它。這樣的管線能夠很長(例如,Intel的Netburst架構超過了20個階段)。在管線很長的狀況下,一旦發生延誤(即指令流中斷),須要很長時間才能恢復速度。管線延誤發生在這樣的狀況下: 下一條指令未能正確預測,或者裝載下一條指令耗時過長(例如,須要從內存讀取時)。

 

As a result CPU designers spend a lot of time and chip real estate on branch prediction so that pipeline stalls happen as infrequently as possible.

On CISC processors the decoding stage can also take some time. The x86 and x86-64 processors are especially affected. In recent years these processors therefore do not cache the raw byte sequence of the instructions in L1i but instead they cache the decoded instructions. L1i in this case is called the 「trace cache」. Trace caching allows the processor to skip over the first steps of the pipeline in case of a cache hit which is especially good if the pipeline stalled.

譯者信息

爲了解決這個問題,CPU的設計人員們在分支預測上投入大量時間和芯片資產(chip real estate),以下降管線延誤的出現頻率。

在CISC處理器上,指令的解碼階段也須要一些時間。x86及x86-64處理器尤其嚴重。近年來,這些處理器再也不將指令的原始字節序列存入L1i,而是緩存解碼後的版本。這樣的L1i被叫作「追蹤緩存(trace cache)」。追蹤緩存能夠在命中的狀況下讓處理器跳過管線最初的幾個階段,在管線發生延誤時尤爲有用。

 

As said before, the caches from L2 on are unified caches which contain both code and data. Obviously here the code is cached in the byte sequence form and not decoded.

To achieve the best performance there are only a few rules related to the instruction cache:

  1. Generate code which is as small as possible. There are exceptions when software pipelining for the sake of using pipelines requires creating more code or where the overhead of using small code is too high.
  2. Whenever possible, help the processor to make good prefetching decisions. This can be done through code layout or with explicit prefetching.

These rules are usually enforced by the code generation of a compiler. There are a few things the programmer can do and we will talk about them in Section 6.

譯者信息

前面說過,L2以上的緩存是統一緩存,既保存代碼,也保存數據。顯然,這裏保存的代碼是原始字節序列,而不是解碼後的形式。

在提升性能方面,與指令緩存相關的只有不多的幾條準則:

  1. 生成儘可能少的代碼。也有一些例外,如出於管線化的目的須要更多的代碼,或使用小代碼會帶來太高的額外開銷。
  2. 儘可能幫助處理器做出良好的預取決策,能夠經過代碼佈局或顯式預取來實現。

這些準則通常會由編譯器的代碼生成階段強制執行。至於程序員能夠參與的部分,咱們會在第6節介紹。

 

3.4.1 Self Modifying Code

In early computer days memory was a premium. People went to great lengths to reduce the size of the program to make more room for program data. One trick frequently deployed was to change the program itself over time. Such Self Modifying Code (SMC) is occasionally still found, these days mostly for performance reasons or in security exploits.

SMC should in general be avoided. Though it is generally executed correctly there are boundary cases in which it is not and it creates performance problems if not done correctly. Obviously, code which is changed cannot be kept in the trace cache which contains the decoded instructions. But even if the trace cache is not used because the code has not been executed at all (or for some time) the processor might have problems. If an upcoming instruction is changed while it already entered the pipeline the processor has to throw away a lot of work and start all over again. There are even situations where most of the state of the processor has to be tossed away.

譯者信息

3.4.1 自修改的代碼

在計算機的早期歲月裏,內存十分昂貴。人們想盡想方設法,只爲了儘可能壓縮程序容量,給數據多留一些空間。其中,有一種方法是修改程序自身,稱爲自修改代碼(SMC)。如今,有時候咱們還能看到它,通常是出於提升性能的目的,也有的是爲了攻擊安全漏洞。

通常狀況下,應該避免SMC。雖然通常狀況下沒有問題,但有時會因爲執行錯誤而出現性能問題。顯然,發生改變的代碼是沒法放入追蹤緩存(追蹤緩存放的是解碼後的指令)的。即便沒有使用追蹤緩存(代碼還沒被執行或有段時間沒執行),處理器也可能會遇到問題。若是某個進入管線的指令發生了變化,處理器只能扔掉目前的成果,從新開始。在某些狀況下,甚至須要丟棄處理器的大部分狀態。

 

Finally, since the processor assumes - for simplicity reasons and because it is true in 99.9999999% of all cases - that the code pages are immutable, the L1i implementation does not use the MESI protocol but instead a simplified SI protocol. This means if modifications are detected a lot of pessimistic assumptions have to be made.

It is highly advised to avoid SMC whenever possible. Memory is not such a scarce resource anymore. It is better to write separate functions instead of modifying one function according to specific needs. Maybe one day SMC support can be made optional and we can detect exploit code trying to modify code this way. If SMC absolutely has to be used, the write operations should bypass the cache as to not create problems with data in L1d needed in L1i. See Section 6.1 for more information on these instructions.

譯者信息

最後,因爲處理器認爲代碼頁是不可修改的(這是出於簡單化的考慮,並且在99.9999999%狀況下確實是正確的),L1i用到並非MESI協議,而是一種簡化後的SI協議。這樣一來,若是萬一檢測到修改的狀況,就須要做出大量悲觀的假設。

所以,對於SMC,強烈建議能不用就不用。如今內存已經再也不是一種那麼稀缺的資源了。最好是寫多個函數,而不要根據須要把一個函數改來改去。也許有一天能夠把SMC變成可選項,咱們就能經過這種方式檢測入侵代碼。若是必定要用SMC,應該讓寫操做越過緩存,以避免因爲L1i須要L1d裏的數據而產生問題。更多細節,請參見6.1節。

 

Normally on Linux it is easy to recognize programs which contain SMC. All program code is write-protected when built with the regular toolchain. The programmer has to perform significant magic at link time to create an executable where the code pages are writable. When this happens, modern Intel x86 and x86-64 processors have dedicated performance counters which count uses of self-modifying code. With the help of these counters it is quite easily possible to recognize programs with SMC even if the program will succeed due to relaxed permissions.

3.5 Cache Miss Factors

We have already seen that when memory accesses miss the caches, the costs skyrocket. Sometimes this is not avoidable and it is important to understand the actual costs and what can be done to mitigate the problem.

譯者信息

在Linux上,判斷程序是否包含SMC是很容易的。利用正常工具鏈(toolchain)構建的程序代碼都是寫保護(write-protected)的。程序員須要在連接時施展某些關鍵的魔術才能生成可寫的代碼頁。現代的Intel x86和x86-64處理器都有統計SMC使用狀況的專用計數器。經過這些計數器,咱們能夠很容易判斷程序是否包含SMC,即便它被准許運行。

3.5 緩存未命中的因素

咱們已經看過內存訪問沒有命中緩存時,那陡然猛漲的高昂代價。可是有時候,這種狀況又是沒法避免的,所以咱們須要對真正的代價有所認識,並學習如何緩解這種局面。

 

 

3.5.1 Cache and Memory Bandwidth

To get a better understanding of the capabilities of the processors we measure the bandwidth available in optimal circumstances. This measurement is especially interesting since different processor versions vary widely. This is why this section is filled with the data of several different machines. The program to measure performance uses the SSE instructions of the x86 and x86-64 processors to load or store 16 bytes at once. The working set is increased from 1kB to 512MB just as in our other tests and it is measured how many bytes per cycle can be loaded or stored.

Figure 3.24: Pentium 4 Bandwidth

Figure 3.24 shows the performance on a 64-bit Intel Netburst processor. For working set sizes which fit into L1d the processor is able to read the full 16 bytes per cycle, i.e., one load instruction is performed per cycle (themovapsinstruction moves 16 bytes at once). The test does not do anything with the read data, we test only the read instructions themselves. As soon as the L1d is not sufficient anymore the performance goes down dramatically to less than 6 bytes per cycle. The step at 218 bytes is due to the exhaustion of the DTLB cache which means additional work for each new page. Since the reading is sequential prefetching can predict the accesses perfectly and the FSB can stream the memory content at about 5.3 bytes per cycle for all sizes of the working set. The prefetched data is not propagated into L1d, though. These are of course numbers which will never be achievable in a real program. Think of them as practical limits.

譯者信息

3.5.1 緩存與內存帶寬 

爲了更好地理解處理器的能力,咱們測量了各類理想環境下可以達到的帶寬值。因爲不一樣處理器的版本差異很大,因此這個測試比較有趣,也由於如此,這一節都快被測試數據灌滿了。咱們使用了x86和x86-64處理器的SSE指令來裝載和存儲數據,每次16字節。工做集則與其它測試同樣,從1kB增長到512MB,測量的具體對象是每一個週期所處理的字節數。

 
圖3.24: P4的帶寬

圖3.24展現了一顆64位Intel Netburst處理器的性能圖表。當工做集可以徹底放入L1d時,處理器的每一個週期能夠讀取完整的16字節數據,即每一個週期執行一條裝載指令(moveaps指令,每次移動16字節的數據)。測試程序並不對數據進行任何處理,只是測試讀取指令自己。當工做集增大,沒法再徹底放入L1d時,性能開始急劇降低,跌至每週期6字節。在218工做集處出現的臺階是因爲DTLB緩存耗盡,所以須要對每一個新頁施加額外處理。因爲這裏的讀取是按順序的,預取機制能夠完美地工做,而FSB能以5.3字節/週期的速度傳輸內容。但預取的數據並不進入L1d。固然,真實世界的程序永遠沒法達到以上的數字,但咱們能夠將它們看做一系列實際上的極限值。

 

What is more astonishing than the read performance is the write and copy performance. The write performance, even for small working set sizes, does not ever rise above 4 bytes per cycle. This indicates that, in these Netburst processors, Intel elected to use a Write-Through mode for L1d where the performance is obviously limited by the L2 speed. This also means that the performance of the copy test, which copies from one memory region into a second, non-overlapping memory region, is not significantly worse. The necessary read operations are so much faster and can partially overlap with the write operations. The most noteworthy detail of the write and copy measurements is the low performance once the L2 cache is not sufficient anymore. The performance drops to 0.5 bytes per cycle! That means write operations are by a factor of ten slower than the read operations. This means optimizing those operations is even more important for the performance of the program.

譯者信息更使人驚訝的是寫操做和複製操做的性能。即便是在很小的工做集下,寫操做也始終沒法達到4字節/週期的速度。這意味着,Intel爲Netburst處理器的L1d選擇了寫通(write-through)模式,因此寫入性能受到L2速度的限制。同時,這也意味着,複製測試的性能不會比寫入測試差太多(複製測試是將某塊內存的數據拷貝到另外一塊不重疊的內存區),由於讀操做很快,能夠與寫操做實現部分重疊。最值得關注的地方是,兩個操做在工做集沒法徹底放入L2後出現了嚴重的性能滑坡,降到了0.5字節/週期!比讀操做慢了10倍!顯然,若是要提升程序性能,優化這兩個操做更爲重要。

 

In Figure 3.25 we see the results on the same processor but with two threads running, one pinned to each of the two hyper-threads of the processor.

Figure 3.25: P4 Bandwidth with 2 Hyper-Threads

The graph is shown at the same scale as the previous one to illustrate the differences and the curves are a bit jittery simply because of the problem of measuring two concurrent threads. The results are as expected. Since the hyper-threads share all the resources except the registers each thread has only half the cache and bandwidth available. That means even though each thread has to wait a lot and could award the other thread with execution time this does not make any difference since the other thread also has to wait for the memory. This truly shows the worst possible use of hyper-threads.
譯者信息

再來看圖3.25,它來自同一顆處理器,只是運行雙線程,每一個線程分別運行在處理器的一個超線程上。

 
圖3.25: P4開啓兩個超線程時的帶寬表現

圖3.25採用了與圖3.24相同的刻度,以方便比較二者的差別。圖3.25中的曲線抖動更多,是因爲採用雙線程的緣故。結果正如咱們預期,因爲超線程共享着幾乎全部資源(僅除寄存器外),因此每一個超線程只能獲得一半的緩存和帶寬。因此,即便每一個線程都要花上許多時間等待內存,從而把執行時間讓給另外一個線程,也是無濟於事——由於另外一個線程也一樣須要等待。這裏偏偏展現了使用超線程時可能出現的最壞狀況。

Figure 3.26: Core 2 Bandwidth

Compared to Figures 3.24 and 3.25 the results in Figures 3.26 and 3.27 look quite different for an Intel Core 2 processor. This is a dual-core processor with shared L2 which is four times as big as the L2 on the P4 machine. This only explains the delayed drop-off of the write and copy performance, though.

There are much bigger differences. The read performance throughout the working set range hovers around the optimal 16 bytes per cycle. The drop-off in the read performance after 220 bytes is again due to the working set being too big for the DTLB. Achieving these high numbers means the processor is not only able to prefetch the data and transport the data in time. It also means the data is prefetched into L1d.

 

譯者信息
 
圖3.26: Core 2的帶寬表現

再來看Core 2處理器的狀況。看看圖3.26和圖3.27,再對比下P4的圖3.24和3.25,能夠看出不小的差別。Core 2是一顆雙核處理器,有着共享的L2,容量是P4 L2的4倍。但更大的L2只能解釋寫操做的性能降低出現較晚的現象。

固然還有更大的不一樣。能夠看到,讀操做的性能在整個工做集範圍內一直穩定在16字節/週期左右,在220處的降低一樣是因爲DTLB的耗盡引發。可以達到這麼高的數字,不但代表處理器可以預取數據,而且按時完成傳輸,並且還意味着,預取的數據是被裝入L1d的。

 

The write and copy performance is dramatically different, too. The processor does not have a Write-Through policy; written data is stored in L1d and only evicted when necessary. This allows for write speeds close to the optimal 16 bytes per cycle. Once L1d is not sufficient anymore the performance drops significantly. As with the Netburst processor, the write performance is significantly lower. Due to the high read performance the difference is even higher here. In fact, when even the L2 is not sufficient anymore the speed difference increases to a factor of 20! This does not mean the Core 2 processors perform poorly. To the contrary, their performance is always better than the Netburst core's.

Figure 3.27: Core 2 Bandwidth with 2 Threads

In Figure 3.27, the test runs two threads, one on each of the two cores of the Core 2 processor. Both threads access the same memory, not necessarily perfectly in sync, though. The results for the read performance are not different from the single-threaded case. A few more jitters are visible which is to be expected in any multi-threaded test case.

譯者信息

寫/複製操做的性能與P4相比,也有很大差別。處理器沒有采用寫通策略,寫入的數據留在L1d中,只在必要時才逐出。這使得寫操做的速度能夠逼近16字節/週期。一旦工做集超過L1d,性能即飛速降低。因爲Core 2讀操做的性能很是好,因此二者的差值顯得特別大。當工做集超過L2時,二者的差值甚至超過20倍!但這並不表示Core 2的性能很差,相反,Core 2永遠都比Netburst強。

 
圖3.27: Core 2運行雙線程時的帶寬表現

在圖3.27中,啓動雙線程,各自運行在Core 2的一個核心上。它們訪問相同的內存,但不須要完美同步。從結果上看,讀操做的性能與單線程並沒有區別,只是多了一些多線程狀況下常見的抖動。

 

The interesting point is the write and copy performance for working set sizes which would fit into L1d. As can be seen in the figure, the performance is the same as if the data had to be read from the main memory. Both threads compete for the same memory location and RFO messages for the cache lines have to be sent. The problematic point is that these requests are not handled at the speed of the L2 cache, even though both cores share the cache. Once the L1d cache is not sufficient anymore modified entries are flushed from each core's L1d into the shared L2. At that point the performance increases significantly since now the L1d misses are satisfied by the L2 cache and RFO messages are only needed when the data has not yet been flushed. This is why we see a 50% reduction in speed for these sizes of the working set. The asymptotic behavior is as expected: since both cores share the same FSB each core gets half the FSB bandwidth which means for large working sets each thread's performance is about half that of the single threaded case.

譯者信息有趣的地方來了——當工做集小於L1d時,寫操做與複製操做的性能不好,就好像數據須要從內存讀取同樣。兩個線程彼此競爭着同一個內存位置,因而不得不頻頻發送RFO消息。問題的根源在於,雖然兩個核心共享着L2,但沒法以L2的速度處理RFO請求。而當工做集超過L1d後,性能出現了迅猛提高。這是由於,因爲L1d容量不足,因而將被修改的條目刷新到共享的L2。因爲L1d的未命中能夠由L2知足,只有那些還沒有刷新的數據才須要RFO,因此出現了這樣的現象。這也是這些工做集狀況下速度降低一半的緣由。這種漸進式的行爲也與咱們期待的一致: 因爲每一個核心共享着同一條FSB,每一個核心只能獲得一半的FSB帶寬,所以對於較大的工做集來講,每一個線程的性能大體至關於單線程時的一半。

 

Because there are significant differences between the processor versions of one vendor it is certainly worthwhile looking at the performance of other vendors' processors, too. Figure 3.28 shows the performance of an AMD family 10h Opteron processor. This processor has 64kB L1d, 512kB L2, and 2MB of L3. The L3 cache is shared between all cores of the processor. The results of the performance test can be seen in Figure 3.28.

Figure 3.28: AMD Family 10h Opteron Bandwidth

The first detail one notices about the numbers is that the processor is capable of handling two instructions per cycle if the L1d cache is sufficient. The read performance exceeds 32 bytes per cycle and even the write performance is, with 18.7 bytes per cycle, high. The read curve flattens quickly, though, and is, with 2.3 bytes per cycle, pretty low. The processor for this test does not prefetch any data, at least not efficiently.

譯者信息

因爲同一個廠商的不一樣處理器之間都存在着巨大差別,咱們沒有理由不去研究一下其它廠商處理器的性能。圖3.28展現了AMD家族10h Opteron處理器的性能。這顆處理器有64kB的L1d、512kB的L2和2MB的L3,其中L3緩存由全部核心所共享。

 
圖3.28: AMD家族10h Opteron的帶寬表現

你們首先應該會注意到,在L1d緩存足夠的狀況下,這個處理器每一個週期能處理兩條指令。讀操做的性能超過了32字節/週期,寫操做也達到了18.7字節/週期。可是,不久,讀操做的曲線就急速降低,跌到2.3字節/週期,很是差。處理器在這個測試中並無預取數據,或者說,沒有有效地預取數據。

 

The write curve on the other hand performs according to the sizes of the various caches. The peak performance is achieved for the full size of the L1d, going down to 6 bytes per cycle for L2, to 2.8 bytes per cycle for L3, and finally .5 bytes per cycle if not even L3 can hold all the data. The performance for the L1d cache exceeds that of the (older) Core 2 processor, the L2 access is equally fast (with the Core 2 having a larger cache), and the L3 and main memory access is slower.

The copy performance cannot be better than either the read or write performance. This is why we see the curve initially dominated by the read performance and later by the write performance.

譯者信息

另外一方面,寫操做的曲線隨幾級緩存的容量而流轉。在L1d階段達到最高性能,隨後在L2階段降低到6字節/週期,在L3階段進一步降低到2.8字節/週期,最後,在工做集超過L3後,降到0.5字節/週期。它在L1d階段超過了Core 2,在L2階段基本至關(Core 2的L2更大一些),在L3及主存階段比Core 2慢。

複製的性能既沒法超越讀操做的性能,也沒法超越寫操做的性能。所以,它的曲線先是被讀性能壓制,隨後又被寫性能壓制。

 

The multi-thread performance of the Opteron processor is shown in Figure 3.29.

Figure 3.29: AMD Fam 10h Bandwidth with 2 Threads

The read performance is largely unaffected. Each thread's L1d and L2 works as before and the L3 cache is in this case not prefetched very well either. The two threads do not unduly stress the L3 for their purpose. The big problem in this test is the write performance. All data the threads share has to go through the L3 cache. This sharing seems to be quite inefficient since even if the L3 cache size is sufficient to hold the entire working set the cost is significantly higher than an L3 access. Comparing this graph with Figure 3.27 we see that the two threads of the Core 2 processor operate at the speed of the shared L2 cache for the appropriate range of working set sizes. This level of performance is achieved for the Opteron processor only for a very small range of the working set sizes and even here it approaches only the speed of the L3 which is slower than the Core 2's L2.
譯者信息

圖3.29顯示的是Opteron處理器在多線程時的性能表現。

 
圖3.29: AMD Fam 10h在雙線程時的帶寬表現

讀操做的性能沒有受到很大的影響。每一個線程的L1d和L2表現與單線程下相仿,L3的預取也依然表現不佳。兩個線程並無過渡爭搶L3。問題比較大的是寫操做的性能。兩個線程共享的全部數據都須要通過L3,而這種共享看起來卻效率不好。即便是在L3足夠容納整個工做集的狀況下,所須要的開銷仍然遠高於L3的訪問時間。再來看圖3.27,能夠發現,在必定的工做集範圍內,Core 2處理器能以共享的L2緩存的速度進行處理。而Opteron處理器只能在很小的一個範圍內實現類似的性能,並且,它僅僅只能達到L3的速度,沒法與Core 2的L2相比。

3.5.2 Critical Word Load

Memory is transferred from the main memory into the caches in blocks which are smaller than the cache line size. Today 64 bits are transferred at once and the cache line size is 64 or 128 bytes. This means 8 or 16 transfers per cache line are needed.

The DRAM chips can transfer those 64-bit blocks in burst mode. This can fill the cache line without any further commands from the memory controller and the possibly associated delays. If the processor prefetches cache lines this is probably the best way to operate.

譯者信息

3.5.2 關鍵字加載

內存以比緩存線還小的塊從主存儲器向緩存傳送。現在64位可一次性傳送,緩存線的大小爲64或128比特。這意味着每一個緩存線須要8或16次傳送。

DRAM芯片能夠以觸發模式傳送這些64位的塊。這使得不須要內存控制器的進一步指令和可能伴隨的延遲,就能夠將緩存線充滿。若是處理器預取了緩存,這有多是最好的操做方式。

 

 

If a program's cache access of the data or instruction caches misses (that means, it is a compulsory cache miss, because the data is used for the first time, or a capacity cache miss, because the limited cache size requires eviction of the cache line) the situation is different. The word inside the cache line which is required for the program to continue might not be the first word in the cache line. Even in burst mode and with double data rate transfer the individual 64-bit blocks arrive at noticeably different times. Each block arrives 4 CPU cycles or more later than the previous one. If the word the program needs to continue is the eighth of the cache line, the program has to wait an additional 30 cycles or more after the first word arrives.

Things do not necessarily have to be like this. The memory controller is free to request the words of the cache line in a different order. The processor can communicate which word the program is waiting on, the critical word, and the memory controller can request this word first. Once the word arrives the program can continue while the rest of the cache line arrives and the cache is not yet in a consistent state. This technique is called Critical Word First & Early Restart.

譯者信息

若是程序在訪問數據或指令緩存時沒有命中(這多是強制性未命中或容量性未命中,前者是因爲數據第一次被使用,後者是因爲容量限制而將緩存線逐出),狀況就不同了。程序須要的並不老是緩存線中的第一個字,而數據塊的到達是有前後順序的,即便是在突發模式和雙倍傳輸率下,也會有明顯的時間差,一半在4個CPU週期以上。舉例來講,若是程序須要緩存線中的第8個字,那麼在首字抵達後它還須要額外等待30個週期以上。

固然,這樣的等待並非必需的。事實上,內存控制器能夠按不一樣順序去請求緩存線中的字。當處理器告訴它,程序須要緩存中具體某個字,即「關鍵字(critical word)」時,內存控制器就會先請求這個字。一旦請求的字抵達,雖然緩存線的剩餘部分還在傳輸中,緩存的狀態尚未達成一致,但程序已經能夠繼續運行。這種技術叫作關鍵字優先及較早重啓(Critical Word First & Early Restart)。

 

Processors nowadays implement this technique but there are situations when that is not possible. If the processor prefetches data the critical word is not known. Should the processor request the cache line during the time the prefetch operation is in flight it will have to wait until the critical word arrives without being able to influence the order.

Figure 3.30: Critical Word at End of Cache Line

Even with these optimizations in place the position of the critical word on a cache line matters. Figure 3.30 shows the Follow test for sequential and random access. Shown is the slowdown of running the test with the pointer which is chased in the first word versus the case when the pointer is in the last word. The element size is 64 bytes, corresponding the cache line size. The numbers are quite noisy but it can be seen that, as soon as the L2 is not sufficient to hold the working set size, the performance of the case where the critical word is at the end is about 0.7% slower. The sequential access appears to be affected a bit more. This would be consistent with the aforementioned problem when prefetching the next cache line.

譯者信息

如今的處理器都已經實現了這一技術,但有時沒法運用。好比,預取操做的時候,並不知道哪一個是關鍵字。若是在預取的中途請求某條緩存線,處理器只能等待,並不能更改請求的順序。

 
圖3.30: 關鍵字位於緩存線尾時的表現

在關鍵字優先技術生效的狀況下,關鍵字的位置也會影響結果。圖3.30展現了下一個測試的結果,圖中表示的是關鍵字分別在線首和線尾時的性能對比狀況。元素大小爲64字節,等於緩存線的長度。圖中的噪聲比較多,但仍然能夠看出,當工做集超過L2後,關鍵字處於線尾狀況下的性能要比線首狀況下低0.7%左右。而順序訪問時受到的影響更大一些。這與咱們前面提到的預取下條線時可能遇到的問題是相符的。

 

3.5.3 Cache Placement

Where the caches are placed in relationship to the hyper-threads, cores, and processors is not under control of the programmer. But programmers can determine where the threads are executed and then it becomes important how the caches relate to the used CPUs.

Here we will not go into details of when to select what cores to run the threads. We will only describe architecture details which the programmer has to take into account when setting the affinity of the threads.

譯者信息

3.5.3 緩存設定

緩存放置的位置與超線程,內核和處理器之間的關係,不在程序員的控制範圍以內。可是程序員能夠決定線程執行的位置,接着高速緩存與使用的CPU的關係將變得很是重要。

這裏咱們將不會深刻(探討)何時選擇什麼樣的內核以運行線程的細節。咱們僅僅描述了在設置關聯線程的時候,程序員須要考慮的系統結構的細節。

 

Hyper-threads, by definition share everything but the register set. This includes the L1 caches. There is not much more to say here. The fun starts with the individual cores of a processor. Each core has at least its own L1 caches. Aside from this there are today not many details in common:

  • Early multi-core processors had separate L2 caches and no higher caches.
  • Later Intel models had shared L2 caches for dual-core processors. For quad-core processors we have to deal with separate L2 caches for each pair of two cores. There are no higher level caches.
  • AMD's family 10h processors have separate L2 caches and a unified L3 cache.

 

譯者信息

超線程,經過定義,共享除去寄存器集之外的全部數據。包括 L1 緩存。這裏沒有什麼能夠多說的。多核處理器的獨立核心帶來了一些樂趣。每一個核心都至少擁有本身的 L1 緩存。除此以外,下面列出了一些不一樣的特性:

  • 早期多核心處理器有獨立的 L2 緩存且沒有更高層級的緩存。
  • 以後英特爾的雙核心處理器模型擁有共享的L2 緩存。對四核處理器,則分對擁有獨立的L2 緩存,且沒有更高層級的緩存。
  • AMD 家族的 10h 處理器有獨立的 L2 緩存以及一個統一的L3 緩存。
A lot has been written in the propaganda material of the processor vendors about the advantage of their respective models. Separate L2 caches have an advantage if the working sets handled by the cores do not overlap. This works well for single-threaded programs. Since this is still often the reality today this approach does not perform too badly. But there is always some overlap. The caches all contain the most actively used parts of the common runtime libraries which means some cache space is wasted.

 

Completely sharing all caches beside L1 as Intel's dual-core processors do can have a big advantage. If the working set of the threads working on the two cores overlaps significantly the total available cache memory is increased and working sets can be bigger without performance degradation. If the working sets do not overlap Intel's Advanced Smart Cache management is supposed to prevent any one core from monopolizing the entire cache.

譯者信息

關於各類處理器模型的優勢,已經在它們各自的宣傳手冊裏寫得夠多了。在每一個核心的工做集互不重疊的狀況下,獨立的L2擁有必定的優點,單線程的程序能夠表現優良。考慮到目前實際環境中仍然存在大量相似的狀況,這種方法的表現並不會太差。不過,無論怎樣,咱們總會遇到工做集重疊的狀況。若是每一個緩存都保存着某些通用運行庫的經常使用部分,那麼很顯然是一種浪費。

若是像Intel的雙核處理器那樣,共享除L1外的全部緩存,則會有一個很大的優勢。若是兩個核心的工做集重疊的部分較多,那麼綜合起來的可用緩存容量會變大,從而容許容納更大的工做集而不致使性能的降低。若是二者的工做集並不重疊,那麼則是由Intel的高級智能緩存管理(Advanced Smart Cache management)發揮功用,防止其中一個核心壟斷整個緩存。

 

If both cores use about half the cache for their respective working sets there is some friction, though. The cache constantly has to weigh the two cores' cache use and the evictions performed as part of this rebalancing might be chosen poorly. To see the problems we look at the results of yet another test program.

Figure 3.31: Bandwidth with two Processes

The test program has one process constantly reading or writing, using SSE instructions, a 2MB block of memory. 2MB was chosen because this is half the size of the L2 cache of this Core 2 processor. The process is pinned to one core while a second process is pinned to the other core. This second process reads and writes a memory region of variable size. The graph shows the number of bytes per cycle which are read or written. Four different graphs are shown, one for each combination of the processes reading and writing. The read/write graph is for the background process, which always uses a 2MB working set to write, and the measured process with variable working set to read.

譯者信息

即便每一個核心只使用一半的緩存,也會有一些摩擦。緩存須要不斷衡量每一個核心的用量,在進行逐出操做時可能會做出一些比較差的決定。咱們來看另外一個測試程序的結果。

 
圖3.31: 兩個進程的帶寬表現

此次,測試程序兩個進程,第一個進程不斷用SSE指令讀/寫2MB的內存數據塊,選擇2MB,是由於它正好是Core 2處理器L2緩存的一半,第二個進程則是讀/寫大小變化的內存區域,咱們把這兩個進程分別固定在處理器的兩個核心上。圖中顯示的是每一個週期讀/寫的字節數,共有4條曲線,分別表示不一樣的讀寫搭配狀況。例如,標記爲讀/寫(read/write)的曲線表明的是後臺進程進行寫操做(固定2MB工做集),而被測量進程進行讀操做(工做集從小到大)。

 

The interesting part of the graph is the part between 220 and 223 bytes. If the L2 cache of the two cores were completely separate we could expect that the performance of all four tests would drop between 221 and 222bytes, that means, once the L2 cache is exhausted. As we can see in Figure 3.31 this is not the case. For the cases where the background process is writing this is most visible. The performance starts to deteriorate before the working set size reaches 1MB. The two processes do not share memory and therefore the processes do not cause RFO messages to be generated. These are pure cache eviction problems. The smart cache handling has its problems with the effect that the experienced cache size per core is closer to 1MB than the 2MB per core which are available. One can only hope that, if caches shared between cores remain a feature of upcoming processors, the algorithm used for the smart cache handling will be fixed.

譯者信息圖中最有趣的是220到223之間的部分。若是兩個核心的L2是徹底獨立的,那麼全部4種狀況下的性能降低均應發生在221到222之間,也就是L2緩存耗盡的時候。但從圖上來看,實際狀況並非這樣,特別是背景進程進行寫操做時尤其明顯。當工做集達到1MB(220)時,性能即出現惡化,兩個進程並無共享內存,所以並不會產生RFO消息。因此,徹底是緩存逐出操做引發的問題。目前這種智能的緩存處理機制有一個問題,每一個核心能實際用到的緩存更接近1MB,而不是理論上的2MB。若是將來的處理器仍然保留這種多核共享緩存模式的話,咱們惟有但願廠商會把這個問題解決掉。

 

Having a quad-core processor with two L2 caches was just a stop-gap solution before higher-level caches could be introduced. This design provides no significant performance advantage over separate sockets and dual-core processors. The two cores communicate via the same bus which is, at the outside, visible as the FSB. There is no special fast-track data exchange.

The future of cache design for multi-core processors will lie in more layers. AMD's 10h family of processors make the start. Whether we will continue to see lower level caches be shared by a subset of the cores of a processor remains to be seen. The extra levels of cache are necessary since the high-speed and frequently used caches cannot be shared among many cores. The performance would be impacted. It would also require very large caches with high associativity. Both numbers, the cache size and the associativity, must scale with the number of cores sharing the cache. Using a large L3 cache and reasonably-sized L2 caches is a reasonable trade-off. The L3 cache is slower but it is ideally not as frequently used as the L2 cache.

For programmers all these different designs mean complexity when making scheduling decisions. One has to know the workloads and the details of the machine architecture to achieve the best performance. Fortunately we have support to determine the machine architecture. The interfaces will be introduced in later sections.

譯者信息

推出擁有雙L2緩存的4核處理器僅僅只是一種臨時措施,是開發更高級緩存以前的替代方案。與獨立插槽及雙核處理器相比,這種設計並無帶來多少性能提高。兩個核心是經過同一條總線(被外界看做FSB)進行通訊,並無什麼特別快的數據交換通道。

將來,針對多核處理器的緩存將會包含更多層次。AMD的10h家族是一個開始,至於會不會有更低級共享緩存的出現,還須要咱們拭目以待。咱們有必要引入更多級別的緩存,由於頻繁使用的高速緩存不可能被許多核心共用,不然會對性能形成很大的影響。咱們也須要更大的高關聯性緩存,它們的數量、容量和關聯性都應該隨着共享核心數的增加而增加。巨大的L3和適度的L2應該是一種比較合理的選擇。L3雖然速度較慢,但也較少使用。

對於程序員來講,不一樣的緩存設計就意味着調度決策時的複雜性。爲了達到最高的性能,咱們必須掌握工做負載的狀況,必須瞭解機器架構的細節。好在咱們在判斷機器架構時仍是有一些支援力量的,咱們會在後面的章節介紹這些接口。

 

3.5.4 FSB Influence

The FSB plays a central role in the performance of the machine. Cache content can only be stored and loaded as quickly as the connection to the memory allows. We can show how much so by running a program on two machines which only differ in the speed of their memory modules. Figure 3.32 shows the results of the Addnext0 test (adding the content of the next elementspad[0]element to the ownpad[0]element) forNPAD=7 on a 64-bit machine. Both machines have Intel Core 2 processors, the first uses 667MHz DDR2 modules, the second 800MHz modules (a 20% increase).

Figure 3.32: Influence of FSB Speed

The numbers show that, when the FSB is really stressed for large working set sizes, we indeed see a large benefit. The maximum performance increase measured in this test is 18.2%, close to the theoretical maximum. What this shows is that a faster FSB indeed can pay off big time. It is not critical when the working set fits into the caches (and these processors have a 4MB L2). It must be kept in mind that we are measuring one program here. The working set of a system comprises the memory needed by all concurrently running processes. This way it is easily possible to exceed 4MB memory or more with much smaller programs.

譯者信息

3.5.4 FSB的影響

FSB在性能中扮演了核心角色。緩存數據的存取速度受制於內存通道的速度。咱們作一個測試,在兩臺機器上分別跑同一個程序,這兩臺機器除了內存模塊的速度有所差別,其它徹底相同。圖3.32展現了Addnext0測試(將下一個元素的pad[0]加到當前元素的pad[0]上)在這兩臺機器上的結果(NPAD=7,64位機器)。兩臺機器都採用Core 2處理器,一臺使用667MHz的DDR2內存,另外一臺使用800MHz的DDR2內存(比前一臺增加20%)。

 
圖3.32: FSB速度的影響

圖上的數字代表,當工做集大到對FSB形成壓力的程度時,高速FSB確實會帶來巨大的優點。在咱們的測試中,性能的提高達到了18.5%,接近理論上的極限。而當工做集比較小,能夠徹底歸入緩存時,FSB的做用並不大。固然,這裏咱們只測試了一個程序的狀況,在實際環境中,系統每每運行多個進程,工做集是很容易超過緩存容量的。

 

Today some of Intel's processors support FSB speeds up to 1,333MHz which would mean another 60% increase. The future is going to see even higher speeds. If speed is important and the working set sizes are larger, fast RAM and high FSB speeds are certainly worth the money. One has to be careful, though, since even though the processor might support higher FSB speeds the motherboard/Northbridge might not. It is critical to check the specifications.

譯者信息現在,一些英特爾的處理器,支持前端總線(FSB)的速度高達1,333 MHz,這意味着速度有另外60%的提高。未來還會出現更高的速度。速度是很重要的,工做集會更大,快速的RAM和高FSB速度的內存確定是值得投資的。咱們必須當心使用它,由於即便處理器能夠支持更高的前端總線速度,可是主板的北橋芯片可能不會。使用時,檢查它的規範是相當重要的。

 

Part 3 

Editor's note: this the third installment of Ulrich Drepper's "What every programmer should know about memory" document; this section talks about virtual memory, and TLB performance in particular. Those who have not read part 1 and part 2 may wish to do so now. As always, please send typo reports and the like to lwn@lwn.net rather than posting them as comments here.]

 

4 Virtual Memory

The virtual memory subsystem of a processor implements the virtual address spaces provided to each process. This makes each process think it is alone in the system. The list of advantages of virtual memory are described in detail elsewhere so they will not be repeated here. Instead this section concentrates on the actual implementation details of the virtual memory subsystem and the associated costs.

譯者信息

[編輯注:本文是Ulrich Drepper的「每一個程序員應該瞭解的內存方面的知識」文章的第三部分;這一部分談論了虛擬內存,特別是TLB性能。沒有閱讀第1部分第2部分的人可能如今就想讀一讀了。和往常同樣,請將排字錯誤報告之類發送到lwn@lwn.net,而不要發送到這裏的評論]

4 虛擬內存

處理器的虛擬內存子系統爲每一個進程實現了虛擬地址空間。這讓每一個進程認爲它在系統中是獨立的。虛擬內存的優勢列表別的地方描述的很是詳細,因此這裏就不重複了。本節集中在虛擬內存的實際的實現細節,和相關的成本。

 

A virtual address space is implemented by the Memory Management Unit (MMU) of the CPU. The OS has to fill out the page table data structures, but most CPUs do the rest of the work themselves. This is actually a pretty complicated mechanism; the best way to understand it is to introduce the data structures used to describe the virtual address space.

The input to the address translation performed by the MMU is a virtual address. There are usually few—if any—restrictions on its value. Virtual addresses are 32-bit values on 32-bit systems, and 64-bit values on 64-bit systems. On some systems, for instance x86 and x86-64, the addresses used actually involve another level of indirection: these architectures use segments which simply cause an offset to be added to every logical address. We can ignore this part of address generation, it is trivial and not something that programmers have to care about with respect to performance of memory handling. {Segment limits on x86 are performance-relevant but that is another story.}

 

譯者信息

虛擬地址空間是由CPU的內存管理單元(MMU)實現的。OS必須填充頁表數據結構,但大多數CPU本身作了剩下的工做。這事實上是一個至關複雜的機制;最好的理解它的方法是引入數據結構來描述虛擬地址空間。

由MMU進行地址翻譯的輸入地址是虛擬地址。一般對它的值不多有限制 — 假設還有一點的話。 虛擬地址在32位系統中是32位的數值,在64位系統中是64位的數值。在一些系統,例如x86和x86-64,使用的地址實際上包含了另外一個層次的間接尋址:這些結構使用分段,這些分段只是簡單的給每一個邏輯地址加上位移。咱們能夠忽略這一部分的地址產生,它不重要,不是程序員很是關心的內存處理性能方面的東西。{x86的分段限制是與性能相關的,但那是另外一回事了}

 

4.1 Simplest Address Translation

The interesting part is the translation of the virtual address to a physical address. The MMU can remap addresses on a page-by-page basis. Just as when addressing cache lines, the virtual address is split into distinct parts. These parts are used to index into various tables which are used in the construction of the final physical address. For the simplest model we have only one level of tables.

Figure 4.1: 1-Level Address Translation

Figure 4.1 shows how the different parts of the virtual address are used. A top part is used to select an entry in a Page Directory; each entry in that directory can be individually set by the OS. The page directory entry determines the address of a physical memory page; more than one entry in the page directory can point to the same physical address. The complete physical address of the memory cell is determined by combining the page address from the page directory with the low bits from the virtual address. The page directory entry also contains some additional information about the page such as access permissions.

譯者信息

4.1 最簡單的地址轉換

有趣的地方在於由虛擬地址到物理地址的轉換。MMU能夠在逐頁的基礎上從新映射地址。就像地址緩存排列的時候,虛擬地址被分割爲不一樣的部分。這些部分被用來作多個表的索引,而這些表是被用來建立最終物理地址用的。最簡單的模型是隻有一級表。

Figure 4.1: 1-Level Address Translation

圖 4.1 顯示了虛擬地址的不一樣部分是如何使用的。高字節部分是用來選擇一個頁目錄的條目;那個目錄中的每一個地址能夠被OS分別設置。頁目錄條目決定了物理內存頁的地址;頁面中能夠有不止一個條目指向一樣的物理地址。完整的內存物理地址是由頁目錄得到的頁地址和虛擬地址低字節部分合並起來決定的。頁目錄條目還包含一些附加的頁面信息,如訪問權限。

 

The data structure for the page directory is stored in memory. The OS has to allocate contiguous physical memory and store the base address of this memory region in a special register. The appropriate bits of the virtual address are then used as an index into the page directory, which is actually an array of directory entries.

For a concrete example, this is the layout used for 4MB pages on x86 machines. The Offset part of the virtual address is 22 bits in size, enough to address every byte in a 4MB page. The remaining 10 bits of the virtual address select one of the 1024 entries in the page directory. Each entry contains a 10 bit base address of a 4MB page which is combined with the offset to form a complete 32 bit address.

譯者信息

頁目錄的數據結構存儲在內存中。OS必須分配連續的物理內存,並將這個地址範圍的基地址存入一個特殊的寄存器。而後虛擬地址的適當的位被用來做爲頁目錄的索引,這個頁目錄事實上是目錄條目的列表。

做爲一個具體的例子,這是 x86機器4MB分頁設計。虛擬地址的位移部分是22位大小,足以定位一個4M頁內的每個字節。虛擬地址中剩下的10位指定頁目錄中1024個條目的一個。每一個條目包括一個10位的4M頁內的基地址,它與位移結合起來造成了一個完整的32位地址。

 

4.2 Multi-Level Page Tables

4MB pages are not the norm, they would waste a lot of memory since many operations an OS has to perform require alignment to memory pages. With 4kB pages (the norm on 32-bit machines and, still, often on 64-bit machines), the Offset part of the virtual address is only 12 bits in size. This leaves 20 bits as the selector of the page directory. A table with 220 entries is not practical. Even if each entry would be only 4 bytes the table would be 4MB in size. With each process potentially having its own distinct page directory much of the physical memory of the system would be tied up for these page directories.

The solution is to use multiple levels of page tables. These can then represent a sparse huge page directory where regions which are not actually used do not require allocated memory. The representation is therefore much more compact, making it possible to have the page tables for many processes in memory without impacting performance too much.

譯者信息

4.2 多級頁表

4MB的頁不是規範,它們會浪費不少內存,由於OS須要執行的許多操做須要內存頁的隊列。對於4kB的頁(32位機器的規範,甚至一般是64位機器的規範),虛擬地址的位移部分只有12位大小。這留下了20位做爲頁目錄的指針。具備220個條目的表是不實際的。即便每一個條目只要4比特,這個表也要4MB大小。因爲每一個進程可能具備其惟一的頁目錄,由於這些頁目錄許多系統中物理內存被綁定起來。

解決辦法是用多級頁表。而後這些就能表示一個稀疏的大的頁目錄,目錄中一些實際不用的區域不須要分配內存。所以這種表示更緊湊,使它可能爲內存中的不少進程使用頁表而並不太影響性能。.

 

Today the most complicated page table structures comprise four levels. Figure 4.2 shows the schematics of such an implementation.

Figure 4.2: 4-Level Address Translation

The virtual address is, in this example, split into at least five parts. Four of these parts are indexes into the various directories. The level 4 directory is referenced using a special-purpose register in the CPU. The content of the level 4 to level 2 directories is a reference to next lower level directory. If a directory entry is marked empty it obviously need not point to any lower directory. This way the page table tree can be sparse and compact. The entries of the level 1 directory are, just like in Figure 4.1, partial physical addresses, plus auxiliary data like access permissions.

譯者信息

今天最複雜的頁表結構由四級構成。圖4.2顯示了這樣一個實現的原理圖。

Figure 4.2: 4-Level Address Translation

在這個例子中,虛擬地址被至少分爲五個部分。其中四個部分是不一樣的目錄的索引。被引用的第4級目錄使用CPU中一個特殊目的的寄存器。第4級到第2級目錄的內容是對次低一級目錄的引用。若是一個目錄條目標識爲空,顯然就是不須要指向任何低一級的目錄。這樣頁表樹就能稀疏和緊湊。正如圖4.1,第1級目錄的條目是一部分物理地址,加上像訪問權限的輔助數據。

 

To determine the physical address corresponding to a virtual address the processor first determines the address of the highest level directory. This address is usually stored in a register. Then the CPU takes the index part of the virtual address corresponding to this directory and uses that index to pick the appropriate entry. This entry is the address of the next directory, which is indexed using the next part of the virtual address. This process continues until it reaches the level 1 directory, at which point the value of the directory entry is the high part of the physical address. The physical address is completed by adding the page offset bits from the virtual address. This process is called page tree walking. Some processors (like x86 and x86-64) perform this operation in hardware, others need assistance from the OS.

譯者信息爲了決定相對於虛擬地址的物理地址,處理器先決定最高級目錄的地址。這個地址通常保存在一個寄存器。而後CPU取出虛擬地址中相對於這個目錄的索引部分,並用那個索引選擇合適的條目。這個條目是下一級目錄的地址,它由虛擬地址的下一部分索引。處理器繼續直到它到達第1級目錄,那裏那個目錄條目的值就是物理地址的高字節部分。物理地址在加上虛擬地址中的頁面位移以後就完整了。這個過程稱爲頁面樹遍歷。一些處理器(像x86和x86-64)在硬件中執行這個操做,其餘的須要OS的協助。

 

Each process running on the system might need its own page table tree. It is possible to partially share trees but this is rather the exception. It is therefore good for performance and scalability if the memory needed by the page table trees is as small as possible. The ideal case for this is to place the used memory close together in the virtual address space; the actual physical addresses used do not matter. A small program might get by with using just one directory at each of levels 2, 3, and 4 and a few level 1 directories. On x86-64 with 4kB pages and 512 entries per directory this allows the addressing of 2MB with a total of 4 directories (one for each level). 1GB of contiguous memory can be addressed with one directory for levels 2 to 4 and 512 directories for level 1.

Assuming all memory can be allocated contiguously is too simplistic, though. For flexibility reasons the stack and the heap area of a process are, in most cases, allocated at pretty much opposite ends of the address space. This allows either area to grow as much as possible if needed. This means that there are most likely two level 2 directories needed and correspondingly more lower level directories.

譯者信息

系統中運行的每一個進程可能須要本身的頁表樹。有部分共享樹的可能,可是這至關例外。所以若是頁表樹須要的內存儘量小的話將對性能與可擴展性有利。理想的狀況是將使用的內存緊靠着放在虛擬地址空間;但實際使用的物理地址不影響。一個小程序可能只須要第2,3,4級的一個目錄和少量第1級目錄就能應付過去。在一個採用4kB頁面和每一個目錄512條目的x86-64機器上,這容許用4級目錄對2MB定位(每一級一個)。1GB連續的內存能夠被第2到第4級的一個目錄和第1級的512個目錄定位。

可是,假設全部內存能夠被連續分配是太簡單了。因爲複雜的緣由,大多數狀況下,一個進程的棧與堆的區域是被分配在地址空間中很是相反的兩端。這樣使得任一個區域能夠根據須要儘量的增加。這意味着最有可能須要兩個第2級目錄和相應的更多的低一級的目錄。

 

But even this does not always match current practice. For security reasons the various parts of an executable (code, data, heap, stack, DSOs, aka shared libraries) are mapped at randomized addresses [nonselsec]. The randomization extends to the relative position of the various parts; that implies that the various memory regions in use in a process are widespread throughout the virtual address space. By applying some limits to the number of bits of the address which are randomized the range can be restricted, but it certainly, in most cases, will not allow a process to run with just one or two directories for levels 2 and 3.

If performance is really much more important than security, randomization can be turned off. The OS will then usually at least load all DSOs contiguously in virtual memory.

譯者信息

但即便這也不經常匹配如今的實際。因爲安全的緣由,一個可運行的(代碼,數據,堆,棧,動態共享對象,aka共享庫)不一樣的部分被映射到隨機的地址[未選中的]。隨機化延伸到不一樣部分的相對位置;那意味着一個進程使用的不一樣的內存範圍,遍及於虛擬地址空間。經過對隨機的地址位數採用一些限定,範圍能夠被限制,但在大多數狀況下,這固然不會讓一個進程只用一到兩個第2和第3級目錄運行。

若是性能真的遠比安全重要,隨機化能夠被關閉。OS而後一般是在虛擬內存中至少連續的裝載全部的動態共享對象(DSO)。

 

4.3 Optimizing Page Table Access

All the data structures for the page tables are kept in the main memory; this is where the OS constructs and updates the tables. Upon creation of a process or a change of a page table the CPU is notified. The page tables are used to resolve every virtual address into a physical address using the page table walk described above. More to the point: at least one directory for each level is used in the process of resolving a virtual address. This requires up to four memory accesses (for a single access by the running process) which is slow. It is possible to treat these directory table entries as normal data and cache them in L1d, L2, etc., but this would still be far too slow.

譯者信息

4.3 優化頁表訪問

頁表的全部數據結構都保存在主存中;在那裏OS建造和更新這些表。當一個進程建立或者一個頁表變化,CPU將被通知。頁表被用來解決每一個虛擬地址到物理地址的轉換,用上面描述的頁表遍歷方式。更多有關於此:至少每一級有一個目錄被用於處理虛擬地址的過程。這須要至多四次內存訪問(對一個運行中的進程的單次訪問來講),這很慢。有可能像普通數據同樣處理這些目錄表條目,並將他們緩存在L1d,L2等等,但這仍然很是慢。

 

 

From the earliest days of virtual memory, CPU designers have used a different optimization. A simple computation can show that only keeping the directory table entries in the L1d and higher cache would lead to horrible performance. Each absolute address computation would require a number of L1d accesses corresponding to the page table depth. These accesses cannot be parallelized since they depend on the previous lookup's result. This alone would, on a machine with four page table levels, require at the very least 12 cycles. Add to that the probability of an L1d miss and the result is nothing the instruction pipeline can hide. The additional L1d accesses also steal precious bandwidth to the cache.

譯者信息從虛擬內存的早期階段開始,CPU的設計者採用了一種不一樣的優化。簡單的計算顯示,只有將目錄表條目保存在L1d和更高級的緩存,纔會致使可怕的性能問題。每一個絕對地址的計算,都須要相對於頁表深度的大量的L1d訪問。這些訪問不能並行,由於它們依賴於前面查詢的結果。在一個四級頁表的機器上,這種單線性將 至少至少須要12次循環。再加上L1d的非命中的可能性,結果是指令流水線沒有什麼能隱藏的。額外的L1d訪問也消耗了珍貴的緩存帶寬。 

 

So, instead of just caching the directory table entries, the complete computation of the address of the physical page is cached. For the same reason that code and data caches work, such a cached address computation is effective. Since the page offset part of the virtual address does not play any part in the computation of the physical page address, only the rest of the virtual address is used as the tag for the cache. Depending on the page size this means hundreds or thousands of instructions or data objects share the same tag and therefore same physical address prefix.

The cache into which the computed values are stored is called the Translation Look-Aside Buffer (TLB). It is usually a small cache since it has to be extremely fast. Modern CPUs provide multi-level TLB caches, just as for the other caches; the higher-level caches are larger and slower. The small size of the L1TLB is often made up for by making the cache fully associative, with an LRU eviction policy. Recently, this cache has been growing in size and, in the process, was changed to be set associative. As a result, it might not be the oldest entry which gets evicted and replaced whenever a new entry has to be added.

譯者信息

因此,替代於只是緩存目錄表條目,物理頁地址的完整的計算結果被緩存了。由於一樣的緣由,代碼和數據緩存也工做起來,這樣的地址計算結果的緩存是高效的。因爲虛擬地址的頁面位移部分在物理頁地址的計算中不起任何做用,只有虛擬地址的剩餘部分被用做緩存的標籤。根據頁面大小這意味着成百上千的指令或數據對象共享同一個標籤,所以也共享同一個物理地址前綴。

保存計算數值的緩存叫作旁路轉換緩存(TLB)。由於它必須很是的快,一般這是一個小的緩存。現代CPU像其它緩存同樣,提供了多級TLB緩存;越高級的緩存越大越慢。小號的L1級TLB一般被用來作全相聯映像緩存,採用LRU回收策略。最近這種緩存大小變大了,並且在處理器中變得集相聯。其結果之一就是,當一個新的條目必須被添加的時候,可能不是最久的條目被回收於替換了。

 

As noted above, the tag used to access the TLB is a part of the virtual address. If the tag has a match in the cache, the final physical address is computed by adding the page offset from the virtual address to the cached value. This is a very fast process; it has to be since the physical address must be available for every instruction using absolute addresses and, in some cases, for L2 look-ups which use the physical address as the key. If the TLB lookup misses the processor has to perform a page table walk; this can be quite costly.

Prefetching code or data through software or hardware could implicitly prefetch entries for the TLB if the address is on another page. This cannot be allowed for hardware prefetching because the hardware could initiate page table walks that are invalid. Programmers therefore cannot rely on hardware prefetching to prefetch TLB entries. It has to be done explicitly using prefetch instructions. TLBs, just like data and instruction caches, can appear in multiple levels. Just as for the data cache, the TLB usually appears in two flavors: an instruction TLB (ITLB) and a data TLB (DTLB). Higher-level TLBs such as the L2TLB are usually unified, as is the case with the other caches.

 

譯者信息

正如上面提到的,用來訪問TLB的標籤是虛擬地址的一個部分。若是標籤在緩存中有匹配,最終的物理地址將被計算出來,經過未來自虛擬地址的頁面位移地址加到緩存值的方式。這是一個很是快的過程;也必須這樣,由於每條使用絕對地址的指令都須要物理地址,還有在一些狀況下,由於使用物理地址做爲關鍵字的L2查找。若是TLB查詢未命中,處理器就必須執行一次頁表遍歷;這可能代價很是大。

經過軟件或硬件預取代碼或數據,會在地址位於另外一頁面時,暗中預取TLB的條目。硬件預取不可能容許這樣,由於硬件會初始化非法的頁面表遍歷。所以程序員不能依賴硬件預取機制來預取TLB條目。它必須使用預取指令明確的完成。就像數據和指令緩存,TLB能夠表現爲多個等級。正如數據緩存,TLB一般表現爲兩種形式:指令TLB(ITLB)和數據TLB(DTLB)。高級的TLB像L2TLB一般是統一的,就像其餘的緩存情形同樣。

 

4.3.1 Caveats Of Using A TLB

The TLB is a processor-core global resource. All threads and processes executed on the processor core use the same TLB. Since the translation of virtual to physical addresses depends on which page table tree is installed, the CPU cannot blindly reuse the cached entries if the page table is changed. Each process has a different page table tree (but not the threads in the same process) as does the kernel and the VMM (hypervisor) if present. It is also possible that the address space layout of a process changes. There are two ways to deal with this problem:

  • The TLB is flushed whenever the page table tree is changed.
  • The tags for the TLB entries are extended to additionally and uniquely identify the page table tree they refer to.

In the first case the TLB is flushed whenever a context switch is performed. Since, in most OSes, a switch from one thread/process to another one requires executing some kernel code, TLB flushes are restricted to entering and leaving the kernel address space. On virtualized systems it also happens when the kernel has to call the VMM and on the way back. If the kernel and/or VMM does not have to use virtual addresses, or can reuse the same virtual addresses as the process or kernel which made the system/VMM call, the TLB only has to be flushed if, upon leaving the kernel or VMM, the processor resumes execution of a different process or kernel.

譯者信息

4.3.1 使用TLB的注意事項

TLB是以處理器爲核心的全局資源。全部運行於處理器的線程與進程使用同一個TLB。因爲虛擬到物理地址的轉換依賴於安裝的是哪種頁表樹,若是頁表變化了,CPU不能盲目的重複使用緩存的條目。每一個進程有一個不一樣的頁表樹(不算在同一個進程中的線程),內核與內存管理器VMM(管理程序)也同樣,若是存在的話。也有可能一個進程的地址空間佈局發生變化。有兩種解決這個問題的辦法:

  • 當頁表樹變化時TLB刷新。
  • TLB條目的標籤附加擴展並惟一標識其涉及的頁表樹

第一種狀況,只要執行一個上下文切換TLB就被刷新。由於大多數OS中,從一個線程/進程到另外一個的切換須要執行一些核心代碼,TLB刷新被限制進入或離開核心地址空間。在虛擬化的系統中,當內核必須調用內存管理器VMM和返回的時候,這也會發生。若是內核和/或內存管理器沒有使用虛擬地址,或者當進程或內核調用系統/內存管理器時,能重複使用同一個虛擬地址,TLB必須被刷新。當離開內核或內存管理器時,處理器繼續執行一個不一樣的進程或內核。

 

Flushing the TLB is effective but expensive. When executing a system call, for instance, the kernel code might be restricted to a few thousand instructions which touch, perhaps, a handful of new pages (or one huge page, as is the case for Linux on some architectures). This work would replace only as many TLB entries as pages are touched. For Intel's Core2 architecture with its 128 ITLB and 256 DTLB entries, a full flush would mean that more than 100 and 200 entries (respectively) would be flushed unnecessarily. When the system call returns to the same process, all those flushed TLB entries could be used again, but they will be gone. The same is true for often-used code in the kernel or VMM. On each entry into the kernel the TLB has to be filled from scratch even though the page tables for the kernel and VMM usually do not change and, therefore, TLB entries could, in theory, be preserved for a very long time. This also explains why the TLB caches in today's processors are not bigger: programs most likely will not run long enough to fill all these entries.

譯者信息刷新TLB高效但昂貴。例如,當執行一個系統調用,觸及的內核代碼可能僅限於幾千條指令,或許少量新頁面(或一個大的頁面,像某些結構的Linux的就是這樣)。這個工做將替換觸及頁面的全部TLB條目。對Intel帶128ITLB和256DTLB條目的Core2架構,徹底的刷新意味着多於100和200條目(分別的)將被沒必要要的刷新。當系統調用返回同一個進程,全部那些被刷新的TLB條目可能被再次用到,但它們沒有了。內核或內存管理器經常使用的代碼也同樣。每條進入內核的條目上,TLB必須擦去再裝,即便內核與內存管理器的頁表一般不會改變。所以理論上說,TLB條目能夠被保持一個很長時間。這也解釋了爲何如今處理器中的TLB緩存都不大:程序頗有可能不會執行時間長到裝滿全部這些條目。

 

This fact, of course, did not escape the CPU architects. One possibility to optimize the cache flushes is to individually invalidate TLB entries. For instance, if the kernel code and data falls into a specific address range, only the pages falling into this address range have to evicted from the TLB. This only requires comparing tags and, therefore, is not very expensive. This method is also useful in case a part of the address space is changed, for instance, through a call tomunmap.

A much better solution is to extend the tag used for the TLB access. If, in addition to the part of the virtual address, a unique identifier for each page table tree (i.e., a process's address space) is added, the TLB does not have to be completely flushed at all. The kernel, VMM, and the individual processes all can have unique identifiers. The only issue with this scheme is that the number of bits available for the TLB tag is severely limited, while the number of address spaces is not. This means some identifier reuse is necessary. When this happens the TLB has to be partially flushed (if this is possible). All entries with the reused identifier must be flushed but this is, hopefully, a much smaller set.

譯者信息

固然事實逃脫不了CPU的結構。對緩存刷新優化的一個可能的方法是單獨的使TLB條目失效。例如,若是內核代碼與數據落於一個特定的地址範圍,只有落入這個地址範圍的頁面必須被清除出TLB。這隻須要比較標籤,所以不是很昂貴。在部分地址空間改變的場合,例如對去除內存頁的一次調用,這個方法也是有用的,

更好的解決方法是爲TLB訪問擴展標籤。若是除了虛擬地址的一部分以外,一個惟一的對應每一個頁表樹的標識(如一個進程的地址空間)被添加,TLB將根本不須要徹底刷新。內核,內存管理程序,和獨立的進程均可以有惟一的標識。這種場景惟一的問題在於,TLB標籤能夠得到的位數異常有限,可是地址空間的位數卻不是。這意味着一些標識的再利用是有必要的。這種狀況發生時TLB必須部分刷新(若是可能的話)。全部帶有再利用標識的條目必須被刷新,可是但願這是一個很是小的集合。

 

This extended TLB tagging is of general advantage when multiple processes are running on the system. If the memory use (and hence TLB entry use) of each of the runnable processes is limited, there is a good chance the most recently used TLB entries for a process are still in the TLB when it gets scheduled again. But there are two additional advantages:

  1. Special address spaces, such as those used by the kernel and VMM, are often only entered for a short time; afterward control is often returned to the address space which initiated the call. Without tags, two TLB flushes are performed. With tags the calling address space's cached translations are preserved and, since the kernel and VMM address space do not often change TLB entries at all, the translations from previous system calls, etc. can still be used.
  2. When switching between two threads of the same process no TLB flush is necessary at all. Without extended TLB tags the entry into the kernel destroys the first thread's TLB entries, though.

 

譯者信息

當多個進程運行在系統中時,這種擴展的TLB標籤具備通常優點。若是每一個可運行進程對內存的使用(所以TLB條目的使用)作限制,進程最近使用的TLB條目,當其再次列入計劃時,有很大機會仍然在TLB。但還有兩個額外的優點:

  1. 特殊的地址空間,像內核和內存管理器使用的那些,常常僅僅進入一小段時間;以後控制常常返回初始化這次調用的地址空間。沒有標籤,就有兩次TLB刷新操做。有標籤,調用地址空間緩存的轉換地址將被保存,並且因爲內核與內存管理器地址空間根本不會常常改變TLB條目,系統調用以前的地址轉換等等能夠仍然使用。
  2. 當同一個進程的兩個線程之間切換時,TLB刷新根本就不須要。雖然沒有擴展TLB標籤時,進入內核的條目會破壞第一個線程的TLB的條目。
Some processors have, for some time, implemented these extended tags. AMD introduced a 1-bit tag extension with the Pacifica virtualization extensions. This 1-bit Address Space ID (ASID) is, in the context of virtualization, used to distinguish the VMM's address space from that of the guest domains. This allows the OS to avoid flushing the guest's TLB entries every time the VMM is entered (for instance, to handle a page fault) or the VMM's TLB entries when control returns to the guest. The architecture will allow the use of more bits in the future. Other mainstream processors will likely follow suit and support this feature. 譯者信息有些處理器在一些時候實現了這些擴展標籤。AMD給帕西菲卡(Pacifica)虛擬化擴展引入了一個1位的擴展標籤。在虛擬化的上下文中,這個1位的地址空間ID(ASID)被用來從客戶域區別出內存管理程序的地址空間。這使得OS可以避免在每次進入內存管理程序的時候(例如爲了處理一個頁面錯誤)刷新客戶的TLB條目,或者當控制回到客戶時刷新內存管理程序的TLB條目。這個架構將來會容許使用更多的位。其它主流處理器極可能會隨之適應並支持這個功能。

 

4.3.2 Influencing TLB Performance

There are a couple of factors which influence TLB performance. The first is the size of the pages. Obviously, the larger a page is, the more instructions or data objects will fit into it. So a larger page size reduces the overall number of address translations which are needed, meaning that fewer entries in the TLB cache are needed. Most architectures allow the use of multiple different page sizes; some sizes can be used concurrently. For instance, the x86/x86-64 processors have a normal page size of 4kB but they can also use 4MB and 2MB pages respectively. IA-64 and PowerPC allow sizes like 64kB as the base page size.

The use of large page sizes brings some problems with it, though. The memory regions used for the large pages must be contiguous in physical memory. If the unit size for the administration of physical memory is raised to the size of the virtual memory pages, the amount of wasted memory will grow. All kinds of memory operations (like loading executables) require alignment to page boundaries. This means, on average, that each mapping wastes half the page size in physical memory for each mapping. This waste can easily add up; it thus puts an upper limit on the reasonable unit size for physical memory allocation.

譯者信息

4.3.2 影響TLB性能

有一些因素會影響TLB性能。第一個是頁面的大小。顯然頁面越大,裝進去的指令或數據對象就越多。因此較大的頁面大小減小了所需的地址轉換總次數,即須要更少的TLB緩存條目。大多數架構容許使用多個不一樣的頁面尺寸;一些尺寸能夠並存使用。例如,x86/x86-64處理器有一個普通的4kB的頁面尺寸,但它們也能夠分別用4MB和2MB頁面。IA-64 和 PowerPC容許如64kB的尺寸做爲基本的頁面尺寸。

然而,大頁面尺寸的使用也隨之帶來了一些問題。用做大頁面的內存範圍必須是在物理內存中連續的。若是物理內存管理的單元大小升至虛擬內存頁面的大小,浪費的內存數量將會增加。各類內存操做(如加載可執行文件)須要頁面邊界對齊。這意味着平均每次映射浪費了物理內存中頁面大小的一半。這種浪費很容易累加;所以它給物理內存分配的合理單元大小劃定了一個上限。

 

It is certainly not practical to increase the unit size to 2MB to accommodate large pages on x86-64. This is just too large a size. But this in turn means that each large page has to be comprised of many smaller pages. And these small pages have to be contiguous in physical memory. Allocating 2MB of contiguous physical memory with a unit page size of 4kB can be challenging. It requires finding a free area with 512 contiguous pages. This can be extremely difficult (or impossible) after the system runs for a while and physical memory becomes fragmented.

On Linux it is therefore necessary to pre-allocate these big pages at system start time using the specialhugetlbfsfilesystem. A fixed number of physical pages are reserved for exclusive use as big virtual pages. This ties down resources which might not always be used. It also is a limited pool; increasing it normally means restarting the system. Still, huge pages are the way to go in situations where performance is a premium, resources are plenty, and cumbersome setup is not a big deterrent. Database servers are an example.

譯者信息

在x86-64結構中增長單元大小到2MB來適應大頁面固然是不實際的。這是一個太大的尺寸。但這轉而意味着每一個大頁面必須由許多小一些的頁面組成。這些小頁面必須在物理內存中連續。以4kB單元頁面大小分配2MB連續的物理內存具備挑戰性。它須要找到有512個連續頁面的空閒區域。在系統運行一段時間而且物理內存開始碎片化之後,這可能極爲困難(或者不可能)

所以在Linux中有必要在系統啓動的時候,用特別的Huge TLBfs文件系統,預分配這些大頁面。一個固定數目的物理頁面被保留,以單獨用做大的虛擬頁面。這使可能不會常常用到的資源捆綁留下來。它也是一個有限的池;增大它通常意味着要重啓系統。儘管如此,大頁面是進入某些局面的方法,在這些局面中性能具備保險性,資源豐富,並且麻煩的安裝不會成爲大的妨礙。數據庫服務器就是一個例子。

 

Increasing the minimum virtual page size (as opposed to optional big pages) has its problems, too. Memory mapping operations (loading applications, for example) must conform to these page sizes. No smaller mappings are possible. The location of the various parts of an executable have, for most architectures, a fixed relationship. If the page size is increased beyond what has been taken into account when the executable or DSO was built, the load operation cannot be performed. It is important to keep this limitation in mind. Figure 4.3 shows how the alignment requirements of an ELF binary can be determined. It is encoded in the ELF program header.

$ eu-readelf -l /bin/ls
Program Headers:
  Type   Offset   VirtAddr           PhysAddr           FileSiz  MemSiz   Flg Align
...
  LOAD   0x000000 0x0000000000400000 0x0000000000400000 0x0132ac 0x0132ac R E 0x200000
  LOAD   0x0132b0 0x00000000006132b0 0x00000000006132b0 0x001a71 0x001a71 RW  0x200000
...

Figure 4.3: ELF Program Header Indicating Alignment Requirements

In this example, an x86-64 binary, the value is 0x200000 = 2,097,152 = 2MB which corresponds to the maximum page size supported by the processor.

譯者信息

增大最小的虛擬頁面大小(正如選擇大頁面的相反面)也有它的問題。內存映射操做(例如加載應用)必須確認這些頁面大小。不可能有更小的映射。對大多數架構來講,一個可執行程序的各個部分位置有一個固定的關係。若是頁面大小增長到超過了可執行程序或DSO(Dynamic Shared Object)建立時考慮的大小,加載操做將沒法執行。腦海裏記得這個限制很重要。圖4.3顯示了一個ELF二進制的對齊需求是如何決定的。它編碼在ELF程序頭部。

$ eu-readelf -l /bin/ls
Program Headers:
  Type   Offset   VirtAddr           PhysAddr           FileSiz  MemSiz   Flg Align
...
  LOAD   0x000000 0x0000000000400000 0x0000000000400000 0x0132ac 0x0132ac R E 0x200000
  LOAD   0x0132b0 0x00000000006132b0 0x00000000006132b0 0x001a71 0x001a71 RW  0x200000
...

Figure 4.3: ELF 程序頭代表了對齊需求

在這個例子中,一個x86-64二進制,它的值爲0x200000 = 2,097,152 = 2MB,符合處理器支持的最大頁面尺寸。

 

There is a second effect of using larger page sizes: the number of levels of the page table tree is reduced. Since the part of the virtual address corresponding to the page offset increases, there are not that many bits left which need to be handled through page directories. This means that, in case of a TLB miss, the amount of work which has to be done is reduced.

Beyond using large page sizes, it is possible to reduce the number of TLB entries needed by moving data which is used at the same time to fewer pages. This is similar to some optimizations for cache use we talked about above. Only now the alignment required is large. Given that the number of TLB entries is quite small this can be an important optimization.

譯者信息

使用較大內存尺寸有第二個影響:頁表樹的級數減小了。因爲虛擬地址相對於頁面位移的部分增長了,須要用來在頁目錄中使用的位,就沒有剩下許多了。這意味着當一個TLB未命中時,須要作的工做數量減小了。

超出使用大頁面大小,它有可能減小移動數據時須要同時使用的TLB條目數目,減小到數頁。這與一些上面咱們談論的緩存使用的優化機制相似。只有如今對齊需求是巨大的。考慮到TLB條目數目如此小,這多是一個重要的優化。

 

4.4 Impact Of Virtualization

Virtualization of OS images will become more and more prevalent; this means another layer of memory handling is added to the picture. Virtualization of processes (basically jails) or OS containers do not fall into this category since only one OS is involved. Technologies like Xen or KVM enable—with or without help from the processor—the execution of independent OS images. In these situations there is one piece of software alone which directly controls access to the physical memory.

Figure 4.4: Xen Virtualization Model

In the case of Xen (see Figure 4.4) the Xen VMM is that piece of software. The VMM does not implement many of the other hardware controls itself, though. Unlike VMMs on other, earlier systems (and the first release of the Xen VMM) the hardware outside of memory and processors is controlled by the privileged Dom0 domain. Currently, this is basically the same kernel as the unprivileged DomU kernels and, as far as memory handling is concerned, they do not differ. Important here is that the VMM hands out physical memory to the Dom0 and DomU kernels which, themselves, then implement the usual memory handling as if they were running directly on a processor.

譯者信息

4.4 虛擬化的影響

OS映像的虛擬化將變得愈來愈流行;這意味着另外一個層次的內存處理被加入了想象。進程(基本的隔間)或者OS容器的虛擬化,由於只涉及一個OS而沒有落入此分類。相似Xen或KVM的技術使OS映像可以獨立運行 — 有或者沒有處理器的協助。這些情形下,有一個單獨的軟件直接控制物理內存的訪問。

圖 4.4: Xen 虛擬化模型

對Xen來講(見圖4.4),Xen VMM(Xen內存管理程序)就是那個軟件。可是,VMM沒有本身實現許多硬件的控制,不像其餘早先的系統(包括Xen VMM的第一個版本)的VMM,內存之外的硬件和處理器由享有特權的Dom0域控制。如今,這基本上與沒有特權的DomU內核同樣,就內存處理方面而言,它們沒有什麼不一樣。這裏重要的是,VMM本身分發物理內存給Dom0和DomU內核,而後就像他們是直接運行在一個處理器上同樣,實現一般的內存處理

 

To implement the separation of the domains which is required for the virtualization to be complete, the memory handling in the Dom0 and DomU kernels does not have unrestricted access to physical memory. The VMM does not hand out memory by giving out individual physical pages and letting the guest OSes handle the addressing; this would not provide any protection against faulty or rogue guest domains. Instead, the VMM creates its own page table tree for each guest domain and hands out memory using these data structures. The good thing is that access to the administrative information of the page table tree can be controlled. If the code does not have appropriate privileges it cannot do anything.

譯者信息爲了實現完成虛擬化所需的各個域之間的分隔,Dom0和DomU內核中的內存處理不具備無限制的物理內存訪問權限。VMM不是經過分發獨立的物理頁並讓客戶OS處理地址的方式來分發內存;這不能提供對錯誤或欺詐客戶域的防範。替代的,VMM爲每個客戶域建立它本身的頁表樹,而且用這些數據結構分發內存。好處是對頁表樹管理信息的訪問能獲得控制。若是代碼沒有合適的特權,它不能作任何事。

 

This access control is exploited in the virtualization Xen provides, regardless of whether para- or hardware (aka full) virtualization is used. The guest domains construct their page table trees for each process in a way which is intentionally quite similar for para- and hardware virtualization. Whenever the guest OS modifies its page tables the VMM is invoked. The VMM then uses the updated information in the guest domain to update its own shadow page tables. These are the page tables which are actually used by the hardware. Obviously, this process is quite expensive: each modification of the page table tree requires an invocation of the VMM. While changes to the memory mapping are not cheap without virtualization they become even more expensive now.

譯者信息在虛擬化的Xen支持中,這種訪問控制已被開發,無論使用的是參數的或硬件的(又名全)虛擬化。客戶域以意圖上與參數的和硬件的虛擬化極爲類似的方法,給每一個進程建立它們的頁表樹。每當客戶OS修改了VMM調用的頁表,VMM就會用客戶域中更新的信息去更新本身的影子頁表。這些是實際由硬件使用的頁表。顯然這個過程很是昂貴:每次對頁表樹的修改都須要VMM的一次調用。而沒有虛擬化時內存映射的改變也不便宜,它們如今變得甚至更昂貴。

 

The additional costs can be really large, considering that the changes from the guest OS to the VMM and back themselves are already quite expensive. This is why the processors are starting to have additional functionality to avoid the creation of shadow page tables. This is good not only because of speed concerns but it also reduces memory consumption by the VMM. Intel has Extended Page Tables (EPTs) and AMD calls it Nested Page Tables (NPTs). Basically both technologies have the page tables of the guest OSes produce virtual physical addresses. These addresses must then be further translated, using the per-domain EPT/NPT trees, into actual physical addresses. This will allow memory handling at almost the speed of the no-virtualization case since most VMM entries for memory handling are removed. It also reduces the memory use of the VMM since now only one page table tree for each domain (as opposed to process) has to be maintained.

譯者信息考慮到從客戶OS的變化到VMM以及返回,其自己已經至關昂貴,額外的代價可能真的很大。這就是爲何處理器開始具備避免建立影子頁表的額外功能。這樣很好不只是由於速度的問題,並且它減小了VMM消耗的內存。Intel有擴展頁表(EPTs),AMD稱之爲嵌套頁表(NPTs)。基本上兩種技術都具備客戶OS的頁表,來產生虛擬的物理地址。而後經過每一個域一個EPT/NPT樹的方式,這些地址會被進一步轉換爲真實的物理地址。這使得能夠用幾乎非虛擬化情境的速度進行內存處理,由於大多數用來內存處理的VMM條目被移走了。它也減小了VMM使用的內存,由於如今一個域(相對於進程)只有一個頁表樹須要維護。

 

The results of the additional address translation steps are also stored in the TLB. That means the TLB does not store the virtual physical address but, instead, the complete result of the lookup. It was already explained that AMD's Pacifica extension introduced the ASID to avoid TLB flushes on each entry. The number of bits for the ASID is one in the initial release of the processor extensions; this is just enough to differentiate VMM and guest OS. Intel has virtual processor IDs (VPIDs) which serve the same purpose, only there are more of them. But the VPID is fixed for each guest domain and therefore it cannot be used to mark separate processes and avoid TLB flushes at that level, too.

譯者信息額外的地址轉換步驟的結果也存儲於TLB。那意味着TLB不存儲虛擬物理地址,而替代以完整的查詢結果。已經解釋過AMD的帕西菲卡擴展爲了不TLB刷新而給每一個條目引入ASID。ASID的位數在最第一版本的處理器擴展中是一位;這正好足夠區分VMM和客戶OS。Intel有服務同一個目的的虛擬處理器ID(VPIDs),它們只有更多位。但對每一個客戶域VPID是固定的,所以它不能標記單獨的進程,也不能避免TLB在那個級別刷新。

 

The amount of work needed for each address space modification is one problem with virtualized OSes. There is another problem inherent in VMM-based virtualization, though: there is no way around having two layers of memory handling. But memory handling is hard (especially when taking complications like NUMA into account, see Section 5). The Xen approach of using a separate VMM makes optimal (or even good) handling hard since all the complications of a memory management implementation, including 「trivial」 things like discovery of memory regions, must be duplicated in the VMM. The OSes have fully-fledged and optimized implementations; one really wants to avoid duplicating them.

Figure 4.5: KVM Virtualization Model

 

譯者信息

對虛擬OS,每一個地址空間的修改須要的工做量是一個問題。可是還有另外一個內在的基於VMM虛擬化的問題:沒有什麼辦法處理兩層的內存。但內存處理很難(特別是考慮到像NUMA同樣的複雜性,見第5部分)。Xen方法使用一個單獨的VMM,這使最佳的(或最好的)處理變得困難,由於全部內存管理實現的複雜性,包括像發現內存範圍之類「瑣碎的」事情,必須被複制於VMM。OS有徹底成熟的與最佳的實現;人們確實想避免複製它們。

圖 4.5: KVM 虛擬化模型

This is why carrying the VMM/Dom0 model to its conclusion is such an attractive alternative. Figure 4.5 shows how the KVM Linux kernel extensions try to solve the problem. There is no separate VMM running directly on the hardware and controlling all the guests; instead, a normal Linux kernel takes over this functionality. This means the complete and sophisticated memory handling functionality in the Linux kernel is used to manage the memory of the system. Guest domains run alongside the normal user-level processes in what the creators call 「guest mode」. The virtualization functionality, para- or full virtualization, is controlled by another user-level process, the KVM VMM. This is just another process which happens to control a guest domain using the special KVM device the kernel implements. 譯者信息這就是爲何對VMM/Dom0模型的分析是這麼有吸引力的一個選擇。圖4.5顯示了KVM的Linux內核擴展如未嘗試解決這個問題的。並無直接運行在硬件之上且管理全部客戶的單獨的VMM,替代的,一個普通的Linux內核接管了這個功能。這意味着Linux內核中完整且複雜的內存管理功能,被用來管理系統的內存。客戶域運行於普通的用戶級進程,建立者稱其爲「客戶模式」。虛擬化的功能,參數的或全虛擬化的,被另外一個用戶級進程KVM VMM控制。這也就是另外一個進程用特別的內核實現的KVM設備,去恰巧控制一個客戶域。

 

The benefit of this model over the separate VMM of the Xen model is that, even though there are still two memory handlers at work when guest OSes are used, there only needs to be one implementation, that in the Linux kernel. It is not necessary to duplicate the same functionality in another piece of code like the Xen VMM. This leads to less work, fewer bugs, and, perhaps, less friction where the two memory handlers touch since the memory handler in a Linux guest makes the same assumptions as the memory handler in the outer Linux kernel which runs on the bare hardware.

Overall, programmers must be aware that, with virtualization used, the cost of memory operations is even higher than without virtualization. Any optimization which reduces this work will pay off even more in virtualized environments. Processor designers will, over time, reduce the difference more and more through technologies like EPT and NPT but it will never completely go away.

譯者信息

這個模型相較Xen獨立的VMM模型好處在於,即便客戶OS使用時,仍然有兩個內存處理程序在工做,只須要在Linux內核裏有一個實現。不須要像Xen VMM那樣從另外一段代碼複製一樣的功能。這帶來更少的工做,更少的bug,或許還有更少的兩個內存處理程序接觸產生的摩擦,由於一個Linux客戶的內存處理程序與運行於裸硬件之上的Linux內核的外部內存處理程序,作出了相同的假設。

總的來講,程序員必須清醒認識到,採用虛擬化時,內存操做的代價比沒有虛擬化要高不少。任何減小這個工做的優化,將在虛擬化環境付出更多。隨着時間的過去,處理器的設計者將經過像EPT和NPT技術愈來愈減小這個差距,但它永遠都不會徹底消失。

 

 

 

與大容量存儲不一樣,

相關文章
相關標籤/搜索