在64位平臺上的Lucene,應該使用MMapDirectory[轉]

從3.1版本開始,Lucene和Solr開始在64位的Windows和Solaris系統中默認使用MMapDirectory,從3.3版本開始,64位的Linux系統也啓用了這個配置。這個變化使一些Lucene和Solr的用戶有些迷茫,由於忽然之間他們的系統的某些行爲和原來不同了。在郵件列表中,一些用戶發帖詢問爲何使用的資源比原來多了不少。也有不少專家開始告訴人們不要使用MMapDirectory。可是從Lucene的commiter的視角來看,MMapDirectory絕對是這些平臺的最佳選擇。html

在這篇blog中,我會試着解釋解釋關於virtual memory的一些基本常識,以及這些常識是怎麼被用於提高lucene的性能。瞭解了這些,你就會明白那些不讓你使用MMapDirectory的人是錯誤的。第二部分我會列出一些配置的細節,能夠避免出現「mmap failed」這樣的錯誤或者因爲java堆的一些特性致使lucene沒法達到最優的性能。java

Virtual Memory[1]

咱們從操做系統的內核開始提及。從70年代開始,軟件的I/O模式就是這樣的:只要你須要訪問disk的數據,你就會向kernal發起一個syscall,把一個指向某個buffer的指針傳進去,而後讀或者寫磁盤。若是你不想頻繁的發起大量的syscall,(由於用戶進程發起syscall會消耗不少資源),你應該使用較大的buffer,這樣每次多讀一些,訪問磁盤的次數也就少了。這也是爲何有人建議把Lucene的整個index都load到Java heap中的一個緣由(使用RAMDirectory)。程序員

可是全部的現代操做系統,像Linux,Windows(NT+), Mac OS X, 以及solaris都提供了一個更好的方式來完成I/O:他們用複雜的文件系統cache和內存管理來幫你buffer數據。其中最重要的一個feature叫作Virtual Memory,是一個處理超大數據(好比lucene index)的很好的解決方案。Virtual Memory是計算機體系結構的一個重要部分,實現它須要硬件級的支持,通常稱做memory management unit(MMU),是CPU的一部分。它的工做方式很是簡單:每一個進程都有獨立的虛擬地址空間,全部的library,堆,棧空間都映射在這個虛擬空間裏。在大多數狀況下,這個虛擬地址空間的起始偏移量都是0,在這裏load程序代碼,由於程序代碼的地址指針不會變化。每一個進程都會看到一個大的,不間斷的先行地址空間,它被稱爲virtual memory,由於這個地址空間和physical memory沒有半毛錢關係,只是進程看起來像memory而已。進程能夠像訪問真實內存同樣訪問這個虛擬地址空間,也不須要關心與此同時還有不少其餘進程也在使用內存。底層的OS和MMU一塊兒合做,把這些虛擬地址映射到真實的memory中。這個工做須要page table的幫助,page table由位於MMU硬件裏的TLBs(translation lookaside buffers, 它cache了頻繁被訪問的page)支持。這樣,OS能夠把全部進程的內存訪問請求發佈到真實可用的物理內存上,並且對於運行的程序來講是徹底透明的。apache

Schematic drawing of virtual memory
(image from Wikipedia [1]http://en.wikipedia.org/wiki/File:Virtual_memory.svg, licensed by CC BY-SA 3.0)

使用了這樣的虛擬化以後,OS還須要作一件事:當物理內存不夠的時候,OS要能決定swap out一些再也不使用的pages,釋放物理空間。當一個進程試着訪問一個page out的虛擬地址時,它會再次被reload進內存。在這個過程裏,用戶進程不須要作任何事情,對進程來講,內存管理是徹底透明的。這對應用程序來講是天大的好事,由於它沒必要關心內存是否夠用。固然,這對於須要使用大量內存的應用,好比Lucene,也會來帶一些問題。緩存

 

Lucene & Virtual Memory

咱們來看一個例子,假設咱們把整個的索引load進了內存(實際上是virtual memory)。若是咱們分配了一個RAMDirectory,而且把全部的索引文件都load進去了,那麼咱們其實違背了OS的意願。OS自己是會盡力優化磁盤訪問,因此OS會在物理內存中cache住全部的磁盤IO。而咱們把這些全部本應cache住的內容copy到了咱們本身的虛擬地址空間了,消耗了大量的物理內存。而物理內存是有限的,OS可能會把咱們分配的這個超大的RAMDirectory踢出物理內存,也就是放到了磁盤上(OS swap file)。事實上,咱們是在和OS kernel打架,結果就是OS把咱們辛辛苦苦從磁盤上讀取的數據又踢回了磁盤。因此RAMDirectory並非優化索引加載時耗的好主意。並且,RAMDirectory還有一些和GC以及concurrency相關的問題。由於數據存儲在swap space,JAVA的GC要清理它是很費勁的。這會致使大量的磁盤IO,很慢的索引訪問速度,以及因爲GC不方便而致使的長達數分鐘的延遲。併發

 

若是咱們不用RAMDirectory來緩存index,而是使用NIOFSDirectory或者SimpleFSDirectory,會有另外的問題:咱們的代碼須要執行不少syscall來拷貝數據,數據流向是從磁盤或文件系統緩存向Java heap的buffer。在每一個搜索請求中,這樣的IO都存在。app

 

Memory Mapping Files

上面問題的解決方案就是MMapDirectory,它使用virtual memory和mmap來訪問磁盤文件。less

在本文前半部分講述的方法,咱們都是依賴系統調用在文件系統cache以及Java heap之間拷貝數據。那麼怎麼才能直接訪問文件系統cache呢?這就是mmap的做用!ide

簡單說MMapDirectory就是把lucene的索引看成swap file來處理。mmap()系統調用讓OS把整個索引文件映射到虛擬地址空間,這樣Lucene就會以爲索引在內存中。而後Lucene就能夠像訪問一個超大的byte[]數據(在Java中這個數據被封裝在ByteBuffer接口裏)同樣訪問磁盤上的索引文件。Lucene在訪問虛擬空間中的索引時,不須要任何的系統調用,CPU裏的MMU和TLB會處理全部的映射工做。若是數據還在磁盤上,那麼MMU會發起一箇中斷,OS將會把數據加載進文件系統Cache。若是數據已經在cache裏了,MMU/TLB會直接把數據映射到內存,這隻須要訪問內存,速度很快。程序員不須要關心paging in/out,全部的這些都交給OS。並且,這種狀況下沒有併發的干擾,惟一的問題就是Java的ByteBuffer封裝後的byte[]稍微慢一些,可是Java裏要想用mmap就只能用這個接口。還有一個很大的優勢就是全部的內存issue都由OS來負責,這樣沒有GC的問題。svg

What does this all mean to our Lucene/Solr application?

  • 爲了避免和OS爭內存,應該給JVM分儘量少的heap空間(-Xmx option). 索引的訪問所有都交給OS cache。並且這樣對JVM的GC也好。
  • 釋放盡量多的內存給OS,這樣文件系統緩存的空間也大,swap的機率低。

 

----------------------------------------------------------------

 

Don’t be afraid – Some clarification to common misunderstandings

Since version 3.1,  Apache Lucene and  Solr use  MMapDirectory by default on 64bit Windows and Solaris systems; since version 3.3 also for 64bit Linux systems. This change lead to some confusion among Lucene and Solr users, because suddenly their systems started to behave differently than in previous versions. On the Lucene and Solr mailing lists a lot of posts arrived from users asking why their Java installation is suddenly consuming three times their physical memory or system administrators complaining about heavy resource usage. Also consultants were starting to tell people that they should  not use  MMapDirectory and change their solrconfig.xml to work instead with slow  SimpleFSDirectory or  NIOFSDirectory (which is much slower on Windows, caused by a JVM bug  #6265734). From the point of view of the Lucene committers, who carefully decided that using  MMapDirectory is the best for those platforms, this is rather annoying, because they know, that Lucene/Solr can work with much better performance than before. Common misinformation about the background of this change causes suboptimal installations of this great search engine everywhere.

In this blog post, I will try to explain the basic operating system facts regarding virtual memory handling in the kernel and how this can be used to largely improve performance of Lucene (「VIRTUAL MEMORY for DUMMIES」). It will also clarify why the blog and mailing list posts done by various people are wrong and contradict the purpose of  MMapDirectory. In the second part I will show you some configuration details and settings you should take care of to prevent errors like 「mmap failed」 and suboptimal performance because of stupid Java heap allocation.

Virtual Memory[1]

Let’s start with your operating system’s kernel: The naive approach to do I/O in software is the way, you have done this since the 1970s – the pattern is simple: whenever you have to work with data on disk, you execute a  syscall to your operating system kernel, passing a pointer to some buffer (e.g. a  byte[] array in Java) and transfer some bytes from/to disk. After that you parse the buffer contents and do your program logic. If you don’t want to do too many syscalls (because those may cost a lot processing power), you generally use large buffers in your software, so synchronizing the data in the buffer with your disk needs to be done less often. This is one reason, why some people suggest to load the whole Lucene index into Java heap memory (e.g., by using  RAMDirectory).

But all modern operating systems like Linux, Windows (NT+), MacOS X, or Solaris provide a much better approach to do this 1970s style of code by using their sophisticated file system caches and memory management features. A feature called  「virtual memory」 is a good alternative to handle very large and space intensive data structures like a Lucene index. Virtual memory is an integral part of a computer architecture; implementations require hardware support, typically in the form of a  memory management unit (MMU) built into the CPU. The way how it works is very simple: Every process gets his own virtual address space where all libraries, heap and stack space is mapped into. This address space in most cases also start at offset zero, which simplifies loading the program code because no relocation of address pointers needs to be done. Every process sees a large unfragmented linear address space it can work on. It is called 「virtual memory」 because this address space has nothing to do with physical memory, it just looks like so to the process. Software can then access this large address space as if it were real memory without knowing that there are other processes also consuming memory and having their own virtual address space. The underlying operating system works together with the MMU (memory management unit) in the CPU to map those virtual addresses to real memory once they are accessed for the first time. This is done using so called page tables, which are backed by  TLBslocated in the MMU hardware  (translation lookaside buffers, they cache frequently accessed pages). By this, the operating system is able to distribute all running processes’ memory requirements to the real available memory, completely transparent to the running programs.
 
By using this virtualization, there is one more thing, the operating system can do: If there is not enough physical memory, it can decide to 「swap out」 pages no longer used by the processes, freeing physical memory for other processes or caching more important file system operations. Once a process tries to access a virtual address, which was paged out, it is reloaded to main memory and made available to the process. The process does not have to do anything, it is completely transparent. This is a good thing to applications because they don’t need to know anything about the amount of memory available; but also leads to problems for very memory intensive applications like Lucene.

Lucene & Virtual Memory

Let’s take the example of loading the whole index or large parts of it into 「memory」  (we already know, it is only virtual memory). If we allocate a  RAMDirectory and load all index files into it, we are working against the operating system: The operating system tries to optimize disk accesses, so it caches already all disk I/O in physical memory. We copy all these cache contents into our own virtual address space, consuming horrible amounts of physical memory (and we must wait for the copy operation to take place!).  As physical memory is limited, the operating system may, of course, decide to swap out our large RAMDirectory and where does it land? – On disk again (in the OS swap file)! In fact, we are fighting against our O/S kernel who pages out all stuff we loaded from disk  [2]. So  RAMDirectory is not a good idea to optimize index loading times! Additionally,  RAMDirectory has also more problems related to garbage collection and concurrency. Because the data residing in swap space, Java’s garbage collector has a hard job to free the memory in its own heap management. This leads to high disk I/O, slow index access times, and minute-long latency in your searching code caused by the garbage collector driving crazy.

On the other hand, if we don’t use  RAMDirectory to buffer our index and use NIOFSDirectory or  SimpleFSDirectory, we have to pay another price: Our code has to do a lot of syscalls to the O/S kernel to copy blocks of data between the disk or filesystem cache and our buffers residing in Java heap. This needs to be done on every search request, over and over again.

Memory Mapping Files

The solution to the above issues is  MMapDirectory, which uses virtual memory and a kernel feature called 「mmap」  [3] to access the disk files.

In our previous approaches, we were relying on using a syscall to copy the data between the file system cache and our local Java heap. How about directly accessing the file system cache? This is what mmap does!

Basically mmap does the same like handling the Lucene index as a swap file. The  mmap()syscall tells the O/S kernel to virtually map our whole index files into the previously described virtual address space, and make them look like RAM available to our Lucene process. We can then access our index file on disk just like it would be a large  byte[] array (in Java this is encapsulated by a  ByteBuffer interface to make it safe for use by Java code). If we access this virtual address space from the Lucene code we don’t need to do any syscalls, the processor’s MMU and TLB handles all the mapping for us. If the data is only on disk, the MMU will cause an interrupt and the O/S kernel will load the data into file system cache. If it is already in cache, MMU/TLB map it directly to the physical memory in file system cache. It is now just a native memory access, nothing more! We don’t have to take care of paging in/out of buffers, all this is managed by the O/S kernel. Furthermore, we have no concurrency issue, the only overhead over a standard  byte[] array is some wrapping caused by Java’s  ByteBufferinterface (it is still slower than a real  byte[] array, but  that is the only way to use mmap from Java and is much faster than all other directory implementations shipped with Lucene). We also waste no physical memory, as we operate directly on the O/S cache, avoiding all Java GC issues described before.

What does this all mean to our Lucene/Solr application?
  • We should not work against the operating system anymore, so allocate as less as possible heap space (-Xmx Java option). Remember, our index accesses rely on passed directly to O/S cache! This is also very friendly to the Java garbage collector.
  • Free as much as possible physical memory to be available for the O/S kernel as file system cache. Remember, our Lucene code works directly on it, so reducing the number of paging/swapping between disk and memory. Allocating too much heap to our Lucene application hurts performance! Lucene does not require it with MMapDirectory.

Why does this only work as expected on operating systems and Java virtual machines with 64bit?

One limitation of 32bit platforms is the size of pointers, they can refer to any address within 0 and 2 32-1, which is 4 Gigabytes. Most operating systems limit that address space to 3 Gigabytes because the remaining address space is reserved for use by device hardware and similar things. This means the overall linear address space provided to any process is limited to 3 Gigabytes, so you cannot map any file larger than that into this 「small」 address space to be available as big byte[] array. And when you mapped that one large file, there is no virtual space (address like 「house number」) available anymore. As physical memory sizes in current systems already have gone beyond that size, there is no address space available to make use for mapping files without wasting resources  (in our case 「address space」, not physical memory!).

On 64bit platforms this is different: 2 64-1 is a very large number, a number in excess of 18 quintillion bytes, so there is no real limit in address space. Unfortunately, most hardware (the MMU, CPU’s bus system) and operating systems are limiting this address space to 47 bits for user mode applications (Windows: 43 bits)  [4]. But there is still much of addressing space available to map terabytes of data.

Common misunderstandings

If you have read carefully what I have told you about virtual memory, you can easily verify that the following is true:
  • MMapDirectory does not consume additional memory and the size of mapped index files is not limited by the physical memory available on your server. By mmap() files, we only reserve address space not memory! Remember, address space on 64bit platforms is for free!
  • MMapDirectory will not load the whole index into physical memory. Why should it do this? We just ask the operating system to map the file into address space for easy access, by no means we are requesting more. Java and the O/S optionally provide the option to try loading the whole file into RAM (if enough is available), but Lucene does not use that option (we may add this possibility in a later version).
  • MMapDirectory does not overload the server when 「top」 reports horrible amounts of memory. 「top」 (on Linux) has three columns related to memory: 「VIRT」, 「RES」, and 「SHR」. The first one (VIRT, virtual) is reporting allocated virtual address space (and that one is for free on 64 bit platforms!). This number can be multiple times of your index size or physical memory when merges are running in IndexWriter. If you have only one IndexReader open it should be approximately equal to allocated heap space (-Xmx) plus index size. It does not show physical memory used by the process. The second column (RES, resident) memory shows how much (physical) memory the process allocated for operating and should be in the size of your Java heap space. The last column (SHR, shared) shows how much of the allocated virtual address space is shared with other processes. If you have several Java applications using MMapDirectory to access the same index, you will see this number going up. Generally, you will see the space needed by shared system libraries, JAR files, and the process executable itself (which are also mmapped).

How to configure my operating system and Java VM to make optimal use of MMapDirectory?

First of all, default settings in Linux distributions and Solaris/Windows are perfectly fine. But there are some paranoid system administrators around, that want to control everything (with lack of understanding). Those limit the maximum amount of virtual address space that can be allocated by applications. So please check that 「 ulimit -v」 and 「 ulimit -m」 both report 「 unlimited」, otherwise it may happen that  MMapDirectory reports  「mmap failed」 while opening your index. If this error still happens on systems with lot’s of very large indexes, each of those with many segments, you may need to tune your kernel parameters in  /etc/sysctl.conf: The default value of  vm.max_map_count is 65530, you may need to raise it. I think, for Windows and Solaris systems there are similar settings available, but it is up to the reader to find out how to use them.

For configuring your Java VM, you should rethink your memory requirements: Give only the really needed amount of heap space and leave as much as possible to the O/S. As a rule of thumb: Don’t use more than ¼ of your physical memory as heap space for Java running Lucene/Solr, keep the remaining memory free for the operating system cache. If you have more applications running on your server, adjust accordingly. As usual the more physical memory the better, but you don’t need as much physical memory as your index size. The kernel does a good job in paging in frequently used pages from your index.

A good possibility to check that you have configured your system optimally is by looking at both "top"  (and correctly interpreting it, see above) and the similar command " iotop" (can be installed, e.g., on Ubuntu Linux by " apt-get install iotop"). If your system does lots of swap in/swap out for the Lucene process, reduce heap size, you possibly used too much. If you see lot's of disk I/O, buy more RUM ( Simon Willnauer) so mmapped files don't need to be paged in/out all the time, and finally:  buy SSDs.

Happy mmapping!
相關文章
相關標籤/搜索