(轉)The C10K problem翻譯

 

現在的web服務器須要同時處理一萬個以上的客戶端了,難道不是嗎?畢竟現在的網絡是個big place了。php

如今的計算機也很強大了,你只須要花大概$1200就能夠買一個1000MHz的處理器,2G的內存, 1000Mbit/sec的網卡的機器。讓咱們來看看--20000個客戶,每一個爲50KHz,100Kbyes和 50Kbit/sec,那麼沒有什麼比爲這兩萬個客戶端的每一個每秒從硬盤讀取4千字節而後發送到網絡上 去更消耗資源的了。能夠看出硬件再也不是瓶頸了。 (That works out to $0.08 per client, by the way. Those $100/client licensing fees some operating systems charge are starting to look a little heavy!)css

在1999年最繁忙的ftp站點,cdrom.com,儘管有G比特的網絡帶寬,卻也只能同時處理10000個 客戶端。在2001年,一樣的速度能夠被幾個ISP服務商所提供,他們預期該趨勢會由於大量的商業 用戶而變得愈來愈廣泛。html

目前的瘦客戶端模型也開始又變得流行起來了--服務器運行在Internet上,爲數千個客戶端服務。java

基於以上一些考慮,這裏就配置操做系統或者編寫支持數千個網絡客戶端的代碼問題提出一些 注意點,該論題是基於類Unix操做系統的--該系統是個人我的愛好,固然Windows也有佔有一席之地。node

Contents

 

Related Sites

2003年10月,Felix von Leitner整理了一個很好的網站和一個 presentation,該網站介紹了網絡的可測量性,完成 了以不一樣網絡系統調用和不一樣的操做系統爲基準的性能比較。其中一項就是2.6版本的Linux內核 擊敗了2.4的內核,固然還有許多的圖片能夠給OS的開發者在平時提供點想法。
(See also the Slashdot comments; it'll be interesting to see whether anyone does followup benchmarks improving on Felix's results.)linux

Book to Read First

若是你尚未讀過W.Richard Stevens先生的《Unix網絡編程:第一卷》的話,請儘快獲取一份 拷貝,該書描述了許多關於編寫高性能的服務器的I/O策略和各自的一些缺陷,甚至還講述 了"thundering herd"問題,同時你也能夠閱讀Jeff Darcy寫的關於高性能服務器設計的一些 notes。 
(Another book which might be more helpful for those who are *using* rather than *writing* a web server isBuilding Scalable Web Sites by Cal Henderson.)git

I/O框架

如下所列的爲幾個包裝好的庫,它們概要了幾中常見的技巧,而且可使你的代碼與具體操做 系統隔離,從而具備更好的移植性。web

  • ACE, 一個重量級的C++ I/O框架,用面向對象實現了一些I/O策略和其它有用的東西,特別的, 它的Reactor是用OO方式處理非阻塞I/O,而Proactor是用OO方式處理異步I/O的( In particular, his Reactor is an OO way of doing nonblocking I/O, and Proactor is an OO way of doing asynchronous I/O).
  • ASIO 一個C++的I/O框架,逐漸成爲Boost庫的一部分。it's like ACE updated for the STL era。
  • libevent 由Niels Provos用C編寫的一個輕量級的I/O框架。它支持kqueue和select,而且很 快就能夠支持poll和epoll(翻譯此文時已經支持)。我想它應該是隻採用了水平觸發機制,該機制 有好處固然也有很差的地方。Niels給出了一張圖 來講明時間和鏈接數目在處理一個事件上的功能,從圖上能夠看出kqueue和sys_epoll明顯勝出。
  • 我本人也嘗試太輕量級的框架(很惋惜沒有維持至今):
    • Poller 是一個輕量級的C++ I/O框架,它使用任何一種準備就緒API(poll, select, /dev/poll, kqueue, sigio)實現水平觸發準備就緒API。以其它不一樣的API爲基準 ,Poller的性能 好得多。該連接文檔的下面一部分說明了如何使用這些準備就緒API。
    • rn 是一個輕量級的C I/O框架,也是我繼Poller後的第二個框架。該框架能夠很容易的被用 於商業應用中,也容易的適用於非C++應用中。它現在已經在幾個商業產品中使用。
  • Matt Welsh在2000年四月關於在構建服務器方面如何平衡工做線程和事件驅動技術的使用寫了 一篇論文,在該論文中描述了他本身的Sandstorm I/O框架。
  • Cory Nelson's Scale! library - 一個Windows下的異步套接字,文件和管道的庫。

I/O Strategies

網絡軟件設計者每每有不少種選擇,如下列出一些:sql

  • 是否處理多個I/O?如何處理在單一線程中的多個I/O調用?
    • 不處理,從頭至尾使用阻塞和同步I/O調用,可使用多線程或多進程來達到併發效果。
    • 使用非阻塞調用(如在一個設置O_NONBLOCK選項的socket上使用write)讀取I/O,當I/O完 成時發出通知(如poll,/dev/poll)從而開始下一個I/O。這種主要使用在網絡I/O上,而不是磁盤的I/O上。
    • 使用異步調用(如aio_write())讀取I/O,當I/O完成時會發出通知(如信號或者完成端口),能夠同時使用在網絡I/O和磁盤I/O上。
  • 如何控制對每一個客戶的服務?
    • 對每一個客戶使用一個進程(經典的Unix方法,自從1980年一直使用)
    • 一個系統級的線程處理多個客戶,每一個客戶是以下一種:
      • a user-level thread (e.g. GNU state threads, classic Java with green threads)
      • a state machine (a bit esoteric, but popular in some circles; my favorite)
      • a continuation (a bit esoteric, but popular in some circles)
    • o一個系統級的線程對應一個客戶端(e.g. classic Java with native threads)
    • 一個系統級的線程對應每個活動的客戶端(e.g. Tomcat with apache front end; NT完成端口; 線程池)
  • 是否使用標準的操做系統服務,仍是把一些代碼放入內核中(如自定義驅動,內核模塊,VxD)。

下面的五種方式應該是最經常使用的了。apache

  1. 一個線程服務多個客戶端,使用非阻塞I/O和水平觸發的就緒通知
  2. 一個線程服務多個客戶端,使用非阻塞I/O和就緒改變時通知
  3. 一個服務線程服務多個客戶端,使用異步I/O
  4. 一個服務線程服務一個客戶端,使用阻塞I/O
  5. 把服務代碼編譯進內核

1. 一個線程服務多個客戶端,使用非阻塞I/O和水平觸發的就緒通知

...把網絡句柄設置爲非阻塞模型,而後使用select()或poll()來告知哪一個句柄已有數據在等待 處理。此模型是最傳統的,在此模型下,由內核告知你某個文件描述符是否準備好,是否已經完 成你的任務自從上次內核告知已準備好以來(「水平觸發」這個名字來源計算機硬件設計,與其 相對的是「邊緣觸發」,Jonathon Lemon在它的關於kqueue() 的論文中介紹了這兩個術語)。

注意:牢記內核的就緒通知僅僅只是個提示,當你試圖從一個文件描述符讀取數據時,該文件 描述符可能並無準備好。這就是爲何須要在使用就緒通知的時候使用非阻塞模型的緣由。

一個重要的瓶頸是read()或sendfile()從磁盤塊讀取時,若是該頁當前並不在內存中。設置磁 盤文件描述符爲非阻塞沒有任何影響。一樣的問題也發生在內存映射磁盤文件中。首先一個服務 須要磁盤I/O時,進程塊和全部的客戶端都必須等待,所以最初的非線程的性能就被消耗了。 
這也是異步I/O的目的,固然僅限於沒有AIO的系統。處理磁盤I/O的工做線程或工做進程也可能遭遇此 瓶頸。一條途徑就是使用內存映射文件,若是mincore()指明I/O必需的話,那麼要求一個工做線 程來完成此I/O,而後繼續處理網絡事件。Jef Poskanzer提到Pai,Druschel和Zwaenepoel的 Flash web服務器使用了這個方法,而且他們就此在Usenix'99上作了一個演講,看上去就好像 FreeBSD和Solaris 中提供了mincore()同樣,可是它並非Single Unix Specification的一部分,在Linux的2.3.51 的內核中提供了該方法,感謝Chuck Lever

2003.11的 freebsd-hackers list中,Vivek Pei上報了一個不錯的成果,他們利用系統剖析 工具剖析它們的Flash Web服務器,而後再攻擊其瓶頸。其中找到的一個瓶頸就是mincore(猜想 畢竟不是好辦法),另一個就是sendfile在磁盤塊訪問時。他們修改了sendfile(),當須要讀 取的頁不在內存中時則返回相似EWOULDBLOCK的值,從而提升了性能。The end result of their optimizations is a SpecWeb99 score of about 800 on a 1GHZ/1GB FreeBSD box, which is better than anything on file at spec.org.

在非阻塞套接字的集合中,關於單一線程是如何告知哪一個套接字是準備就緒的,如下列出了幾 種方法:

  • 傳統的select() 
    遺憾的是,select()受限於FD_SETSIZE個句柄。該限制被編譯進了標準庫和用戶程序(有些 版本的C library容許你在用戶程序編譯時放寬該限制)。

    See Poller_select (cch) for an example of how to use select() interchangeably with other readiness notification schemes.

     

  • 傳統的poll() 
    poll()雖然沒有文件描述符個數的硬編碼限制,可是當有數千個時速度就會變得很慢,由於 大多數的文件描述符在某個時間是空閒的,完全掃描數千個描述符是須要花費必定時間的。

    有些操做系統(如Solaris 8)經過使用了poll hinting技術改進了poll(),該技術由Niels Provos在1999年實現並利用基準測試程序測試過。

    See Poller_poll (cchbenchmarks) for an example of how to use poll() interchangeably with other readiness notification schemes.

     

  • /dev/poll
    這是在Solaris中被推薦的代替poll的方法。

    /dev/poll的背後思想就是利用poll()在大部分的調用時使用相同的參數。使用/dev/poll時 ,首先打開/dev/poll獲得文件描述符,而後把你關心的文件描述符寫入到/dev/poll的描述符, 而後你就能夠從/dev/poll的描述符中讀取到已就緒的文件描述符。

    /dev/poll 在Solaris 7(see patchid 106541) 中就已經存在,不過在Solaris 8 中才公開現身。在750個客戶端的狀況下,this has 10% of the overhead of poll()。

    關於/dev/poll在Linux上有多種不一樣的嘗試實現,可是沒有一種能夠和epoll相比,不推薦在 Linux上使用/dev/poll。

    See Poller_devpoll (cch benchmarks ) for an example of how to use /dev/poll interchangeably with many other readiness notification schemes. (Caution - the example is for Linux /dev/poll, might not work right on Solaris.)

     

  • kqueue()
    這是在FreeBSD系統上推薦使用的代替poll的方法(and, soon, NetBSD).

    kqueue()便可以水平觸發,也能夠邊緣觸發,具體請看下面.

2. 一個線程服務多個客戶端,使用非阻塞I/O和就緒改變時通知

Readiness change notification(或邊緣觸發就緒通知)的意思就是當你給內核一個文件描述 符,一段時間後,若是該文件描述符從沒有就緒到已經準備就緒,那麼內核就會發出通知,告知 該文件描述符已經就緒,而且不會再對該描述符發出相似的就緒通知直到你在描述符上進行一些 操做使得該描述符再也不就緒(如直到在send,recv或者accept等調用上遇到EWOULDBLOCK錯誤,或 者發送/接收了少於須要的字節數)。

當使用Readiness change notification時,必須準備好處理亂真事件,由於最多見的實現是隻 要接收到任何數據包都發出就緒信號,而無論文件描述符是否準備就緒。

這是水平觸發的就緒通知的相對應的機制。It's a bit less forgiving of programming mistakes, since if you miss just one event, the connection that event was for gets stuck forever. 然而,我發現edge-triggered readiness notification可使編寫帶OpenSSL的 非阻塞客戶端更簡單,能夠試下。

[Banga, Mogul, Drusha '99]詳細描述了這種模型.

有幾種APIs可使得應用程序得到「文件描述符已就緒」的通知:

3. 一個服務線程服務多個客戶端,使用異步I/O

該方法目前尚未在Unix上廣泛的使用,可能由於不多的操做系統支持異步I/O,或者由於它需 要從新修改應用程序(rethinking your applications)。 在標準Unix下,異步I/O是由"aio_"接口 提供的,它把一個信號和值與每個I/O操做關聯起來。信號和其值的隊列被有效地分配到用戶的 進程上。異步I/O是POSIX 1003.1b實時標準的擴展,也屬於Single Unix Specification,version 2.

AIO使用的是邊緣觸發的完成時通知,例如,當一個操做完成時信號就被加入隊列(也可使用 水平觸發的完成時通知,經過調用aio_suspend()便可, 不過我想不多人會這麼作).

glibc 2.1和後續版本提供了一個普通的實現,僅僅是爲了兼容標準,而不是爲了得到性能上的提升。

Ben LaHaise編寫的Linux AIO實現合併到了2.5.32的內核中,它並無採用內核線程,而是使 用了一個高效的underlying api,可是目前它還不支持套接字(2.4內核也有了AIO的補丁,不過 2.5/2.6的實現有必定程序上的不一樣)。更多信息以下:

Suparma建議先看看AIO的API.

RedHat AS和Suse SLES都在2.4的內核中提供了高性能的實現,與2.6的內核實現類似,但並不徹底同樣。

2006.2,在網絡AIO有了一個新的嘗試,具體請看Evgeniy Polyakov的基於kevent的AIO.

1999, SGI爲Linux實現了一個高速的AIO< /a>,在到1.1版本時,聽說能夠很好的工做於磁盤I/O和網 絡套接字,且使用了內核線程。目前該實現依然對那些不能等待Ben的AIO套接字支持的人來講是 頗有用的。

O'Reilly 的"POSIX.4: Programming for the Real World"一書對aio作了很好的介紹.

這裏 有一個指南介紹了早期的非標準的aio實現,能夠看看,可是請記住你得把"aioread"轉換爲"aio_read"。

注意AIO並無提供無阻塞的爲磁盤I/O打開文件的方法,若是你在乎因打開磁盤文件而引發 sleep的話,Linus建議你在另一個線程中調用open()而不是把但願寄託在對aio_open()系統調用上。

在Windows下,異步I/O與術語"重疊I/O"和"IOCP"(I/O Completion Port,I/O完成端口)有必定聯繫。Microsoft的IOCP結合了 先前的如異步I/O(如aio_write)的技術,把事件完成的通知進行排隊(就像使用了aio_sigevent字段的aio_write),而且它 爲了保持單一IOCP線程的數量從而阻止了一部分請求。(Microsoft's IOCP combines techniques from the prior art like asynchronous I/O (like aio_write) and queued completion notification (like when using the aio_sigevent field with aio_write) with a new idea of holding back some requests to try to keep the number of running threads associated with a single IOCP constant.) 更多信息請看 Mark russinovich在sysinternals.com上的文章 Inside I/O Completion Ports, Jeffrey Richter的書"Programming Server-Side Applications for Microsoft Windows 2000" (AmazonMSPress), U.S. patent #06223207, or MSDN.

4. 一個服務線程服務一個客戶端,使用阻塞I/O

... 讓read()和write()阻塞. 這樣很差的地方在於須要爲每一個客戶端使用一個完整的棧,從而比較浪費內存。 許多操做系統仍在處理數百個線程時存在必定的問題。若是每一個線程使用2MB的棧,那麼當你在32位的機器上運行 512(2^30 / 2^21=512)個線程時,你就會用光全部的1GB的用戶可訪問虛擬內存(Linux也是同樣運行在x86上的)。 你能夠減少每一個線程所擁有的棧內存大小,可是因爲大部分線程庫在一旦線程建立後就不能增大線程棧大小,因此這樣作 就意味着你必須使你的程序最小程度地使用內存。固然你也能夠把你的程序運行在64位的處理器上去。

Linux,FreeBSD和Solaris系統的線程庫一直在更新,64位的處理器也已經開始在大部分的用戶中所使用。 也許在不遠的未來,這些喜歡使用一個線程來服務一個客戶端的人也有能力服務於10000個客戶了。 可是在目前,若是你想支持更多的客戶,你最好仍是使用其它的方法。

For an unabashedly pro-thread viewpoint, see Why Events Are A Bad Idea (for High-concurrency Servers) by von Behren, Condit, and Brewer, UCB, presented at HotOS IX. Anyone from the anti-thread camp care to point out a paper that rebuts this one? :-)

LinuxThreads

LinuxTheads 是標準Linux線程庫的命名。 它從glibc2.0開始已經集成在glibc庫中,而且高度兼容Posix標準,不過在性能和信號的支持度上稍遜一籌。

NGPT: Next Generation Posix Threads for Linux下一代LinuxPosix線程

NGPT是一個由IBM發起的項目,其目的是提供更好的Posix兼容的Linux線程支持。 如今已到2.2穩定版,而且運行良好...可是NGPT team 公佈 他們正在把NGPT的代碼基改成support-only模式,由於他們以爲這纔是支持社區長久運行的最好的方式。 NGPT小組將繼續改進Linux的線程支持,但主要關注NPTL方面。 (Kudos to the NGPT team for their good work and the graceful way they conceded to NPTL.)

NPTL: Native Posix Thread Library for Linux(Linux本地Posix線程庫)

NPTL是由 Ulrich Drepper ( glibc的主要維護人員)和 Ingo Molnar發起的項目,目的是提供world-class的Posix Linux線程支持。

2003.10.5,NPTL做爲一個add-on目錄(就像linuxthreads同樣)被合併到glibc的cvs樹中,因此頗有可能隨glibc的下一次release而 一塊兒發佈。

Red Hat 9是最先的包含NPTL的發行版本(對一些用戶來講有點不太方便,可是必須有人來打破這沉默[break the ice]...)

NPTL links:

這是我嘗試寫的描述NPTL歷史的文章(也能夠參考Jerry Cooperstein的文章):

2002.3,NGPT小組的Bill Abt,glibc的維護者Ulrich Drepper 和其它人召開了個會議來探討LinuxThreads的發展,會議的一個idea就是要改進mutex的性能。 Rusty Russell 等人 隨後實現了 fast userspace mutexes (futexes), (現在已在NGPT和NPTL中應用了)。 與會的大部分人都認爲NGPT應該合併到glibc中。

然而Ulrich Drepper並不怎麼喜歡NGPT,他認爲他能夠作得更好。 (對那些曾經想提供補丁給glibc的人來講,這應該不會令他們感到驚訝:-) 因而在接下來的幾個月裏,Ulrich Drepper, Ingo Molnar和其它人致力於glibc和內核的改變,而後就弄出了 Native Posix Threads Library (NPTL). NPTL使用了NGPT設計的全部內核改進(kernel enhancement),而且採用了幾個最新的改進。 Ingo Molnar描述了 一下的幾個內核改進:

NPTL使用了三個由NGPT引入的內核特徵: getpid()返回PID,CLONE_THREAD和futexes; NPTL還使用了(並依賴)也是該項目的一部分的一個更爲wider的內核特徵集。

一些由NGPT引入內核的items也被修改,清除和擴展,例如線程組的處理(CLONE_THREAD). [the CLONE_THREAD changes which impacted NGPT's compatibility got synced with the NGPT folks, to make sure NGPT does not break in any unacceptable way.]

這些爲NPTL開發的而且後來在NPTL中使用的內核特徵都描述在設計白皮書中,http://people.redhat.com/drepper/nptl-design.pdf ...

A short list: TLS support, various clone extensions (CLONE_SETTLS, CLONE_SETTID, CLONE_CLEARTID), POSIX thread-signal handling, sys_exit() extension (release TID futex upon VM-release), the sys_exit_group() system-call, sys_execve() enhancements and support for detached threads.

There was also work put into extending the PID space - eg. procfs crashed due to 64K PID assumptions, max_pid, and pid allocation scalability work. Plus a number of performance-only improvements were done as well.

In essence the new features are a no-compromises approach to 1:1 threading - the kernel now helps in everything where it can improve threading, and we precisely do the minimally necessary set of context switches and kernel calls for every basic threading primitive.

NGPT和NPTL的一個最大的不一樣就是NPTL是1:1的線程模型,而NGPT是M:N的編程模型(具體請看下面). 儘管這樣, Ulrich的最初的基準測試 仍是代表NPTL比NGPT快不少。(NGPT小組期待查看Ulrich的測試程序來覈實他的結果.)

FreeBSD線程支持

FreeBSD支持LinuxThreads和用戶空間的線程庫。一樣,M:N的模型實現KSE在FreeBSD 5.0中引入。 具體請查看www.unobvious.com/bsd/freebsd-threads.html.

2003.3.25, Jeff Roberson 發表於freebsd-arch:

... 感謝Julian, David Xu, Mini, Dan Eischen,和其它的每一位參加了KSE和libpthread開發的成員所提供的基礎, Mini和我已經開發出了一個1:1模型的線程實現,它能夠和KSE並行工做而不會帶來任何影響。It actually helps bring M:N threading closer by testing out shared bits. ...

And 2006.7, Robert Watson提議1:1的線程模型應該爲FreeBSD 7.x的默認實現:

我知道曾經討論過這個問題,可是我認爲隨着7.x的向前推動,這個問題應該從新考慮。 在不少普通的應用程序和特定的基準測試中,libthr明顯的比libpthread在性能上要好得多。 libthr是在咱們大量的平臺上實現的,而libpthread卻只有在幾個平臺上。 最主要的是由於咱們使得Mysql和其它的大量線程的使用者轉換到"libthr",which is suggestive, also! ... 因此strawman提議:讓libthr成爲7.x上的默認線程庫。

NetBSD線程支持

根據Noriyuki Soda的描述:

內核支持M:N線程庫是基於調度程序激活模型,合併於2003.1.18當時的NetBSD版本中。

詳情請看Nathan J. Williams, Wasabi Systems, Inc.在2002年的FREENIX上的演示 An Implementation of Scheduler Activations on the NetBSD Operating System

Solaris線程支持

Solaris的線程支持還在進一步提升evolving... 從Solaris 2到Solaris 8,默認的線程庫使用的都是M:N模型, 可是Solaris 9卻默認使用了1:1線程模型. 查看Sun多線程編程指南 和Sun的關於Java和Solaris線程的note.

Java在JDK 1.3.x及更早的線程支持

你們都知道,Java一直到JDK1.3.x都沒有支持任何處理網絡鏈接的方法,除了一個線程服務一個客戶端的模型以外。 Volanomark是一個不錯的微型測試程序,能夠用來測量在 某個時候不一樣數目的網絡鏈接時每秒鐘的信息吞吐量。在2003.5, JDK 1.3的實現實際上能夠同時處理10000個鏈接,可是性能卻嚴重降低了。 從Table 4 能夠看出JVMs能夠處理10000個鏈接,可是隨着鏈接數目的增加性能也逐步降低。

Note: 1:1 threading vs. M:N threading

在實現線程庫的時候有一個選擇就是你能夠把全部的線程支持都放到內核中(也就是所謂的1:1的模型),也能夠 把一些線程移到用戶空間上去(也就是所謂的M:N模型)。從某個角度來講, M:N被認爲擁有更好的性能,可是因爲很難被正確的編寫, 因此大部分人都遠離了該方法。

5. 把服務代碼編譯進內核

Novell和Microsoft都宣稱已經在不一樣時期完成了該工做,至少NFS的實現完成了該工做。 khttpd在Linux下爲靜態web頁面完成了該工做, Ingo Molnar完成了"TUX" (Threaded linUX webserver) ,這是一個Linux下的快速的可擴展的內核空間的HTTP服務器。 Ingo在2000.9.1宣佈 alpha版本的TUX能夠在 ftp://ftp.redhat.com/pub/redhat/tux下載, 而且介紹瞭如何加入其郵件列表來獲取更多信息。 
在Linux內核的郵件列表上討論了該方法的好處和缺點,多數人認爲不該該把web服務器放進內核中, 相反內核加入最小的鉤子hooks來提升web服務器的性能,這樣對其它形式的服務器就有益。 具體請看 Zach Brown的討論 對比用戶級別和內核的http服務器。 在2.4的linux內核中爲用戶程序提供了足夠的權力(power),就像X15 服務器運行的速度和TUX幾乎同樣,可是它沒有對內核作任何改變。

 

Comments

Richard Gooch曾經寫了一篇討論I/O選項的論文。

在2001, Tim Brecht和MMichal Ostrowski爲使用簡單的select的服務器 作了各類策略的測度 測試的數據值得看一看。

在2003, Tim Brecht發表了 userver的源碼, 該服務器是整合了Abhishek Chandra, David Mosberger, David Pariag和Michal Ostrowski所寫的幾個服務器而成的, 可使用select(), poll(), epoll()和sigio.

回到1999.3, Dean Gaudet發表:

我一直在問「爲何大家不使用基於select/event的模型,它明顯是最快的。」...

他們的理由是「太難理解了,而且其中關鍵部分(payoff)不清晰」,可是幾個月後,當該模型變得易懂時人們就開始願意使用它了。

Mark Russinovich寫了 一篇評論和 文章討論了在2.2的linux內核只可以I/O策略問題。 儘管某些地方彷佛有點錯誤,不過仍是值得去看。特別是他認爲Linux2.2的異步I/O (請看上面的F_SETSIG) 並無在數據準備好時通知用戶進程,而只有在新的鏈接到達時纔有。 這看起來是一個奇怪的誤解。 還能夠看看 早期的一些commentsIngo Molnar在1999.4.30所舉的反例Russinovich在1999.5.2的comments, Alan Cox的 反例,和各類 linux內核郵件. 我懷疑他想說的是Linux不支持異步磁盤I/O,這在過去是正確的,可是如今SGI已經實現了KAIO,它已再也不正確了。

查看頁面 sysinternals.com和 MSDN瞭解一下「完成端口」, 聽說它是NT中獨特的技術, 簡單說,win32的"重疊I/O"被認爲是過低水平而不方面使用,「完成端口」是提供了完成事件隊列的封裝,再加上魔法般的調度, 經過容許更多的線程來得到完成事件若是該端口上的其它已得到完成事件的線程處於睡眠中時(可能正在處理阻塞I/O),從而能夠保持運行線程數目恆定(scheduling magic that tries to keep the number of running threads constant by allowing more threads to pick up completion events if other threads that had picked up completion events from this port are sleeping (perhaps doing blocking I/O).

查看OS/400的I/O完成端口支持.

在1999.9,在linux內核郵件列表上曾有一次很是有趣的討論,討論題目爲 "15,000 Simultaneous Connections" (而且延續到第二週). Highlights:

  • Ed Hall 發表了一些他本身的經驗:他已經在運行Solaris的UP P2/333上完成>1000個鏈接每秒。 他的代碼使用了一個很小的線程池(每一個cpu 1或者2個線程池),每一個線程池使用事件模型來管理大量的客戶端鏈接。
  • Mike Jagdisposted an analysis of poll/select overhead, and said "The current select/poll implementation can be improved significantly, especially in the blocking case, but the overhead will still increase with the number of descriptors because select/poll does not, and cannot, remember what descriptors are interesting. This would be easy to fix with a new API. Suggestions are welcome..."
  • Mike posted about his work on improving select() and poll().
  • Mike posted a bit about a possible API to replace poll()/select(): "How about a 'device like' API where you write 'pollfd like' structs, the 'device' listens for events and delivers 'pollfd like' structs representing them when you read it? ... "
  • Rogier Wolff suggested using "the API that the digital guys suggested",http://www.cs.rice.edu/~gaurav/papers/usenix99.ps
  • Joerg Pommnitz pointed out that any new API along these lines should be able to wait for not just file descriptor events, but also signals and maybe SYSV-IPC. Our synchronization primitives should certainly be able to do what Win32's WaitForMultipleObjects can, at least.
  • Stephen Tweedie asserted that the combination of F_SETSIG, queued realtime signals, and sigwaitinfo() was a superset of the API proposed in http://www.cs.rice.edu/~gaurav/papers/usenix99.ps. He also mentions that you keep the signal blocked at all times if you're interested in performance; instead of the signal being delivered asynchronously, the process grabs the next one from the queue with sigwaitinfo().
  • Jayson Nordwick compared completion ports with the F_SETSIG synchronous event model, and concluded they're pretty similar.
  • Alan Cox noted that an older rev of SCT's SIGIO patch is included in 2.3.18ac.
  • Jordan Mendelson posted some example code showing how to use F_SETSIG.
  • Stephen C. Tweedie continued the comparison of completion ports and F_SETSIG, and noted: "With a signal dequeuing mechanism, your application is going to get signals destined for various library components if libraries are using the same mechanism," but the library can set up its own signal handler, so this shouldn't affect the program (much).
  • Doug Royer noted that he'd gotten 100,000 connections on Solaris 2.6 while he was working on the Sun calendar server. Others chimed in with estimates of how much RAM that would require on Linux, and what bottlenecks would be hit.

Interesting reading!

 

Limits on open filehandles

  • Any Unix: the limits set by ulimit or setrlimit.
  • Solaris: see the Solaris FAQ, question 3.46 (or thereabouts; they renumber the questions periodically).
  • FreeBSD:

    Edit /boot/loader.conf, add the line
    set kern.maxfiles=XXXX
    where XXXX is the desired system limit on file descriptors, and reboot. Thanks to an anonymous reader, who wrote in to say he'd achieved far more than 10000 connections on FreeBSD 4.3, and says
    "FWIW: You can't actually tune the maximum number of connections in FreeBSD trivially, via sysctl.... You have to do it in the /boot/loader.conf file. 
    The reason for this is that the zalloci() calls for initializing the sockets and tcpcb structures zones occurs very early in system startup, in order that the zone be both type stable and that it be swappable. 
    You will also need to set the number of mbufs much higher, since you will (on an unmodified kernel) chew up one mbuf per connection for tcptempl structures, which are used to implement keepalive."
    Another reader says
    "As of FreeBSD 4.4, the tcptempl structure is no longer allocated; you no longer have to worry about one mbuf being chewed up per connection."
    See also:
  • OpenBSD: A reader says
    "In OpenBSD, an additional tweak is required to increase the number of open filehandles available per process: the openfiles-cur parameter in  /etc/login.conf needs to be increased. You can change kern.maxfiles either with sysctl -w or in sysctl.conf but it has no effect. This matters because as shipped, the login.conf limits are a quite low 64 for nonprivileged processes, 128 for privileged."
  • Linux: See Bodo Bauer's /proc documentation. On 2.4 kernels:
    echo 32768 > /proc/sys/fs/file-max 
    increases the system limit on open files, and
    ulimit -n 32768
    increases the current process' limit.

    On 2.2.x kernels,

    echo 32768 > /proc/sys/fs/file-max echo 65536 > /proc/sys/fs/inode-max 
    increases the system limit on open files, and
    ulimit -n 32768
    increases the current process' limit.

    I verified that a process on Red Hat 6.0 (2.2.5 or so plus patches) can open at least 31000 file descriptors this way. Another fellow has verified that a process on 2.2.12 can open at least 90000 file descriptors this way (with appropriate limits). The upper bound seems to be available memory. 
    Stephen C. Tweedie posted about how to set ulimit limits globally or per-user at boot time using initscript and pam_limit. 
    In older 2.2 kernels, though, the number of open files per process is still limited to 1024, even with the above changes. 
    See also Oskar's 1998 post, which talks about the per-process and system-wide limits on file descriptors in the 2.0.36 kernel.

Limits on threads

On any architecture, you may need to reduce the amount of stack space allocated for each thread to avoid running out of virtual memory. You can set this at runtime with pthread_attr_init() if you're using pthreads.

  • Solaris: it supports as many threads as will fit in memory, I hear.
  • Linux 2.6 kernels with NPTL: /proc/sys/vm/max_map_count may need to be increased to go above 32000 or so threads. (You'll need to use very small stack threads to get anywhere near that number of threads, though, unless you're on a 64 bit processor.) See the NPTL mailing list, e.g. the thread with subject "Cannot create more than 32K threads?", for more info.
  • Linux 2.4: /proc/sys/kernel/threads-max is the max number of threads; it defaults to 2047 on my Red Hat 8 system. You can set increase this as usual by echoing new values into that file, e.g. "echo 4000 > /proc/sys/kernel/threads-max"
  • Linux 2.2: Even the 2.2.13 kernel limits the number of threads, at least on Intel. I don't know what the limits are on other architectures. Mingo posted a patch for 2.1.131 on Intel that removed this limit. It appears to be integrated into 2.3.20.

    See also Volano's detailed instructions for raising file, thread, and FD_SET limits in the 2.2 kernel. Wow. This document steps you through a lot of stuff that would be hard to figure out yourself, but is somewhat dated.

  • Java: See Volano's detailed benchmark info, plus their info on how to tune various systems to handle lots of threads.

Java issues

Up through JDK 1.3, Java's standard networking libraries mostly offered the one-thread-per-client model. There was a way to do nonblocking reads, but no way to do nonblocking writes.

In May 2001, JDK 1.4 introduced the package java.nio to provide full support for nonblocking I/O (and some other goodies). See the release notes for some caveats. Try it out and give Sun feedback!

HP's java also includes a Thread Polling API.

In 2000, Matt Welsh implemented nonblocking sockets for Java; his performance benchmarks show that they have advantages over blocking sockets in servers handling many (up to 10000) connections. His class library is called java-nbio; it's part of the Sandstorm project. Benchmarks showing performance with 10000 connectionsare available.

See also Dean Gaudet's essay on the subject of Java, network I/O, and threads, and the paper by Matt Welsh on events vs. worker threads.

Before NIO, there were several proposals for improving Java's networking APIs:

  • Matt Welsh's Jaguar system proposes preserialized objects, new Java bytecodes, and memory management changes to allow the use of asynchronous I/O with Java.
  • Interfacing Java to the Virtual Interface Architecture, by C-C. Chang and T. von Eicken, proposes memory management changes to allow the use of asynchronous I/O with Java.
  • JSR-51 was the Sun project that came up with the java.nio package. Matt Welsh participated (who says Sun doesn't listen?).

Other tips

  • Zero-Copy
    Normally, data gets copied many times on its way from here to there. Any scheme that eliminates these copies to the bare physical minimum is called "zero-copy".
    • Thomas Ogrisegg's zero-copy send patch for mmaped files under Linux 2.4.17-2.4.20. Claims it's faster than sendfile().
    • IO-Lite is a proposal for a set of I/O primitives that gets rid of the need for many copies.
    • Alan Cox noted that zero-copy is sometimes not worth the trouble back in 1999. (He did like sendfile(), though.)
    • Ingo implemented a form of zero-copy TCP in the 2.4 kernel for TUX 1.0 in July 2000, and says he'll make it available to userspace soon.
    • Drew Gallatin and Robert Picco have added some zero-copy features to FreeBSD; the idea seems to be that if you call write() or read() on a socket, the pointer is page-aligned, and the amount of data transferred is at least a page, *and* you don't immediately reuse the buffer, memory management tricks will be used to avoid copies. But see followups to this message on linux-kernelfor people's misgivings about the speed of those memory management tricks.

      According to a note from Noriyuki Soda:

      Sending side zero-copy is supported since NetBSD-1.6 release by specifying "SOSEND_LOAN" kernel option. This option is now default on NetBSD-current (you can disable this feature by specifying "SOSEND_NO_LOAN" in the kernel option on NetBSD_current). With this feature, zero-copy is automatically enabled, if data more than 4096 bytes are specified as data to be sent.
    • The sendfile() system call can implement zero-copy networking.
      The sendfile() function in Linux and FreeBSD lets you tell the kernel to send part or all of a file. This lets the OS do it as efficiently as possible. It can be used equally well in servers using threads or servers using nonblocking I/O. (In Linux, it's poorly documented at the moment; use _syscall4 to call it. Andi Kleen is writing new man pages that cover this. See also Exploring The sendfile System Call by Jeff Tranter in Linux Gazette issue 91.) Rumor has it, ftp.cdrom.com benefitted noticeably from sendfile().

      A zero-copy implementation of sendfile() is on its way for the 2.4 kernel. See LWN Jan 25 2001.

      One developer using sendfile() with Freebsd reports that using POLLWRBAND instead of POLLOUT makes a big difference.

      Solaris 8 (as of the July 2001 update) has a new system call 'sendfilev'. A copy of the man page is here.. The Solaris 8 7/01 release notes also mention it. I suspect that this will be most useful when sending to a socket in blocking mode; it'd be a bit of a pain to use with a nonblocking socket.

  • Avoid small frames by using writev (or TCP_CORK)
    A new socket option under Linux, TCP_CORK, tells the kernel to avoid sending partial frames, which helps a bit e.g. when there are lots of little write() calls you can't bundle together for some reason. Unsetting the option flushes the buffer. Better to use writev(), though...

    See LWN Jan 25 2001 for a summary of some very interesting discussions on linux-kernel about TCP_CORK and a possible alternative MSG_MORE.

  • Behave sensibly on overload.
    [Provos, Lever, and Tweedie 2000] notes that dropping incoming connections when the server is overloaded improved the shape of the performance curve, and reduced the overall error rate. They used a smoothed version of "number of clients with I/O ready" as a measure of overload. This technique should be easily applicable to servers written with select, poll, or any system call that returns a count of readiness events per call (e.g. /dev/poll or sigtimedwait4()).
  • Some programs can benefit from using non-Posix threads.
    Not all threads are created equal. The clone() function in Linux (and its friends in other operating systems) lets you create a thread that has its own current working directory, for instance, which can be very helpful when implementing an ftp server. See Hoser FTPd for an example of the use of native threads rather than pthreads.
  • Caching your own data can sometimes be a win.
    "Re: fix for hybrid server problems" by Vivek Sadananda Pai (vivek@cs.rice.edu) on new-httpd, May 9th, states:

    "I've compared the raw performance of a select-based server with a multiple-process server on both FreeBSD and Solaris/x86. On microbenchmarks, there's only a marginal difference in performance stemming from the software architecture. The big performance win for select-based servers stems from doing application-level caching. While multiple-process servers can do it at a higher cost, it's harder to get the same benefits on real workloads (vs microbenchmarks). I'll be presenting those measurements as part of a paper that'll appear at the next Usenix conference. If you've got postscript, the paper is available athttp://www.cs.rice.edu/~vivek/flash99/"

Other limits

  • Old system libraries might use 16 bit variables to hold file handles, which causes trouble above 32767 handles. glibc2.1 should be ok.
  • Many systems use 16 bit variables to hold process or thread id's. It would be interesting to port theVolano scalability benchmark to C, and see what the upper limit on number of threads is for the various operating systems.
  • Too much thread-local memory is preallocated by some operating systems; if each thread gets 1MB, and total VM space is 2GB, that creates an upper limit of 2000 threads.
  • Look at the performance comparison graph at the bottom ofhttp://www.acme.com/software/thttpd/benchmarks.html. Notice how various servers have trouble above 128 connections, even on Solaris 2.6? Anyone who figures out why, let me know. 
    Note: if the TCP stack has a bug that causes a short (200ms) delay at SYN or FIN time, as Linux 2.2.0-2.2.6 had, and the OS or http daemon has a hard limit on the number of connections open, you would expect exactly this behavior. There may be other causes.

Kernel Issues

For Linux, it looks like kernel bottlenecks are being fixed constantly. See Linux Weekly NewsKernel Trafficthe Linux-Kernel mailing list, and my Mindcraft Redux page.

In March 1999, Microsoft sponsored a benchmark comparing NT to Linux at serving large numbers of http and smb clients, in which they failed to see good results from Linux. See also my article on Mindcraft's April 1999 Benchmarks for more info.

See also The Linux Scalability Project. They're doing interesting work, including Niels Provos' hinting poll patch, and some work on the thundering herd problem.

See also Mike Jagdis' work on improving select() and poll(); here's Mike's post about it.

Mohit Aron (aron@cs.rice.edu) writes that rate-based clocking in TCP can improve HTTP response time over 'slow' connections by 80%.

Measuring Server Performance

Two tests in particular are simple, interesting, and hard:

  1. raw connections per second (how many 512 byte files per second can you serve?)
  2. total transfer rate on large files with many slow clients (how many 28.8k modem clients can simultaneously download from your server before performance goes to pot?)

Jef Poskanzer has published benchmarks comparing many web servers. Seehttp://www.acme.com/software/thttpd/benchmarks.html for his results.

I also have a few old notes about comparing thttpd to Apache that may be of interest to beginners.

Chuck Lever keeps reminding us about Banga and Druschel's paper on web server benchmarking. It's worth a read.

IBM has an excellent paper titled Java server benchmarks [Baylor et al, 2000]. It's worth a read.

Examples

Interesting select()-based servers

Interesting /dev/poll-based servers

  • N. Provos, C. Lever"Scalable Network I/O in Linux," May, 2000. [FREENIX track, Proc. USENIX 2000, San Diego, California (June, 2000).] Describes a version of thttpd modified to support /dev/poll. Performance is compared with phhttpd.

Interesting kqueue()-based servers

Interesting realtime signal-based servers

  • Chromium's X15. This uses the 2.4 kernel's SIGIO feature together with sendfile() and TCP_CORK, and reportedly achieves higher speed than even TUX. The source is available under a community source (not open source) license. See the original announcement by Fabio Riccardi.
  • Zach Brown's phhttpd - "a quick web server that was written to showcase the sigio/siginfo event model. consider this code highly experimental and yourself highly mental if you try and use it in a production environment." Uses the siginfo features of 2.3.21 or later, and includes the needed patches for earlier kernels. Rumored to be even faster than khttpd. See his post of 31 May 1999 for some notes.

Interesting thread-based servers

Interesting in-kernel servers

Other interesting links

 


Changelog

$Log: c10k.html,v $ Revision 1.212 2006/09/02 14:52:13 dank added asio Revision 1.211 2006/07/27 10:28:58 dank Link to Cal Henderson's book. Revision 1.210 2006/07/27 10:18:58 dank Listify polyakov links, add Drepper's new proposal, note that FreeBSD 7 might move to 1:1 Revision 1.209 2006/07/13 15:07:03 dank link to Scale! library, updated Polyakov links Revision 1.208 2006/07/13 14:50:29 dank Link to Polyakov's patches Revision 1.207 2003/11/03 08:09:39 dank Link to Linus's message deprecating the idea of aio_open Revision 1.206 2003/11/03 07:44:34 dank link to userver Revision 1.205 2003/11/03 06:55:26 dank Link to Vivek Pei's new Flash paper, mention great specweb99 score 

Copyright 1999-2006 Dan Kegel
dank@kegel.com
Last updated: 2 Sept 2006
[Return to www.kegel.com]

相關文章
相關標籤/搜索