http://www.cppblog.com/isware/archive/2011/07/19/151390.htmlhtml
http://pl.atyp.us/content/tech/servers.htmlnode
本文將與你分享我多年來在服務器開發方面的一些經驗。對於這裏所說的服務器,更精確的定義應該是每秒處理大量離散消息或者請求的服務程序,網絡服務器更符合這種狀況,但並不是全部的網絡程序都是嚴格意義上的服務器。使用「高性能請求處理程序」是一個很糟糕的標題,爲了敘述起來簡單,下面將簡稱爲「服務器」。python
本文不會涉及到多任務應用程序,在單個程序裏同時處理多個任務如今已經很常見。好比你的瀏覽器可能就在作一些並行處理,可是這類並行程序設計沒有多大挑戰性。真正的挑戰出如今服務器的架構設計對性能產生制約時,如何經過改善架構來提高系統性能。對於在擁有上G內存和G赫茲CPU上運行的瀏覽器來講,經過DSL進行多個併發下載任務不會有如此的挑戰性。這裏,應用的焦點不在於經過吸管小口吮吸,而是如何經過水龍頭大口暢飲,這裏麻煩是如何解決在硬件性能的制約.(做者的意思應該是怎麼經過網絡硬件的改善來增大流量)react
一些人可能會對個人某些觀點和建議發出置疑,或者自認爲有更好的方法, 這是沒法避免的。在本文中我不想扮演上帝的角色;這裏所談論的是我本身的一些經驗,這些經驗對我來講, 不只在提升服務器性能上有效,並且在下降調試困難度和增長系統的可擴展性上也有做用。可是對某些人的系統可能會有所不一樣。若是有其它更適合於你的方法,那實在是很不錯. 可是值得注意的是,對本文中所提出的每一條建議的其它一些可替代方案,我通過實驗得出的結論都是悲觀的。你本身的小聰明在這些實驗中或許有更好的表現,可是若是所以慫恿我在這裏建議讀者這麼作,可能會引發無辜讀者的反感。你並不想惹怒讀者,對吧?ios
本文的其他部分將主要說明影響服務器性能的四大殺手:nginx
1) 數據拷貝(Data Copies)git
2) 環境切換(Context Switches)github
3) 內存分配(Memory allocation)web
4) 鎖競爭(Lock contention)算法
在文章結尾部分還會提出其它一些比較重要的因素,可是上面的四點是主要因素。若是服務器在處理大部分請求時可以作到沒有數據拷貝,沒有環境切換,沒有內存分配,沒有鎖競爭,那麼我敢保證你的服務器的性能必定很出色。
本節會有點短,由於大多數人在數據拷貝上吸收過教訓。幾乎每一個人都知道產生數據拷貝是不對的,這點是顯而易見的,在你的職業生涯中, 你很早就會見識過它;並且遇到過這個問題,由於10年前就有人開始說這個詞。對我來講確實如此。現今,幾乎每一個大學課程和幾乎全部how-to文檔中都提到了它。甚至在某些商業宣傳冊中,"零拷貝" 都是個流行用語。
儘管數據拷貝的壞處顯而易見,可是仍是會有人忽視它。由於產生數據拷貝的代碼經常隱藏很深且帶有假裝,你知道你所調用的庫或驅動的代碼會進行數據拷貝嗎?答案每每超出想象。猜猜「程序I/O」在計算機上到底指什麼?哈希函數是假裝的數據拷貝的例子,它有帶拷貝的內存訪問消耗和更多的計算。曾經指出哈希算法是一種有效的「拷貝+」彷佛可以被避免,但據我所知,有一些很是聰明的人說過要作到這一點是至關困難的。若是想真正去除數據拷貝,無論是由於影響了服務器性能,仍是想在黑客大會上展現"零複製」技術,你必須本身跟蹤可能發生數據拷貝的全部地方,而不是輕信宣傳。
有一種能夠避免數據拷貝的方法是使用buffer的描述符(或者buffer chains的描述符)來取代直接使用buffer指針,每一個buffer描述符應該由如下元素組成:
l 一個指向buffer的指針和整個buffer的長度
l 一個指向buffer中真實數據的指針和真實數據的長度,或者長度的偏移
l 以雙向鏈表的形式提供指向其它buffer的指針
l 一個引用計數
如今,代碼能夠簡單的在相應的描述符上增長引用計數來代替內存中數據的拷貝。這種作法在某些條件下表現的至關好,包括在典型的網絡協議棧的操做上,但有些狀況下這作法也使人很頭大。通常來講,在buffer chains的開頭和結尾增長buffer很容易,對整個buffer增長引用計數,以及對buffer chains的即刻釋放也很容易。在chains的中間增長buffer,一塊一塊的釋放buffer,或者對部分buffer增長引用技術則比較困難。而分割,組合chains會讓人立馬崩潰。
我 不建議在任何狀況下都使用這種技術,由於當你想在鏈上搜索你想要的一個塊時,就不得不遍歷一遍描述符鏈,這甚至比數據拷貝更糟糕。最適用這種技術地方是在 程序中大的數據塊上,這些大數據塊應該按照上面所說的那樣獨立的分配描述符,以免發生拷貝,也能避免影響服務器其它部分的工做.(大數據塊拷貝很消耗CPU,會影響其它併發線程的運行)。
關於數據拷貝最後要指出的是:在避免數據拷貝時不要走極端。我看到過太多的代碼爲了不數據拷貝,最後結果反而比拷貝數據更糟糕,好比產生環境切換或者一個大的I/O請求被分解了。數據拷貝是昂貴的,可是在避免它時,是收益遞減的(意思是作過頭了,效果反而很差)。爲了除去最後少許的數據拷貝而改變代碼,繼而讓代碼複雜度翻番,不如把時間花在其它方面。
相 對於數據拷貝影響的明顯,很是多的人會忽視了上下文切換對性能的影響。在個人經驗裏,比起數據拷貝,上下文切換是讓高負載應用完全完蛋的真正殺手。系統更 多的時間都花費在線程切換上,而不是花在真正作有用工做的線程上。使人驚奇的是,(和數據拷貝相比)在同一個水平上,致使上下文切換緣由老是更常見。引發 環境切換的第一個緣由每每是活躍線程數比CPU個數多。隨着活躍線程數相對於CPU個數的增長,上下文切換的次數也在增長,若是你夠幸運,這種增加是線性的,但更常見是指數增加。這個簡單的事實解釋了爲何每一個鏈接一個線程的多線程設計的可伸縮性更差。對於一個可伸縮性的系統來講,限制活躍線程數少於或等於CPU個數是更有實際意義的方案。曾經這種方案的一個變種是隻使用一個活躍線程,雖然這種方案避免了環境爭用,同時也避免了鎖,但它不能有效利用多CPU在增長總吞吐量上的價值,所以除非程序無CPU限制(non-CPU-bound),(一般是網絡I/O限制 network-I/O-bound),應該繼續使用更實際的方案。
一個有適量線程的程序首先要考慮的事情是規劃出如何建立一個線程去管理多鏈接。這一般意味着前置一個select/poll, 異步I/O,信號或者完成端口,然後臺使用一個事件驅動的程序框架。關於哪一種前置API是最好的有不少爭論。 Dan Kegel的C10K在這個領域是一篇不錯的論文。我的認爲,select/poll和信號一般是一種醜陋的方案,所以我更傾向於使用AIO或者完成端口,可是實際上它並不會好太多。也許除了select(),它們都還不錯。因此不要花太多精力去探索前置系統最外層內部到底發生了什麼。
對於最簡單的多線程事件驅動服務器的概念模型, 其內部有一個請求緩存隊列,客戶端請求被一個或者多個監聽線程獲取後放到隊列裏,而後一個或者多個工做線程從隊列裏面取出請求並處理。從概念上來講,這是一個很好的模型,有不少用這種方式來實現他們的代碼。這會產生什麼問題嗎?引發環境切換的第二個緣由是把對請求的處理從一個線程轉移到另外一個線程。有些人甚至把對請求的迴應又切換回最初的線程去作,這真是雪上加霜,由於每個請求至少引發了2次環境切換。把一個請求從監聽線程轉換到成工做線程,又轉換回監聽線程的過程當中,使用一種「平滑」的方法來避免環境切換是很是重要的。此時,是否把鏈接請求分配到多個線程,或者讓全部線程依次做爲監聽線程來服務每一個鏈接請求,反而不重要了。
即便在未來,也不可能有辦法知道在服務器中同一時刻會有多少激活線程.畢竟,每時每刻均可能有請求從任意鏈接發送過來,一些進行特殊任務的「後臺」線 程也會在任意時刻被喚醒。那麼若是你不知道當前有多少線程是激活的,又怎麼可以限制激活線程的數量呢?根據個人經驗,最簡單同時也是最有效的方法之一是: 用一個老式的帶計數的信號量,每個線程執行的時候就先持有信號量。若是信號量已經到了最大值,那些處於監聽模式的線程被喚醒的時候可能會有一次額外的環 境切換,(監聽線程被喚醒是由於有鏈接請求到來, 此時監聽線程持有信號量時發現信號量已滿,因此即刻休眠), 接 着它就會被阻塞在這個信號量上,一旦全部監聽模式的線程都這樣阻塞住了,那麼它們就不會再競爭資源了,直到其中一個線程釋放信號量,這樣環境切換對系統的 影響就能夠忽略不計。更主要的是,這種方法使大部分時間處於休眠狀態的線程避免在激活線程數中佔用一個位置,這種方式比其它的替代方案更優雅。
一旦處理請求的過程被分紅兩個階段(監聽和工做),那麼更進一步,這些處理過程在未來被分紅更多的階段(更多的線程)就是很天然的事了。最簡單的狀況是一個完整的請求先完成第一步,而後是第二步(好比迴應)。然而實際會更復雜:一個階段可能產生出兩個不一樣執行路徑,也可能只是簡單的生成一個應答(例如返回一個緩存的值)。由此每一個階段都須要知道下一步該如何作,根據階段分發函數的返回值有三種可能的作法:
l 請求須要被傳遞到另一個階段(返回一個描述符或者指針)
l 請求已經完成(返回ok)
l 請求被阻塞(返回"請求阻塞")。這和前面的狀況同樣,阻塞到直到別的線程釋放資源
應該注意到在這種模式下,對階段的排隊是在一個線程內完成的,而不是經由兩個線程中完成。這樣避免不斷把請求放在下一階段的隊列裏,緊接着又從該隊列取出這個請求來執行。這種經由不少活動隊列和鎖的階段很不必。
這種把一個複雜的任務分解成多個較小的互相協做的部分的方式,看起來很熟悉,這是由於這種作法確實很老了。個人方法,源於CAR在1978年發明的"通訊序列化進程"(Communicating Sequential Processes CSP),它的基礎能夠上溯到1963時的Per Brinch Hansen and Matthew Conway--在我出生以前!然而,當Hoare創造出CSP這個術語的時候,「進程」是從抽象的數學角度而言的,並且,這個CSP術語中的進程和操做系統中同名的那個進程並無關係。依我看來,這種在操做系統提供的單個線程以內,實現相似多線程同樣協同併發工做的CSP的方法,在可擴展性方面讓不少人頭疼。
一個實際的例子是,Matt Welsh的SEDA,這個例子代表分段執行的(stage-execution)思想朝着一個比較合理的方向發展。SEDA是一個很好的「server Aarchitecture done right」的例子,值得把它的特性評論一下:
1. SEDA的批處理傾向於強調一個階段處理多個請求,而個人方式傾向於強調一個請求分紅多個階段處理。
2. 在我看來SEDA的一個重大缺陷是給每一個階段申請一個獨立的在加載響應階段中線程「後臺」重分配的線程池。結果,緣由1和緣由2引發的環境切換仍然不少。
3. 在純技術的研究項目中,在Java中使用SEDA是有用的,然而在實際應用場合,我以爲這種方法不多被選擇。
申請和釋放內存是應用程序中最多見的操做, 所以發明了許多聰明的技巧使得內存的申請效率更高。然而再聰明的方法也不能彌補這種事實:在不少場合中,通常的內存分配方法很是沒有效率。因此爲了減小向系統申請內存,我有三個建議。
建 議一是使用預分配。咱們都知道因爲使用靜態分配而對程序的功能加上人爲限制是一種糟糕的設計。可是仍是有許多其它很不錯的預分配方案。一般認爲,經過系統 一次性分配內存要比分開幾回分配要好,即便這樣作在程序中浪費了某些內存。若是可以肯定在程序中會有幾項內存使用,在程序啓動時預分配就是一個合理的選 擇。即便不能肯定,在開始時爲請求句柄預分配可能須要的全部內存也比在每次須要一點的時候才分配要好。經過系統一次性連續分配多項內存還能極大減小錯誤處 理代碼。在內存比較緊張時,預分配可能不是一個好的選擇,可是除非面對最極端的系統環境,不然預分配都是一個穩賺不賠的選擇。
建議二是使用一個內存釋放分配的lookaside list(監視列表或者後備列表)。基本的概念是把最近釋放的對象放到鏈表裏而不是真的釋放它,當不久再次須要該對象時,直接從鏈表上取下來用,不用經過系統來分配。使用lookaside list的一個額外好處是能夠避免複雜對象的初始化和清理.
一般,讓lookaside list不受限制的增加,即便在程序空閒時也不釋放佔用的對象是個糟糕的想法。在避免引入複雜的鎖或競爭狀況下,不按期的「清掃"非活躍對象是頗有必要的。一個比較穩當的辦法是,讓lookaside list由兩個能夠獨立鎖定的鏈表組成:一個"新鏈"和一個"舊鏈".使用時優先從"新"鏈分配,而後最後才依靠"舊"鏈。對象老是被釋放的"新"鏈上。清除線程則按以下規則運行:
1. 鎖住兩個鏈
2. 保存舊鏈的頭結點
3. 把前一個新鏈掛到舊鏈的前頭
4. 解鎖
5. 在空閒時經過第二步保存的頭結點開始釋放舊鏈的全部對象。
使用了這種方式的系統中,對象只有在真的沒用時纔會釋放,釋放至少延時一個清除間隔期(指清除線程的運行間隔),但同常不會超過兩個間隔期。清除線程不會和普通線程發生鎖競爭。理論上來講,一樣的方法也能夠應用到請求的多個階段,但目前我尚未發現有這麼用的。
使用lookaside lists有一個問題是,保持分配對象須要一個鏈表指針(鏈表結點),這可能會增長內存的使用。可是即便有這種狀況,使用它帶來的好處也可以遠遠彌補這些額外內存的花銷。
第三條建議與咱們尚未討論的鎖有關係。先拋開它不說。即便使用lookaside list,內存分配時的鎖競爭也經常是最大的開銷。解決方法是使用線程私有的lookasid list, 這樣就能夠避免多個線程之間的競爭。更進一步,每一個處理器一個鏈會更好,但這樣只有在非搶先式線程環境下才有用。基於極端考慮,私有lookaside list甚至能夠和一個共用的鏈工做結合起來使用。
高效率的鎖是很是難規劃的, 以致於我把它稱做卡律布狄斯和斯庫拉(參見附錄)。一方面, 鎖的簡單化(粗粒度鎖)會致使並行處理的串行化,於是下降了併發的效率和系統可伸縮性; 另外一方面, 鎖的複雜化(細粒度鎖)在空間佔用上和操做時的時間消耗上均可能產生對性能的侵蝕。偏向於粗粒度鎖會有死鎖發生,而偏向於細粒度鎖則會產生競爭。在這二者之間,有一個狹小的路徑通向正確性和高效率,可是路在哪裏?
因爲鎖傾向於對程序邏輯產生束縛,因此若是要在不影響程序正常工做的基礎上規劃出鎖方案基本是不可能的。這也就是人們爲何憎恨鎖,而且爲本身設計的不可擴展的單線程方案找藉口了。
幾乎咱們每一個系統中鎖的設計都始於一個"鎖住一切的超級大鎖",並寄但願於它不會影響性能,當但願落空時(幾乎是必然), 大鎖被分紅多個小鎖,而後咱們繼續禱告(性能不會受影響),接着,是重複上面的整個過程(許多小鎖被分紅更小的鎖), 直到性能達到可接受的程度。一般,上面過程的每次重複都回增長大於20%-50%的複雜性和鎖負荷,並減小5%-10%的鎖競爭。最終結果是取得了適中的效率,可是實際效率的下降是不可避免的。設計者開始抓狂:"我已經按照書上的指導設計了細粒度鎖,爲何系統性能仍是很糟糕?"
在個人經驗裏,上面的方法從基礎上來講就不正確。設想把解決方案當成一座山,優秀的方案表示山頂,糟糕的方案表示山谷。上面始於"超級鎖"的解決方案就好像被形形色色的山谷,凹溝,小山頭和死衚衕擋在了山峯以外的爬山者同樣,是一個典型的糟糕登山法;從這樣一個地方開始登頂,還不以下山更容易一些。那麼登頂正確的方法是什麼?
首要的事情是爲你程序中的鎖造成一張圖表,有兩個軸:
l 圖表的縱軸表示代碼。若是你正在應用剔出了分支的階段架構(指前面說的爲請求劃分階段),你可能已經有這樣一張劃分圖了,就像不少人見過的OSI七層網絡協議架構圖同樣。
l 圖表的水平軸表示數據集。在請求的每一個階段都應該有屬於該階段須要的數據集。
如今,你有了一張網格圖,圖上每一個單元格表示一個特定階段須要的特定數據集。下面是應該遵照的最重要的規則:兩個請求不該該產生競爭,除非它們在同一個階段須要一樣的數據集。若是你嚴格遵照這個規則,那麼你已經成功了一半。
一旦你定義出了上面那個網格圖,在你的系統中的每種類型的鎖就均可以被標識出來了。你的下一個目標是確保這些標識出來的鎖儘量在兩個軸之間均勻的分佈, 這 部分工做是和具體應用相關的。你得像個鑽石切割工同樣,根據你對程序的瞭解,找出請求階段和數據集之間的天然「紋理線」。有時候它們很容易發現,有時候又 很難找出來,此時須要不斷回顧來發現它。在程序設計時,把代碼分隔成不一樣階段是很複雜的事情,我也沒有好的建議,可是對於數據集的定義,有一些建議給你:
l 若是你能對請求按順序編號,或者能對請求進行哈希,或者能把請求和事物ID關聯起來,那麼根據這些編號或者ID就能對數據更好的進行分隔。
l 有時,基於數據集的資源最大化利用,把請求動態的分配給數據,相對於依據請求的固有屬性來分配會更有優點。就好像現代CPU的多個整數運算單元知道把請求分離同樣。
l 肯定每一個階段指定的數據集是不同的是很是有用的,以便保證一個階段爭奪的數據在另外階段不會爭奪。
若是你在縱向和橫向上把「鎖空間(這裏實際指鎖的分佈)" 分 隔了,而且確保了鎖均勻分佈在網格上,那麼恭喜你得到了一個好方案。如今你處在了一個好的爬山點,打個比喻,你面有了一條通向頂峯的緩坡,但你尚未到山 頂。如今是時候對鎖競爭進行統計,看看該如何改進了。以不一樣的方式分隔階段和數據集,而後統計鎖競爭,直到得到一個滿意的分隔。當你作到這個程度的時候, 那麼無限風景將呈如今你腳下。
我已經闡述完了影響性能的四個主要方面。然而還有一些比較重要的方面須要說一說,大所屬均可歸結於你的平臺或系統環境:
l 你的存儲子系統在大數據讀寫和小數據讀寫,隨即讀寫和順序讀寫方面是如何進行?在預讀和延遲寫入方面作得怎樣?
l 你使用的網絡協議效率如何?是否能夠經過修改參數改善性能?是否有相似於TCP_CORK, MSG_PUSH,Nagle-toggling算法的手段來避免小消息產生?
l 你的系統是否支持Scatter-Gather I/O(例如readv/writev)? 使用這些可以改善性能,也能避免使用緩衝鏈(見第一節數據拷貝的相關敘述)帶來的麻煩。(說明:在dma傳輸數據的過程當中,要求源物理地址和目標物理地址必須是連續的。但在有的計算機體系中,如IA,連續的存儲器地址在物理上不必定是連續的,則dma傳輸要分紅屢次完成。若是傳輸完一塊物理連續的數據後發起一次中斷,同時主機進行下一塊物理連續的傳輸,則這種方式即爲block dma方式。scatter/gather方式則不一樣,它是用一個鏈表描述物理不連續的存儲器,而後把鏈表首地址告訴dma master。dma master傳輸完一塊物理連續的數據後,就不用再發中斷了,而是根據鏈表傳輸下一塊物理連續的數據,最後發起一次中斷。很顯然 scatter/gather方式比block dma方式效率高)
l 你的系統的頁大小是多少?高速緩存大小是多少?向這些大小邊界進行對起是否有用?系統調用和上下文切換花的代價是多少?
l 你是否知道鎖原語的飢餓現象?你的事件機制有沒有"驚羣"問題?你的喚醒/睡眠機制是否有這樣糟糕的行爲: 當X喚醒了Y, 環境馬上切換到了Y,可是X還有沒完成的工做?
我 在這裏考慮的了不少方面,相信你也考慮過。在特定狀況下,應用這裏提到的某些方面可能沒有價值,但能考慮這些因素的影響仍是有用的。若是在系統手冊中,你 沒有找到這些方面的說明,那麼就去努力找出答案。寫一個測試程序來找出答案;無論怎樣,寫這樣的測試代碼都是很好的技巧鍛鍊。若是你寫的代碼在多個平臺上 都運行過,那麼把這些相關的代碼抽象爲一個平臺相關的庫,未來在某個支持這裏提到的某些功能的平臺上,你就贏得了先機。
對你的代碼,「知其因此然」, 弄明白其中高級的操做, 以及在不一樣條件下的花銷.這不一樣於傳統的性能分析, 不是關於具體的實現,而是關乎設計. 低級別的優化永遠是蹩腳設計的最後救命稻草.
(map注:下面這段文字原文沒有,這是譯者對於翻譯的理)
[附錄:奧德修斯(Odysseus,又譯「奧德賽」),神話中伊塔刻島國王,《伊利亞特》和《奧德賽》兩大史詩中的主人公(公元前11世紀到公元前9世紀的希臘史稱做「荷馬時代」。包括《伊利亞特》和《奧德賽》兩部分的《荷馬史詩》,是古代世界一部著名的傑做)。奧德修斯曾參加過著名的特洛伊戰爭,在戰爭中他以英勇善戰、神機妙算而著稱,爲贏得戰爭的勝利,他設計製造了著名的「特洛伊木馬」(後來在西方成了「爲毀滅敵人而送的禮物」的代名詞)。特洛伊城毀滅後,他在回國途中又經歷了許多風險,荷馬的《奧德賽》就是奧德修斯歷險的記述。「斯庫拉和卡律布狄斯」的故事是其中最驚險、最恐怖的一幕。
相傳,斯庫拉和卡律布狄斯是古希臘神話中的女妖和魔怪,女妖斯庫拉住在乎大利和西西里島之間海峽中的一個洞穴裏,她的對面住着另外一個妖怪卡律布狄斯。它們爲害全部過往航海的人。據荷馬說,女妖斯庫拉長着12只不規則的腳,有6個蛇同樣的脖子,每一個脖子上各有一顆可怕的頭,張着血盆大口,每張嘴有3 排毒牙,隨時準備把獵物咬碎。它們天天在乎大利和西西里島之間海峽中興風做浪,航海者在兩個妖怪之間經過是異常危險的,它們時刻在等待着穿過西西里海峽的船舶。在海峽中間,卡律布狄斯化成一個大旋渦,波濤洶涌、水花飛濺,天天3次 從懸崖上奔涌而出,在退落時將經過此處的船隻所有淹沒。當奧德修斯的船接近卡律布狄斯大旋渦時,它像火爐上的一鍋沸水,波濤滔天,激起漫天雪白的水花。當 潮退時,海水混濁,濤聲如雷,驚天動地。這時,黑暗泥濘的巖穴一見到底。正當他們驚恐地注視着這一可怕的景象時,正當舵手當心翼翼地駕駛着船隻從左繞過旋 渦時,忽然海怪斯庫拉出如今他們面前,她一口叼住了6個同伴。奧德修斯親眼看見本身的同伴在妖怪的牙齒中間扭動着雙手和雙腳,掙扎了一下子,他們便被嚼碎,成了血肉模糊的一團。其他的人僥倖經過了卡律布狄斯大旋渦和海怪斯庫拉之間的危險的隘口。後來又歷經種種災難,最後終於回到了故鄉——伊塔刻島。
這個故事在語言學界和翻譯界被廣爲流傳。前蘇聯著名翻譯家巴爾胡達羅夫就曾把 「斯庫拉和卡律布狄斯」比做翻譯中「直譯和意譯」。他說:「形象地說,譯者老是不得不在直譯和意譯之間迂迴應變,猶如在斯庫拉和卡律布狄斯之間曲折前行,以求在這海峽兩岸之間找到一條狹窄然而卻足夠深邃的航道,以便達到理想的目的地——最大限度的等值翻譯。」
德國著名語言學家洪堡特也說過相似的話:「我確信任何翻譯無疑地都是企圖解決不可能解決的任務。由於任何一個翻譯家都會碰到一個暗礁而遭到失敗,他們不是因爲十分準確地遵照了原文的形式而破壞了譯文語言的特色,就是爲了照顧譯文語言的特色而損壞了原文。介於二者之間的作法不只難於辦到,並且簡直是不可能辦到。」
歷史上長久以來都認爲,翻譯只能選擇兩個極端的一種:或者這種——逐字翻譯(直譯);或者那種——自由翻譯(意譯)。就好像翻譯中的斯庫拉和卡律布狄斯」同樣。現在 「斯庫拉和卡律布狄斯」已成爲表示雙重危險——海怪和旋渦的代名詞,人們常說「介於斯庫拉和卡律布狄斯之間」,這就是說:處於兩面受敵的險境,比喻「危機四伏」,用來喻指譯者在直譯和意譯之間反覆做出抉擇之艱難。]
The purpose of this document is to share some ideas that I've developed over the years about how to develop a certain kind of application for which the term "server" is only a weak approximation. More accurately, I'll be writing about a broad class of programs that are designed to handle very large numbers of discrete messages or requests per second. Network servers most commonly fit this definition, but not all programs that do are really servers in any sense of the word. For the sake of simplicity, though, and because "High-Performance Request-Handling Programs" is a really lousy title, we'll just say "server" and be done with it.
I will not be writing about "mildly parallel" applications, even though multitasking within a single program is now commonplace. The browser you're using to read this probably does some things in parallel, but such low levels of parallelism really don't introduce many interesting challenges. The interesting challenges occur when the request-handling infrastructure itself is the limiting factor on overall performance, so that improving the infrastructure actually improves performance. That's not often the case for a browser running on a gigahertz processor with a gigabyte of memory doing six simultaneous downloads over a DSL line. The focus here is not on applications that sip through a straw but on those that drink from a firehose, on the very edge of hardware capabilities where how you do it really does matter.
Some people will inevitably take issue with some of my comments and suggestions, or think they have an even better way. Fine. I'm not trying to be the Voice of God here; these are just methods that I've found to work for me, not only in terms of their effects on performance but also in terms of their effects on the difficulty of debugging or extending code later. Your mileage may vary. If something else works better for you that's great, but be warned that almost everything I suggest here exists as an alternative to something else that I tried once only to be disgusted or horrified by the results. Your pet idea might very well feature prominently in one of these stories, and innocent readers might be bored to death if you encourage me to start telling them. You wouldn't want to hurt them, would you?
The rest of this article is going to be centered around what I'll call the Four Horsemen of Poor Performance:
There will also be a catch-all section at the end, but these are the biggest performance-killers. If you can handle most requests without copying data, without a context switch, without going through the memory allocator and without contending for locks, you'll have a server that performs well even if it gets some of the minor parts wrong.
This could be a very short section, for one very simple reason: most people have learned this lesson already. Everybody knows data copies are bad; it's obvious, right? Well, actually, it probably only seems obvious because you learned it very early in your computing career, and that only happened because somebody started putting out the word decades ago. I know that's true for me, but I digress. Nowadays it's covered in every school curriculum and in every informal how-to. Even the marketing types have figured out that "zero copy" is a good buzzword.
Despite the after-the-fact obviousness of copies being bad, though, there still seem to be nuances that people miss. The most important of these is that data copies are often hidden and disguised. Do you really know whether any code you call in drivers or libraries does data copies? It's probably more than you think. Guess what "Programmed I/O" on a PC refers to. An example of a copy that's disguised rather than hidden is a hash function, which has all the memory-access cost of a copy and also involves more computation. Once it's pointed out that hashing is effectively "copying plus" it seems obvious that it should be avoided, but I know at least one group of brilliant people who had to figure it out the hard way. If you really want to get rid of data copies, either because they really are hurting performance or because you want to put "zero-copy operation" on your hacker-conference slides, you'll need to track down a lot of things that really are data copies but don't advertise themselves as such.
The tried and true method for avoiding data copies is to use indirection, and pass buffer descriptors (or chains of buffer descriptors) around instead of mere buffer pointers. Each descriptor typically consists of the following:
Now, instead of copying a piece of data to make sure it stays in memory, code can simply increment a reference count on the appropriate buffer descriptor. This can work extremely well under some conditions, including the way that a typical network protocol stack operates, but it can also become a really big headache. Generally speaking, it's easy to add buffers at the beginning or end of a chain, to add references to whole buffers, and to deallocate a whole chain at once. Adding in the middle, deallocating piece by piece, or referring to partial buffers will each make life increasingly difficult. Trying to split or combine buffers will simply drive you insane.
I don't actually recommend using this approach for everything, though. Why not? Because it gets to be a huge pain when you have to walk through descriptor chains every time you want to look at a header field. There really are worse things than data copies. I find that the best thing to do is to identify the large objects in a program, such as data blocks, make sure those get allocated separately as described above so that they don't need to be copied, and not sweat too much about the other stuff.
This brings me to my last point about data copies: don't go overboard avoiding them. I've seen way too much code that avoids data copies by doing something even worse, like forcing a context switch or breaking up a large I/O request. Data copies are expensive, and when you're looking for places to avoid redundant operations they're one of the first things you should look at, but there is a point of diminishing returns. Combing through code and then making it twice as complicated just to get rid of that last few data copies is usually a waste of time that could be better spent in other ways.
Whereas everyone thinks it's obvious that data copies are bad, I'm often surprised by how many people totally ignore the effect of context switches on performance. In my experience, context switches are actually behind more total "meltdowns" at high load than data copies; the system starts spending more time going from one thread to another than it actually spends within any thread doing useful work. The amazing thing is that, at one level, it's totally obvious what causes excessive context switching. The #1 cause of context switches is having more active threads than you have processors. As the ratio of active threads to processors increases, the number of context switches also increases - linearly if you're lucky, but often exponentially. This very simple fact explains why multi-threaded designs that have one thread per connection scale very poorly. The only realistic alternative for a scalable system is to limit the number of active threads so it's (usually) less than or equal to the number of processors. One popular variant of this approach is to use only one thread, ever; while such an approach does avoid context thrashing, and avoids the need for locking as well, it is also incapable of achieving more than one processor's worth of total throughput and thus remains beneath contempt unless the program will be non-CPU-bound (usually network-I/O-bound) anyway.
The first thing that a "thread-frugal" program has to do is figure out how it's going to make one thread handle multiple connections at once. This usually implies a front end that uses select/poll, asynchronous I/O, signals or completion ports, with an event-driven structure behind that. Many "religious wars" have been fought, and continue to be fought, over which of the various front-end APIs is best. Dan Kegel's C10K paper is a good resource is this area. Personally, I think all flavors of select/poll and signals are ugly hacks, and therefore favor either AIO or completion ports, but it actually doesn't matter that much. They all - except maybe select() - work reasonably well, and don't really do much to address the matter of what happens past the very outermost layer of your program's front end.
The simplest conceptual model of a multi-threaded event-driven server has a queue at its center; requests are read by one or more "listener" threads and put on queues, from which one or more "worker" threads will remove and process them. Conceptually, this is a good model, but all too often people actually implement their code this way. Why is this wrong? Because the #2 cause of context switches is transferring work from one thread to another. Some people even compound the error by requiring that the response to a request be sent by the original thread - guaranteeing not one but two context switches per request. It's very important to use a "symmetric" approach in which a given thread can go from being a listener to a worker to a listener again without ever changing context. Whether this involves partitioning connections between threads or having all threads take turns being listener for the entire set of connections seems to matter a lot less.
Usually, it's not possible to know how many threads will be active even one instant into the future. After all, requests can come in on any connection at any moment, or "background" threads dedicated to various maintenance tasks could pick that moment to wake up. If you don't know how many threads are active, how can you limit how many are active? In my experience, one of the most effective approaches is also one of the simplest: use an old-fashioned counting semaphore which each thread must hold whenever it's doing "real work". If the thread limit has already been reached then each listen-mode thread might incur one extra context switch as it wakes up and then blocks on the semaphore, but once all listen-mode threads have blocked in this way they won't continue contending for resources until one of the existing threads "retires" so the system effect is negligible. More importantly, this method handles maintenance threads - which sleep most of the time and therefore dont' count against the active thread count - more gracefully than most alternatives.
Once the processing of requests has been broken up into two stages (listener and worker) with multiple threads to service the stages, it's natural to break up the processing even further into more than two stages. In its simplest form, processing a request thus becomes a matter of invoking stages successively in one direction, and then in the other (for replies). However, things can get more complicated; a stage might represent a "fork" between two processing paths which involve different stages, or it might generate a reply (e.g. a cached value) itself without invoking further stages. Therefore, each stage needs to be able to specify "what should happen next" for a request. There are three possibilities, represented by return values from the stage's dispatch function:
Note that, in this model, queuing of requests is done within stages, not between stages. This avoids the common silliness of constantly putting a request on a successor stage's queue, then immediately invoking that successor stage and dequeuing the request again; I call that lots of queue activity - and locking - for nothing.
If this idea of separating a complex task into multiple smaller communicating parts seems familiar, that's because it's actually very old. My approach has its roots in the Communicating Sequential Processes concept elucidated by C.A.R. Hoare in 1978, based in turn on ideas from Per Brinch Hansen and Matthew Conway going back to 1963 - before I was born! However, when Hoare coined the term CSP he meant "process" in the abstract mathematical sense, and a CSP process need bear no relation to the operating-system entities of the same name. In my opinion, the common approach of implementing CSP via thread-like coroutines within a single OS thread gives the user all of the headaches of concurrency with none of the scalability.
A contemporary example of the staged-execution idea evolved in a saner direction is Matt Welsh's SEDA. In fact, SEDA is such a good example of "server architecture done right" that it's worth commenting on some of its specific characteristics (especially where those differ from what I've outlined above).
Allocating and freeing memory is one of the most common operations in many applications. Accordingly, many clever tricks have been developed to make general-purpose memory allocators more efficient. However, no amount of cleverness can make up for the fact that the very generality of such allocators inevitably makes them far less efficient than the alternatives in many cases. I therefore have three suggestions for how to avoid the system memory allocator altogether.
Suggestion #1 is simple preallocation. We all know that static allocation is bad when it imposes artificial limits on program functionality, but there are many other forms of preallocation that can be quite beneficial. Usually the reason comes down to the fact that one trip through the system memory allocator is better than several, even when some memory is "wasted" in the process. Thus, if it's possible to assert that no more than N items could ever be in use at once, preallocation at program startup might be a valid choice. Even when that's not the case, preallocating everything that a request handler might need right at the beginning might be preferable to allocating each piece as it's needed; aside from the possibility of allocating multiple items contiguously in one trip through the system allocator, this often greatly simplifies error-recovery code. If memory is very tight then preallocation might not be an option, but in all but the most extreme circumstances it generally turns out to be a net win.
Suggestion #2 is to use lookaside lists for objects that are allocated and freed frequently. The basic idea is to put recently-freed objects onto a list instead of actually freeing them, in the hope that if they're needed again soon they need merely be taken off the list instead of being allocated from system memory. As an additional benefit, transitions to/from a lookaside list can often be implemented to skip complex object initialization/finalization.
It's generally undesirable to have lookaside lists grow without bound, never actually freeing anything even when your program is idle. Therefore, it's usually necessary to have some sort of periodic "sweeper" task to free inactive objects, but it would also be undesirable if the sweeper introduced undue locking complexity or contention. A good compromise is therefore a system in which a lookaside list actually consists of separately locked "old" and "new" lists. Allocation is done preferentially from the new list, then from the old list, and from the system only as a last resort; objects are always freed onto the new list. The sweeper thread operates as follows:
Objects in this sort of system are only actually freed when they have not been needed for at least one full sweeper interval, but always less than two. Most importantly, the sweeper does most of its work without holding any locks to contend with regular threads. In theory, the same approach can be generalized to more than two stages, but I have yet to find that useful.
One concern with using lookaside lists is that the list pointers might increase object size. In my experience, most of the objects that I'd use lookaside lists for already contain list pointers anyway, so it's kind of a moot point. Even if the pointers were only needed for the lookaside lists, though, the savings in terms of avoided trips through the system memory allocator (and object initialization) would more than make up for the extra memory.
Suggestion #3 actually has to do with locking, which we haven't discussed yet, but I'll toss it in anyway. Lock contention is often the biggest cost in allocating memory, even when lookaside lists are in use. One solution is to maintain multiple private lookaside lists, such that there's absolutely no possibility of contention for any one list. For example, you could have a separate lookaside list for each thread. One list per processor can be even better, due to cache-warmth considerations, but only works if threads cannot be preempted. The private lookaside lists can even be combined with a shared list if necessary, to create a system with extremely low allocation overhead.
Efficient locking schemes are notoriously hard to design, because of what I call Scylla and Charybdis after the monsters in the Odyssey. Scylla is locking that's too simplistic and/or coarse-grained, serializing activities that can or should proceed in parallel and thus sacrificing performance and scalability; Charybdis is overly complex or fine-grained locking, with space for locks and time for lock operations again sapping performance. Near Scylla are shoals representing deadlock and livelock conditions; near Charybdis are shoals representing race conditions. In between, there's a narrow channel that represents locking which is both efficient and correct...or is there? Since locking tends to be deeply tied to program logic, it's often impossible to design a good locking scheme without fundamentally changing how the program works. This is why people hate locking, and try to rationalize their use of non-scalable single-threaded approaches.
Almost every locking scheme starts off as "one big lock around everything" and a vague hope that performance won't suck. When that hope is dashed, and it almost always is, the big lock is broken up into smaller ones and the prayer is repeated, and then the whole process is repeated, presumably until performance is adequate. Often, though, each iteration increases complexity and locking overhead by 20-50% in return for a 5-10% decrease in lock contention. With luck, the net result is still a modest increase in performance, but actual decreases are not uncommon. The designer is left scratching his head (I use "his" because I'm a guy myself; get over it). "I made the locks finer grained like all the textbooks said I should," he thinks, "so why did performance get worse?"
In my opinion, things got worse because the aforementioned approach is fundamentally misguided. Imagine the "solution space" as a mountain range, with high points representing good solutions and low points representing bad ones. The problem is that the "one big lock" starting point is almost always separated from the higher peaks by all manner of valleys, saddles, lesser peaks and dead ends. It's a classic hill-climbing problem; trying to get from such a starting point to the higher peaks only by taking small steps and never going downhill almost never works. What's needed is a fundamentally different way of approaching the peaks.
The first thing you have to do is form a mental map of your program's locking. This map has two axes:
You now have a grid, where each cell represents a particular data set in a particular processing stage. What's most important is the following rule: two requests should not be in contention unless they are in the same data set and the same processing stage. If you can manage that, you've already won half the battle.
Once you've defined the grid, every type of locking your program does can be plotted, and your next goal is to ensure that the resulting dots are as evenly distributed along both axes as possible. Unfortunately, this part is very application-specific. You have to think like a diamond-cutter, using your knowledge of what the program does to find the natural "cleavage lines" between stages and data sets. Sometimes they're obvious to start with. Sometimes they're harder to find, but seem more obvious in retrospect. Dividing code into stages is a complicated matter of program design, so there's not much I can offer there, but here are some suggestions for how to define data sets:
If you've divided your "locking space" both vertically and horizontally, and made sure that lock activity is spread evenly across the resulting cells, you can be pretty sure that your locking is in pretty good shape. There's one more step, though. Do you remember the "small steps" approach I derided a few paragraphs ago? It still has its place, because now you're at a good starting point instead of a terrible one. In metaphorical terms you're probably well up the slope on one of the mountain range's highest peaks, but you're probably not at the top of one. Now is the time to collect contention statistics and see what you need to do to improve, splitting stages and data sets in different ways and then collecting more statistics until you're satisfied. If you do all that, you're sure to have a fine view from the mountaintop.
As promised, I've covered the four biggest performance problems in server design. There are still some important issues that any particular server will need to address, though. Mostly, these come down to knowing your platform/environment:
I'm sure I could think of many more questions in this vein. I'm sure you could too. In any particular situation it might not be worthwhile to do anything about any one of these issues, but it's usually worth at least thinking about them. If you don't know the answers - many of which you will not find in the system documentation - find out. Write a test program or micro-benchmark to find the answers empirically; writing such code is a useful skill in and of itself anyway. If you're writing code to run on multiple platforms, many of these questions correlate with points where you should probably be abstracting functionality into per-platform libraries so you can realize a performance gain on that one platform that supports a particular feature.
The "know the answers" theory applies to your own code, too. Figure out what the important high-level operations in your code are, and time them under different conditions. This is not quite the same as traditional profiling; it's about measuringdesign elements, not actual implementations. Low-level optimization is generally the last resort of someone who screwed up the design.
We have seen different models for socket I/O--and file I/O, in case of a web server for static content. Now, we are now in need of models merging I/O operations, CPU-bound activities such as request parsing and request handling into general server architectures.
There are traditionally two competitive server architectures--one is based on threads, the other on events. Over time, more sophisticated variants emerged, sometimes combining both approaches. There has been a long controversy, whether threads or events are generally the better fundament for high performance web servers [Ous96,vB03a,Wel01]. After more than a decade, this argument has been now reinforced, thanks to new scalability challenges and the trend towards multi-core CPUs.
Before we evaluate the different approaches, we introduce the general architectures, the corresponding patterns in use and give some real world examples.
The thread-based approach basically associates each incoming connection with a separate thread (resp. process). In this way, synchronous blocking I/O is the natural way of dealing with I/O. It is a common approach that is well supported by many programming languages. It also leads to a straight forward programming model, because all tasks necessary for request handling can be coded sequentially. Moreover, it provides a simple mental abstraction by isolating requests and hiding concurrency. Real concurrency is achieved by employing multiple threads/processes at the same time.
Conceptually, multi-process and multi-threaded architectures share the same principles: each new connection is handled by a dedicated activity.
A traditional approach to UNIX-based network servers is the process-per-connection model, using a dedicated process for handling a connection [Ste03]. This model has also been used for the first HTTP server, CERN httpd. Due to the nature of processes, they are isolating different requests promptly, as they do not share memory. Being rather heavyweight structures, the creation of processes is a costly operation and servers often employ a strategy called preforking. When using preforking, the main server process forks several handler processes preemptively on start-up, as shown infigure 4.1. Often, the (thread-safe) socket descriptor is shared among all processes, and each process blocks for a new connection, handles the connection and then waits for the next connection.
Some multi-process servers also measure the load and spawn additional requests when needed. However, it is important to note that the heavyweight structure of a process limits the maximum of simultaneous connections. The large memory footprint as a result of the connection-process mapping leads to a concurrency/memory trade-off. Especially in case of long-running, partially inactive connections (e.g. long-polling notification requests), the multi-process architecture provides only limited scalability for concurrent requests.
The popular Apache web server provides a robust multi-processing module that is based on process preforking, Apache-MPM prefork. It is still the default multi-processing module for UNIX-based setups of Apache.
When reasonable threading libraries have become available, new server architectures emerged that replaced heavyweight processes with more leightweight threads. In effect, they employ a thread-per-connection model. Although following the same principles, the multi-threaded approach has several important differences. First of all, multiple threads share the same address space and hence share global variables and state. This makes it possible to implement mutual features for all request handlers, such as a shared cache for cacheable responses inside the web server. Obviously, correct synchronization and coordination is then required. Another difference of the more leightweight structures of threads is their smaller memory footprint. Compared to the full-blown memory size of an entire process, a thread only consumes limited memory (i.e. the thread stack). Also, threads require less resources for creation/termination. We have already seen that the dimensions of a process are a severe problem in case of high concurrency. Threads are generally a more efficient replacement when mapping connections to activities.
In practice, it is a common architecture to place a single dispatcher thread (sometimes also called acceptor thread) in front of a pool of threads for connection handling [Ste03], as shown in figure 4.2. Thread pools are a common way of bounding the maximum number of threads inside the server. The dispatcher blocks on the socket for new connections. Once established, the connection is passed to a queue of incoming connections. Threads from the thread pool take connections from the queue, execute the requests and wait for new connections in the queue. When the queue is also bounded, the maximum number of awaiting connections can be restricted. Additional connections will be rejected. While this strategy limits the concurrency, it provides more predictable latencies and prevents total overload.
Apache-MPM worker is a multi-processing module for the Apache web server that combines processes and threads. The module spawns several processes and each process in turn manages its own pool of threads.
Multi-threaded servers using a thread-per-connection model are easy to implement and follow a simple strategy. Synchronous, blocking I/O operations can be used as a natural way of expressing I/O access. The operating system overlaps multiple threads via preemptively scheduling. In most cases, at least a blocking I/O operation triggers scheduling and causes a context switch, allowing the next thread to continue. This is a sound model for decent concurrency, and also appropriate when a reasonable amount of CPU-bound operations must be executed. Furthermore, multiple CPU cores can be used directly, as threads and processes are scheduled to all cores available.
Under heavy load, a multi-threaded web server consumes large amounts of memory (due to a single thread stack for each connection), and constant context switching causes considerable losses of CPU time. An indirect penalty thereof is increased chance of CPU cache misses. Reducing the absolute number of threads improves the per-thread performance, but limits the overall scalability in terms of maximum simultaneous connections.
As an alternative to synchronous blocking I/O, the event-driven approach is also common in server architectures. Due to the asynchronous/non-blocking call semantics, other models than the previously outlined thread-per-connection model are needed. A common model is the mapping of a single thread to multiple connections. The thread then handles all occurring events from I/O operations of these connections and requests. As shown in figure 4.3, new events are queued and the thread executes a so-called event loop--dequeuing events from the queue, processing the event, then taking the next event or waiting for new events to be pushed. Thus, the work executed by a thread is very similar to that of a scheduler, multiplexing multiple connections to a single flow of execution.
Processing an event either requires registered event handler code for specific events, or it is based on the execution of a callback associated to the event in advance. The different states of the connections handled by a thread are organized in appropriate data structures-- either explicitly using finite state machines or implicitly via continuations or closures of callbacks. As a result, the control flow of an application following the event-driven style is somehow inverted. Instead of sequential operations, an event-driven program uses a cascade of asynchronous calls and callbacks that get executed on events. This notion often makes the flow of control less obvious and complicates debugging.
The usage of event-driven server architectures has historically depended on the availability of asynchronous/non-blocking I/O operations on OS level and suitable, high performance event notification interfaces such as epoll and kqueue. Earlier implementations of event-based servers such as the Flash web server by Pai et al [Pai99].
Different patterns have emerged for event-based I/O multiplexing, recommending solutions for highly concurrent, high-performance I/O handling. The patterns generally address the problem of network services to handle multiple concurrent requests.
The Reactor pattern [Sch95] targets synchronous, non-blocking I/O handling and relies on an event notification interface. On startup, an application following this pattern registers a set of resources (e.g. a socket) and events (e.g. a new connection) it is interested in. For each resource event the application is interested in, an appropriate event handler must be provided--a callback or hook method. The core component of the Reactor pattern is a synchronous event demultiplexer, that awaits events of resources using a blocking event notification interface. Whenever the synchronous event demultiplexer receives an event (e.g. a new client connection), it notifies a dispatcher and awaits for the next event. The dispatcher processes the event by selecting the associated event handler and triggering the callback/hook execution.
The Reactor pattern thus decouples a general framework for event handling and multiplexing from the application-specific event handlers. The original pattern focuses on a single-threaded execution. This requires the event handlers to adhere to the non-blocking style of operations. Otherwise, a blocking operation can suspend the entire application. Other variants of the Reactor pattern use a thread pool for the event handlers. While this improves performance on multi-core platforms, an additional overhead for coordination and synchronization must be taken into account.
In contrast, the Proactor pattern [Pya97] leverages truly asynchronous, non-blocking I/O operations, as provided by interfaces such as POSIX AIO. As a result, the Proactor can be considered as an entirely asynchronous variant of the Reactor pattern seen before. It incorporates support for completition events instead of blocking event notification interfaces. A proactive initiator represents the main application thread and is responsible for initiating asynchronous I/O operations. When issuing such an operation, it always registers a completition handler and completition dispatcher. The execution of the asynchronous operation is governed by the asynchronous operation processor, an entity that is part of the OS in practice. When the I/O operation has been completed, the completition dispatcher is notified. Next, the completition handler processes the resulting event.
An important property in terms of scalability compared to the Reactor pattern is the better multithreading support. The execution of completition handlers can easily be handed off to a dedicated thread pool.
Having a single thread running an event loop and waiting for I/O notifications has a different impact on scalability than the thread-based approach outlined before. Not associating connections and threads does dramatically decrease the number of threads of the server--in an extreme case, down to the single event-looping thread plus some OS kernel threads for I/O. We thereby get rid of the overhead of excessive context switching and do not need a thread stack for each connection. This decreases the memory footprint under load and wastes less CPU time to context switching. Ideally, the CPU becomes the only apparent bottleneck of an event-driven network application. Until full saturation of resources is archived, the event loop scales with increasing throughput. Once the load increases beyond maximum saturation, the event queue begins to stack up as the event-processing thread is not able to match up. Under this condition, the event-driven approach still provides a thorough throughput, but latencies of requests increase linearly, due to overload. This might be acceptable for temporary load peaks, but permanent overload degrades performance and renders the service unusable. One countermeasure is a more resource-aware scheduling and decoupling of event processing, as we will see soon when analysing a staged-based approach.
For the moment, we stay with the event-driven architectures and align them with multi-core architectures. While the thread-based model covers both--I/O-based and CPU-based concurrency, the initial event-based architecture solely addresses I/O concurrency. For exploiting multiple CPUs or cores, event-driven servers must be further adapted.
An obvious approach is the instantiation of multiple separate server processes on a single machine. This is often referred to as the N-copy approach for using N instances on a host with N CPUs/cores. In our case a machine would run multiple web server instances and register all instances at the load balancers. A less isolated alternative shares the server socket between all instances, thus requiring some coordination. For instance, an implementation of this approach is available for node.js using the cluster module, which forks multiple instances of an application and shares a single server socket.
The web servers in the architectural model have a specific feature--they are stateless, shared-nothing components. Already using an internal cache for dynamic requests requires several changes in the server architecture. For the moment, the easier concurrency model of having a single-threaded server and sequential execution semantics of callbacks can be accepted as part of the architecture. It is exactly this simple execution model that makes single-threaded applications attractive for developers, as the efforts of coordination and synchronization are diminished and the application code (i.e. callbacks) is guaranteed not to run concurrently. On the other hand, this characteristic intrinsically prevents the utilization of multiple processes inside a single event-driven application. Zeldovich et al. have addresses this issue withlibasync-smp [Zel03], an asynchronous programming library taking advantage of multiple processes and parallel callback execution. The simple sequential programming model is still preserved. The basic idea is the usage of tokens, so-called colors assigned to each callback. Callbacks with different colors can be executed in parallel, while serial execution is guaranteed for callbacks with the same color. Using a default color to all non-labelled callbacks makes this approach backward compatible to programs without any colors.
Let us extend our web server with a cache, using the coloring for additional concurrency. Reading and parsing a new request are sequential operations, but different requests can be handled at the same time. Thus, each request gets a distinct color (e.g. using the socket descriptor), and the parsing operation of different request can actually happen in parallel, as they are labelled differently. After having parsed the request, the server must check if the required content is already cached. Otherwise, it must be requested from the application server. Checking the cache now is a concurrent operation that must be executed sequentially, in order to provide consistency. Hence, the same color label is used for this step for all requests, indicating the scheduler to run all of these operations always serially, and never in parallel. This library also allows the callback to execute partially blocking operations. As long as the operation is not labelled with a shared color, it will not block other callbacks directly. The library is backed by a thread pool and a set of event queues, distinguished by colors. This solution allows to adhere to the traditional event-driven programming style, but introduces real concurrency to a certain extent. However, it requires the developer to label callbacks correctly. Reasoning about the flows of executions in an event-driven program is already difficult sometimes, and the additional effort may complicate this further.
The need for scalable architectures and the drawbacks of both general models have led to alternative architectures and libraries incorporating features of both models.
A formative architecture combining threads and events for scalable servers has been designed by Welsh et al. [Wel01], the so called SEDA. As a basic concept, it divides the server logic into a series of well-defined stages, that are connected by queues, as shown in figure 4.4. Requests are passed from stage to stage during processing. Each stage is backed by a thread or a thread pool, that may be configured dynamically.
The separation favors modularity as the pipeline of stages can be changed and extended easily. Another very important feature of the SEDA design is the resource awareness and explicit control of load. The size of the enqueued items per stage and the workload of the thread pool per stage gives explicit insights on the overall load factor. In case of an overload situation, a server can adjust scheduling parameters or thread pool sizes. Other adaptive strategies include dynamic reconfiguration of the pipeline or deliberate request termination. When resource management, load introspection and adaptivity are decoupled from the application logic of a stage, it is simple to develop well-conditioned services. From a concurrency perspective, SEDA represents a hybrid approach between thread-per-connection multithreading and event-based concurrency. Having a thread (or a thread pool) dequeuing and processing elements resembles an event-driven approach. The usage of multiple stages with independent threads effectively utilizies multiple CPUs or cores and tends to a multi-threaded environment. From a developer's perspective, the implementation of handler code for a certain stage also resembles more traditional thread programming.
The drawbacks of SEDA are the increased latencies due to queue and stage traversal even in case of minimal load. In a later retrospective [Wel10], Welsh also criticized a missing differentiation of module boundaries (stages) and concurrency boundaries (queues and threads). This distribution triggers too many context switches, when a requests passes through multiple stages and queues. A better solution groups multiple stages together with a common thread pool. This decreaes context switches and improves response times. Stages with I/O operations and comparatively long execution times can still be isolated.
The SEDA model has inspired several implementations, including the generic server framework Apache MINA and enterprise service buses such as Mule ESB.
For instance, the Capriccio threading library by von Behren et al. [vB03b] promises scalable threads for servers by tackling the main thread issues. The problem of extensive context switches is addressed by using a non-preemptive scheduling. Threads eithers yield on I/O operations, or on an explicit yield operation. The stack size of each thread is limited based on prior analysis at compile time. This makes it unnecessary to overprovide bounded stack space preemptively. However, unbounded loops and the usage of recursive calls render a complete calculation of stack size apriori impossible. As a workaround, checkpoints are inserted into the code, that determine if a stack overflow is about to happen and allocate new stack chunks in that case. The checkpoints are inserted at compile time and are placed in a manner that there will never be a stack overflow within the code between two checkpoints. Additionally, resource-aware scheduling is applied that prevents thrashing. Therefore, CPU, memory and file descriptors are watched and combined with a static analysis of the resource usage of threads, scheduling is dynamically adapted.
Also, hybrid libraries, combining threads and events, have been developed. Li and Zdancewic [Li07] have implemented a combined model for Haskell, based on concurrency monads. The programming language Scala also provides event-driven and multi-threaded concurrency, that can be combined for server implementations.
thread-based | event-driven | |
---|---|---|
connection/request state | thread context | state machine/continuation |
main I/O model | synchronous/blocking | asynchronous/non-blocking |
activity flow | thread-per-connection | events and associated handlers |
primary scheduling strategy | preemptive (OS) | cooperative |
scheduling component | scheduler (OS) | event loop |
calling semantics | blocking | dispatching/awaiting events |
Pariag et al. [Par07] have conducted a detailed performance-oriented comparison of thread-based, event-driven and hybrid pipelined servers. The thread-based server (knot) has taken advantage of the aforementioned Capriccio library. The event-driven server (26#26server) has been designed to support socket sharing and multiprocessor support using the N-copy approach. Lastly, the hybrid pipelined server (WatPipe) has been heavily inspired by SEDA, and consists of four stages for serving web requests. Pariag and his team then tested and heavily tuned the three servers. Finally, they benchmarked the servers using different scenarios, including deliberate overload situations. Previous benchmarks have been used to promote either new thread-based or event-driven architectures[Pai99,Wel01,vB03a], often with clear benefits for the new architecture. The extensive benchmark of Pariag et al. revealed that all three architectural models can be used for building highly scalable servers, as long as thorough tuning and (re-)configuration is conducted. The results also showed that event-driven architectures using asynchronous I/O have still a marginal advantage over thread-based architectures.
Event-driven web servers like nginx (e.g. GitHub, WordPress.com), lighttpd (e.g. YouTube, Wikipedia) or Tornado (e.g. Facebook, Quora) are currently very popular and several generic frameworks have emerged that follow this architectural pattern. Such frameworks available for Java include netty and MINA.
Please note that we do not conduct our own benchmarks in this chapter. Nottingham, one of the editors of the HTTP standards, has written an insightful summary, why even handed server benchmarking is extremely hard and costly [Not11]. Hence, we solely focus on the architecture concepts and design principles of web servers and confine our considerations to the prior results of Pariag et al. [Par07].
http://www.nightmare.com/medusa/medusa.html
Medusa is an architecture for very-high-performance TCP/IP servers (like HTTP, FTP, and NNTP). Medusa is different from most other servers because it runs as a single process, multiplexing I/O with its various client and server connections within a single process/thread.
Medusa is written in Python, a high-level object-oriented language that is particularly well suited to building powerful, extensible servers. Medusa can be extended and modified at run-time, even by the end-user. User 'scripts' can be used to completely change the behavior of the server, and even add in completely new server types.
Most Internet servers are built on a 'forking' model. ('Fork' is a Unix term for starting a new process.) Such servers actually invoke an entire new process for every single client connection. This approach is simple to implement, but does not scale very well to high-load situations. Lots of clients mean a lot of processes, which gobble up large quantities of virtual memory and other system resources. A high-load server thus needs to have a lot of memory. Many popular Internet servers are running with hundreds of megabytes of memory.
The vast majority of Internet servers are I/O bound - for any one process, the CPU is sitting idle 99.9% of the time, usually waiting for input from an external device (in the case of an Internet server, it is waiting for input from the network). This problem is exacerbated by the imbalance between server and client bandwidth: most clients are connecting at relatively low bandwidths (28.8 kbits/sec or less, with network delays and inefficiencies it can be far lower). To a typical server CPU, the time between bytes for such a client seems like an eternity! (Consider that a 200 Mhz CPU can perform roughly 50,000 operations for each byte received from such a client).
A simple metaphor for a 'forking' server is that of a supermarket: for every 'customer' being processed [at a cash register], another 'person' must be created to handle each client session. But what if your checkout clerks were so fast they could each individually handle hundreds of customers per second? Since these clerks are almost always waiting for a customer to come through their line, you have a very large staff, sitting around idle 99.9% of the time! Why not replace this staff with a single super-clerk ?
This is exactly how Medusa works!
The most obvious advantage to a single long-running server process is a dramatic improvement in performance. There are two types of overhead involved in the forking model:
Starting up a new process is an expensive operation on any operating system. Virtual memory must be allocated, libraries must be initialized, and the operating system now has yet another task to keep track of. This start-up cost is so high that it is actually noticeable to people! For example, the first time you pull up a web page with 15 inline images, while you are waiting for the page to load you may have created and destroyed at least 16 processes on the web server.
Each process also requires a certain amount of virtual memory space to be allocated on its behalf. Even though most operating systems implement a 'copy-on-write' strategy that makes this much less costly than it could be, the end result is still very wasteful. A 100-user FTP server can still easily require hundreds of megabytes of real memory in order to avoid thrashing (excess paging activity due to lack of real memory).
Medusa eliminates both types of overhead. Running as a single process, there is no per-client creation/destruction overhead. This means each client request is answered very quickly. And virtual memory requirements are lowered dramatically. Memory requirements can even be controlled with more precision in order to gain the highest performance possible for a particular machine configuration.
Another major advantage to the single-process model is persistence. Often it is necessary to maintain some sort of state information that is available to each and every client, i.e., a database connection or file pointer. Forking-model servers that need such shared state must arrange some method of getting it - usually via an IPC (inter-process communication) mechanism such as sockets or named pipes. IPC itself adds yet another significant and needless overhead - single-process servers can share such information within a single address space.
Implementing persistence in Medusa is easy - the address space of its process (and thus its open database handles, variables, etc...) is available to each and every client.
Alright, at this point many of my readers will say I'm beating up on a strawman. In fact, they will say, such server architectures are already available - namely Microsoft's Internet Information Server. IIS avoids the above-named problems by using threads. Threads are 'lightweight processes' - they represent multiple concurrent execution paths within a single address space. Threads solve many of the problems mentioned above, but also create new ones:
Threads are required in only a limited number of situations. In many cases where threads seem appropriate, an asynchronous solution can actually be written with less work, and will perform better. Avoiding the use of threads also makes access to shared resources (like database connections) easier to manage, since multi-user locking is not necessary.
Note: In the rare case where threads are actually necessary, Medusa can of course use them, if the host operating system supports them.
Another solution (used by many current HTTP servers on Unix) is to 'pre-spawn' a large number of processes - clients are attached to each server in turn. Although this alleviates the performance problem up to that number of users, it still does not scale well. To reliably and efficiently handle [n] users, [n] processes are still necessary.
Since Medusa is written in Python, it is easily extensible. No separate compilation is necessary. New facilities can be loaded and unloaded into the server without any recompilation or linking, even while the server is running. [For example, Medusa can be configured to automatically upgrade itself to the latest version every so often].
Many of the most popular security holes (popular, at least, among the mischievous) exploit the fact that servers are usually written in a low-level language. Unless such languages are used with extreme care, weaknesses can be introduced that are very difficult to predict and control. One of the favorite loop-holes is the 'memory buffer overflow', used by the Internet Worm (and many others) to gain unwarranted access to Internet servers.
Such problems are virtually non-existent when working in a high-level language like Python, where for example all access to variables and their components are checked at run-time for valid range operations. Even unforseen errors and operating system bugs can be caught - Python includes a full exception-handling system which promotes the construction of 'highly available' servers. Rather than crashing the entire server, Medusa will often inform the user, log the error, and keep right on running.
The currently available version of Medusa includes integrated World Wide Web (HTTP) and file transfer (FTP) servers. This combined server can solve a major performance problem at any high-load site, by replacing two forking servers with a single non-forking, non-threading server. Multiple servers of each type can also be instantiated.
Also included is a secure 'remote-control' capability, called a monitor server. With this server enabled, authorized users can 'log in' to the running server, and control, manipulate, and examine the server while it is running .
Several extensions are available for the HTTP server, and more will become available over time. Each of these extensions can be loaded/unloaded into the server dynamically.
An API is evolving for users to extend not just the HTTP server but Medusa as a whole, mixing in other server types and new capabilities into existing servers. I am actively encouraging other developers to produce (and if they wish, to market) Medusa extensions. The underlying socket library (and thus the core networking technology of Medusa) is very stable, and has been running virtually unchanged since 1995.
Medusa is available from http://www.nightmare.com/medusa
Feedback, both positive and negative, is much appreciated; please send email to rushing@nightmare.com.