用Netty開發中間件：高併發性能優化(轉)

時間 2019-11-10

原文原文鏈接

用Netty開發中間件：高併發性能優化

最近在寫一個後臺中間件的原型，主要是作消息的分發和透傳。由於要用Java實現，因此網絡通訊框架的第一選擇固然就是Netty了，使用的是Netty 4版本。Netty果真效率很高，不用作太多努力就能達到一個比較高的tps。但使用過程當中也碰到了一些問題，我的以爲都是比較經典而在網上又不太容易查找到相關資料的問題，因此在此總結一下。html

1.Context Switch太高

壓測時用nmon監控內核，發現Context Switch高達30w+。這明顯不正常，但JVM能有什麼致使Context Switch。參考以前整理過的恐龍書《Operating System Concept》的讀書筆記《進程調度》和Wiki上的Context Switch介紹，進程/線程發生上下文切換的緣由有：java

I/O等待：在多任務系統中，進程主動發起I/O請求，但I/O設備尚未準備好，因此會發生I/O阻塞，進程進入Wait狀態。
時間片耗盡：在多任務分時系統中，內核分配給進程的時間片已經耗盡了，進程進入Ready狀態，等待內核從新分配時間片後的執行機會。
硬件中斷：在搶佔式的多任務分時系統中，I/O設備能夠在任意時刻發生中斷，CPU會停下當前正在執行的進程去處理中斷，所以進程進入Ready狀態。

根據分析，重點就放在第一個和第二個因素上。編程

進程與線程的上下文切換

以前的讀書筆記裏總結的是進程的上下文切換緣由，那線程的上下文切換又有什麼不一樣呢？在StackOverflow上果真找到了提問thread context switch vs process context switch：api

「The main distinction between a thread switch and a process switch is that during a thread switch, the virtual memory space remains the same, while it does not during a process switch. Both types involve handing control over to the operating system kernel to perform the context switch. The process of switching in and out of the OS kernel along with the cost of switching out the registers is the largest fixed cost of performing a context switch.
A more fuzzy cost is that a context switch messes with the processors cacheing mechanisms. Basically, when you context switch, all of the memory addresses that the processor 「remembers」 in it’s cache effectively become useless. The one big distinction here is that when you change virtual memory spaces, the processor’s Translation Lookaside Buffer (TLB) or equivalent gets flushed making memory accesses much more expensive for a while. This does not happen during a thread switch.」promise

經過排名第一的大牛的解答瞭解到，進程和線程的上下文切換都涉及進出系統內核和寄存器的保存和還原，這是它們的最大開銷。但與進程的上下文切換相比，線程仍是要輕量一些，最大的區別是線程上下文切換時虛擬內存地址保持不變，因此像TLB等CPU緩存不會失效。但要注意的是另外一份提問What is the overhead of a context-switch?的中提到了：Intel和AMD在2008年引入的技術可能會使TLB不失效。感興趣的話請自行研究吧。緩存

1.1 非阻塞I/O

針對第一個因素I/O等待，最直接的解決辦法就是使用非阻塞I/O操做。在Netty中，就是服務端和客戶端都使用NIO。性能優化

這裏在說一下如何主動的向Netty的Channel寫入數據，由於網絡上搜到的資料都是千篇一概：服務端就是接到請求後在Handler中寫入返回數據，而客戶端的例子居然也都是在Handler裏Channel Active以後發送數據。由於要作消息透傳，並且是向下遊系統發消息時是異步非阻塞的，網上那種例子根本無法用，因此在這裏說一下個人方法吧。網絡

關於服務端，在接收到請求後，在channelRead0()中經過ctx.channel()獲得Channel，而後就經過ThreadLocal變量或其餘方法，只要能把這個Channel保存住就行。當須要返回響應數據時就主動向持有的Channel寫數據。具體請參照後面第4節。session

關於客戶端也是同理，在啓動客戶端以後要拿到Channel，當要主動發送數據時就向Channel中寫入。併發

EventLoopGroup group = new NioEventLoopGroup(); Bootstrap b = new Bootstrap(); b.group(group) .channel(NioSocketChannel.class) .remoteAddress(host, port) .handler(new ChannelInitializer<SocketChannel>() { @Override protected void initChannel(SocketChannel ch) throws Exception { ch.pipeline().addLast(...); } }); try { ChannelFuture future = b.connect().sync(); this.channel = future.channel(); } catch (InterruptedException e) { throw new IllegalStateException("Error when start netty client: addr=[" + addr + "]", e); }

1.2 減小線程數

線程太多的話每一個線程獲得的時間片就少，CPU要讓各個線程都有機會執行就要切換，切換就要不斷保存和還原線程的上下文現場。因而檢查Netty的I/O worker的EventLoopGroup。以前在《Netty 4源碼解析：服務端啓動》中曾經分析過，EventLoopGroup默認的線程數是CPU核數的二倍。因此手動配置NioEventLoopGroup的線程數，減小一些I/O線程。

private void doStartNettyServer(int port) throws InterruptedException { EventLoopGroup bossGroup = new NioEventLoopGroup(); EventLoopGroup workerGroup = new NioEventLoopGroup(4); try { ServerBootstrap b = new ServerBootstrap() .group(bossGroup, workerGroup) .channel(NioServerSocketChannel.class) .localAddress(port) .childHandler(new ChannelInitializer<SocketChannel>() { @Override public void initChannel(SocketChannel ch) throws Exception { ch.pipeline().addLast(...); } }); // Bind and start to accept incoming connections. ChannelFuture f = b.bind(port).sync(); // Wait until the server socket is closed. f.channel().closeFuture().sync(); } finally { bossGroup.shutdownGracefully(); workerGroup.shutdownGracefully(); } }

此外由於還用了Akka做爲業務線程池，因此還看了下如何修改Akka的默認配置。方法是新建一個叫作application.conf的配置文件，咱們建立ActorSystem時會自動加載這個配置文件，下面的配置文件中定製了一個dispatcher：

my-dispatcher {
  # Dispatcher is the name of the event-based dispatcher type = Dispatcher mailbox-type = "akka.dispatch.SingleConsumerOnlyUnboundedMailbox" # What kind of ExecutionService to use executor = "fork-join-executor" # Configuration for the fork join pool fork-join-executor { # Min number of threads to cap factor-based parallelism number to parallelism-min = 2 # Parallelism (threads) ... ceil(available processors * factor) parallelism-factor = 1.0 # Max number of threads to cap factor-based parallelism number to parallelism-max = 16 } # Throughput defines the maximum number of messages to be # processed per actor before the thread jumps to the next actor. # Set to 1 for as fair as possible. throughput = 100 }

簡單來講，最關鍵的幾個配置項是：

parallelism-factor：決定線程池的大小（居然不是parallelism-max）。
throughput：決定coroutine的切換頻率，1是最爲頻繁也最爲公平的設置。

由於本篇主要是介紹Netty的，因此具體含義就詳細介紹了，請參考官方文檔中對Dispatcher和Mailbox的介紹。建立特定Dispatcher的Akka很簡單，如下是建立類型化Actor時指定Dispatcher的方法。

TypedActor.get(system).typedActorOf(
            new TypedProps<MyActorImpl>( MyActor.class, new Creator<MyActorImpl>() { @Override public MyActorImpl create() throws Exception { return new MyActorImpl(XXX); } } ).withDispatcher("my-dispatcher") );

1.3 去業務線程池

儘管上面作了種種改進配置，用jstack查看線程配置確實生效了，但Context Switch的情況並無好轉。因而乾脆去掉Akka實現的業務線程池，完全減小線程上下文的切換。發現CS從30w+一會兒降到了16w！費了好大力氣在萬能的StackOverflow上查到了一篇文章，其中一句話點醒了我：

And if the recommendation is not to block in the event loop, then this can be done in an application thread. But that would imply an extra context switch. This extra context switch may not be acceptable to latency sensitive applaications.

有了線索就趕忙去查Netty源碼，發現的確像調用channel.write()操做不是在當前線程上執行。Netty內部統一使用executor.inEventLoop()判斷當前線程是不是EventLoopGroup的線程，不然會包裝好Task交給內部線程池執行：

private void write(Object msg, boolean flush, ChannelPromise promise) { AbstractChannelHandlerContext next = findContextOutbound(); EventExecutor executor = next.executor(); if (executor.inEventLoop()) { next.invokeWrite(msg, promise); if (flush) { next.invokeFlush(); } } else { int size = channel.estimatorHandle().size(msg); if (size > 0) { ChannelOutboundBuffer buffer = channel.unsafe().outboundBuffer(); // Check for null as it may be set to null if the channel is closed already if (buffer != null) { buffer.incrementPendingOutboundBytes(size); } } Runnable task; if (flush) { task = WriteAndFlushTask.newInstance(next, msg, size, promise); } else { task = WriteTask.newInstance(next, msg, size, promise); } safeExecute(executor, task, promise, msg); } }

業務線程池原來是把雙刃劍。雖然將任務交給業務線程池異步執行下降了Netty的I/O線程的佔用時間、減輕了壓力，但同時業務線程池增長了線程上下文切換的次數。經過上述這些優化手段，終於將壓測時的CS從每秒30w+降到了8w左右，效果仍是挺明顯的！

2.系統調用開銷

系統調用通常會涉及到從User Space到Kernel Space的模態轉換(Mode Transition或Mode Switch)。這種轉換也是有必定開銷的。

Mode Switch vs. Context Switch

StackOverflow上果真什麼問題都有。前面介紹過了線程的上下文切換，那它與內核態和用戶態的切換是什麼關係？模態切換算是CS的一種嗎？Does there have to be a mode switch for something to qualify as a context switch?回答了這個問題：

「A mode switch happens inside one process. A context switch involves more than one process (or thread). Context switch happens only in kernel mode. If context switching happens between two user mode processes, first cpu has to change to kernel mode, perform context switch, return back to user mode and so on. So there has to be a mode switch associated with a context switch. But a context switch doesn’t imply a mode switch (could be done by the hardware alone). A mode switch does not require a context switch either.」

Context Switch必須在內核中完成，原理簡單說就是主動觸發一個軟中斷（相似被動被硬件觸發的硬中斷），因此通常Context Switch都會伴隨Mode Switch。然而有些硬件也能夠直接完成（不是很懂了），有些CPU甚至沒有咱們常說Ring 0 ~ 3的特權級概念。而Mode Switch則與Context Switch更是無關了，按照Wiki上的說法硬要扯上關係的話也只能說有的系統裏可能在Mode Switch中發生Context Switch。

Netty涉及的系統調用最多的就是網絡通訊操做了，因此爲了下降系統調用的頻度，最直接的方法就是緩衝輸出內容，達到必定的數據大小、寫入次數或時間間隔時才flush緩衝區。

對於緩衝區大小不足，寫入速度過快等問題，Netty提供了writeBufferLowWaterMark和writeBufferHighWaterMark選項，當緩衝區達到必定大小時則不能寫入，避免被撐爆。感受跟Netty提供的Traffic Shaping流量整形功能有點像呢。具體還未深刻研究，感興趣的同窗能夠自行學習一下。

3.Zero Copy實現

《Netty權威指南（第二版）》中專門有一節介紹Netty的Zero Copy，但針對的是Netty內部的零拷貝功能。咱們這裏想談的是如何在應用代碼中實現Zero Copy，最典型的應用場景就是消息透傳。由於透傳不須要完整解析消息，只須要知道消息要轉發給下游哪一個系統就足夠了。因此透傳時，咱們能夠只解析出部分消息，消息總體還原封不動地放在Direct Buffer裏，最後直接將它寫入到鏈接下游系統的Channel中。因此應用層的Zero Copy實現就分爲兩部分：Direct Buffer配置和Buffer的零拷貝傳遞。

3.1 內存池

使用Netty帶來的又一個好處就是內存管理。只需一行簡單的配置，就能得到到內存池帶來的好處。在底層，Netty實現了一個Java版的Jemalloc內存管理庫（還記得Redis自帶的那個嗎），爲咱們作完了全部「髒活累活」！

ServerBootstrap b = new ServerBootstrap() .group(bossGroup, workerGroup) .channel(NioServerSocketChannel.class) .localAddress(port) .childOption(ChannelOption.ALLOCATOR, PooledByteBufAllocator.DEFAULT) .childHandler(new ChannelInitializer<SocketChannel>() { @Override public void initChannel(SocketChannel ch) throws Exception { ch.pipeline().addLast(...); } });

3.2 應用層的Zero Copy

默認狀況下，Netty會自動釋放ByteBuf。也就是說當咱們覆寫的channelRead0()返回時，ByteBuf就結束了它的使命，被Netty自動釋放掉（若是是池化的就可會被放回到內存池中）。

public abstract class SimpleChannelInboundHandler<I> extends ChannelInboundHandlerAdapter { @Override public void channelRead(ChannelHandlerContext ctx, Object msg) throws Exception { boolean release = true; try { if (acceptInboundMessage(msg)) { @SuppressWarnings("unchecked") I imsg = (I) msg; channelRead0(ctx, imsg); } else { release = false; ctx.fireChannelRead(msg); } } finally { if (autoRelease && release) { ReferenceCountUtil.release(msg); } } } }

由於Netty是用引用計數的方式來判斷是否回收的，因此要想繼續使用ByteBuf而不讓Netty釋放的話，就要增長它的引用計數。只要咱們在ChannelPipeline中的任意一個Handler中調用ByteBuf.retain()將引用計數加1，Netty就不會釋放掉它了。咱們在鏈接下游的客戶端的Encoder中發送消息成功後再釋放掉，這樣就達到了零拷貝透傳的效果：

public class RespEncoder extends MessageToByteEncoder<Resp> { @Override protected void encode(ChannelHandlerContext ctx, Msg msg, ByteBuf out) throws Exception { // Raw in Msg is retained ByteBuf out.writeBytes(msg.getRaw(), 0, msg.getRaw().readerIndex()); msg.getRaw().release(); } }

4.併發下的狀態處理

前面第1.1節介紹的異步寫入持有的Channel和第2節介紹的根據必定規則flush緩衝區等等，都涉及到狀態的保存。若是要併發訪問這些狀態的話，就要提防併發的race condition問題，避免更新衝突、丟失等等。

4.1 Channel保存

在Netty服務端的Handler裏如何持有Channel呢？我是這樣作的，在channelActive()或第一次進入channelRead0()時建立一個Session對象持有Channel。由於以前在《Netty 4源碼解析：請求處理》中曾經分析過Netty 4的線程模型：多個客戶端可能會對應一個EventLoop線程，但對於一個客戶端來講只能對應一個EventLoop線程。每一個客戶端都對應本身的Handler實例，而且一直使用到鏈接斷開。

public class FrontendHandler extends SimpleChannelInboundHandler<Msg> { private Session session; @Override public void channelActive(ChannelHandlerContext ctx) throws Exception { session = factory.createSession(ctx.channel()); super.channelActive(ctx); } @Override protected void channelRead0(final ChannelHandlerContext ctx, Msg msg) throws Exception { session.handleRequest(msg); } @Override public void channelInactive(ChannelHandlerContext ctx) throws Exception { session = null; super.channelInactive(ctx); } }

4.2 Decoder狀態

由於網絡粘包拆包等因素，Decoder不可避免的要保存一些解析過程的中間狀態。由於Netty對於每一個客戶端的生命週期內會一直使用同一個Decoder實例，因此解析完成後必定要重置中間狀態，避免後續解析錯誤。

public class RespDecoder extends ReplayingDecoder { public MsgDecoder() { doCleanUp(); } @Override protected void decode(ChannelHandlerContext ctx, ByteBuf in, List<Object> out) throws Exception { if (doParseMsg(in)) { doSendToHandler(out); doCleanUp(); } } }

5.總結

5.1 多變的Netty

總結以前先吐槽一下，使人又愛又恨的Netty更新速度。從Netty 3到Netty 4，API發生了一次「大地震」，好多網上的示例程序都是基於Netty 3，因此學習Netty 4時發現好多例子都跑不起來了。除了API，Netty內部的線程模型等等變化就更不用說了。本覺得用上了Netty 4就能夠安心了，結果Netty 5的線程模型又-變-了！看看官方文檔裏的說法吧，升級的話又要注意了。

Even more flexible thread model

In Netty 4.x each EventLoop is tightly coupled with a fixed thread that executes all I/O events of its registered Channels and any tasks submitted to it. Starting with version 5.0 an EventLoop does no longer use threads directly but instead makes use of an Executor abstraction. That is, it takes an Executor object as a parameter in its constructor and instead of polling for I/O events in an endless loop each iteration is now a task that is submitted to this Executor. Netty 4.x would simply spawn its own threads and completely ignore the fact that it’s part of a larger system. Starting with Netty 5.0, developers can run Netty and the rest of the system in the same thread pool and potentially improve performance by applying better scheduling strategies and through less scheduling overhead (due to fewer threads). It shall be mentioned, that this change does not in any way affect the way ChannelHandlers are developed. From a developer’s point of view, the only thing that changes is that it’s no longer guaranteed that a ChannelHandler will always be executed by the same thread. It is, however, guaranteed that it will never be executed by two or more threads at the same time. Furthermore, Netty will also take care of any memory visibility issues that might occur. So there’s no need to worry about thread-safety and volatile variables within a ChannelHandler.

根據官方文檔的說法，Netty再也不保證特定的Handler實例在運行時必定對應一個線程，因此，在Handler中用ThreadLocal的話就是比較危險的寫法了！

5.2 高併發編程技巧

通過上面的種種琢磨和努力，tps終於從幾千達到了5w左右，學到了不少以前不懂的網絡編程和性能優化的知識，仍是頗有成就感的！總結一下，高併發中間件的優化策略有：

線程數控制：高併發下若是線程較多時，Context Switch會很是明顯，超過CPU核心數的線程不會帶來任何好處。不是特別耗時的操做的話，業務線程池也是有害無益的。Netty 5爲咱們提供了指定底層線程池的機會，這樣能更好的控制整個中間件的線程數和調度策略。
非阻塞I/O操做：要想線程少還多作事，避免阻塞是必定要作的。
減小系統調用：雖然Mode Switch比Context Switch的開銷要小得多，但咱們仍是要儘可能減小頻繁的syscall。
數據零拷貝：從內核空間的Direct Buffer拷貝到用戶空間，每次透傳都拷貝的話累積起來是個不小的開銷。
共享狀態保護：中間件內部的併發處理也是決定性能的關鍵。

原地址：http://blog.csdn.net/dc_726/article/details/48978891