一次鞭辟入裏的 Log4j2 異步日誌輸出阻塞問題的定位

「本文已參與好文召集令活動,點擊查看:後端、大前端雙賽道投稿,2萬元獎池等你挑戰!html

一次鞭辟入裏的 Log4j2 日誌輸出阻塞問題的定位

問題現象

線上某個應用的某個實例忽然出現某些次請求服務響應極慢的狀況,有幾回請求超過 60s 才返回,而且經過日誌發現,服務線程並無作什麼很重的操做。這種狀況斷斷續續持續了半小時左右。前端

核心問題定位

因爲半小時後,服務實例恢復正常,而且出現請求極慢的次數並不太多,沒有能及時打印出現問題的時候線程堆棧採集相關信息。可是,咱們有本身的法寶,JFR(關於 JFR,請參考個人另外一系列JFR全解)。java

JFR 很是適合用來過後覆盤定位問題,而且配置得當的話,性能損耗極小,而且不像 APM 系統同樣須要額外的採集以及須要統一的服務進程進行整合展示。咱們隨用隨取便可,可是 JFR 面向的是單進程的問題定位,找到問題進程,定位跨進程業務問題鏈路追蹤仍是須要 APM 系統的。git

咱們經過以下命令採集對應時間段的 JFR:github

jcmd 進程pid JFR.dump begin=2021-07-04T17:30:00.000Z end=2021-07-04T18:30:00.000Z
複製代碼

咱們選擇的時間跨度比出問題的時間跨度大一些,這樣保證底層問題出現的時候也被採集寫入這個 dump 文件。web

對於歷史某些請求響應慢,我通常按照以下流程去看:數據庫

  1. 是否有 STW(Stop-the-world,參考個人另外一篇文章:JVM相關 - SafePoint 與 Stop The World 全解):
  2. 是否有 GC 致使的長時間 STW
  3. 是否有其餘緣由致使進程全部線程進入 safepoint 致使 STW
  4. 是否 IO 花了太長時間,例如調用其餘微服務,訪問各類存儲(硬盤,數據庫,緩存等等)
  5. 是否在某些鎖上面阻塞太長時間?
  6. 是否 CPU 佔用太高,哪些線程致使的?

接下來咱們來詳細分析,首先是 GC,從下圖看,並無能嚴重影響線程 STW 很長時間的: imageapache

而後查看 safepoint 相關,發現也沒有什麼影響:編程

image

接着,咱們查看 IO 相關的事件,也並無發現業務相關的阻塞讀:後端

image

image

而後,咱們查看是否在某些鎖上面阻塞太長時間,終於發現異常:

image

經過堆棧,咱們發現是** log4j2 打印日誌卡住了**。

問題分析

1. Java Monitor Blocked 的 Monitor Address 對於同一個對象也是會改變的

首先,對於 Java Monitor Blocked 這個事件,咱們能夠經過 Monitor Address 來判斷是不是同一個鎖。例如這裏咱們發現這些線程都是阻塞在 FFFF 4C09 3188 這個地址的對象,這個地址是基於對象在 Java 堆中的相對地址得出的。

可是對於同一個對象,這個地址並不會是不變的,在每次 GC 掃描整理到這個對象對象的時候,若是有回收內存,通常這個地址就會改變。因爲咱們使用的是 G1,因此並非每次 GC 都會掃描到這個對象,可是若是地址發生改變,必定是由於發生了 GC

2. log4j2 異步日誌原理簡介

在咱們的應用配置中,採用的日誌框架是 log4j2,採用的配置是異步日誌配置。這裏簡單說一下 Log4j2 異步日誌的原理:Log4j2 異步日誌基於高性能數據結構 Disruptor,Disruptor 是一個環形 buffer,作了不少性能優化(具體原理能夠參考個人另外一系列:高併發數據結構(disruptor)),Log4j2 對此的應用以下所示:

image

簡單來講,多線程經過 log4j2 的門面類 org.apache.logging.log4j.Logger 進行日誌輸出,被封裝成爲一個 org.apache.logging.log4j.core.LogEvent,放入到 Disruptor 的環形 buffer 中。在消費端有一個單線程消費這些 LogEvent 寫入對應的 Appender,咱們這裏只有一個 Appender,其配置是:

<RollingFile name="file" append="true"
             filePattern="./app.log-%d{yyyy.MM.dd.HH}">
    <PatternLayout pattern="${logFormat}"/>
    <Policies>
        <TimeBasedTriggeringPolicy interval="1" modulate="true"/>
    </Policies>
    <DirectWriteRolloverStrategy maxFiles="72"/>
</RollingFile>
複製代碼

異步日誌 logger 配置爲(配置 includeLocation 爲 false,避免每次打印日誌須要獲取調用堆棧的性能損耗):

<Asyncroot level="info" includeLocation="false">
    <appender-ref ref="file"/>
</Asyncroot>
複製代碼

log4j component 額外配置爲:

log4j2.component.properties:

# 當沒有日誌的時候,消費線程經過 Block 等待日誌事件到來,這樣 CPU 佔用最少
AsyncLoggerConfig.WaitStrategy=Block
複製代碼

3. log4j2 disruptor 的 RingBuffer 的大小

既然是一個環形 Buffer,它的容量是有限的,咱們這裏沒有修改它的大小,走的是默認配置,查看其源碼:

AsyncLoggerConfigDisruptor.java

public synchronized void start() {
    //省略其餘代碼
    ringBufferSize = DisruptorUtil.calculateRingBufferSize("AsyncLoggerConfig.RingBufferSize");
    disruptor = new Disruptor<>(factory, ringBufferSize, threadFactory, ProducerType.MULTI, waitStrategy);
}
複製代碼

DisruptorUtil.java

private static final int RINGBUFFER_MIN_SIZE = 128;
private static final int RINGBUFFER_DEFAULT_SIZE = 256 * 1024;
private static final int RINGBUFFER_NO_GC_DEFAULT_SIZE = 4 * 1024;

static int calculateRingBufferSize(final String propertyName) {
    //是否啓用了 ThreadLocal,若是是則爲 4 kB,不是則爲 256 kB
    int ringBufferSize = Constants.ENABLE_THREADLOCALS ? RINGBUFFER_NO_GC_DEFAULT_SIZE : RINGBUFFER_DEFAULT_SIZE;
    //讀取系統變量,以及 log4j2.component.properties 文件獲取 propertyName(這裏是 AsyncLoggerConfig.RingBufferSize) 這個配置
    final String userPreferredRBSize = PropertiesUtil.getProperties().getStringProperty(propertyName,
            String.valueOf(ringBufferSize));
    try {
        int size = Integer.parseInt(userPreferredRBSize);
        //若是小於 128 字節則按照 128 字節設置
        if (size < RINGBUFFER_MIN_SIZE) {
            size = RINGBUFFER_MIN_SIZE;
            LOGGER.warn("Invalid RingBufferSize {}, using minimum size {}.", userPreferredRBSize,
                    RINGBUFFER_MIN_SIZE);
        }
        ringBufferSize = size;
    } catch (final Exception ex) {
        LOGGER.warn("Invalid RingBufferSize {}, using default size {}.", userPreferredRBSize, ringBufferSize);
    }
    //取最近的 2 的 n 次方,由於對於 2 的 n 次方取餘等於對於 2^n-1 取與運算,這樣更快
    return Integers.ceilingNextPowerOfTwo(ringBufferSize);
}
複製代碼

若是啓用了 ThreadLocal 這種方式生成 LogEvent,每次不新生成的 LogEvent 用以前的,用 ThreadLocal 存儲的,這樣避免了建立新的 LogEvent。可是考慮下面這種狀況:

logger.info("{}", someObj);
複製代碼

這樣會形成強引用,致使若是線程沒有新的日誌,這個 someObj 一直不能回收。因此針對 Web 應用,log4j2 默認是不啓用 ThreadLocal 的 方式生成 LogEvent

Constants.java

public static final boolean IS_WEB_APP = PropertiesUtil.getProperties().getBooleanProperty(
            "log4j2.is.webapp", isClassAvailable("javax.servlet.Servlet"));
public static final boolean ENABLE_THREADLOCALS = !IS_WEB_APP && PropertiesUtil.getProperties().getBooleanProperty(
            "log4j2.enable.threadlocals", true);
複製代碼

由此,能夠看出,咱們的 RingBuffer 的大小爲 256 kB

4. RingBuffer 滿了 log4j2 會發生什麼?

當 RingBuffer 滿了,若是在 log4j2.component.properties 配置了 AsyncLoggerConfig.SynchronizeEnqueueWhenQueueFull=false,則會 Wait(實際上是 park) 在 Disruptor 的 produce 方法上,等待消費出下一個能夠生產的環形 buffer 槽;默認這個配置爲 true,即全部生產日誌的線程嘗試獲取全局中的同一個鎖(private final Object queueFullEnqueueLock = new Object();):

DisruptorUtil.java

static final boolean ASYNC_CONFIG_SYNCHRONIZE_ENQUEUE_WHEN_QUEUE_FULL = PropertiesUtil.getProperties()
        .getBooleanProperty("AsyncLoggerConfig.SynchronizeEnqueueWhenQueueFull", true);
複製代碼
private boolean synchronizeEnqueueWhenQueueFull() {
    return DisruptorUtil.ASYNC_CONFIG_SYNCHRONIZE_ENQUEUE_WHEN_QUEUE_FULL
            // Background thread must never block
            && backgroundThreadId != Thread.currentThread().getId();
}

private final Object queueFullEnqueueLock = new Object();

private void enqueue(final LogEvent logEvent, final AsyncLoggerConfig asyncLoggerConfig) {
    //若是 AsyncLoggerConfig.SynchronizeEnqueueWhenQueueFull=true,默認就是 true
    if (synchronizeEnqueueWhenQueueFull()) {
        synchronized (queueFullEnqueueLock) {
            disruptor.getRingBuffer().publishEvent(translator, logEvent, asyncLoggerConfig);
        }
    } else {
        //若是 AsyncLoggerConfig.SynchronizeEnqueueWhenQueueFull=false
        disruptor.getRingBuffer().publishEvent(translator, logEvent, asyncLoggerConfig);
    }
}
複製代碼

默認配置的時候,異常堆棧和咱們 JFR 中看到的同樣,舉個例子:

"Thread-0" #27 [13136] prio=5 os_prio=0 cpu=0.00ms elapsed=141.08s tid=0x0000022d6f2fbcc0 nid=0x3350 waiting for monitor entry  [0x000000399bcfe000]
   java.lang.Thread.State: BLOCKED (on object monitor)
	at org.apache.logging.log4j.core.async.AsyncLoggerConfigDisruptor.enqueue(AsyncLoggerConfigDisruptor.java:375)
	- waiting to lock <merged>(a java.lang.Object)
	at org.apache.logging.log4j.core.async.AsyncLoggerConfigDisruptor.enqueueEvent(AsyncLoggerConfigDisruptor.java:330)
	at org.apache.logging.log4j.core.async.AsyncLoggerConfig.logInBackgroundThread(AsyncLoggerConfig.java:159)
	at org.apache.logging.log4j.core.async.EventRoute$1.logMessage(EventRoute.java:46)
	at org.apache.logging.log4j.core.async.AsyncLoggerConfig.handleQueueFull(AsyncLoggerConfig.java:149)
	at org.apache.logging.log4j.core.async.AsyncLoggerConfig.logToAsyncDelegate(AsyncLoggerConfig.java:136)
	at org.apache.logging.log4j.core.async.AsyncLoggerConfig.log(AsyncLoggerConfig.java:116)
	at org.apache.logging.log4j.core.config.LoggerConfig.log(LoggerConfig.java:460)
	at org.apache.logging.log4j.core.config.AwaitCompletionReliabilityStrategy.log(AwaitCompletionReliabilityStrategy.java:82)
	at org.apache.logging.log4j.core.Logger.log(Logger.java:162)
	at org.apache.logging.log4j.spi.AbstractLogger.tryLogMessage(AbstractLogger.java:2190)
	at org.apache.logging.log4j.spi.AbstractLogger.logMessageTrackRecursion(AbstractLogger.java:2144)
	at org.apache.logging.log4j.spi.AbstractLogger.logMessageSafely(AbstractLogger.java:2127)
	at org.apache.logging.log4j.spi.AbstractLogger.logMessage(AbstractLogger.java:2003)
	at org.apache.logging.log4j.spi.AbstractLogger.logIfEnabled(AbstractLogger.java:1975)
	at org.apache.logging.log4j.spi.AbstractLogger.info(AbstractLogger.java:1312)
	省略業務方法堆棧
複製代碼

配置爲 false 的時候,堆棧是這個樣子的:

"Thread-0" #27 [18152] prio=5 os_prio=0 cpu=0.00ms elapsed=5.68s tid=0x000002c1fa120e00 nid=0x46e8 runnable  [0x000000eda8efe000]
   java.lang.Thread.State: TIMED_WAITING (parking)
	at jdk.internal.misc.Unsafe.park(java.base@17-loom/Native Method)
	at java.util.concurrent.locks.LockSupport.parkNanos(java.base@17-loom/LockSupport.java:410)
	at com.lmax.disruptor.MultiProducerSequencer.next(MultiProducerSequencer.java:136)
	at com.lmax.disruptor.MultiProducerSequencer.next(MultiProducerSequencer.java:105)
	at com.lmax.disruptor.RingBuffer.publishEvent(RingBuffer.java:524)
	at org.apache.logging.log4j.core.async.AsyncLoggerConfigDisruptor.enqueue(AsyncLoggerConfigDisruptor.java:379)
	at org.apache.logging.log4j.core.async.AsyncLoggerConfigDisruptor.enqueueEvent(AsyncLoggerConfigDisruptor.java:330)
	at org.apache.logging.log4j.core.async.AsyncLoggerConfig.logInBackgroundThread(AsyncLoggerConfig.java:159)
	at org.apache.logging.log4j.core.async.EventRoute$1.logMessage(EventRoute.java:46)
	at org.apache.logging.log4j.core.async.AsyncLoggerConfig.handleQueueFull(AsyncLoggerConfig.java:149)
	at org.apache.logging.log4j.core.async.AsyncLoggerConfig.logToAsyncDelegate(AsyncLoggerConfig.java:136)
	at org.apache.logging.log4j.core.async.AsyncLoggerConfig.log(AsyncLoggerConfig.java:116)
	at org.apache.logging.log4j.core.config.LoggerConfig.log(LoggerConfig.java:460)
	at org.apache.logging.log4j.core.config.AwaitCompletionReliabilityStrategy.log(AwaitCompletionReliabilityStrategy.java:82)
	at org.apache.logging.log4j.core.Logger.log(Logger.java:162)
	at org.apache.logging.log4j.spi.AbstractLogger.tryLogMessage(AbstractLogger.java:2190)
	at org.apache.logging.log4j.spi.AbstractLogger.logMessageTrackRecursion(AbstractLogger.java:2144)
	at org.apache.logging.log4j.spi.AbstractLogger.logMessageSafely(AbstractLogger.java:2127)
	at org.apache.logging.log4j.spi.AbstractLogger.logMessage(AbstractLogger.java:2003)
	at org.apache.logging.log4j.spi.AbstractLogger.logIfEnabled(AbstractLogger.java:1975)
	at org.apache.logging.log4j.spi.AbstractLogger.info(AbstractLogger.java:1312)
複製代碼

5. 爲什麼會滿,咱們的消費線程當時在幹啥?

咱們來看下當時的消費線程是否有異常,因爲當時硬盤 io 看系統監控並無異常因此這個線程很大多是 Runnable 的,不斷的在寫入日誌。同時,咱們知道,寫入文件 java 底層基於 native 調用,因此咱們查看 JFR 的 native 方法採集。使用事件查看器中的 Method Profiling Sample Native,右鍵點擊,並選擇「使用所選類型事件建立新頁」:

image

在建立出來的新頁中,按照線程分組查看,查看 Log4j2 的 disruptor 消費線程,能夠得出下圖: image

能夠看出:在出問題的時間段,採集到的寫入日誌文件的事件數量明顯增多不少,而且,經過堆棧能夠看到:

"Log4j2-TF-2-AsyncLoggerConfig-1" #26 [23680] daemon prio=5 os_prio=0 cpu=406.25ms elapsed=2.89s tid=0x0000022d6f3d4080 nid=0x5c80 runnable  [0x000000399bbfe000]
   java.lang.Thread.State: RUNNABLE
	at java.io.FileOutputStream.writeBytes(java.base@17-loom/Native Method)
	at java.io.FileOutputStream.write(java.base@17-loom/FileOutputStream.java:365)
	at org.apache.logging.log4j.core.appender.OutputStreamManager.writeToDestination(OutputStreamManager.java:261)
	- eliminated <0x000000070ee0af40> (a org.apache.logging.log4j.core.appender.rolling.RollingFileManager)
	at org.apache.logging.log4j.core.appender.FileManager.writeToDestination(FileManager.java:272)
	- eliminated <0x000000070ee0af40> (a org.apache.logging.log4j.core.appender.rolling.RollingFileManager)
	at org.apache.logging.log4j.core.appender.rolling.RollingFileManager.writeToDestination(RollingFileManager.java:236)
	- eliminated <0x000000070ee0af40> (a org.apache.logging.log4j.core.appender.rolling.RollingFileManager)
	at org.apache.logging.log4j.core.appender.OutputStreamManager.flushBuffer(OutputStreamManager.java:293)
	- locked <0x000000070ee0af40> (a org.apache.logging.log4j.core.appender.rolling.RollingFileManager)
	at org.apache.logging.log4j.core.appender.OutputStreamManager.flush(OutputStreamManager.java:302)
	- locked <0x000000070ee0af40> (a org.apache.logging.log4j.core.appender.rolling.RollingFileManager)
	at org.apache.logging.log4j.core.appender.AbstractOutputStreamAppender.directEncodeEvent(AbstractOutputStreamAppender.java:199)
	at org.apache.logging.log4j.core.appender.AbstractOutputStreamAppender.tryAppend(AbstractOutputStreamAppender.java:190)
	at org.apache.logging.log4j.core.appender.AbstractOutputStreamAppender.append(AbstractOutputStreamAppender.java:181)
	at org.apache.logging.log4j.core.appender.RollingFileAppender.append(RollingFileAppender.java:312)
	at org.apache.logging.log4j.core.config.AppenderControl.tryCallAppender(AppenderControl.java:156)
	at org.apache.logging.log4j.core.config.AppenderControl.callAppender0(AppenderControl.java:129)
	at org.apache.logging.log4j.core.config.AppenderControl.callAppenderPreventRecursion(AppenderControl.java:120)
	at org.apache.logging.log4j.core.config.AppenderControl.callAppender(AppenderControl.java:84)
	at org.apache.logging.log4j.core.config.LoggerConfig.callAppenders(LoggerConfig.java:543)
	at org.apache.logging.log4j.core.async.AsyncLoggerConfig.callAppenders(AsyncLoggerConfig.java:127)
	at org.apache.logging.log4j.core.config.LoggerConfig.processLogEvent(LoggerConfig.java:502)
	at org.apache.logging.log4j.core.config.LoggerConfig.log(LoggerConfig.java:485)
	at org.apache.logging.log4j.core.async.AsyncLoggerConfig.log(AsyncLoggerConfig.java:121)
	at org.apache.logging.log4j.core.async.AsyncLoggerConfig.logToAsyncLoggerConfigsOnCurrentThread(AsyncLoggerConfig.java:169)
	at org.apache.logging.log4j.core.async.AsyncLoggerConfigDisruptor$Log4jEventWrapperHandler.onEvent(AsyncLoggerConfigDisruptor.java:111)
	at org.apache.logging.log4j.core.async.AsyncLoggerConfigDisruptor$Log4jEventWrapperHandler.onEvent(AsyncLoggerConfigDisruptor.java:97)
	at com.lmax.disruptor.BatchEventProcessor.processEvents(BatchEventProcessor.java:168)
	at com.lmax.disruptor.BatchEventProcessor.run(BatchEventProcessor.java:125)
	at java.lang.Thread.run(java.base@17-loom/Thread.java:1521)
複製代碼

log4j2 調用了不少次 flush, JFR 顯示採集到的事件,每次都調用了 flush。每次調用 flush,都會形成文件真正寫的 native 調用。而消費慢的緣由,其實就是這種 native 調用太多,系統寫入不過來了。

問題解決

咱們能夠經過如下四個方向解決這個問題:

  1. 減小日誌輸出,精簡日誌,這是比較基礎的方式,也是比較簡單的方式,可是身爲一個技術人,咱們不能知足於此
  2. 增長硬盤 io,這個也是比較基礎簡單的解決方式
  3. 咱們是否能夠減小這個 flush 呢?答案是能夠的,咱們能夠配置 Appender 的 immediateFlush 爲 false
  4. 增長監控,針對堆棧包含 org.apache.logging.log4j.core.async.AsyncLoggerConfigDisruptor.enqueue 的 java monitor block 事件進行監控,若是發現時間過長或者數量不少的事件則報警或者重建進程

1. 配置 Appender 的 immediateFlush 爲 false

咱們能夠配置 Appender 的 immediateFlush 爲 false,例如:

<RollingFile name="file" append="true"
             filePattern="./app.log-%d{yyyy.MM.dd.HH}"
             immediateFlush="false">
    <PatternLayout pattern="${logFormat}"/>
    <Policies>
        <TimeBasedTriggeringPolicy interval="1" modulate="true"/>
    </Policies>
    <DirectWriteRolloverStrategy maxFiles="72"/>
</RollingFile>
複製代碼

這裏的原理對應源碼:

AbstractOutputStreamAppender.java

protected void directEncodeEvent(final LogEvent event) {
    getLayout().encode(event, manager);
    //若是配置了 immdiateFlush (默認爲 true)或者當前事件是 EndOfBatch
    if (this.immediateFlush || event.isEndOfBatch()) {
        manager.flush();
    }
}
複製代碼

那麼對於 Log4j2 Disruptor 異步日誌來講,何時 LogEventEndOfBatch 呢?是在消費到的 index 等於生產發佈到的最大 index 的時候,這也是比較符合性能設計考慮,即在沒有消費完的時候,儘量地不 flush,消費完當前全部的時候再去 flush:

BatchEventProcessor.java

private void processEvents()
{
    T event = null;
    long nextSequence = sequence.get() + 1L;

    while (true)
    {
        try
        {
            final long availableSequence = sequenceBarrier.waitFor(nextSequence);
            if (batchStartAware != null)
            {
                batchStartAware.onBatchStart(availableSequence - nextSequence + 1);
            }

            while (nextSequence <= availableSequence)
            {
                event = dataProvider.get(nextSequence);
                //這裏 nextSequence == availableSequence 就是 EndOfBatch
                eventHandler.onEvent(event, nextSequence, nextSequence == availableSequence);
                nextSequence++;
            }

            sequence.set(availableSequence);
        }
        catch (final TimeoutException e)
        {
            notifyTimeout(sequence.get());
        }
        catch (final AlertException ex)
        {
            if (running.get() != RUNNING)
            {
                break;
            }
        }
        catch (final Throwable ex)
        {
            exceptionHandler.handleEventException(ex, nextSequence, event);
            sequence.set(nextSequence);
            nextSequence++;
        }
    }
}

複製代碼

2. 增長基於 JFR 事件監控

這個須要 Java 14 以上的版本

Configuration config = Configuration.getConfiguration("default");
//設置監控的鎖 block 時間超過多少就會採集
config.getSettings().put("jdk.JavaMonitorEnter#threshold", "1s");
try (var es = new RecordingStream(config)) {
    es.onEvent("jdk.JavaMonitorEnter", recordedEvent -> {
        //若是堆棧包含咱們關注的,則報警
        if (recordedEvent.getStackTrace().getFrames().stream().anyMatch(recordedFrame -> recordedFrame.toString().contains("org.apache.logging.log4j.core.async.AsyncLoggerConfigDisruptor.enqueue")))  {
            System.out.println("Alarm: " + recordedEvent);
        }
    });
    es.start();
}
複製代碼

微信搜索「個人編程喵」關注公衆號,每日一刷,輕鬆提高技術,斬獲各類offer

相關文章
相關標籤/搜索