「本文已參與好文召集令活動,點擊查看:後端、大前端雙賽道投稿,2萬元獎池等你挑戰!」html
線上某個應用的某個實例忽然出現某些次請求服務響應極慢的狀況,有幾回請求超過 60s 才返回,而且經過日誌發現,服務線程並無作什麼很重的操做。這種狀況斷斷續續持續了半小時左右。前端
因爲半小時後,服務實例恢復正常,而且出現請求極慢的次數並不太多,沒有能及時打印出現問題的時候線程堆棧採集相關信息。可是,咱們有本身的法寶,JFR(關於 JFR,請參考個人另外一系列JFR全解)。java
JFR 很是適合用來過後覆盤定位問題,而且配置得當的話,性能損耗極小,而且不像 APM 系統同樣須要額外的採集以及須要統一的服務進程進行整合展示。咱們隨用隨取便可,可是 JFR 面向的是單進程的問題定位,找到問題進程,定位跨進程業務問題鏈路追蹤仍是須要 APM 系統的。git
咱們經過以下命令採集對應時間段的 JFR:github
jcmd 進程pid JFR.dump begin=2021-07-04T17:30:00.000Z end=2021-07-04T18:30:00.000Z
複製代碼
咱們選擇的時間跨度比出問題的時間跨度大一些,這樣保證底層問題出現的時候也被採集寫入這個 dump 文件。web
對於歷史某些請求響應慢,我通常按照以下流程去看:數據庫
接下來咱們來詳細分析,首先是 GC,從下圖看,並無能嚴重影響線程 STW 很長時間的: apache
而後查看 safepoint 相關,發現也沒有什麼影響:編程
接着,咱們查看 IO 相關的事件,也並無發現業務相關的阻塞讀:後端
而後,咱們查看是否在某些鎖上面阻塞太長時間,終於發現異常:
經過堆棧,咱們發現是** log4j2 打印日誌卡住了**。
首先,對於 Java Monitor Blocked 這個事件,咱們能夠經過 Monitor Address 來判斷是不是同一個鎖。例如這裏咱們發現這些線程都是阻塞在 FFFF 4C09 3188
這個地址的對象,這個地址是基於對象在 Java 堆中的相對地址得出的。
可是對於同一個對象,這個地址並不會是不變的,在每次 GC 掃描整理到這個對象對象的時候,若是有回收內存,通常這個地址就會改變。因爲咱們使用的是 G1,因此並非每次 GC 都會掃描到這個對象,可是若是地址發生改變,必定是由於發生了 GC
在咱們的應用配置中,採用的日誌框架是 log4j2,採用的配置是異步日誌配置。這裏簡單說一下 Log4j2 異步日誌的原理:Log4j2 異步日誌基於高性能數據結構 Disruptor,Disruptor 是一個環形 buffer,作了不少性能優化(具體原理能夠參考個人另外一系列:高併發數據結構(disruptor)),Log4j2 對此的應用以下所示:
簡單來講,多線程經過 log4j2 的門面類 org.apache.logging.log4j.Logger
進行日誌輸出,被封裝成爲一個 org.apache.logging.log4j.core.LogEvent
,放入到 Disruptor 的環形 buffer 中。在消費端有一個單線程消費這些 LogEvent 寫入對應的 Appender,咱們這裏只有一個 Appender,其配置是:
<RollingFile name="file" append="true"
filePattern="./app.log-%d{yyyy.MM.dd.HH}">
<PatternLayout pattern="${logFormat}"/>
<Policies>
<TimeBasedTriggeringPolicy interval="1" modulate="true"/>
</Policies>
<DirectWriteRolloverStrategy maxFiles="72"/>
</RollingFile>
複製代碼
異步日誌 logger 配置爲(配置 includeLocation
爲 false,避免每次打印日誌須要獲取調用堆棧的性能損耗):
<Asyncroot level="info" includeLocation="false">
<appender-ref ref="file"/>
</Asyncroot>
複製代碼
log4j component 額外配置爲:
log4j2.component.properties
:
# 當沒有日誌的時候,消費線程經過 Block 等待日誌事件到來,這樣 CPU 佔用最少
AsyncLoggerConfig.WaitStrategy=Block
複製代碼
既然是一個環形 Buffer,它的容量是有限的,咱們這裏沒有修改它的大小,走的是默認配置,查看其源碼:
AsyncLoggerConfigDisruptor.java
public synchronized void start() {
//省略其餘代碼
ringBufferSize = DisruptorUtil.calculateRingBufferSize("AsyncLoggerConfig.RingBufferSize");
disruptor = new Disruptor<>(factory, ringBufferSize, threadFactory, ProducerType.MULTI, waitStrategy);
}
複製代碼
private static final int RINGBUFFER_MIN_SIZE = 128;
private static final int RINGBUFFER_DEFAULT_SIZE = 256 * 1024;
private static final int RINGBUFFER_NO_GC_DEFAULT_SIZE = 4 * 1024;
static int calculateRingBufferSize(final String propertyName) {
//是否啓用了 ThreadLocal,若是是則爲 4 kB,不是則爲 256 kB
int ringBufferSize = Constants.ENABLE_THREADLOCALS ? RINGBUFFER_NO_GC_DEFAULT_SIZE : RINGBUFFER_DEFAULT_SIZE;
//讀取系統變量,以及 log4j2.component.properties 文件獲取 propertyName(這裏是 AsyncLoggerConfig.RingBufferSize) 這個配置
final String userPreferredRBSize = PropertiesUtil.getProperties().getStringProperty(propertyName,
String.valueOf(ringBufferSize));
try {
int size = Integer.parseInt(userPreferredRBSize);
//若是小於 128 字節則按照 128 字節設置
if (size < RINGBUFFER_MIN_SIZE) {
size = RINGBUFFER_MIN_SIZE;
LOGGER.warn("Invalid RingBufferSize {}, using minimum size {}.", userPreferredRBSize,
RINGBUFFER_MIN_SIZE);
}
ringBufferSize = size;
} catch (final Exception ex) {
LOGGER.warn("Invalid RingBufferSize {}, using default size {}.", userPreferredRBSize, ringBufferSize);
}
//取最近的 2 的 n 次方,由於對於 2 的 n 次方取餘等於對於 2^n-1 取與運算,這樣更快
return Integers.ceilingNextPowerOfTwo(ringBufferSize);
}
複製代碼
若是啓用了 ThreadLocal 這種方式生成 LogEvent,每次不新生成的 LogEvent 用以前的,用 ThreadLocal 存儲的,這樣避免了建立新的 LogEvent。可是考慮下面這種狀況:
logger.info("{}", someObj);
複製代碼
這樣會形成強引用,致使若是線程沒有新的日誌,這個 someObj
一直不能回收。因此針對 Web 應用,log4j2 默認是不啓用 ThreadLocal 的 方式生成 LogEvent:
public static final boolean IS_WEB_APP = PropertiesUtil.getProperties().getBooleanProperty(
"log4j2.is.webapp", isClassAvailable("javax.servlet.Servlet"));
public static final boolean ENABLE_THREADLOCALS = !IS_WEB_APP && PropertiesUtil.getProperties().getBooleanProperty(
"log4j2.enable.threadlocals", true);
複製代碼
由此,能夠看出,咱們的 RingBuffer 的大小爲 256 kB。
當 RingBuffer 滿了,若是在 log4j2.component.properties
配置了 AsyncLoggerConfig.SynchronizeEnqueueWhenQueueFull=false
,則會 Wait(實際上是 park) 在 Disruptor 的 produce 方法上,等待消費出下一個能夠生產的環形 buffer 槽;默認這個配置爲 true,即全部生產日誌的線程嘗試獲取全局中的同一個鎖(private final Object queueFullEnqueueLock = new Object();
):
static final boolean ASYNC_CONFIG_SYNCHRONIZE_ENQUEUE_WHEN_QUEUE_FULL = PropertiesUtil.getProperties()
.getBooleanProperty("AsyncLoggerConfig.SynchronizeEnqueueWhenQueueFull", true);
複製代碼
private boolean synchronizeEnqueueWhenQueueFull() {
return DisruptorUtil.ASYNC_CONFIG_SYNCHRONIZE_ENQUEUE_WHEN_QUEUE_FULL
// Background thread must never block
&& backgroundThreadId != Thread.currentThread().getId();
}
private final Object queueFullEnqueueLock = new Object();
private void enqueue(final LogEvent logEvent, final AsyncLoggerConfig asyncLoggerConfig) {
//若是 AsyncLoggerConfig.SynchronizeEnqueueWhenQueueFull=true,默認就是 true
if (synchronizeEnqueueWhenQueueFull()) {
synchronized (queueFullEnqueueLock) {
disruptor.getRingBuffer().publishEvent(translator, logEvent, asyncLoggerConfig);
}
} else {
//若是 AsyncLoggerConfig.SynchronizeEnqueueWhenQueueFull=false
disruptor.getRingBuffer().publishEvent(translator, logEvent, asyncLoggerConfig);
}
}
複製代碼
默認配置的時候,異常堆棧和咱們 JFR 中看到的同樣,舉個例子:
"Thread-0" #27 [13136] prio=5 os_prio=0 cpu=0.00ms elapsed=141.08s tid=0x0000022d6f2fbcc0 nid=0x3350 waiting for monitor entry [0x000000399bcfe000]
java.lang.Thread.State: BLOCKED (on object monitor)
at org.apache.logging.log4j.core.async.AsyncLoggerConfigDisruptor.enqueue(AsyncLoggerConfigDisruptor.java:375)
- waiting to lock <merged>(a java.lang.Object)
at org.apache.logging.log4j.core.async.AsyncLoggerConfigDisruptor.enqueueEvent(AsyncLoggerConfigDisruptor.java:330)
at org.apache.logging.log4j.core.async.AsyncLoggerConfig.logInBackgroundThread(AsyncLoggerConfig.java:159)
at org.apache.logging.log4j.core.async.EventRoute$1.logMessage(EventRoute.java:46)
at org.apache.logging.log4j.core.async.AsyncLoggerConfig.handleQueueFull(AsyncLoggerConfig.java:149)
at org.apache.logging.log4j.core.async.AsyncLoggerConfig.logToAsyncDelegate(AsyncLoggerConfig.java:136)
at org.apache.logging.log4j.core.async.AsyncLoggerConfig.log(AsyncLoggerConfig.java:116)
at org.apache.logging.log4j.core.config.LoggerConfig.log(LoggerConfig.java:460)
at org.apache.logging.log4j.core.config.AwaitCompletionReliabilityStrategy.log(AwaitCompletionReliabilityStrategy.java:82)
at org.apache.logging.log4j.core.Logger.log(Logger.java:162)
at org.apache.logging.log4j.spi.AbstractLogger.tryLogMessage(AbstractLogger.java:2190)
at org.apache.logging.log4j.spi.AbstractLogger.logMessageTrackRecursion(AbstractLogger.java:2144)
at org.apache.logging.log4j.spi.AbstractLogger.logMessageSafely(AbstractLogger.java:2127)
at org.apache.logging.log4j.spi.AbstractLogger.logMessage(AbstractLogger.java:2003)
at org.apache.logging.log4j.spi.AbstractLogger.logIfEnabled(AbstractLogger.java:1975)
at org.apache.logging.log4j.spi.AbstractLogger.info(AbstractLogger.java:1312)
省略業務方法堆棧
複製代碼
配置爲 false 的時候,堆棧是這個樣子的:
"Thread-0" #27 [18152] prio=5 os_prio=0 cpu=0.00ms elapsed=5.68s tid=0x000002c1fa120e00 nid=0x46e8 runnable [0x000000eda8efe000]
java.lang.Thread.State: TIMED_WAITING (parking)
at jdk.internal.misc.Unsafe.park(java.base@17-loom/Native Method)
at java.util.concurrent.locks.LockSupport.parkNanos(java.base@17-loom/LockSupport.java:410)
at com.lmax.disruptor.MultiProducerSequencer.next(MultiProducerSequencer.java:136)
at com.lmax.disruptor.MultiProducerSequencer.next(MultiProducerSequencer.java:105)
at com.lmax.disruptor.RingBuffer.publishEvent(RingBuffer.java:524)
at org.apache.logging.log4j.core.async.AsyncLoggerConfigDisruptor.enqueue(AsyncLoggerConfigDisruptor.java:379)
at org.apache.logging.log4j.core.async.AsyncLoggerConfigDisruptor.enqueueEvent(AsyncLoggerConfigDisruptor.java:330)
at org.apache.logging.log4j.core.async.AsyncLoggerConfig.logInBackgroundThread(AsyncLoggerConfig.java:159)
at org.apache.logging.log4j.core.async.EventRoute$1.logMessage(EventRoute.java:46)
at org.apache.logging.log4j.core.async.AsyncLoggerConfig.handleQueueFull(AsyncLoggerConfig.java:149)
at org.apache.logging.log4j.core.async.AsyncLoggerConfig.logToAsyncDelegate(AsyncLoggerConfig.java:136)
at org.apache.logging.log4j.core.async.AsyncLoggerConfig.log(AsyncLoggerConfig.java:116)
at org.apache.logging.log4j.core.config.LoggerConfig.log(LoggerConfig.java:460)
at org.apache.logging.log4j.core.config.AwaitCompletionReliabilityStrategy.log(AwaitCompletionReliabilityStrategy.java:82)
at org.apache.logging.log4j.core.Logger.log(Logger.java:162)
at org.apache.logging.log4j.spi.AbstractLogger.tryLogMessage(AbstractLogger.java:2190)
at org.apache.logging.log4j.spi.AbstractLogger.logMessageTrackRecursion(AbstractLogger.java:2144)
at org.apache.logging.log4j.spi.AbstractLogger.logMessageSafely(AbstractLogger.java:2127)
at org.apache.logging.log4j.spi.AbstractLogger.logMessage(AbstractLogger.java:2003)
at org.apache.logging.log4j.spi.AbstractLogger.logIfEnabled(AbstractLogger.java:1975)
at org.apache.logging.log4j.spi.AbstractLogger.info(AbstractLogger.java:1312)
複製代碼
咱們來看下當時的消費線程是否有異常,因爲當時硬盤 io 看系統監控並無異常,因此這個線程很大多是 Runnable 的,不斷的在寫入日誌。同時,咱們知道,寫入文件 java 底層基於 native 調用,因此咱們查看 JFR 的 native 方法採集。使用事件查看器中的 Method Profiling Sample Native,右鍵點擊,並選擇「使用所選類型事件建立新頁」:
在建立出來的新頁中,按照線程分組查看,查看 Log4j2 的 disruptor 消費線程,能夠得出下圖:
能夠看出:在出問題的時間段,採集到的寫入日誌文件的事件數量明顯增多不少,而且,經過堆棧能夠看到:
"Log4j2-TF-2-AsyncLoggerConfig-1" #26 [23680] daemon prio=5 os_prio=0 cpu=406.25ms elapsed=2.89s tid=0x0000022d6f3d4080 nid=0x5c80 runnable [0x000000399bbfe000]
java.lang.Thread.State: RUNNABLE
at java.io.FileOutputStream.writeBytes(java.base@17-loom/Native Method)
at java.io.FileOutputStream.write(java.base@17-loom/FileOutputStream.java:365)
at org.apache.logging.log4j.core.appender.OutputStreamManager.writeToDestination(OutputStreamManager.java:261)
- eliminated <0x000000070ee0af40> (a org.apache.logging.log4j.core.appender.rolling.RollingFileManager)
at org.apache.logging.log4j.core.appender.FileManager.writeToDestination(FileManager.java:272)
- eliminated <0x000000070ee0af40> (a org.apache.logging.log4j.core.appender.rolling.RollingFileManager)
at org.apache.logging.log4j.core.appender.rolling.RollingFileManager.writeToDestination(RollingFileManager.java:236)
- eliminated <0x000000070ee0af40> (a org.apache.logging.log4j.core.appender.rolling.RollingFileManager)
at org.apache.logging.log4j.core.appender.OutputStreamManager.flushBuffer(OutputStreamManager.java:293)
- locked <0x000000070ee0af40> (a org.apache.logging.log4j.core.appender.rolling.RollingFileManager)
at org.apache.logging.log4j.core.appender.OutputStreamManager.flush(OutputStreamManager.java:302)
- locked <0x000000070ee0af40> (a org.apache.logging.log4j.core.appender.rolling.RollingFileManager)
at org.apache.logging.log4j.core.appender.AbstractOutputStreamAppender.directEncodeEvent(AbstractOutputStreamAppender.java:199)
at org.apache.logging.log4j.core.appender.AbstractOutputStreamAppender.tryAppend(AbstractOutputStreamAppender.java:190)
at org.apache.logging.log4j.core.appender.AbstractOutputStreamAppender.append(AbstractOutputStreamAppender.java:181)
at org.apache.logging.log4j.core.appender.RollingFileAppender.append(RollingFileAppender.java:312)
at org.apache.logging.log4j.core.config.AppenderControl.tryCallAppender(AppenderControl.java:156)
at org.apache.logging.log4j.core.config.AppenderControl.callAppender0(AppenderControl.java:129)
at org.apache.logging.log4j.core.config.AppenderControl.callAppenderPreventRecursion(AppenderControl.java:120)
at org.apache.logging.log4j.core.config.AppenderControl.callAppender(AppenderControl.java:84)
at org.apache.logging.log4j.core.config.LoggerConfig.callAppenders(LoggerConfig.java:543)
at org.apache.logging.log4j.core.async.AsyncLoggerConfig.callAppenders(AsyncLoggerConfig.java:127)
at org.apache.logging.log4j.core.config.LoggerConfig.processLogEvent(LoggerConfig.java:502)
at org.apache.logging.log4j.core.config.LoggerConfig.log(LoggerConfig.java:485)
at org.apache.logging.log4j.core.async.AsyncLoggerConfig.log(AsyncLoggerConfig.java:121)
at org.apache.logging.log4j.core.async.AsyncLoggerConfig.logToAsyncLoggerConfigsOnCurrentThread(AsyncLoggerConfig.java:169)
at org.apache.logging.log4j.core.async.AsyncLoggerConfigDisruptor$Log4jEventWrapperHandler.onEvent(AsyncLoggerConfigDisruptor.java:111)
at org.apache.logging.log4j.core.async.AsyncLoggerConfigDisruptor$Log4jEventWrapperHandler.onEvent(AsyncLoggerConfigDisruptor.java:97)
at com.lmax.disruptor.BatchEventProcessor.processEvents(BatchEventProcessor.java:168)
at com.lmax.disruptor.BatchEventProcessor.run(BatchEventProcessor.java:125)
at java.lang.Thread.run(java.base@17-loom/Thread.java:1521)
複製代碼
log4j2 調用了不少次 flush, JFR 顯示採集到的事件,每次都調用了 flush。每次調用 flush,都會形成文件真正寫的 native 調用。而消費慢的緣由,其實就是這種 native 調用太多,系統寫入不過來了。
咱們能夠經過如下四個方向解決這個問題:
immediateFlush
爲 false。org.apache.logging.log4j.core.async.AsyncLoggerConfigDisruptor.enqueue
的 java monitor block 事件進行監控,若是發現時間過長或者數量不少的事件則報警或者重建進程immediateFlush
爲 false咱們能夠配置 Appender 的 immediateFlush
爲 false,例如:
<RollingFile name="file" append="true"
filePattern="./app.log-%d{yyyy.MM.dd.HH}"
immediateFlush="false">
<PatternLayout pattern="${logFormat}"/>
<Policies>
<TimeBasedTriggeringPolicy interval="1" modulate="true"/>
</Policies>
<DirectWriteRolloverStrategy maxFiles="72"/>
</RollingFile>
複製代碼
這裏的原理對應源碼:
AbstractOutputStreamAppender.java
protected void directEncodeEvent(final LogEvent event) {
getLayout().encode(event, manager);
//若是配置了 immdiateFlush (默認爲 true)或者當前事件是 EndOfBatch
if (this.immediateFlush || event.isEndOfBatch()) {
manager.flush();
}
}
複製代碼
那麼對於 Log4j2 Disruptor 異步日誌來講,何時 LogEvent
是 EndOfBatch
呢?是在消費到的 index 等於生產發佈到的最大 index 的時候,這也是比較符合性能設計考慮,即在沒有消費完的時候,儘量地不 flush,消費完當前全部的時候再去 flush:
private void processEvents()
{
T event = null;
long nextSequence = sequence.get() + 1L;
while (true)
{
try
{
final long availableSequence = sequenceBarrier.waitFor(nextSequence);
if (batchStartAware != null)
{
batchStartAware.onBatchStart(availableSequence - nextSequence + 1);
}
while (nextSequence <= availableSequence)
{
event = dataProvider.get(nextSequence);
//這裏 nextSequence == availableSequence 就是 EndOfBatch
eventHandler.onEvent(event, nextSequence, nextSequence == availableSequence);
nextSequence++;
}
sequence.set(availableSequence);
}
catch (final TimeoutException e)
{
notifyTimeout(sequence.get());
}
catch (final AlertException ex)
{
if (running.get() != RUNNING)
{
break;
}
}
catch (final Throwable ex)
{
exceptionHandler.handleEventException(ex, nextSequence, event);
sequence.set(nextSequence);
nextSequence++;
}
}
}
複製代碼
這個須要 Java 14 以上的版本
Configuration config = Configuration.getConfiguration("default");
//設置監控的鎖 block 時間超過多少就會採集
config.getSettings().put("jdk.JavaMonitorEnter#threshold", "1s");
try (var es = new RecordingStream(config)) {
es.onEvent("jdk.JavaMonitorEnter", recordedEvent -> {
//若是堆棧包含咱們關注的,則報警
if (recordedEvent.getStackTrace().getFrames().stream().anyMatch(recordedFrame -> recordedFrame.toString().contains("org.apache.logging.log4j.core.async.AsyncLoggerConfigDisruptor.enqueue"))) {
System.out.println("Alarm: " + recordedEvent);
}
});
es.start();
}
複製代碼
微信搜索「個人編程喵」關注公衆號,每日一刷,輕鬆提高技術,斬獲各類offer: