一次消息消費服務的內存泄漏排查小記

線上有一個消息消費服務xxx-consumer,使用spring-kafka框架,主線程批量從消費隊列(kafka)拉取交易系統生產的消息,而後提交到子線程池中挨個處理消費。java

public abstract class AbstractMessageDispatchListener implements
        BatchAcknowledgingMessageListener<String, Msg>, ApplicationListener<ApplicationReadyEvent> {
​
    private ThreadPoolExecutor executor;
​
    public abstract MessageWorker chooseWorker(ConsumerRecord<String, Msg> data);
​
    @Override
    public void onMessage(List<ConsumerRecord<String, Msg>> datas, Acknowledgment acknowledgment) {
        List<Future<?>> futureList = new ArrayList<>(datas.size());
        try {
            CountDownLatch countDownLatch = new CountDownLatch(datas.size());
            for (ConsumerRecord<String, Msg> data : datas) {
                Future<?> future = executor.submit(new Worker(data, countDownLatch));
                futureList.add(future);
            }
​
            countDownLatch.await(20000L - 2000, TimeUnit.MILLISECONDS);
            long countDownLatchCount = countDownLatch.getCount();
            if (countDownLatchCount > 0) {
                return;
            }
            acknowledgment.acknowledge();
        } catch (Exception e) {
            logger.error("onMessage error ", e);
        } finally {
            for (Future<?> future : futureList) {
                if (future.isDone() || future.isCancelled()) {
                    continue;
                }
                future.cancel(true);
            }
        }
    }
​
    @Override
    public void onApplicationEvent(ApplicationReadyEvent event) {
        ThreadFactoryBuilder builder = new ThreadFactoryBuilder();
        builder.setNameFormat(this.getClass().getSimpleName() + "-pool-%d");
        builder.setDaemon(false);
        executor = new ThreadPoolExecutor(12,
                12 * 2,
                60L,
                TimeUnit.SECONDS,
                new ArrayBlockingQueue<>(100),
                builder.build());
    }
​
    private class Worker implements Runnable {
        private ConsumerRecord<String, Msg> data;
        private CountDownLatch countDownLatch;
​
        Worker(ConsumerRecord<String, Msg> data, CountDownLatch countDownLatch) {
            this.data = data;
            this.countDownLatch = countDownLatch;
        }
​
        @Override
        public void run() {
            try {
                MessageWorker worker = chooseWorker(data);
                worker.work(data.value());
            } finally {
                countDownLatch.countDown();
            }
        }
    }
}

 

1. 問題背景

有一天早上xxx-consumer服務出現大量報警,人工排查發現30w+的消息未處理,業務日誌正常,gc日誌有大量Full gc,初步判斷由於Full gc致使消息處理慢,大量的消息積壓。spring

 

2. 堆棧分析

查看了近一個月的JVM內存信息,發現老年代內存沒法被回收(9月22號的降低是由於服務有一次上線重啓),初步判斷髮生了內存泄漏。app

 經過<jmap -dump:format=b,file=/home/work/app/xxx-consumer/logs/jmap_dump.hprof -F>命令導出內存快照,使用Memory Analyzer解析內存快照文件jmap_dump.hprof,發現有很明顯的內存泄漏提示:框架

 進一步查看線程細節,發現建立了大量的ThreadLocalScope對象且循環引用:分佈式

 

同時咱們也看到了分佈式追蹤(dd-trace-java)jar包中的FakeSpan類,初步判斷是dd-trace-java中自研擴展的kafka插件存在內存泄漏bug。ide

 

3. 代碼分析

繼續查看dd-trace-java中kafka插件的代碼,其處理流程以下:ui

第一批消息this

  1. (SpringKafkaConsumerInstrumentation:L22)BatchAcknowledgingMessageListener.onMessage進入時,主線程會建立一個scope00=ThreadLocalScope(Type_BatchMessageListener_Value,toRestore=null)spa

  2. (ExecutorInstrumentation:L21L47)消息被submit到線程池中處理時,子線程會建立一個scope10=ThreadLocalScope(Type_BatchMessageListener_Value,toRestore=null)插件

  3. (SpringKafkaConsumerInstrumentation:L68)子線程處理消息時(ConsumerRecord.value),會建立一個scope11=ThreadLocalScope(Type_ConsumberRecord_Value,toRestore=scope10)

  4. (ExecutorInstrumentation:L54)子線程處理完消息後,執行scope10.close(),而scopeManager.tlsScope.get()=scope11,命中ThreadLocalScope:L19,scope10和scope11均沒法被GC

  5. (SpringKafkaConsumerInstrumentation:L42)BatchAcknowledgingMessageListener.onMessage退出時,主線程會執行scope00.close(),scope00會被GC

 第二批消息

  1. (SpringKafkaConsumerInstrumentation:L22)BatchAcknowledgingMessageListener.onMessage進入時,主線程會建立一個scope01=ThreadLocalScope(Type_BatchMessageListener_Value,toRestore=null)

  2. (ExecutorInstrumentation:L21L47)消息被submit到線程池中處理時,子線程會建立一個scope12=ThreadLocalScope(Type_BatchMessageListener_Value,toRestore=scope11)

  3. (SpringKafkaConsumerInstrumentation:L68)子線程處理消息時(ConsumerRecord.value),會建立一個scope13=ThreadLocalScope(Type_ConsumberRecord_Value,toRestore=scope12)

  4. (ExecutorInstrumentation:L54)子線程處理完消息後,執行scope12.close(),而scopeManager.tlsScope.get()=scope13,命中ThreadLocalScope:L19,scope12和scope13均沒法被GC

  5. (SpringKafkaConsumerInstrumentation:L42)BatchAcknowledgingMessageListener.onMessage退出時,主線程會執行scope01.close(),scope01會被GC

 從上能夠看到,主線程建立的ThreadLocalScope能被正確GC,而線程池中建立的ThreadLocalScope被循環引用,沒法被正確GC,從而形成內存泄漏。

 

End

RecoredValueAdvice沒有銷燬本身建立的對象,而是寄但願於BatchMessageListenerAdvice去銷燬。

但(SpringKafkaConsumerInstrumentation:L27)BatchAcknowledgingMessageListener.onMessage退出時,只會close主線程建立的ThreadLocalScope,不會close線程池中建立的ThreadLocalScope,致使子線程建立的ThreadLocalScope被循環引用,沒法被正確GC,從而形成內存泄漏。

相關文章
相關標籤/搜索