RocketMQ一次延遲消息故障排查【實戰筆記】

目錄

1、問題描述
2、臨時方案
3、問題定位
1.搜查日誌
2.消息查詢
3.原理回顧
4.報錯源碼
4、解決方式
5、結束了嗎?

1、問題描述

RocketMQ社區版本支持18個延遲級別,每一個級別在設定的時間都被會消費者準確消費到。爲此也專門測試過消費的間隔是否是準確,測試結果顯示很準確。然而,如此準確的特性竟然出問題了,接到業務同窗報告線上某個集羣延遲消息消費不到,開發環境、測試環境都沒問題。各個環境的版本都是統一的RocketMQ 4.5.2。詭異!apache

2、臨時方案

將該業務的topic和consumer轉移到線上其餘集羣,延遲消息消費正常。json

3、問題定位

1.搜查日誌

發現storeerror.log一直刷下面的日誌,觀察發現offset遠遠超過了存儲的大小!微信

2020-03-18 16:38:06 WARN ScheduleMessageTimerThread - Offset not matched. Request offset: 509548460, firstOffset: 0, lastOffset: 6000000, mappedFileSize: 6000000, mappedFiles count: 1
2020-03-18 16:38:06 WARN ScheduleMessageTimerThread - Offset not matched. Request offset: 665682360, firstOffset: 0, lastOffset: 6000000, mappedFileSize: 6000000, mappedFiles count: 1
2020-03-18 16:38:07 WARN ScheduleMessageTimerThread - Offset not matched. Request offset: 509548460, firstOffset: 0, lastOffset: 6000000, mappedFileSize: 6000000, mappedFiles count: 1
2020-03-18 16:38:07 WARN ScheduleMessageTimerThread - Offset not matched. Request offset: 665682360, firstOffset: 0, lastOffset: 6000000, mappedFileSize: 6000000, mappedFiles count: 1
2020-03-18 16:38:07 WARN ScheduleMessageTimerThread - Offset not matched. Request offset: 509548460, firstOffset: 0, lastOffset: 6000000, mappedFileSize: 6000000, mappedFiles count: 1
2020-03-18 16:38:07 WARN ScheduleMessageTimerThread - Offset not matched. Request offset: 665682360, firstOffset: 0, lastOffset: 6000000, mappedFileSize: 6000000, mappedFiles count: 1
2020-03-18 16:38:07 WARN ScheduleMessageTimerThread - Offset not matched. Request offset: 509548460, firstOffset: 0, lastOffset: 6000000, mappedFileSize: 6000000, mappedFiles count: 1
2.消息查詢

經過查詢發送延遲的消息,發現該消息存儲成功;已經存儲在了「SCHEDULE_TOPIC_XXXX」中了。app

bin/mqadmin queryMsgById -n x.x.x.x:9876  -i 0A6F160300002A9F000000BACA44720C
RocketMQLog:WARN No appenders could be found for logger (io.netty.util.internal.PlatformDependent0).
RocketMQLog:WARN Please initialize the logger system properly.
OffsetID: 0A6F160300002A9F000000BACA44720C
OffsetID: 0A6F160300002A9F000000BACA44720C
Topic: SCHEDULE_TOPIC_XXXX
Tags: [null]
Keys: [rocketmq.key.-1351724170]
Queue ID: 3
Queue Offset: 142
CommitLog Offset: 802257400332
Reconsume Times: 0
Born Timestamp: 2020-03-18 15:27:37,512
Store Timestamp: 2020-03-18 15:27:37,513
Born Host: 10.x.x.x:56650
Store Host: 10.x.x.x:10911
System Flag: 0
Properties: {REAL_TOPIC=melon_online_test, KEYS=rocketmq.key.-1351724170, UNIQ_KEY=0A6F15AB6305070DEA4E5ADD60280023, WAIT=true, DELAY=4, REAL_QID=1}
Message Body Path: /tmp/rocketmq/msgbodys/0A6F15AB6305070DEA4E5ADD60280023
org.apache.rocketmq.client.exception.MQClientException: CODE: 17 DESC: No topic route info in name server for the topic: SCHEDULE_TOPIC_XXXX
See http://rocketmq.apache.org/docs/faq/ for further details.
at org.apache.rocketmq.client.impl.MQClientAPIImpl.getTopicRouteInfoFromNameServer(MQClientAPIImpl.java:1351)
at org.apache.rocketmq.client.impl.MQClientAPIImpl.getTopicRouteInfoFromNameServer(MQClientAPIImpl.java:1321)
at org.apache.rocketmq.tools.admin.DefaultMQAdminExtImpl.examineTopicRouteInfo(DefaultMQAdminExtImpl.java:305)
at org.apache.rocketmq.tools.admin.DefaultMQAdminExtImpl.queryTopicConsumeByWho(DefaultMQAdminExtImpl.java:619)
at org.apache.rocketmq.tools.admin.DefaultMQAdminExtImpl.messageTrackDetail(DefaultMQAdminExtImpl.java:776)
at org.apache.rocketmq.tools.admin.DefaultMQAdminExt.messageTrackDetail(DefaultMQAdminExt.java:435)
at org.apache.rocketmq.tools.command.message.QueryMsgByIdSubCommand.printMsg(QueryMsgByIdSubCommand.java:145)
at org.apache.rocketmq.tools.command.message.QueryMsgByIdSubCommand.queryById(QueryMsgByIdSubCommand.java:49)
at org.apache.rocketmq.tools.command.message.QueryMsgByIdSubCommand.execute(QueryMsgByIdSubCommand.java:252)
at org.apache.rocketmq.tools.command.MQAdminStartup.main0(MQAdminStartup.java:138)
at org.apache.rocketmq.tools.command.MQAdminStartup.main(MQAdminStartup.java:89)
3.原理回顧

延遲消息存儲時被替換爲「SCHEDULE_TOPIC_XXXX」主題,broker會爲每一個等級的創建定時任務進行調度,將各個等級到時間的消息替換爲原來的主題,從而消費者能夠消費到該消息。以上日誌加上消息查詢結果已存儲在「SCHEDULE_TOPIC_XXXX」,能夠判定調度環節出問題了,因爲offset遠遠超過最大offset而報錯;沒能將目標主題成功替換學習

4.報錯源碼
public MappedFile findMappedFileByOffset(final long offset, final boolean returnFirstOnNotFound) {
try {
MappedFile firstMappedFile = this.getFirstMappedFile();
MappedFile lastMappedFile = this.getLastMappedFile();
if (firstMappedFile != null && lastMappedFile != null) {
// 這裏報錯了,每次用offset=509548460來查,但是總大小才6000000
// 每次來查報錯返回null
if (offset < firstMappedFile.getFileFromOffset() || offset >= lastMappedFile.getFileFromOffset() + this.mappedFileSize) {
LOG_ERROR.warn("Offset not matched. Request offset: {}, firstOffset: {}, lastOffset: {}, mappedFileSize: {}, mappedFiles count: {}",
offset,
firstMappedFile.getFileFromOffset(),
lastMappedFile.getFileFromOffset() + this.mappedFileSize,
this.mappedFileSize,
this.mappedFiles.size());
} else {
// ...
}
} catch (Exception e) {
log.error("findMappedFileByOffset Exception", e);
}

return null;
}

調用位置:ScheduleMessageService#executeOnTimeup測試

// 根據offset撈取消息內容
MessageExt msgExt = ScheduleMessageService.this.defaultMessageStore.lookMessageByOffset(
offsetPy, sizePy);

....
// 處理結束後處理位點自增
nextOffset = offset + (i / ConsumeQueue.CQ_STORE_UNIT_SIZE);
ScheduleMessageService.this.updateOffset(this.delayLevel, nextOffset);

其中offsetPy來自ConsumeQueue;另外delayOffset.json存儲延遲消息的處理進度。ui

4、解決方式

將"delayOffset.json"和"consumequeue/SCHEDULE_TOPIC_XXXX"移到其餘目錄,至關於刪除;逐臺重啓broker節點。重啓結束後,通過驗證,延遲消息功能正常發送和消費。this

5、結束了嗎?

看到這裏你可能還有疑問,offset如何能增加到665682360這麼大的?並且offset徹底由broker自身去調度後自增的,並不是由客戶端上報的。記得前年RocketMQ也出現過相似問題,那時候RocketMQ版本是4.1,在測試環境延遲消息失效。此問題很罕見,一年未必碰到一會,後面會提交給社區討論下。spa



「瓜農老梁  學習同行」

    

本文分享自微信公衆號 - 瓜農老梁(gh_01130ae30a83)。
若有侵權,請聯繫 support@oschina.cn 刪除。
本文參與「OSC源創計劃」,歡迎正在閱讀的你也加入,一塊兒分享。.net

相關文章
相關標籤/搜索