記一次解決線上OOM的心路歷程（配置中心）

時間 2019-12-05

標籤一次解決線上 oom 心路歷程配置中心简体版

原文原文鏈接

背景：隨着Best Diamond的不斷推廣、成熟，內部使用其來進行配置統一管理的項目愈來愈多，在各自的測試環境中測試達到他們的預期後，逐漸將其投入生產環境使用。前端

事故：有一個公司內部的核心繫統生產發佈時Best Diamond client日誌輸出鏈接server端失敗，致使項目配置文件不能及時拉取（因爲配置中心的客戶端在拉取到服務端的配置後會將其存入服務器本地，當因網絡或者其餘緣由致使沒有拉取到server端的配置時會自動讀取本地已有的配置，所以並無影響該項目的正常啓用，但這是僥倖的，由於萬一該項目在Best Diamond的服務端修改了某一變量值，則後果不堪設想）。java

事故排查：bash

在日誌平臺查看Best Diamond服務端的日誌輸出，看到某一臺服務端的日誌中有：OutOfMemoryError： GC overhead limit exceeded。看到上面這個錯時基本已經判定是內存泄漏引發。服務器

臨時處理：因爲是生產環境，爲不同用戶的生產環境的正常使用第一時間對server端進行重啓（此處未將當時的內存先dump出來再重啓，屬於嚴重的錯誤操做，爲後續的排查帶來一些困難）。網絡

緣由分析：架構

一、測試環境中爲什麼從未出現過該問題，測試環境中client的鏈接數遠比生產環境中的大，測試環境和生產環境有哪裏不同。運維

二、生產環境爲什麼上線一年多了從未出現此問題，卻在此時出現了。ide

繼續排查：看代碼提交記錄及生產發版日誌，拉取線上版本代碼（注：代碼多人在維護），正如上面所提，並未將當時的現場內存dump到文件中（算是給之後一個警醒），此時再分析是有必定難度的，但不能聽任無論或者等到下一次宕機再來排查，處理：既然懷疑是內存泄漏致使的，那畢竟是代碼有處理不到的地方，因而dump出了另外一臺生產環境同版本的服務端內存，命令爲：工具

jmap -dump:live,format=b,file=dump.hprof 24971（PID）

將dump出來的內存文件藉助jvisualvm進行分析：測試

有此能夠看到String、Date、ClinetInfo這類型的實例數均過百萬（char[] 是由於String內部是有char[]實現，從實例數看程序單獨使用char[]的狀況排除），佔用了虛擬機60%的內存，由此咱們懷疑ClientInfo實例存在內存泄漏，因而查看代碼：

package com.best.diamond.model.netty;

import java.util.Date;

public class ClientInfo {

    private String address;

    private Date connectTime;

    public ClientInfo(String address, Date connectTime) {
        this.address = address;
        this.connectTime = connectTime;
    }

    public String getAddress() {
        return address;
    }

    public void setAddress(String address) {
        this.address = address;
    }

    public Date getConnectTime() {
        return connectTime;
    }

    public void setConnectTime(Date connectTime) {
        this.connectTime = connectTime;
    }

    @Override
    public int hashCode() {
        final int prime = 31;
        int result = 1;
        result = prime * result
                + ((address == null) ? 0 : address.hashCode());
        return result;
    }

    @Override
    public boolean equals(Object obj) {
        if (this == obj)
            return true;
        if (obj == null)
            return false;
        if (getClass() != obj.getClass())
            return false;
        ClientInfo other = (ClientInfo) obj;
        if (address == null) {
            if (other.address != null)
                return false;
        }
        else if (!address.equals(other.address))
            return false;
        return true;
    }

}

果不其然，ClinetInfo中有String、Date兩類屬性，因而便判定ClientInfo是致使內存泄漏的罪魁禍首。

繼續看代碼發現有不少地方都使用到了ClientInfo，進一步步排查、排除，最終鎖定了自實現Netty的一個ChannelHandler,DiamondServerHandler該類主要代碼以下：

@Sharable
public class DiamondServerHandler extends SimpleChannelInboundHandler<String> {

    private final static String HEARTBEAT = "heartbeat";

    private final static String DIAMOND = "bestdiamond=";

    private final static Logger logger = LoggerFactory.getLogger(DiamondServerHandler.class);

    private final static Charset CHARSET = Charset.forName("UTF-8");

    public static ConcurrentHashMap<ClientKey, List<ClientInfo>> clients = new ConcurrentHashMap<>();

    private ConcurrentHashMap<String /*client address*/, ChannelHandlerContext> channels = new ConcurrentHashMap<String, ChannelHandlerContext>();

    @Autowired
    private ProjectConfigService projectConfigService;

    @Autowired
    private ProjectModuleService projectModuleService;

            modules = new ArrayList<>(Arrays.asList(StringUtils.split(facet.getModules(), Constants.COMMA)));
        }
        if (CollectionUtils.isEmpty(modules)) {
            throw new ServiceException(StatusCode.ILLEGAL_ARGUMENT, "Module has not been maintained yet.");
        }
        ClientKey key = new ClientKey();
        key.setProjCode(facet.getProjCode());
        key.setProfile(facet.getProfile());
        key.setModules(modules);
        List<ClientInfo> addrs = clients.get(key);
        if (addrs == null) {
            addrs = new ArrayList<>();
        }
        String clientAddress = ctx.channel().remoteAddress().toString().substring(1);
        ClientInfo clientInfo = new ClientInfo(clientAddress, new Date());
        addrs.add(clientInfo);
        clients.put(key, addrs);
        channels.put(clientAddress, ctx);
        return facet;
    }

    @Override
    }

    @Override
    public void exceptionCaught(ChannelHandlerContext ctx, Throwable cause) throws Exception {
        ctx.close();
    }

    @Override
    public void channelInactive(ChannelHandlerContext ctx) throws Exception {
        super.channelInactive(ctx);
        String address = ctx.channel().remoteAddress().toString();
        channels.remove(address);
        String watchConfigs = removeClientConnectionInfo(ctx);
        removeConfigWatchNode(watchConfigs, address);
        for (List<ClientInfo> infoList : clients.values()) {
            for (ClientInfo client : infoList) {
                if (address.equals(client.getAddress())) {
                    infoList.remove(client);
                    break;
                }
            }
        }
        logger.info(ctx.channel().remoteAddress() + " 斷開鏈接。");
    }

從代碼咱們能夠看出ClinetInfo主要用來Server對於Client的部分信息的存儲，客戶端在鏈接上服務端時建立ClientInfo實例，鏈接斷開時釋放，心細的同窗應該發現了，此處用於存儲和刪除ClientInfo的key--address變量處理是有問題的，存儲時：

String clientAddress = ctx.channel().remoteAddress().toString().substring(1);

刪除時：

String address = ctx.channel().remoteAddress().toString();

這樣ClientInfo中對於斷開後的Clinet信息是永遠不會刪除的，由此便找到了內存泄漏的地方。（說明：因爲DiamondServerHandler使用了Netty的@Sharable註解，它將被全部channel公用，引用一直不會失效，所以它一直不會被回收）。

調整後的代碼：

@Sharable
public class DiamondServerHandler extends SimpleChannelInboundHandler<String> {

    private final static String HEARTBEAT = "heartbeat";

    private final static String DIAMOND = "bestdiamond=";

    private final static Logger logger = LoggerFactory.getLogger(DiamondServerHandler.class);

    private final static Charset CHARSET = Charset.forName("UTF-8");

    private static Object locker = new Object();

    public static ConcurrentHashMap<ClientKey, List<ClientInfo>> clients = new ConcurrentHashMap<>();

    private ConcurrentHashMap<String /*client address*/, ChannelHandlerContext> channels = new ConcurrentHashMap<String, ChannelHandlerContext>();

    @Autowired
    private ProjectConfigService projectConfigService;

    @Autowired
    private ProjectModuleService projectModuleService;

            modules = new ArrayList<>(Arrays.asList(StringUtils.split(facet.getModules(), Constants.COMMA)));
        }
        if (CollectionUtils.isEmpty(modules)) {
            throw new ServiceException(StatusCode.ILLEGAL_ARGUMENT, "Module has not been maintained yet.");
        }
        ClientKey key = new ClientKey();
        key.setProjCode(facet.getProjCode());
        key.setProfile(facet.getProfile());
        key.setModules(modules);
        List<ClientInfo> addrs = clients.get(key);
        synchronized (locker) {
            if (null == addrs) {
                addrs = new ArrayList<>();
            }
        }
        String clientAddress = ctx.channel().remoteAddress().toString().substring(1);
        ClientInfo clientInfo = new ClientInfo(clientAddress, new Date());
        addrs.add(clientInfo);
        clients.put(key, addrs);
        channels.put(clientAddress, ctx);
        return facet;
    }

    @Override
    }

    @Override
    public void exceptionCaught(ChannelHandlerContext ctx, Throwable cause) throws Exception {
        ctx.close();
    }

    @Override
    public void channelInactive(ChannelHandlerContext ctx) throws Exception {
        super.channelInactive(ctx);
        String address = ctx.channel().remoteAddress().toString().substring(1);
        channels.remove(address);
        removeConfigWatchNode(watchConfigs, address.substring(1));
        removeConfigWatchNode(watchConfigs, address);
        for (List<ClientInfo> infoList : clients.values()) {
            for (ClientInfo client : infoList) {
                if (address.equals(client.getAddress())) {
                    infoList.remove(client);
                    break;
                }
            }
        }
        logger.info(ctx.channel().remoteAddress() + " 斷開鏈接。");
    }

調整後發版測試，問題解決。

其實尚未結束，即便這裏會存在內存泄漏可是超過百萬的實例數仍是有點太多了，進過排查、肯定發現是歷史問題致使的，前期是使用了公司的HA作的服務端的高可用，HA代理機每分鐘便會斷開鏈接，再從新鏈接，諮詢運維同窗說這個是HA的一種機制。

此處坑：咱們當初使用Netty原本就是爲了服務端與客戶端之間維護長鏈接，HA的這種機制與其相違背，後來咱們架構上有所調整，不在藉助前端的HA來代理客戶端的鏈接，這樣也合理一點，最終解決了問題。

總結：

一、出現OOM時絕大部分是代碼的問題，第一時間須要dump出當時虛機的內存快照，便於定位問題。

二、藉助工具去分析dump出的內存文件，能夠提升排查的效率，此處jvisualvm其實只是最基礎的排查工具，後面咱們有使用了其餘的可視化工具解決了其餘宕機狀況（後續在寫）。

三、定位到問題解決後，依然須要找到提交代碼的同窗，給予警醒。