一個神奇的bug：OOM？優雅終止線程？系統內存佔用較高？

時間 2021-06-01

標籤 java linux sql apache segmentfault api 數組安全 app jvm 欄目 Java 简体版

原文原文鏈接

摘要：該項目是DAYU平臺的數據開發（DLF），數據開發中一個重要的功能就是ETL（數據清洗）。ETL由源端到目的端，中間的業務邏輯通常由用戶本身編寫的SQL模板實現，velocity是其中涉及的一種模板語言。

Velocity之OOM

Velocity的基本使用

Velocity模板語言的基本使用代碼以下：java

1. 初始化模板引擎

2. 獲取模板文件

3. 設置變量

4. 輸出

在ETL業務中，Velocity模板的輸出是用戶的ETL SQL語句集，至關於.sql文件。這裏官方提供的api須要傳入一個java.io.Writer類的對象用於存儲模板的生成的SQL語句集。而後，這些語句集會根據咱們的業務作SQL語句的拆分，逐個執行。linux

java.io.Writer類是一個抽象類，在JDK1.8中有多種實現，包括但不只限於如下幾種：sql

因爲雲環境對用戶文件讀寫建立等權限的安全性要求比較苛刻，所以，咱們使用了java.io.StringWriter，其底層是StringBuffer對象，StringBuffer底層是char數組。apache

簡單模板Hellovelocity.vm：segmentfault

; "複製代碼")api

set($iAMVariable = 'good!')

set($person.password = '123')

Welcome ${name} to velocity.com
today is ${date}數組

foreach($one in $list)

$one

end

Name: ${person.name}
Password: ${person.password}安全

; "複製代碼")app

HelloVelocity.java

; "複製代碼")jvm

package com.xlf;

import org.apache.velocity.Template;
import org.apache.velocity.VelocityContext;
import org.apache.velocity.app.VelocityEngine;
import org.apache.velocity.runtime.RuntimeConstants;
import org.apache.velocity.runtime.resource.loader.ClasspathResourceLoader;

import java.io.StringWriter;
import java.util.ArrayList;
import java.util.Date;
import java.util.List; public class HelloVelocity { public static void main(String[] args) { // 初始化模板引擎

VelocityEngine ve = new VelocityEngine();
    ve.setProperty(RuntimeConstants.RESOURCE_LOADER, "classpath");
    ve.setProperty("classpath.resource.loader.class", ClasspathResourceLoader.class.getName());
    ve.init(); // 獲取模板文件
    Template template = ve.getTemplate("Hellovelocity.vm");
    VelocityContext ctx = new VelocityContext(); // 設置變量
    ctx.put("name", "velocity");
    ctx.put("date", (new Date()));

    List temp = new ArrayList();
    temp.add("Hey");
    temp.add("Volecity!");
    ctx.put("list", temp);

    Person person = new Person();
    ctx.put("person", person); // 輸出
    StringWriter sw = new StringWriter();
    template.merge(ctx, sw);
    System.out.println(sw.toString());
}

}

; "複製代碼")

控制檯輸出

OOM重現

大模板文件BigVelocity.template.vm

（文件字數超出博客限制，稍後在附件中給出~~）

模板文件自己就379kb不算大，關鍵在於其中定義了一個包含90000多個元素的String數組，數組的每一個元素都是」1」，而後寫了79層嵌套循環，循環的每一層都是遍歷該String數組；最內層循環調用了一次:

show table;

這意味着這個模板將生成包含96372的79次方個SQL語句，其中每個SQL語句都是：

show table;

將如此巨大的字符量填充進StringWriter對象裏面，至少須要10的380屢次方GB的內存空間，這幾乎是不現實的。所以OOM溢出是必然的。

BigVelocity.java

; "複製代碼")

package com.xlf;

import java.io.StringWriter; public class BigVelocity { public static void main(String[] args) { // 初始化模板引擎

VelocityEngine ve = new VelocityEngine();
    ve.setProperty(RuntimeConstants.RESOURCE_LOADER, "classpath");
    ve.setProperty("classpath.resource.loader.class", ClasspathResourceLoader.class.getName());
    ve.init(); // 獲取模板文件
    Template template = ve.getTemplate("BigVelocity.template.vm");
    VelocityContext ctx = new VelocityContext();
    StringWriter sw = new StringWriter();
    template.merge(ctx, sw);
}

}

; "複製代碼")

控制檯輸出

OOM緣由分析

Velocity模板生成的結果寫入StringWriter對象中，如前面分析，其底層是一個char數組。直接產生OOM的代碼在於java.util.Array.copyOf()函數：

StringWriter底層char數組容量極限測試

StringWriterOOMTest.java

; "複製代碼")

package com.xlf;

import java.io.StringWriter; public class StringWriterOOMTest { public static void main(String[] args) {

System.out.println("The maximum value of Integer is: " + Integer.MAX_VALUE);
    StringWriter sw = new StringWriter(); int count = 0; for (int i = 0; i < 100000; i++) { for (int j = 0; j < 100000; j++) {
            sw.write("This will cause OOMn");
            System.out.println("sw.getBuffer().length(): " + sw.getBuffer().length() + ", count: " + (++count));
        }
    }
}

}

; "複製代碼")

Jvm參數設置（參考硬件配置）

環境：JDK8 + Windows10臺式機 + 32GB內存 + 1TB SSD + i7-8700

若是你的硬件配置不充分，請勿輕易嘗試！

測試結果

StringWriterOOMTest運行時的整個進程內存大小在Windows任務管理器中達10300多MB時，程序中止。

控制檯輸出

測試結果分析

char數組元素最大值不會超過Integer.MAX_VALUE，回事很是接近的一個值，我這裏相差20多。網上搜索了一番，比較靠譜的說法是：確實比Integer.MAX_VALUE小一點，不會等於Integer.MAX_VALUE，是由於char[]對象還有一些別的空間佔用，好比對象頭，應該說是這些空間加起來不能超過Integer.MAX_VALUE。若是有讀者感興趣，能夠自行探索下別的類型數組的元素個數。我這裏也算是一點拙見，拋磚引玉。

OOM解決方案

緣由總結

經過上面一系列重現與分析，咱們知道了OOM的根本緣由是模板文件渲染而成的StringWriter對象過大。具體表如今：

若是系統沒有足夠大的內存空間分配給JVM，會致使OOM，由於這部份內存並非無用內存，JVM不能回收
若是系統有足夠大的內存空間分配給JVM，char數組中的元素個數在接近於MAX_VALUE會拋出OOM錯誤。

解決方案

前面分析過，出於安全的緣由，咱們只能用StringWriter對象去接收模板渲染結果的輸出。不能用文件。因此只能在StringWriter自己去作文章進行改進了：

繼承StringWriter類，重寫其write方法爲：

; "複製代碼")

StringWriter sw = new StringWriter() { public void write(String str) { int length = this.getBuffer().length() + str.length(); // 限制大小爲10MB

if (length > 10 * 1024 * 1024) { this.getBuffer().delete(0, this.getBuffer().length()); throw new RuntimeException("Velocity template size exceeds limit!");
    } this.getBuffer().append(str);
}

};

; "複製代碼")

其餘代碼保持不變

BigVelocitySolution.java

; "複製代碼")

package com.xlf;

import java.io.StringWriter; public class BigVelocitySolution { public static void main(String[] args) { // 初始化模板引擎

VelocityEngine ve = new VelocityEngine();
    ve.setProperty(RuntimeConstants.RESOURCE_LOADER, "classpath");
    ve.setProperty("classpath.resource.loader.class", ClasspathResourceLoader.class.getName());
    ve.init(); // 獲取模板文件
    Template template = ve.getTemplate("BigVelocity.template.vm");
    VelocityContext ctx = new VelocityContext();
    StringWriter sw = new StringWriter() { public void write(String str) { int length = this.getBuffer().length() + str.length(); // 限制大小爲10MB
            if (length > 10 * 1024 * 1024) { this.getBuffer().delete(0, this.getBuffer().length()); throw new RuntimeException("Velocity template size exceeds limit!");
            } this.getBuffer().append(str);
        }
    };
    template.merge(ctx, sw);
}

}

; "複製代碼")

控制檯輸出

若是velocity模板渲染後的sql語句集大小在容許的範圍內，這些語句集會根據咱們的業務作SQL語句的拆分，逐句執行。

如何優雅終止線程

在後續逐句執行sql語句的過程當中，每一句sql都是調用的周邊服務（DLI，OBS，MySql等）去執行的，結果每次都會返回給咱們的做業開發調度服務（DLF）後臺。咱們的DLF平臺支持及時中止做業的功能，也就是說假如這個做業在調度過程當中要執行10000條SQL，我要在中途中止不執行後面的SQL了——這樣的功能是支持的。

在修改上面提到OOM那個bug並經過測試後，測試同窗發現咱們的做業沒法中止下來，換句話說，咱們做業所在的java線程沒法中止。

線程中止失敗重現

一番debug與代碼深刻研讀以後，發現咱們項目中確實是調用了對應的線程對象的interrupt方法thread.interrupt();去終止線程的。

那麼爲何調用了interrupt方法依舊沒法終止線程？

TestForInterruptedException.java

; "複製代碼")

package com.xlf; public class TestForInterruptedException { public static void main(String[] args) {

StringBuilder sb = new StringBuilder(); for (int i = 0; i < 10; i++) {
        sb.append("show tables;n");
    } int i = 0; for (String str : sb.toString().split("n")) { if (i > 4) {
            Thread.currentThread().interrupt();
            System.out.println(i + " after interrupt");
        }
        System.out.println(str);
        System.out.println(i++);
    }

}

}

; "複製代碼")

控制檯輸出

測試結果分析

TestForInterruptedException.main函數中作的事情足夠簡單，先產生一個大一點的字符串，拆分紅10段小字符串，for循環中逐段打印小字符串；並企圖從第5段（初始段爲0）開始，去終止線程。結果發現線程並無終止！

這是怎麼回事？爲何調用了線程的interrupt方法並無終止線程？或者說是由於jvm須要一點時間去響應這個方法？其實並不是如此，感興趣的同窗能夠把循環次數加的更大一些，在循環開始幾回就進行interrupt，你會發現結果仍是這樣。

通過一番探索，線程終止的方法無外乎兩種：

使用該Thread對象的stop()方法能讓線程立刻中止，可是這種方法太過於暴力，實際上並不會被使用到，詳見JDK1.8的註釋：
Deprecated. This method is inherently unsafe. Stopping a thread with Thread.stop causes it to unlock all of the monitors that it has locked (as a natural consequence of the unchecked ThreadDeath exception propagating up the stack). If any of the objects previously protected by these monitors were in an inconsistent state, the damaged objects become visible to other threads, potentially resulting in arbitrary behavior. Many uses of stop should be replaced by code that simply modifies some variable to indicate that the target thread should stop running. The target thread should check this variable regularly, and return from its run method in an orderly fashion if the variable indicates that it is to stop running. If the target thread waits for long periods (on a condition variable, for example), the interrupt method should be used to interrupt the wait…
第二種方法就是上面JDK註釋中提到的設置標誌位的作法。這類作法又分爲兩種，不管哪種都須要去被終止的線程自己去「主動」地判斷該標誌位的狀態：

設置一個常規的標誌位，好比：boolean類型變量的true/ false, 根據變量的狀態去決定線程是否繼續運行——代碼裏去主動判斷變量狀態。這種通常用在循環中，檢測到相應狀態就break, return或者throw exception。
使用Thead類的實例方法interrupt去終止該thread對象表明的線程。可是interrupt方法本質上也是設置了一箇中斷標識位，並且該標誌位一旦被捕獲（讀取），「大部分時候」就會被重置（失效）。所以它並不保證線程必定可以中止，並且不保證立刻可以中止，有以下幾類狀況：
interrupt方法設置的中斷標識位後，若是該線程日後的程序執行邏輯中執行了Object類的wait/join/sleep，這3個方法會及時捕獲interrupt標誌位，重置並拋出InterruptedException。
相似於上一點，java.nio.channels包下的InterruptibleChannel類也會去主動捕獲interrupt標誌位，即線程處於InterruptibleChannel的I/O阻塞中也會被中斷，以後標誌位一樣會被重置，而後channel關閉，拋出java.nio.channels.ClosedByInterruptException；一樣的例子還有java.nio.channels.Selector，詳見JavaDoc
Thread類的實例方法isInterrupted()也能去捕獲中斷標識位並重置標識位，這個方法用在須要判斷程序終止的地方，能夠理解爲主動且顯式地去捕獲中斷標識位。
值得注意的是：拋出與捕獲InterruptedException並不涉及線程標識位的捕獲與重置
怎麼理解我前面說的中斷標識位一旦被捕獲，「大部分時候」就會被重置？Thread類中有private native boolean isInterrupted(boolean ClearInterrupted);當傳參爲false時就能在中斷標識位被捕獲後不重置。然而通常狀況它只會用於兩個地方
Thread類的static方法：此處會重置中斷標識位，並且沒法指定某個線程對象，只能是當前線程去判斷

Thread類的實例方法：這個方法也是經常使用的判斷線程中斷標識位的方法，並且不會重置標識位。

小結

要終止線程，目前JDK中可行的作法有：

本身設置變量去標識一個線程是否已中斷
合理利用JDK自己的線程中斷標識位去判斷線程是否中斷

這兩個作法都須要後續作相應處理好比去break循環，return方法或者拋出異常等等。

線程什麼時候終止？

線程終止緣由通常來說有兩種：

線程執行完他的正常代碼邏輯，天然結束。
線程執行中拋出Throwable對象且不被顯式捕獲，JVM會終止線程。衆所周知：Throwable類是Exception和Error的父類！

線程異常終止ExplicitlyCatchExceptionAndDoNotThrow.java

; "複製代碼")

package com.xlf; public class ExplicitlyCatchExceptionAndDoNotThrow { public static void main(String[] args) throws Exception {

boolean flag = true;
    System.out.println("Main started！"); try { throw new InterruptedException();
    } catch (InterruptedException exception) {
        System.out.println("InterruptedException is caught!");
    }
    System.out.println("Main doesn't stop！"); try { throw new Throwable();
    } catch (Throwable throwable) {
        System.out.println("Throwable is caught!");
    }
    System.out.println("Main is still here！"); if (flag) { throw new Exception("Main is dead!");
    }
    System.out.println("You'll never see this!");
}

}

; "複製代碼")

控制檯輸出

測試結果分析

這個測試驗證了前面關於線程異常終止的結論：

線程執行中拋出Throwable對象且不被顯式捕獲，JVM會終止線程。

優雅手動終止線程

線程執行中須要手動終止，最好的作法就是設置標識位（能夠是interrupt也能夠是本身定義的），而後及時捕獲標識位並拋出異常，在業務邏輯的最後去捕獲異常並作一些收尾的清理動做：好比統計任務執行失敗成功的比例，或者關閉某些流等等。這樣，程序的執行就兼顧到了正常與異常的狀況並獲得了優雅的處理。

TerminateThreadGracefully.java

; "複製代碼")

package com.xlf; public class TerminateThreadGracefully { public static void main(String[] args) {

StringBuilder sb = new StringBuilder(); for (int i = 0; i < 10; i++) {
        sb.append("show tables;n");
    } int i = 0; try { for (String str : sb.toString().split("n")) { if (i > 4) {
                Thread.currentThread().interrupt(); if (Thread.currentThread().isInterrupted()) { throw new InterruptedException();
                }
                System.out.println(i + " after interrupt");
            }
            System.out.println(str);
            System.out.println(i++);
        }
    } catch (InterruptedException exception) { // TODO：此處可能作一些清理工做
        System.out.println(Thread.currentThread().isInterrupted());
    }
    System.out.println("Thread main stops normally!");
}

}

; "複製代碼")

控制檯輸出

爲什麼項目中的線程終止失敗？

咱們項目中確實是調用了對應的線程對象的interrupt方法thread.interrupt();去終止線程的。

那麼爲何線程不能相應中斷標識位並終止呢？

回到咱們項目的業務邏輯：

整個job分爲模板讀取、渲染以及SQL執行三個階段，通常而言前兩個階段時間會比較快。在後續逐句執行sql語句的過程當中，每一句sql都是調用的周邊服務（DLI，OBS，MySql等）去執行的，結果每次都會返回給咱們的做業開發調度服務（DLF）後臺。咱們的DLF平臺支持及時中止做業的功能，也就是說假如這個做業在調度過程當中要執行10000條SQL，我要在中途中止不執行後面的SQL了——這樣的功能是支持的。

所以問題就出在了SQL執行的過程。通過屢次debug發現：在SQL執行過程當中須要每次都往OBS（華爲自研，第三方包）中寫log，該過程不可略去。調用該線程對象的interrupt方法thread.interrupt()，interrupt標識位最先被OBS底層用到的java.util.concurrent. CountDownLatch類的await()方法捕獲到，重置標識位並拋出異常，而後在一層層往上拋的時候被轉變成了別的異常類型，並且不能根據最終拋的異常類型去判斷是不是因爲咱們手動終止job引發的。

對於第三方包OBS根據本身的底層邏輯去處理CountDownLatch拋的異常，這本無可厚非。可是咱們的程序終止不了！爲了達到終止線程的作法，我在其中加入了一個自定義的標誌變量，當調用thread.interrupt()的時候去設置變量的狀態，並在幾個關鍵點好比OBS寫log以後去判斷個人自定義標識位的狀態，若是狀態改變了就拋出RuntimeException（能夠不被捕獲，最小化改動代碼）。而且爲了能重用線程池裏的線程對象，在每次job開始的地方去從重置這一自定義標識位。最終達到了優雅手動終止job的目的。

這一部分的源碼涉及項目細節就不貼出來了，可是相關的邏輯前面已經代碼展現過。

系統內存佔用較高且不許確

在線程中運行過程當中定義的普通的局部變量，非ThreadLocal型，通常而言會隨着線程結束而獲得回收。我所遇到的現象是上面的那個線程沒法中止的bug解決以後，線程停下來了，可是在linux上運行top命令相應進程內存佔用仍是很高。

首先我用jmap -histo:alive pid命令對jvm進行進行了強制GC，發現此時堆內存確實基本上沒用到多少（不關老年帶仍是年輕帶都大概是1%左右。）可是top命令看到的佔用大概在18% * 7G（linux總內存）左右。
其次，我用了jcmd命令去對對外內存進行分析，排斥了堆外內存泄露的問題
而後接下來就是用jstack命令查看jvm進程的各個線程都是什麼樣的狀態。與job有關的幾個線程所有是waiting on condition狀態（線程結束，線程池將他們掛起的）。
那麼，如今獲得一個初步的結論就是：無論是該jvm進程用到的堆內存仍是堆外內存，都很小（相對於top命令顯式的18% * 8G佔用量而言）。因此是否能夠猜測：jvm只是向操做系統申請了這麼多內存暫時沒有歸還回去，留待下次線程池有新任務時繼續複用呢？本文最後一部分試驗就圍繞着一點展開。

現象重現

在以下試驗中

設置jvm參數爲：

-Xms100m -Xmx200m -verbose:gc -XX:+PrintGCDetails -XX:+PrintGCTimeStamps

其意義在於：

限制jvm初始內存爲100M，最大堆內存爲200M。並在jvm發生垃圾回收時及時打印詳細的GC信息以及時間戳。而個人代碼裏要作的事情就是重現jvm內存不夠而不得不發生垃圾回收。同時觀察操做系統層面該java進程的內存佔用。

SystemMemoryOccupiedAndReleaseTest.java

; "複製代碼")

package com.xlf;

import java.util.concurrent.SynchronousQueue;
import java.util.concurrent.ThreadFactory;
import java.util.concurrent.ThreadPoolExecutor;
import java.util.concurrent.TimeUnit; public class SystemMemoryOccupiedAndReleaseTest { public static void main(String[] args) { try {

System.out.println("start");
        Thread.sleep(5000);
    } catch (InterruptedException e) {
        e.printStackTrace();
    }

    ThreadPoolExecutor executor = new ThreadPoolExecutor(3, 3, 30, TimeUnit.SECONDS, new SynchronousQueue<Runnable>(), new ThreadFactory() { public Thread newThread(Runnable r) { return new Thread(r);
            }
        }, new ThreadPoolExecutor.AbortPolicy()); try {
        System.out.println("(executor已初始化):");
        Thread.sleep(1000);
    } catch (InterruptedException e) {
        e.printStackTrace();
    }

    Thread t1 = new Thread(new Runnable() {
        {
            System.out.println("t1 已經初始化");
        }
        @Override public void run() { byte[] b = new byte[100 * 1024 * 1024];
            System.out.println("t1分配了100M空間給數組"); try {
                Thread.sleep(5000);
            } catch (InterruptedException e) {
                e.printStackTrace(); throw new RuntimeException("t1 stop");
            }
            System.out.println("t1 stop");
        }
    }, "t1"); try {
        Thread.sleep(1000);
    } catch (InterruptedException e) {
        e.printStackTrace();
    }

    Thread t2 = new Thread(new Runnable() {
        {
            System.out.println("t2 已經初始化");
        }
        @Override public void run() { byte[] b = new byte[100 * 1024 * 1024];
            System.out.println("t2分配了100M空間給數組"); try {
                Thread.sleep(5000);
            } catch (InterruptedException e) {
                e.printStackTrace(); throw new RuntimeException("t2 stop");
            }
            System.out.println("t2 stop");
        }
    }, "t2"); try {
        Thread.sleep(1000);
    } catch (InterruptedException e) {
        e.printStackTrace();
    }

    Thread t3 = new Thread(new Runnable() {
        {
            System.out.println("t3 已經初始化");
        }
        @Override public void run() { byte[] b = new byte[100 * 1024 * 1024];
            System.out.println("t3分配了100M空間給數組"); try {
                Thread.sleep(5000);
            } catch (InterruptedException e) {
                e.printStackTrace(); throw new RuntimeException("t3 stop");
            }
            System.out.println("t3 stop");
        }
    }, "t3"); try {
        Thread.sleep(1000);
    } catch (InterruptedException e) {
        e.printStackTrace();
    }

    executor.execute(t1);
    System.out.println("t1 executed!"); try {
        Thread.sleep(10000);
    } catch (InterruptedException e) {
        e.printStackTrace();

    }
    executor.execute(t2);
    System.out.println("t2 executed!"); try {
        Thread.sleep(10000);
    } catch (InterruptedException e) {
        e.printStackTrace();

    }
    executor.execute(t3);
    System.out.println("t3 executed!"); try {
        Thread.sleep(10000);
    } catch (InterruptedException e) {
        e.printStackTrace();

    }

    System.out.println("jmap -histo:live pid by cmd:"); try {
        Thread.sleep(20000);
    } catch (InterruptedException e) {
        e.printStackTrace();

    }
    System.out.println("After jmap!"); // You may run jmap -heap pid using cmd here // executor.shutdown();

}
}

VelocityExperiment.zip 19.40KB