壓縮20M文件從30秒到1秒的優化過程

時間 2019-11-06

標籤壓縮 20m 文件 30秒 1秒優化過程简体版

原文原文鏈接

壓縮20M文件從30秒到1秒的優化過程

有一個需求須要將前端傳過來的10張照片，而後後端進行處理之後壓縮成一個壓縮包經過網絡流傳輸出去。以前沒有接觸過用Java壓縮文件的，因此就直接上網找了一個例子改了一下用了，改完之後也能使用，可是隨着前端所傳圖片的大小愈來愈大的時候，耗費的時間也在急劇增長，最後測了一下壓縮20M的文件居然須要30秒的時間。壓縮文件的代碼以下。前端

public static void zipFileNoBuffer() {
    File zipFile = new File(ZIP_FILE);
    try (ZipOutputStream zipOut = new ZipOutputStream(new FileOutputStream(zipFile))) {
        //開始時間
        long beginTime = System.currentTimeMillis();

        for (int i = 0; i < 10; i++) {
            try (InputStream input = new FileInputStream(JPG_FILE)) {
                zipOut.putNextEntry(new ZipEntry(FILE_NAME + i));
                int temp = 0;
                while ((temp = input.read()) != -1) {
                    zipOut.write(temp);
                }
            }
        }
        printInfo(beginTime);
    } catch (Exception e) {
        e.printStackTrace();
    }
}

複製代碼

這裏找了一張2M大小的圖片，而且循環十次進行測試。打印的結果以下，時間大概是30秒。git

fileSize:20M
consum time:29599

複製代碼

第一次優化過程-從30秒到2秒

進行優化首先想到的是利用緩衝區BufferInputStream。在FileInputStream中read()方法每次只讀取一個字節。源碼中也有說明。github

/**
 * Reads a byte of data from this input stream. This method blocks
 * if no input is yet available.
 *
 * @return     the next byte of data, or <code>-1</code> if the end of the
 *             file is reached.
 * @exception  IOException  if an I/O error occurs.
 */
public native int read() throws IOException;

複製代碼

這是一個調用本地方法與原生操做系統進行交互，從磁盤中讀取數據。每讀取一個字節的數據就調用一次本地方法與操做系統交互，是很是耗時的。例如咱們如今有30000個字節的數據，若是使用FileInputStream那麼就須要調用30000次的本地方法來獲取這些數據，而若是使用緩衝區的話（這裏假設初始的緩衝區大小足夠放下30000字節的數據）那麼只須要調用一次就行。由於緩衝區在第一次調用read()方法的時候會直接從磁盤中將數據直接讀取到內存中。隨後再一個字節一個字節的慢慢返回。後端

BufferedInputStream內部封裝了一個byte數組用於存放數據，默認大小是8192數組

優化事後的代碼以下緩存

public static void zipFileBuffer() {
    File zipFile = new File(ZIP_FILE);
    try (ZipOutputStream zipOut = new ZipOutputStream(new FileOutputStream(zipFile));
            BufferedOutputStream bufferedOutputStream = new BufferedOutputStream(zipOut)) {
        //開始時間
        long beginTime = System.currentTimeMillis();
        for (int i = 0; i < 10; i++) {
            try (BufferedInputStream bufferedInputStream = new BufferedInputStream(new FileInputStream(JPG_FILE))) {
                zipOut.putNextEntry(new ZipEntry(FILE_NAME + i));
                int temp = 0;
                while ((temp = bufferedInputStream.read()) != -1) {
                    bufferedOutputStream.write(temp);
                }
            }
        }
        printInfo(beginTime);
    } catch (Exception e) {
        e.printStackTrace();
    }
}

複製代碼

輸出安全

------Buffer
fileSize:20M
consum time:1808

複製代碼

能夠看到相比較於第一次使用FileInputStream效率已經提高了許多了bash

第二次優化過程-從2秒到1秒

使用緩衝區buffer的話已是知足了個人需求了，可是秉着學以至用的想法，就想着用NIO中知識進行優化一下。網絡

使用Channel

爲何要用Channel呢？由於在NIO中新出了Channel和ByteBuffer。正是由於它們的結構更加符合操做系統執行I/O的方式，因此其速度相比較於傳統IO而言速度有了顯著的提升。Channel就像一個包含着煤礦的礦藏，而ByteBuffer則是派送到礦藏的卡車。也就是說咱們與數據的交互都是與ByteBuffer的交互。app

在NIO中可以產生FileChannel的有三個類。分別是FileInputStream、FileOutputStream、以及既能讀又能寫的RandomAccessFile。

源碼以下

public static void zipFileChannel() {
    //開始時間
    long beginTime = System.currentTimeMillis();
    File zipFile = new File(ZIP_FILE);
    try (ZipOutputStream zipOut = new ZipOutputStream(new FileOutputStream(zipFile));
            WritableByteChannel writableByteChannel = Channels.newChannel(zipOut)) {
        for (int i = 0; i < 10; i++) {
            try (FileChannel fileChannel = new FileInputStream(JPG_FILE).getChannel()) {
                zipOut.putNextEntry(new ZipEntry(i + SUFFIX_FILE));
                fileChannel.transferTo(0, FILE_SIZE, writableByteChannel);
            }
        }
        printInfo(beginTime);
    } catch (Exception e) {
        e.printStackTrace();
    }
}

複製代碼

咱們能夠看到這裏並無使用ByteBuffer進行數據傳輸，而是使用了transferTo的方法。這個方法是將兩個通道進行直連。

This method is potentially much more efficient than a simple loop
* that reads from this channel and writes to the target channel.  Many
* operating systems can transfer bytes directly from the filesystem cache
* to the target channel without actually copying them. 

複製代碼

這是源碼上的描述文字，大概意思就是使用transferTo的效率比循環一個Channel讀取出來而後再循環寫入另外一個Channel好。操做系統可以直接傳輸字節從文件系統緩存到目標的Channel中，而不須要實際的copy階段。

copy階段就是從內核空間轉到用戶空間的一個過程

能夠看到速度相比較使用緩衝區已經有了一些的提升。

------Channel
fileSize:20M
consum time:1416

複製代碼

內核空間和用戶空間

那麼爲何從內核空間轉向用戶空間這段過程會慢呢？首先咱們需瞭解的是什麼是內核空間和用戶空間。在經常使用的操做系統中爲了保護系統中的核心資源，因而將系統設計爲四個區域，越往裏權限越大，因此Ring0被稱之爲內核空間，用來訪問一些關鍵性的資源。Ring3被稱之爲用戶空間。

用戶態、內核態：線程處於內核空間稱之爲內核態，線程處於用戶空間屬於用戶態

那麼咱們若是此時應用程序（應用程序是都屬於用戶態的）須要訪問核心資源怎麼辦呢？那就須要調用內核中所暴露出的接口用以調用，稱之爲系統調用。例如此時咱們應用程序須要訪問磁盤上的文件。此時應用程序就會調用系統調用的接口open方法，而後內核去訪問磁盤中的文件，將文件內容返回給應用程序。大體的流程以下

直接緩衝區和非直接緩衝區

既然咱們要讀取一個磁盤的文件，要廢這麼大的周折。有沒有什麼簡單的方法可以使咱們的應用直接操做磁盤文件，不須要內核進行中轉呢？有，那就是創建直接緩衝區了。

非直接緩衝區：非直接緩衝區就是咱們上面所講內核態做爲中間人，每次都須要內核在中間做爲中轉。
直接緩衝區：直接緩衝區不須要內核空間做爲中轉copy數據，而是直接在物理內存申請一塊空間，這塊空間映射到內核地址空間和用戶地址空間，應用程序與磁盤之間數據的存取經過這塊直接申請的物理內存進行交互。

既然直接緩衝區那麼快，咱們爲何不都用直接緩衝區呢？其實直接緩衝區有如下的缺點。直接緩衝區的缺點：

不安全
消耗更多，由於它不是在JVM中直接開闢空間。這部份內存的回收只能依賴於垃圾回收機制，垃圾何時回收不受咱們控制。
數據寫入物理內存緩衝區中，程序就喪失了對這些數據的管理，即何時這些數據被最終寫入從磁盤只能由操做系統來決定，應用程序沒法再幹涉。

綜上所述，因此咱們使用transferTo方法就是直接開闢了一段直接緩衝區。因此性能相比而言提升了許多

使用內存映射文件

NIO中新出的另外一個特性就是內存映射文件，內存映射文件爲何速度快呢？其實緣由和上面所講的同樣，也是在內存中開闢了一段直接緩衝區。與數據直接做交互。源碼以下

//Version 4 使用Map映射文件
public static void zipFileMap() {
    //開始時間
    long beginTime = System.currentTimeMillis();
    File zipFile = new File(ZIP_FILE);
    try (ZipOutputStream zipOut = new ZipOutputStream(new FileOutputStream(zipFile));
            WritableByteChannel writableByteChannel = Channels.newChannel(zipOut)) {
        for (int i = 0; i < 10; i++) {

            zipOut.putNextEntry(new ZipEntry(i + SUFFIX_FILE));

            //內存中的映射文件
            MappedByteBuffer mappedByteBuffer = new RandomAccessFile(JPG_FILE_PATH, "r").getChannel()
                    .map(FileChannel.MapMode.READ_ONLY, 0, FILE_SIZE);

            writableByteChannel.write(mappedByteBuffer);
        }
        printInfo(beginTime);
    } catch (Exception e) {
        e.printStackTrace();
    }
}

複製代碼

打印以下

---------Map
fileSize:20M
consum time:1305

複製代碼

能夠看到速度和使用Channel的速度差很少的。

使用Pipe

Java NIO 管道是2個線程之間的單向數據鏈接。Pipe有一個source通道和一個sink通道。其中source通道用於讀取數據，sink通道用於寫入數據。能夠看到源碼中的介紹，大概意思就是寫入線程會阻塞至有讀線程從通道中讀取數據。若是沒有數據可讀，讀線程也會阻塞至寫線程寫入數據。直至通道關閉。

Whether or not a thread writing bytes to a pipe will block until another
 thread reads those bytes

複製代碼

我想要的效果是這樣的。源碼以下

//Version 5 使用Pip
public static void zipFilePip() {

    long beginTime = System.currentTimeMillis();
    try(WritableByteChannel out = Channels.newChannel(new FileOutputStream(ZIP_FILE))) {
        Pipe pipe = Pipe.open();
        //異步任務
        CompletableFuture.runAsync(()->runTask(pipe));

        //獲取讀通道
        ReadableByteChannel readableByteChannel = pipe.source();
        ByteBuffer buffer = ByteBuffer.allocate(((int) FILE_SIZE)*10);
        while (readableByteChannel.read(buffer)>= 0) {
            buffer.flip();
            out.write(buffer);
            buffer.clear();
        }
    }catch (Exception e){
        e.printStackTrace();
    }
    printInfo(beginTime);

}

//異步任務
public static void runTask(Pipe pipe) {

    try(ZipOutputStream zos = new ZipOutputStream(Channels.newOutputStream(pipe.sink()));
            WritableByteChannel out = Channels.newChannel(zos)) {
        System.out.println("Begin");
        for (int i = 0; i < 10; i++) {
            zos.putNextEntry(new ZipEntry(i+SUFFIX_FILE));

            FileChannel jpgChannel = new FileInputStream(new File(JPG_FILE_PATH)).getChannel();

            jpgChannel.transferTo(0, FILE_SIZE, out);

            jpgChannel.close();
        }
    }catch (Exception e){
        e.printStackTrace();
    }
}

複製代碼