MongoDB導出場景查詢優化

引言

前段時間遇到一個相似導出數據場景,觀察下來發現速度會愈來愈慢,導出100萬數據須要耗費40-60分鐘,從日誌觀察發現,耗時也是愈來愈高。前端

緣由

從代碼邏輯上看,這裏採起了分批次導出的方式,相似前端的分頁,具體是經過skip+limit的方式實現的,那麼採用這種方式會有什麼問題呢?咱們google一下這兩個接口的文檔:java

The cursor.skip() method is often expensive because it requires the server to walk from the beginning of the collection or index to get the offset or skip position before beginning to return results. As the offset (e.g. pageNumber above) increases, cursor.skip() will become slower and more CPU intensive. With larger collections, cursor.skip() may become IO bound.

簡單來講,隨着頁數的增加,skip()會變得愈來愈慢,可是具體就咱們這裏導出的場景來講,按理說應該不必每次都去重複計算,作一些無用功,個人理解應該能夠拿到一個指針,慢慢遍歷,簡單google以後,咱們發現果真是能夠這樣作的。數據庫

咱們能夠在持久層新增一個方法,返回一個cursor專門供上層去遍歷數據,這樣就不用再去遍歷已經導出過的結果集,從O(N2)優化到了O(N),這裏還能夠指定一個batchSize,設置一次從MongoDB中抓取的數據量(元素個數),注意這裏最大是4M.框架

/**
     * <p>Limits the number of elements returned in one batch. A cursor 
     * typically fetches a batch of result objects and store them
     * locally.</p>
     *
     * <p>If {@code batchSize} is positive, it represents the size of each batch of objects retrieved. It can be adjusted to optimize
     * performance and limit data transfer.</p>
     *
     * <p>If {@code batchSize} is negative, it will limit of number objects returned, that fit within the max batch size limit (usually
     * 4MB), and cursor will be closed. For example if {@code batchSize} is -10, then the server will return a maximum of 10 documents and
     * as many as can fit in 4MB, then close the cursor. Note that this feature is different from limit() in that documents must fit within
     * a maximum size, and it removes the need to send a request to close the cursor server-side.</p>
*/

好比說我這裏配置的8000,那麼mongo客戶端就會去默認抓取這麼多的數據量:ide

clipboard.png

通過本地簡單的測試,咱們發現性能已經有了飛躍的提高,導出30萬數據,採用以前的方式,翻頁到後面平均要500ms,總耗時60039ms。而優化後的方式,平均耗時在100ms-200ms之間,總耗時16667ms(中間包括業務邏輯的耗時)。工具

使用

DBCursor cursor = collection.find(query).batchSize(8000);
while (dbCursor.hasNext()) {
  DBObject nextItem = dbCursor.next();
  //業務代碼
  ... 
  //
}

那麼咱們再看看hasNext內部的邏輯好嗎?好的.性能

@Override
    public boolean hasNext() {
        if (closed) {
            throw new IllegalStateException("Cursor has been closed");
        }

        if (nextBatch != null) {
            return true;
        }

        if (limitReached()) {
            return false;
        }

        while (serverCursor != null) {
            //這裏會向mongo發送一條指令去抓取數據
            getMore();
            if (nextBatch != null) {
                return true;
            }
        }

        return false;
    }
    
    
    private void getMore() {
        Connection connection = connectionSource.getConnection();
        try {
            if(serverIsAtLeastVersionThreeDotTwo(connection.getDescription()){
                try {
//能夠看到這裏實際上是調用了`nextBatch`指令        
initFromCommandResult(connection.command(namespace.getDatabaseName(),
                                                             asGetMoreCommandDocument(),
                                                             false,
                                                             new NoOpFieldNameValidator(),
                                                             CommandResultDocumentCodec.create(decoder, "nextBatch")));
                } catch (MongoCommandException e) {
                    throw translateCommandException(e, serverCursor);
                }
            } else {
                initFromQueryResult(connection.getMore(namespace, serverCursor.getId(),
                                                       getNumberToReturn(limit, batchSize, count),
                                                       decoder));
            }
            if (limitReached()) {
                killCursor(connection);
            }
        } finally {
            connection.release();
        }
    }

最後initFromCommandResult 拿到結果並解析成Bson對象測試

總結

  1. 咱們日常寫代碼的時候,最好都可以針對每一個方法、接口甚至是更細的粒度加上埋點,也能夠設置成debug級別,這樣利用log4j/logback等日誌框架動態更新級別,能夠隨時查看耗時,從而更可以針對性的優化,好比Spring有個有個工具類StopWatch就能夠作這件事.fetch

  2. 對於本文說的這個場景,咱們首先看看是否是代碼的邏輯有問題,而後看看是否是數據庫的問題,好比說沒建索引、數據量過大等,再去想辦法針對性的優化,而不要上來就擼代碼。優化

相關文章
相關標籤/搜索