[ElasticSearch]Java API 之滾動搜索(Scroll API)

時間 2019-11-18

標籤 elasticsearch java api 滾動搜索 scroll 欄目日誌分析简体版

原文原文鏈接

通常搜索請求都是返回一"頁"數據，不管數據量多大都一塊兒返回給用戶，Scroll API能夠容許咱們檢索大量數據（甚至所有數據）。Scroll API容許咱們作一個初始階段搜索而且持續批量從Elasticsearch里拉取結果直到沒有結果剩下。這有點像傳統數據庫裏的cursors（遊標）。php

Scroll API的建立並非爲了實時的用戶響應，而是爲了處理大量的數據（Scrolling is not intended for real time user requests, but rather for processing large amounts of data）。從 scroll 請求返回的結果只是反映了 search 發生那一時刻的索引狀態，就像一個快照(The results that are returned from a scroll request reflect the state of the index at the time that the initial search request was made, like a snapshot in time)。後續的對文檔的改動（索引、更新或者刪除）都只會影響後面的搜索請求。css

1. 普通請求

假設咱們想一次返回大量數據，下面代碼中一次請求58000條數據：html

/**
* 普通搜索
* @param client
*/
public static void search(Client client) {
String index = "simple-index";
String type = "simple-type";
// 搜索條件
SearchRequestBuilder searchRequestBuilder = client.prepareSearch();
searchRequestBuilder.setIndices(index);
searchRequestBuilder.setTypes(type);
searchRequestBuilder.setSize(58000);
// 執行
SearchResponse searchResponse = searchRequestBuilder.get();
// 搜索結果
SearchHit[] searchHits = searchResponse.getHits().getHits();
for (SearchHit searchHit : searchHits) {
String source = searchHit.getSource().toString();
logger.info("--------- searchByScroll source {}", source);
} // for
}

運行結果：java

Caused by: QueryPhaseExecutionException[Result window is too large, from + size must be less than or equal to: [10000] but was [58000]. See the scroll api for a more efficient way to request large data sets. This limit can be set by changing the [index.max_result_window] index level parameter.]
at org.elasticsearch.search.internal.DefaultSearchContext.preProcess(DefaultSearchContext.java:212)
at org.elasticsearch.search.query.QueryPhase.preProcess(QueryPhase.java:103)
at org.elasticsearch.search.SearchService.createContext(SearchService.java:676)
at org.elasticsearch.search.SearchService.createAndPutContext(SearchService.java:620)
at org.elasticsearch.search.SearchService.executeQueryPhase(SearchService.java:371)
at org.elasticsearch.search.action.SearchServiceTransportAction$SearchQueryTransportHandler.messageReceived(SearchServiceTransportAction.java:368)
at org.elasticsearch.search.action.SearchServiceTransportAction$SearchQueryTransportHandler.messageReceived(SearchServiceTransportAction.java:365)
at org.elasticsearch.transport.TransportRequestHandler.messageReceived(TransportRequestHandler.java:33)
at org.elasticsearch.transport.RequestHandlerRegistry.processMessageReceived(RequestHandlerRegistry.java:75)
at org.elasticsearch.transport.TransportService$4.doRun(TransportService.java:376)
at org.elasticsearch.common.util.concurrent.AbstractRunnable.run(AbstractRunnable.java:37)
... 3 more

從上面咱們能夠知道，搜索請求一次請求最大量爲[10000]。咱們的請求量已經超標，所以報錯，異常信息提示咱們請求大數據量的狀況下使用Scroll API。數據庫

2. 使用Scroll API 請求

爲了使用 scroll，初始搜索請求應該在查詢中指定 scroll 參數，告訴 Elasticsearch 須要保持搜索的上下文環境多長時間（滾動時間）。api

searchRequestBuilder.setScroll(new TimeValue(60000));

下面代碼中指定了查詢條件以及滾動屬性，如滾動的有效時長（使用setScroll()方法）。咱們經過SearchResponse對象的getScrollId()方法獲取滾動ID。滾動ID會在下一次請求中使用。數組

/**
* 使用scroll進行搜索
* @param client
*/
public static String searchByScroll(Client client) {
String index = "simple-index";
String type = "simple-type";
// 搜索條件
SearchRequestBuilder searchRequestBuilder = client.prepareSearch();
searchRequestBuilder.setIndices(index);
searchRequestBuilder.setTypes(type);
searchRequestBuilder.setScroll(new TimeValue(30000));
// 執行
SearchResponse searchResponse = searchRequestBuilder.get();
String scrollId = searchResponse.getScrollId();
logger.info("--------- searchByScroll scrollID {}", scrollId);
SearchHit[] searchHits = searchResponse.getHits().getHits();
for (SearchHit searchHit : searchHits) {
String source = searchHit.getSource().toString();
logger.info("--------- searchByScroll source {}", source);
} // for
return scrollId;
}

使用上面的請求返回的結果中的滾動ID，這個 ID 能夠傳遞給 scroll API 來檢索下一個批次的結果。這一次請求中不用添加索引和類型，這些都指定在了原始的 search 請求中。less

每次返回下一個批次結果直到沒有結果返回時中止即hits數組空時(Each call to the scroll API returns the next batch of results until there are no more results left to return, ie the hits array is empty)。elasticsearch

/**
* 經過滾動ID獲取文檔
* @param client
* @param scrollId
*/
public static void searchByScrollId(Client client, String scrollId){
TimeValue timeValue = new TimeValue(30000);
SearchScrollRequestBuilder searchScrollRequestBuilder;
SearchResponse response;
// 結果
while (true) {
logger.info("--------- searchByScroll scrollID {}", scrollId);
searchScrollRequestBuilder = client.prepareSearchScroll(scrollId);
// 從新設定滾動時間
searchScrollRequestBuilder.setScroll(timeValue);
// 請求
response = searchScrollRequestBuilder.get();
// 每次返回下一個批次結果直到沒有結果返回時中止即hits數組空時
if (response.getHits().getHits().length == 0) {
break;
} // if
// 這一批次結果
SearchHit[] searchHits = response.getHits().getHits();
for (SearchHit searchHit : searchHits) {
String source = searchHit.getSource().toString();
logger.info("--------- searchByScroll source {}", source);
} // for
// 只有最近的滾動ID才能被使用
scrollId = response.getScrollId();
} // while
}

備註：ide

初始搜索請求和每一個後續滾動請求返回一個新的滾動ID——只有最近的滾動ID才能被使用。（The initial search request and each subsequent scroll request returns a new_scroll_id — only the most recent _scroll_id should be used）

我每次後續滾動請求返回的滾動ID都是相同的，因此對上面的備註，不是很懂，有明白的能夠告知，謝謝。

若是超過滾動時間，繼續使用該滾動ID搜索數據，則會報錯：

Caused by: SearchContextMissingException[No search context found for id [2861]]
at org.elasticsearch.search.SearchService.findContext(SearchService.java:613)
at org.elasticsearch.search.SearchService.executeQueryPhase(SearchService.java:403)
at org.elasticsearch.search.action.SearchServiceTransportAction$SearchQueryScrollTransportHandler.messageReceived(SearchServiceTransportAction.java:384)
at org.elasticsearch.search.action.SearchServiceTransportAction$SearchQueryScrollTransportHandler.messageReceived(SearchServiceTransportAction.java:381)
at org.elasticsearch.transport.TransportRequestHandler.messageReceived(TransportRequestHandler.java:33)
at org.elasticsearch.transport.RequestHandlerRegistry.processMessageReceived(RequestHandlerRegistry.java:75)
at org.elasticsearch.transport.TransportService$4.doRun(TransportService.java:376)
at org.elasticsearch.common.util.concurrent.AbstractRunnable.run(AbstractRunnable.java:37)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
at java.lang.Thread.run(Thread.java:745)

3. 清除滾動ID

雖然當滾動有效時間已過，搜索上下文(Search Context)會自動被清除，可是一值保持滾動代價也是很大的，因此當咱們不在使用滾動時要儘快使用Clear-Scroll API進行清除。

/**
* 清除滾動ID
* @param client
* @param scrollIdList
* @return
*/
public static boolean clearScroll(Client client, List<String> scrollIdList){
ClearScrollRequestBuilder clearScrollRequestBuilder = client.prepareClearScroll();
clearScrollRequestBuilder.setScrollIds(scrollIdList);
ClearScrollResponse response = clearScrollRequestBuilder.get();
return response.isSucceeded();
}
/**
* 清除滾動ID
* @param client
* @param scrollId
* @return
*/
public static boolean clearScroll(Client client, String scrollId){
ClearScrollRequestBuilder clearScrollRequestBuilder = client.prepareClearScroll();
clearScrollRequestBuilder.addScrollId(scrollId);
ClearScrollResponse response = clearScrollRequestBuilder.get();
return response.isSucceeded();
}