以前一直運行正常的數據分析平臺,最近一段時間沒有注意發現日誌索引數據一直未生成,大概持續了n多天,當前狀態: 單臺機器, Elasticsearch(下面稱ES)單節點(空集羣),1000+shrads, 約200G大小。java
使用 top
查看服務器 cpu
,內存等佔用狀況,以下圖示(當時樓主的服務器ES應用的CPU佔用在90%以上,確定有問題)node
內存佔用也極高(當時樓主的8G內存的服務器僅剩下150M左右的空閒,確定是ES的問題)bash
查看ES集羣健康值,發現 status
爲 red
,這種狀態表示部分主分片不可用,樓主當前的狀態是歷史數據可查,可是沒法生成新的 index
數據。服務器
curl http://localhost:9200/_cluster/health?pretty { "cluster_name" : "elasticsearch", "status" : "red", "timed_out" : false, "number_of_nodes" : 1, "number_of_data_nodes" : 1, "active_primary_shards" : 663, "active_shards" : 663, "relocating_shards" : 0, "initializing_shards" : 0, "unassigned_shards" : 6, "delayed_unassigned_shards" : 0, "number_of_pending_tasks" : 0, "number_of_in_flight_fetch" : 0, "task_max_waiting_in_queue_millis" : 0, "active_shards_percent_as_number" : 99.10313901345292 }
查看每一個索引的狀態,發現大部分索引狀態是 red
,處於不可用狀態,由於打開的索引數據過多,致使ES佔用大量的CPU,內存,使得 logstash
不可用,也就沒法建立新的索引數據,從而致使數據丟失。curl
curl -XGET "http://localhost:9200/_cat/indices?v" health status index pri rep docs.count docs.deleted store.size pri.store.size red open jr-2016.12.20 3 0 red open jr-2016.12.21 3 0 red open jr-2016.12.22 3 0 red open jr-2016.12.23 3 0 red open jr-2016.12.24 3 0 red open jr-2016.12.25 3 0 red open jr-2016.12.26 3 0 red open jr-2016.12.27 3 0
查詢ES時拋出的異常:elasticsearch
[2018-08-06 18:27:24,553][DEBUG][action.search ] [Godfrey Calthrop] All shards failed for phase: [query] [jr-2018.08.06][[jr-2018.08.06][2]] NoShardAvailableActionException[null] at org.elasticsearch.action.search.AbstractSearchAsyncAction.start(AbstractSearchAsyncAction.java:129) at org.elasticsearch.action.search.TransportSearchAction.doExecute(TransportSearchAction.java:115) at org.elasticsearch.action.search.TransportSearchAction.doExecute(TransportSearchAction.java:47) at org.elasticsearch.action.support.TransportAction.doExecute(TransportAction.java:149) at org.elasticsearch.action.support.TransportAction.execute(TransportAction.java:137) at org.elasticsearch.action.support.TransportAction.execute(TransportAction.java:85) at org.elasticsearch.client.node.NodeClient.doExecute(NodeClient.java:58) at org.elasticsearch.client.support.AbstractClient.execute(AbstractClient.java:359) at org.elasticsearch.client.FilterClient.doExecute(FilterClient.java:52) at org.elasticsearch.rest.BaseRestHandler$HeadersAndContextCopyClient.doExecute(BaseRestHandler.java:83) at org.elasticsearch.client.support.AbstractClient.execute(AbstractClient.java:359) at org.elasticsearch.client.support.AbstractClient.search(AbstractClient.java:582) at org.elasticsearch.rest.action.search.RestSearchAction.handleRequest(RestSearchAction.java:85) at org.elasticsearch.rest.BaseRestHandler.handleRequest(BaseRestHandler.java:54) at org.elasticsearch.rest.RestController.executeHandler(RestController.java:205) at org.elasticsearch.rest.RestController.dispatchRequest(RestController.java:166) at org.elasticsearch.http.HttpServer.internalDispatchRequest(HttpServer.java:128) at org.elasticsearch.http.HttpServer$Dispatcher.dispatchRequest(HttpServer.java:86) at org.elasticsearch.http.netty.NettyHttpServerTransport.dispatchRequest(NettyHttpServerTransport.java:449) at org.elasticsearch.http.netty.HttpRequestHandler.messageReceived(HttpRequestHandler.java:61)
經過以上排查大概知道是歷史索引數據處於 open 狀態過多,從而致使ES的CPU,內存佔用太高致使的不可用。fetch
#關閉不須要的索引,減小內存佔用 curl -XPOST "http://localhost:9200/index_name/_close"
關閉非熱點索引數據後,樓主的ES集羣的健康值依然是 red 狀態,樓主最後聯想到索引的 red 狀態可能會影響ES的狀態,果不其然以下所示url
curl GET http://10.252.148.85:9200/_cluster/health?level=indices { "cluster_name": "elasticsearch", "status": "red", "timed_out": false, "number_of_nodes": 1, "number_of_data_nodes": 1, "active_primary_shards": 660, "active_shards": 660, "relocating_shards": 0, "initializing_shards": 0, "unassigned_shards": 9, "delayed_unassigned_shards": 0, "number_of_pending_tasks": 0, "number_of_in_flight_fetch": 0, "task_max_waiting_in_queue_millis": 0, "active_shards_percent_as_number": 98.65470852017937, "indices": { "jr-2018.08.06": { "status": "red", "number_of_shards": 3, "number_of_replicas": 0, "active_primary_shards": 0, "active_shards": 0, "relocating_shards": 0, "initializing_shards": 0, "unassigned_shards": 3 } } }
解決方法,刪除這條索引數據(這條數據是樓主排查問題期間產生的髒數據,索引直接刪除)spa
curl -XDELETE 'http://10.252.148.85:9200/jr-2018.08.06'
當ES處於單點時,應注意ES的索引狀態以及服務器的監控,及時清理或者關閉沒必要要的索引數據,避免這種狀況發生。技術成長的道路上,與你同行。.net