ElasticSearch 使用不一樣表結構存儲時間序列數據的查詢效率分析

時間 2019-11-08

標籤 elasticsearch 使用不一樣結構存儲時間序列數據查詢效率分析欄目日誌分析简体版

原文原文鏈接

這裏咱們使用和以前徹底相同的測試數據，來測試 elasticsearch 存儲時間序列的表結構選擇問題。python

一個點一個doc的表結構

一樣咱們以最簡單的表結構開始。在elasticsearch中，先要建立index，而後index下有mapping。所謂的mapping就是表結構的概念。建表的配置以下：web

settings = {    'number_of_shards': 1,    'number_of_replicas': 0,    'index.query.default_field': 'timestamp',    'index.mapping.ignore_malformed': False,    'index.mapping.coerce': False,    'index.query.parse.allow_unmapped_fields': False,
}
mappings = {    'testdata': {        '_source': {'enabled': False},        '_all': {'enabled': False},        'properties': {            'timestamp': {                'type': 'date',                'index': 'no',                'store': False,                'dynamic': 'strict',                'doc_values': True,                'fielddata': {                    'format': 'doc_values'
                }
            },            'vAppid': {                'type': 'string',                'index': 'no',                'store': False,                'dynamic': 'strict',                'doc_values': True,                'fielddata': {                    'format': 'doc_values'
                }
            },            'iResult': {                'type': 'string',                'index': 'no',                'store': False,                'dynamic': 'strict',                'doc_values': True,                'fielddata': {                    'format': 'doc_values'
                }
            },            'vCmdid': {                'type': 'string',                'index': 'no',                'store': False,                'dynamic': 'strict',                'doc_values': True,                'fielddata': {                    'format': 'doc_values'
                }
            },            'dProcessTime': {                'type': 'integer',                'index': 'no',                'store': False,                'dynamic': 'strict',                'doc_values': True,                'fielddata': {                    'format': 'doc_values'
                }
            },            'totalCount': {                'type': 'integer',                'index': 'no',                'store': False,                'dynamic': 'strict',                'doc_values': True,                'fielddata': {                    'format': 'doc_values'
                }
            }
        }
    }
}

表結構雖然沒有作按時間段打包的高級優化，可是一些es相關的設置是特別值得注意的。首先_source被關閉了，這樣原始的json文檔不會被重複存儲一遍。其次_all也被關閉了。並且每一個字段的store都是False，也就是不會單獨被存儲。以前測試mongodb的時候，全部字段都沒有建索引的，因此爲了公平起見，這裏把索引都關了。這些都關掉了，那麼數據存哪裏了？存在doc_values裏。doc_values用於在作聚合運算的時候，根據一批文檔id快速找到對應的列的值。doc_values在磁盤上一個按列壓縮存儲的文件，很是高效。mongodb

那麼800多萬行數據導入以後，磁盤佔用狀況如何？數據庫

size: 198Mi (198Mi)docs: 8,385,335 (8,385,335)

很是驚人，838萬行在mongodb裏佔了3G的磁盤空間，導入es竟然只佔用了198M。即使把全部維度字段的索引加上膨脹也很是小。json

size: 233Mi (233Mi)docs: 8,385,335 (8,385,335)

那麼查詢效率呢？bash

q = {    'aggs': {        'timestamp': {            'terms': { 
                'field': 'timestamp'
            },            'aggs': {                'totalCount': {'sum': {'field': 'totalCount'}}
            }
        }
    },
}
res = es.search(index="wentao-test1", doc_type='testdata', body=q, search_type='count')

一樣是按時間聚合，取得同週期的totalCount之和。查詢結果爲：服務器

{u'_shards': {u'failed': 0, u'successful': 1, u'total': 1}, u'aggregations': {u'timestamp': {u'buckets': [{u'doc_count': 38304,     u'key': 1428789900000,     u'key_as_string': u'2015-04-11T22:05:00.000Z',     u'totalCount': {u'value': 978299.0}},
    {u'doc_count': 38020,     u'key': 1428789960000,     u'key_as_string': u'2015-04-11T22:06:00.000Z',     u'totalCount': {u'value': 970089.0}},
    {u'doc_count': 37865,     u'key': 1428789660000,     u'key_as_string': u'2015-04-11T22:01:00.000Z',     u'totalCount': {u'value': 917908.0}},
    {u'doc_count': 37834,     u'key': 1428789840000,     u'key_as_string': u'2015-04-11T22:04:00.000Z',     u'totalCount': {u'value': 931039.0}},
    {u'doc_count': 37780,     u'key': 1428790140000,     u'key_as_string': u'2015-04-11T22:09:00.000Z',     u'totalCount': {u'value': 972810.0}},
    {u'doc_count': 37761,     u'key': 1428790020000,     u'key_as_string': u'2015-04-11T22:07:00.000Z',     u'totalCount': {u'value': 953866.0}},
    {u'doc_count': 37738,     u'key': 1428790080000,     u'key_as_string': u'2015-04-11T22:08:00.000Z',     u'totalCount': {u'value': 969901.0}},
    {u'doc_count': 37598,     u'key': 1428789600000,     u'key_as_string': u'2015-04-11T22:00:00.000Z',     u'totalCount': {u'value': 919538.0}},
    {u'doc_count': 37541,     u'key': 1428789720000,     u'key_as_string': u'2015-04-11T22:02:00.000Z',     u'totalCount': {u'value': 920581.0}},
    {u'doc_count': 37518,     u'key': 1428789780000,     u'key_as_string': u'2015-04-11T22:03:00.000Z',     u'totalCount': {u'value': 924791.0}}],   u'doc_count_error_upper_bound': 0,   u'sum_other_doc_count': 8007376}}, u'hits': {u'hits': [], u'max_score': 0.0, u'total': 8385335}, u'timed_out': False, u'took': 1033}

只花了1秒鐘的時間，以前這個查詢在mongodb裏須要花9秒。那麼是否是由於elasticsearch是並行數據庫因此快呢？咱們以前在建立index的時候故意指定了shard數量爲1，因此這個查詢只有一個機器參與的。爲了好奇，我又試驗瞭如下6個分片的。在分片爲6的時候，總尺寸爲259M（含索引），而上面那個查詢只須要200ms。固然這裏測試的時候使用的mongodb和es的機器不徹底同樣，也許是由於硬件緣由呢？app

第二個查詢要複雜一些，按vAppid過濾，而後按timestamp和vCmdid兩個維度聚合。查詢以下：elasticsearch

q = {    'query': {        'constant_score': {            'filter': {                'bool': {                    'must_not': {                        'term': {                            'vAppid': ''
                        }
                    }
                }
            }
        },
    },    'aggs': {        'timestamp': {            'terms': { 
                'field': 'timestamp'
            },            'aggs': {                'vCmdid': {                    'terms': { 
                        'field': 'vCmdid'
                    },                    'aggs': {                        'totalCount': {'sum': {'field': 'totalCount'}}
                    }
                }
            }
        }
    },
}
res = es.search(index="wentao-test3", doc_type='testdata', body=q, search_type='count')

constant_score跳過了score階段。查詢結果以下：測試

{u'_shards': {u'failed': 0, u'successful': 1, u'total': 1}, u'aggregations': {u'timestamp': {u'buckets': [{u'doc_count': 38304,     u'key': 1428789900000,     u'key_as_string': u'2015-04-11T22:05:00.000Z',     u'vCmdid': {u'buckets': [{u'doc_count': 7583,        u'key': u'10000',        u'totalCount': {u'value': 241108.0}},
       {u'doc_count': 4122, u'key': u'19', u'totalCount': {u'value': 41463.0}},
       {u'doc_count': 2312, u'key': u'14', u'totalCount': {u'value': 41289.0}},
       {u'doc_count': 2257, u'key': u'18', u'totalCount': {u'value': 57845.0}},
       {u'doc_count': 1723,        u'key': u'1002',        u'totalCount': {u'value': 33844.0}},
       {u'doc_count': 1714,        u'key': u'2006',        u'totalCount': {u'value': 33681.0}},
       {u'doc_count': 1646,        u'key': u'2004',        u'totalCount': {u'value': 28374.0}},
       {u'doc_count': 1448, u'key': u'13', u'totalCount': {u'value': 32187.0}},
       {u'doc_count': 1375, u'key': u'3', u'totalCount': {u'value': 32976.0}},
       {u'doc_count': 1346,        u'key': u'2008',        u'totalCount': {u'value': 45932.0}}],      u'doc_count_error_upper_bound': 0,      u'sum_other_doc_count': 12778}},
    ... // ignore
    {u'doc_count': 37518,     u'key': 1428789780000,     u'key_as_string': u'2015-04-11T22:03:00.000Z',     u'vCmdid': {u'buckets': [{u'doc_count': 7456,        u'key': u'10000',        u'totalCount': {u'value': 234565.0}},
       {u'doc_count': 4049, u'key': u'19', u'totalCount': {u'value': 39884.0}},
       {u'doc_count': 2308, u'key': u'14', u'totalCount': {u'value': 39939.0}},
       {u'doc_count': 2263, u'key': u'18', u'totalCount': {u'value': 57121.0}},
       {u'doc_count': 1731,        u'key': u'1002',        u'totalCount': {u'value': 32309.0}},
       {u'doc_count': 1695,        u'key': u'2006',        u'totalCount': {u'value': 33299.0}},
       {u'doc_count': 1649,        u'key': u'2004',        u'totalCount': {u'value': 28429.0}},
       {u'doc_count': 1423, u'key': u'13', u'totalCount': {u'value': 30672.0}},
       {u'doc_count': 1340,        u'key': u'2008',        u'totalCount': {u'value': 45051.0}},
       {u'doc_count': 1308, u'key': u'3', u'totalCount': {u'value': 32076.0}}],      u'doc_count_error_upper_bound': 0,      u'sum_other_doc_count': 12296}}],   u'doc_count_error_upper_bound': 0,   u'sum_other_doc_count': 8007376}}, u'hits': {u'hits': [], u'max_score': 0.0, u'total': 8385335}, u'timed_out': False, u'took': 2235}

查詢只花了2.2秒，而以前在mongodb上花了21.4秒。在6個shard的index上跑一樣的查詢，只需花0.6秒。

一個時間段打包成一個doc

和以前 MongoDB 的 _._._._.v 的結構同樣，數據按照維度嵌套存放在內部的子文檔裏。

表結構以下

mappings = {    'testdata': {        '_source': {'enabled': False},        '_all': {'enabled': False},        'properties': {            'max_timestamp': {                'type': 'date',                'index': 'not_analyzed',                'store': False,                'dynamic': 'strict',                'doc_values': False,                'fielddata': {                    'format': 'disabled'
                }
            },            'min_timestamp': {                'type': 'date',                'index': 'not_analyzed',                'store': False,                'dynamic': 'strict',                'doc_values': False,                'fielddata': {                    'format': 'disabled'
                }
            },            'count': {                'type': 'integer',                'index': 'no',                'store': False,                'dynamic': 'strict',                'doc_values': False,                'fielddata': {                    'format': 'disabled'
                }
            },            'sum_totalCount': {                'type': 'integer',                'index': 'no',                'store': False,                'dynamic': 'strict',                'doc_values': False,                'fielddata': {                    'format': 'disabled'
                }
            },            'sum_dProcessTime': {                'type': 'integer',                'index': 'no',                'store': False,                'dynamic': 'strict',                'doc_values': False,                'fielddata': {                    'format': 'disabled'
                }
            },            '_': { # timestamp
                'type': 'nested',                'properties': {                    'd': {'type': 'date', 'index': 'not_analyzed', 'store': False, 'fielddata': {'format': 'fst'}},                    'c': {'type': 'integer', 'index': 'not_analyzed', 'store': False, 'fielddata': {'format': 'fst'}},                    '0': {'type': 'integer', 'index': 'not_analyzed', 'store': False, 'fielddata': {'format': 'fst'}},                    '1': {'type': 'integer', 'index': 'not_analyzed', 'store': False, 'fielddata': {'format': 'fst'}},                    '_': { # vAppid
                        'type': 'nested',                        'properties': {                            'd': {'type': 'string', 'index': 'not_analyzed', 'store': False, 'fielddata': {'format': 'fst'}},                            '_': { # iResult
                                'type': 'nested',                                'properties': {                                    'd': {'type': 'string', 'index': 'not_analyzed', 'store': False, 'fielddata': {'format': 'fst'}},                                    '_': { # vCmdid
                                        'type': 'nested',                                        'properties': {                                            'd': {'type': 'string', 'index': 'not_analyzed', 'store': False, 'fielddata': {'format': 'fst'}},                                            'v': { # values
                                                'type': 'nested',                                                'properties': {                                                    '0': {'type': 'integer', 'index': 'not_analyzed', 'store': False, 'fielddata': {'format': 'fst'}},                                                    '1': {'type': 'integer', 'index': 'not_analyzed', 'store': False, 'fielddata': {'format': 'fst'}}
                                                }
                                            }
                                        }
                                    }
                                }
                            }
                        }
                    }
                }
            }
        }
    }
}

表結構的要點是一對nested的嵌套文檔。nested的成員必須打開doc_values或者index中的一項，不然數據不會被保存。由於doc_values更佔空間，因此咱們選擇了不存doc values。

在 MongoDB 裏的數據

{    "sharded" : false,    "primary" : "shard2_RS",    "ns" : "wentao_test.sparse_precomputed_no_appid",    "count" : 39,    "size" : 2.68435e+08,    "avgObjSize" : 6.88294e+06,    "storageSize" : 2.75997e+08,    "numExtents" : 3,    "nindexes" : 1,    "lastExtentSize" : 1.58548e+08,    "paddingFactor" : 1.0000000000000000,    "systemFlags" : 1,    "userFlags" : 1,    "totalIndexSize" : 8176,    "indexSizes" : {        "_id_" : 8176
    },    "ok" : 1.0000000000000000,    "$gleStats" : {        "lastOpTime" : Timestamp(1429187664, 3),        "electionId" : ObjectId("54c9f324adaa0bd054140fda")
    }
}

只有39個文檔，尺寸是270M。數據導入到es以後

size: 74.6Mi (74.6Mi)docs: 9,355,029 (9,355,029)

文檔數變成了935萬個，由於子文檔在es裏也算成文檔的，尺寸只有74M。查詢條件以下

q = {'aggs': {    'expanded_timestamp': {        'nested' : {            'path': '_'
        },        'aggs': {            'grouped_timestamp': {                'terms': {                    'field':  '_.d',                    'size': 0
                },                'aggs': {                    'totalCount': {                        'sum': {                            'field': '_.0'
                        }
                    }
                }
            }
        }
    }
}
}
res = es.search(index="wentao-test4", doc_type='testdata', body=q, search_type='count')

注意 _.0 是預先計算好的同週期的 totalCount sum。嵌套的維度字段排序是 timestmap => vAppid => iResult => vCmdid => values (0 as toalCount, 1 as dProcessTime)。

{u'_shards': {u'failed': 0, u'successful': 1, u'total': 1}, u'aggregations': {u'expanded_timestamp': {u'doc_count': 743,   u'grouped_timestamp': {u'buckets': [{u'doc_count': 8,      u'key': 1428790140000,      u'key_as_string': u'2015-04-11T22:09:00.000Z',      u'totalCount': {u'value': 972810.0}},
     ... // ignore
     {u'doc_count': 1,      u'key': 1428793140000,      u'key_as_string': u'2015-04-11T22:59:00.000Z',      u'totalCount': {u'value': 83009.0}}],    u'doc_count_error_upper_bound': 0,    u'sum_other_doc_count': 0}}}, u'hits': {u'hits': [], u'max_score': 0.0, u'total': 39}, u'timed_out': False, u'took': 56}

查詢只花了0.056秒。使用預先計算的值並不公平。使用原始的值計算也是能夠作到的：

q = {    'aggs': {        'per_id': {            'terms': {                'field': '_uid'
            },            'aggs': {                'expanded_timestamp': {                    'nested' : {                        'path': '_'
                    },                    'aggs': {                        'grouped_timestamp': {                            'terms': {                                'field':  '_.d'
                            },                            'aggs': {                                'expanded_vAppid': {                                    'nested' : {                                        'path': '_._._._.v'
                                    },                                    'aggs': {                                        'totalCount': {                                            'sum' : {                                                'field': '_._._._.v.0'
                                            },
                                        }
                                    }
                                }
                            }
                        }
                    }
                }
            }
        }
    },
}

這裏使用了多級展開，最後對 _._._._.v.0 求和。計算的結果和 _.0 求和是同樣的。花的時間是0.548秒。

而後再來測一下按vAppid過濾，同時按時間和vCmdid兩個維度聚合的查詢。這個寫起來有一些變態：

q = {'aggs': {    'expanded_timestamp': {        'nested' : {            'path': '_'
        },        'aggs': {            'grouped_timestamp': {                'terms': {                    'field':  '_.d',                    'size': 0
                },                'aggs': {                    'expanded_to_vAppid': {                        'nested' : {                            'path': '_._'
                        },                        'aggs': {                            'vAppid_not_empty': {                                'filter': {                                    'bool': {                                        'must_not': {                                            'term': {                                                '_._.d': ''
                                            }
                                        }
                                    }
                                },                                'aggs': {                                    'expanded_to_vCmdid': {                                        'nested' : {                                            'path': '_._._._'
                                        },                                        'aggs': {                                            'ts_and_vCmdid': {                                                'terms': {'field': '_._._._.d', 'size': 0}, # _._._._.d is vCmdid
                                                'aggs': {                                                    'expanded_to_values': {                                                        'nested' : {                                                            'path': '_._._._.v'
                                                        },                                                        'aggs': {                                                            'totalCount': {                                                                'sum' : {                                                                    'field': '_._._._.v.0'
                                                                },
                                                            }
                                                        }
                                                    }
                                                }
                                            }
                                        }
                                    }
                                }
                            }
                        }
                    }
                }
            }
        }
    }
}
}

查詢的速度是3.2秒。比原始格式保存的方式查起來要慢。可是實際狀況下，預先計算的值是更可能被使用的，這種須要拆開原始的value的狀況不多。