【01】把 Elasticsearch 當數據庫使：表結構定義

時間 2019-11-06

原文原文鏈接

Elaticsearch 有很是好的查詢性能，以及很是強大的查詢語法。在必定場合下能夠替代RDBMS作爲OLAP的用途。可是其官方查詢語法並非SQL，而是一種Elasticsearch首創的DSL。主要是兩個方面的DSL：html

Query DSL（https://www.elastic.co/guide/...）至關於SQL裏的 WHERE 部分，實現各類各樣的過濾文檔的方式
Aggregation DSL (https://www.elastic.co/guide/... ) 至關於SQL裏的 GROUP BY 部分，實現文檔按條件聚合並求一些指標（metric），好比求和求平均這些

這兩個DSL說實話是很差學習和理解的，並且即使掌握了寫起來也是比較繁瑣的，可是功能卻很是強大。本系列文章是爲了兩個目的：python

經過類比SQL的概念，實驗並學習Elasticsearch聚合DSL的語法和語義
用 python 實現一個翻譯器，可以使用 SQL 來完成 Elasticsearch 聚合DSL同樣的功能。這個小腳本能夠在平常工做中作爲一件方便的利器

基礎Elasticsearch知識（好比什麼是文檔，什麼是索引）這裏就不贅述了。咱們的重點是學習其查詢和聚合的語法。在本章中，咱們先來準備好樣本數據。選擇的樣本數據是全美的股票列表（http://www.nasdaq.com/screeni...）。選擇這份數據的緣由是由於其維度比較豐富（ipo年份，版塊，交易所等），並且有數字字段用於聚合（最近報價，總市值）。數據下載爲csv格式（https://github.com/taowen/es-...），而且有一個導入腳本（https://github.com/taowen/es-...）git

下面是導入Elasticsearch的mapping（至關於關係型數據庫的表結構定義）：github

{
    "symbol": {
        "properties": {
            "sector": {
                "index": "not_analyzed", 
                "type": "string"
            }, 
            "market_cap": {
                "index": "not_analyzed", 
                "type": "long"
            }, 
            "name": {
                "index": "analyzed", 
                "type": "string"
            }, 
            "ipo_year": {
                "index": "not_analyzed", 
                "type": "integer"
            }, 
            "exchange": {
                "index": "not_analyzed", 
                "type": "string"
            }, 
            "symbol": {
                "index": "not_analyzed", 
                "type": "string"
            }, 
            "last_sale": {
                "index": "not_analyzed", 
                "type": "long"
            }, 
            "industry": {
                "index": "not_analyzed", 
                "type": "string"
            }
        }, 
        "_source": {
            "enabled": true
        }, 
        "_all": {
            "enabled": false
        }
    }
}

對於把 Elasticsearch 看成數據庫來使用，默認如下幾個設置數據庫

把全部字段設置爲 not_analyzed
_source 打開，這樣就不用零散地存儲每一個字段了，大部分狀況下這樣更高效
_all 關閉，由於檢索都是基於 k=v 這樣字段已知的查詢的

執行python import-symbol.py導入完成數據以後，執行app

GET http://127.0.0.1:9200/symbol/_count

返回elasticsearch

{"count":6714,"_shards":{"total":3,"successful":3,"failed":0}}

能夠看到文檔已經被導入索引了。除了導入一個股票的列表，咱們還能夠把歷史的股價給導入到數據庫中。這個數據比較大，放在了網盤上下載（https://yunpan.cn/cxRN6gLX7f9md 訪問密碼 571c）(http://pan.baidu.com/s/1nufbLMx 訪問密碼 bes2)。執行python import-quote.py 導入ide

"quote": {
    "_all": {
      "enabled": false
    },
    "_source": {
      "enabled": true
    }, 
    "properties": {
      "date": {
        "format": "strict_date_optional_time||epoch_millis",
        "type": "date"
      },
      "volume": {
        "type": "long"
      },
      "symbol": {
        "index": "not_analyzed",
        "type": "string"
      },
      "high": {
        "type": "long"
      },
      "low": {
        "type": "long"
      },
      "adj_close": {
        "type": "long"
      },
      "close": {
        "type": "long"
      },
      "open": {
        "type": "long"
      }
    }
  }

從 mapping 的角度，和表結構定義是很是相似的。除了_source，_all和analyzed這幾個概念，基本上沒有什麼差別。Elasticsearch作爲數據庫最大的區別是 index/mapping 的關係，以及 index 通配這些。性能