Solr與MySQL查詢性能對比

時間 2019-11-17

標籤 solr mysql 查詢性能對比欄目 MySQL 简体版

原文原文鏈接

測試環境html

本文簡單對比下Solr與MySQL的查詢性能速度。ios

測試數據量：10407608 Num Docs: 10407608apache

普通查詢json

這裏對MySQL的查詢時間都包含了從MySQL Server獲取數據的時間。緩存

在項目中一個最經常使用的查詢，查詢某段時間內的數據，SQL查詢獲取數據，30s左右網絡

SELECT * FROM `tf_hotspotdata_copy_test` WHERE collectTime BETWEEN '2014-12-06 00:00:00' AND '2014-12-10 21:31:55';

對collectTime創建索引後，一樣的查詢，2s，快了不少。app

Solr索引數據：性能

<!--Index Field for HotSpot-->
<field name="CollectTime" type="tdate" indexed="true" stored="true"/>
<field name="IMSI" type="string" indexed="true" stored="true"/>
<field name="IMEI" type="string" indexed="true" stored="true"/>
<field name="DeviceID" type="string" indexed="true" stored="true"/>

Solr查詢，一樣的條件，72ms測試

"status": 0,
    "QTime": 72,
    "params": {
      "indent": "true",
      "q": "CollectTime:[2014-12-06T00:00:00.000Z TO 2014-12-10T21:31:55.000Z]",
      "_": "1434617215202",
      "wt": "json"
    }

好吧，查詢性能提升的不是一點點，用Solrj代碼試試：優化

SolrQuery params = new SolrQuery();
params.set("q", timeQueryString);
params.set("fq", queryString);
params.set("start", 0); 
params.set("rows", Integer.MAX_VALUE);
params.setFields(retKeys);
QueryResponse response = server.query(params);

Solrj查詢並獲取結果集，結果集大小爲220296，返回5個字段，時間爲12s左右。

爲何須要這麼長時間？上面的"QTime"只是根據索引查詢的時間，若是要從solr服務端獲取查詢到的結果集，solr須要讀取stored的字段（磁盤IO），再通過Http傳輸到本地（網絡IO），這二者比較耗時，特別是磁盤IO。

時間對比：

查詢條件	時間
MySQL（無索引）	30s
MySQL（有索引）	2s
Solrj（select查詢）	12s

如何優化？看看只獲取ID須要的時間：

SQL查詢只返回id，沒有對collectTime建索引，10s左右：

SELECT id FROM `tf_hotspotdata_copy_test` WHERE collectTime BETWEEN '2014-12-06 00:00:00' AND '2014-12-10 21:31:55';

SQL查詢只返回id，一樣的查詢條件，對collectTime建索引，0.337s，很快。

Solrj查詢只返回id，7s左右，快了一點。

id Size: 220296

Time: 7340

時間對比：

查詢條件（只獲取ID）	時間
MySQL（無索引）	10s
MySQL（有索引）	0.337s
Solrj（select查詢）	7s

繼續優化。。

關於Solrj獲取大量結果集速度慢的一些相似問題：

http://stackoverflow.com/questions/28181821/solr-performance#

http://grokbase.com/t/lucene/solr-user/11aysnde25/query-time-help

http://lucene.472066.n3.nabble.com/Solrj-performance-bottleneck-td2682797.html

這個問題沒有好的解決方式，基本的建議都是作分頁，可是咱們須要拿到大量數據作一些比對分析，作分頁沒有意義。

偶然看到一個回答，solr默認的查詢使用的是"/select" request handler，能夠用"/export" request handler來export結果集，看看solr對它的說明：

It's possible to export fully sorted result sets using a special rank query parser and response writer specifically designed to work together to handle scenarios that involve sorting and exporting millions of records. This uses a stream sorting techniquethat begins to send records within milliseconds and continues to stream results until the entire result set has been sorted and exported.

Solr中已經定義了這個requestHandler：

<requestHandler name="/export" class="solr.SearchHandler">
  <lst name="invariants">
    <str name="rq">{!xport}</str>
    <str name="wt">xsort</str>
    <str name="distrib">false</str>
  </lst>
  <arr name="components">
    <str>query</str>
  </arr>
</requestHandler>

使用/export須要字段使用docValues創建索引：

<field name="id" type="string" indexed="true" stored="true" required="true" multiValued="false" docValues="true"/>
<field name="CollectTime" type="tdate" indexed="true" stored="true" docValues="true"/>
<field name="IMSI" type="string" indexed="true" stored="true" docValues="true"/>
<field name="IMEI" type="string" indexed="true" stored="true" docValues="true"/>
<field name="DeviceID" type="string" indexed="true" stored="true" docValues="true"/>

使用docValues必需要有一個用來Sort的字段，且只支持下列類型：

Sort fields must be one of the following types: int,float,long,double,string

docValues支持的返回字段：

Export fields must either be one of the following types: int,float,long,double,string

使用Solrj來查詢並獲取數據：

        SolrQuery params = new SolrQuery();
        params.set("q", timeQueryString);
        params.set("fq", queryString);
        params.set("start", 0);
        params.set("rows", Integer.MAX_VALUE);
        params.set("sort", "id asc");
        params.setHighlight(false);
        params.set("qt", "/export");
        params.setFields(retKeys);
        QueryResponse response = server.query(params);

一個Bug：

org.apache.solr.client.solrj.impl.HttpSolrClient$RemoteSolrException: Error from server at http://192.8.125.30:8985/solr/hotspot: Expected mime type application/octet-stream but got application/json.

Solrj無法正確解析出結果集，看了下源碼，緣由是Solr server返回的ContentType和Solrj解析時檢查時不一致，Solrj的BinaryResponseParser這個CONTENT_TYPE是定死的：

public class BinaryResponseParser extends ResponseParser {
    public static final String BINARY_CONTENT_TYPE = "application/octet-stream";

一時半會也不知道怎麼解決這個Bug，仍是本身寫個Http請求並獲取結果吧，用HttpClient寫了個簡單的客戶端請求並解析json獲取數據，測試速度：

    String url = "http://192.8.125.30:8985/solr/hotspot/export?q=CollectTime%3A[2014-12-06T00%3A00%3A00.000Z+TO+2014-12-10T21%3A31%3A55.000Z]&sort=id+asc&fl=id&wt=json&indent=true";
    long s = System.currentTimeMillis();
    SolrHttpJsonClient client = new SolrHttpJsonClient();
    SolrQueryResult result = client.getQueryResultByGet(url);
    System.out.println("Size: "+result.getResponse().getNumFound());
    long e = System.currentTimeMillis();
    System.out.println("Time: "+(e-s));

一樣的查詢條件獲取220296個結果集，時間爲2s左右，這樣的查詢獲取數據的效率和MySQL創建索引後的效果差很少，暫時能夠接受。

爲何使用docValues的方式獲取數據速度快？

DocValues是一種按列組織的存儲格式，這種存儲方式下降了隨機讀的成本。

傳統的按行存儲是這樣的：

1和2表明的是docid。顏色表明的是不一樣的字段。

改爲按列存儲是這樣的：

按列存儲的話會把一個文件分紅多個文件，每一個列一個。對於每一個文件，都是按照docid排序的。這樣一來，只要知道docid，就能夠計算出這個docid在這個文件裏的偏移量。也就是對於每一個docid須要一次隨機讀操做。

那麼這種排列是如何讓隨機讀更快的呢？祕密在於Lucene底層讀取文件的方式是基於memory mapped byte buffer的，也就是mmap。這種文件訪問的方式是由操做系統去緩存這個文件到內存裏。這樣在內存足夠的狀況下，訪問文件就至關於訪問內存。那麼隨機讀操做也就再也不是磁盤操做了，而是對內存的隨機讀。

那麼爲何按行存儲不能用mmap的方式呢？由於按行存儲的方式一個文件裏包含了不少列的數據，這個文件尺寸每每很大，超過了操做系統的文件緩存的大小。而按列存儲的方式把不一樣列分紅了不少文件，能夠只緩存用到的那些列，而不讓不多使用的列數據浪費內存。

注意Export fields只支持int,float,long,double,string這幾個類型，若是你的查詢結果只包含這幾個類型的字段，那採用這種方式查詢並獲取數據，速度要快不少。

下面是Solr使用「/select」和「/export」的速度對比。

時間對比：

查詢條件	時間
MySQL（無索引）	30s
MySQL（有索引）	2s
Solrj（select查詢）	12s
Solrj（export查詢）	2s

項目中若是用分頁查詢，就用select方式，若是一次性要獲取大量查詢數據就用export方式，這裏沒有采用MySQL對查詢字段建索引，由於數據量天天還在增長，當達到億級的數據量的時候，索引也不能很好的解決問題，並且項目中還有其餘的查詢需求。

分組查詢

咱們來看另外一個查詢需求，假設要統計每一個設備（deviceID）上數據的分佈狀況：

用SQL，須要33s：

SELECT deviceID,Count(*) FROM `tf_hotspotdata_copy_test` GROUP BY deviceID;

一樣的查詢，在對CollectTime創建索引以後，只要14s了。

看看Solr的Facet查詢，只要540ms，快的不是一點點。

SolrQuery query = new SolrQuery();
query.set("q", "*:*");
query.setFacet(true);
query.addFacetField("DeviceID");
QueryResponse response = server.query(query);
FacetField idFacetField = response.getFacetField("DeviceID");
List<Count> idCounts = idFacetField.getValues();
for (Count count : idCounts) {
    System.out.println(count.getName()+": "+count.getCount());
}

時間對比：

查詢條件（統計）	時間
MySQL（無索引）	33s
MySQL（有索引）	14s
Solrj（Facet查詢）	0.54s

若是咱們要查詢某臺設備在某個時間段上按「時」、「周」、「月」、「年」進行數據統計，Solr也是很方便的，好比如下按天統計設備號爲1013上的數據：

    String startTime = "2014-12-06 00:00:00";
    String endTime = "2014-12-16 21:31:55";   
    SolrQuery query = new SolrQuery();
    query.set("q", "DeviceID:1013");
    query.setFacet(true);
    Date start = DateFormatHelper.ToSolrSearchDate(DateFormatHelper.StringToDate(startTime));
    Date end = DateFormatHelper.ToSolrSearchDate(DateFormatHelper.StringToDate(endTime));
    query.addDateRangeFacet("CollectTime", start, end, "+1DAY");
    QueryResponse response = server.query(query);

    List<RangeFacet> dateFacetFields = response.getFacetRanges();
    for (RangeFacet facetField : dateFacetFields{
        List<org.apache.solr.client.solrj.response.RangeFacet.Count> dateCounts= facetField.getCounts();
        for (org.apache.solr.client.solrj.response.RangeFacet.Count count : dateCounts) {
            System.out.println(count.getValue()+": "+count.getCount());
        }
    }