基於Solr的空間搜索（2）

時間 2019-11-13

標籤基於 solr 空間搜索简体版

原文原文鏈接

本文將繼續圍繞Solr+Lucene使用Cartesian Tiers 笛卡爾層和GeoHash的構建索引和查詢的細節進行介紹 git

在Solr中其實支持不少默認距離函數，可是基於座標構建索引和查詢的主要會基於2種方案：算法

（1）GeoHash bash

（2）Cartesian Tiers+GeoHash ide

而這塊的源碼實現都在lucene-spatial.jar中能夠找到。接下來我將根據這2種方案展開關於構建索引和查詢細節進行闡述，都是代碼分析，感興趣的看官能夠繼續往下看。GeoHash 函數

構建索引階段 性能

定義geohash域，在schema.xml中定義：優化

<fieldtype name=「geohash」 class=「solr.GeoHashField」/> this

接下來再構建索引的時候使用到lucene-spatial.jar的GeoHashUtils類：編碼

String geoHash = GeoHashUtils.encode(latitude, longitude);//經過geoHash算法將經緯度變成base32的編碼document.addField(「geohash」, geoHash); //將經緯度對應的bash32編碼存入索引。 spa

查詢階段

在solrconfig.xml中配置好QP，該QP將對用戶的請求Query進行QParser，

查詢語法規範是{!spatial sfield=geofield pt= latitude, longitude d=xx, sphere_radius=xx }

sfield:geohash對應的域名

pt:經緯度字符串

d=球面距離

sphere_radius：圓周半徑

接下來看看QP是如何解析上述查詢語句，而後生成基於GeoHash的Query的，見以下代碼，代碼來源SpatialFilterQParser的parse()方法：

//GeohashType必定是繼承SpatialQueryable的

if (type instanceof SpatialQueryable) {

double radius = localParams.getDouble(SpatialParams.SPHERE_RADIUS, DistanceUtils.EARTH_MEAN_RADIUS_KM); //圓周半徑

//pointStr=經緯度串，dist=距離，DistanceUnits.KILOMETERS 距離單位

SpatialOptions opts = new SpatialOptions(pointStr, dist, sf, measStr, radius, DistanceUnits.KILOMETERS);

opts.bbox = bbox;

//經過GeoHashField 建立查詢Query

result = ((SpatialQueryable)type).createSpatialQuery(this, opts);

}

其中最核心的方法即是GeoHashField的createSpatialQuery(),該方法負責生成基於geoHash的查詢Query，展開看該方法：

public Query createSpatialQuery(QParser parser, SpatialOptions options) {

double [] point = new double[0];

try {

//解析經緯度

point = DistanceUtils.parsePointDouble(null, options.pointStr, 2);

} catch (InvalidGeoException e) {

throw new SolrException(SolrException.ErrorCode.BAD_REQUEST, e);

}

//將經緯度編碼成bash32，對如何編碼請看本文geohash算法解析篇幅

String geohash = GeoHashUtils.encode(point[0], point[1]);

//TODO: optimize this

return new SolrConstantScoreQuery(new ValueSourceRangeFilter(new GeohashHaversineFunction(getValueSource(options.field, parser),

new LiteralValueSource(geohash), options.radius), 「0″, String.valueOf(options.distance), true, true));

}

從源碼中能夠看到代碼做者有標示TODO：optimize this，筆者從源碼中看到這塊的實現，也以爲確實有疑惑，整個大致實現流程是基於Lucene的Filter的方式來過濾命中docId,可是其過濾的範圍讓筆者看起來以爲性能會出現問題，可能也是源碼中有TODO：optimize this的緣故吧。

接下來繼續講下核心處理流程，Lucene的查詢規則是Query->Weight->Scorer,而主要負責查詢遍歷結果集合的就是Scorer，該例子也不例外，一樣是SolrConstantScoreQueryà ConstantWeightà ConstantScorer，經過Query生成Weight，Weight生成Scorer，熟悉Lucene的讀者應該很清楚了，這裏再也不累述，其中ConstantScorer的經過docIdSetIterator遍歷獲取知足條件的docId。而docIdSetIterator即是前面源碼中的ValueSourceRangeFilter，該Filter將會過濾掉不在一個指定球面距離範圍內的數據，而ValueSourceRangeFilter並非實際工做的類，它又將過濾交給了GeohashHaversineFunction，見ValueSourceRangeFilter以下代碼：

public DocIdSet getDocIdSet(final Map context, final IndexReader reader) throws IOException {

return new DocIdSet() {

////lowerVal=0,upperVal=distance,includeLower=true,includeupper=true

@Override

public DocIdSetIterator iterator() throws IOException {

////valueSource= GeohashHaversineFunction,也是實際進行DocList過濾的類

return valueSource.getValues(context, reader).getRangeScorer(reader, lowerVal, upperVal, includeLower, includeUpper);

}

};

}

那麼繼續看GeohashHaversineFunction，首先看其 getRangeScorer（）方法，最核心的部分爲：

if (includeLower && includeUpper) {

return new ValueSourceScorer(reader, this) {

@Override

public boolean matchesValue(int doc) {

//計算docId對應的經緯度和查詢傳入的經緯度的距離

float docVal = floatVal(doc);

//若是返回的docVal(目標座標和查詢座標的球面距離)在給定的distance以內則返回true

//也就是說目標地址爲待查詢的周邊範圍內

return docVal >= l && docVal <= u;

}

};

}

因此再看看計算球面距離的GeohashHaversineFunction.floatVal()方法，能夠從該方法最終調用的是distance()方法,以下所示：

protected double distance(int doc, DocValues gh1DV, DocValues gh2DV) {

double result = 0;

String h1 = gh1DV.strVal(doc); //docId對應的經緯度的base32編碼

String h2 = gh2DV.strVal(doc); //查詢的經緯度的base32編碼

if (h1 != null && h2 != null && h1.equals(h2) == false){

//TODO: If one of the hashes is a literal value source, seems like we could cache it

//and avoid decoding every time

double[] h1Pair = GeoHashUtils.decode(h1); //base32解碼

double[] h2Pair = GeoHashUtils.decode(h2);

//計算2個經度緯度之間的球面距離

result = DistanceUtils.haversine(Math.toRadians(h1Pair[0]), Math.toRadians(h1Pair[1]),

Math.toRadians(h2Pair[0]), Math.toRadians(h2Pair[1]), radius);

} else if (h1 == null || h2 == null){

result = Double.MAX_VALUE;

}

//返回2個經緯度之間球面距離

return result;

}

因此整個查詢流程是將索引中的全部docId從第一個docId =0開始,對應的經度緯度和查詢經緯度的球面距離是否在查詢給定的distance以內，知足着將該docId返回，不知足則過濾。

你們可能看到是全部docId,這也是筆者以爲該過濾範圍實現不靠譜的地方，也許是做者說須要進一步優化的地方。你們若是對怎麼是全部docId進行過濾有疑惑，能夠查看ValueSourceScorer的nextDoc() advance()方法，相信看過以後就明白了。到此Solr基於GeoHash的查詢實現介紹完畢了。

相關標籤/搜索

每日一句

每一个你不满意的现在，都有一个你没有努力的曾经。