Solr進行Distinct 獲取Count

               

今天碰到一個問題,數據以前入solr的時候並無計算條數,如今須要計算出某幾個表中去重後的總數。
因爲solr的ISearch並無相關的Distinct功能.想到一個解決方案是用Solr的Facet分組進行GrupBy,可是由於Facet只能返回100條,而數據確定大於100個分組.全部該方案PASS了。
後來在網上搜到Solr Count Distinct,這麼一個東西,是Solr已經發布的腳本(Solr Search Requests)其中有相似的功能node

A 100% accurate count of distinct values (count distinct) is not generally possible without actually observing all of the values together. However there are a number of ways to estimate the count.json

「unique」 Facet Function
  The unique facet function is Solr’s fastest implementation to calculate the number of distinct values.
  It always provides exact counts on a single Solr node. For distributed search over multiple nodes, it provides exact counts when the number of values per node does not exceed 100 (by default).ruby

When the number of unique values does exceed 100 in any given shard, the following algorithm is used:markdown

It estimates the count by sending the top 100 results from each shard along with the total exact 「unique」 count for each shard.
  totalSeen is the number of actual results we saw from all shards (i.e. not deduped yet).
  uniqueSeen is the number of unique values we saw from all shards (i.e. deduped).
  notSeen is the number of unique values from each shard that were not sent (because of the 100 cutoff).
  factor = uniqueSeen / totalSeen (i.e. what fraction of values that we saw were unique)
  estimate = uniqueSeen + ( notSeen * factor ) (i.e. we simply apply the factor to the number of values we didn’t see)
  Example use:app

$ curl http://localhost:8983/solr/techproducts/query -d ' q=*:*& json.facet={ x : "unique(manu_exact)" // manu_exact is the manufacturer indexed as a single string }'
  • 1
  • 2
  • 3
  • 4
  • 5

For more facet functions, adding facet functions to each facet bucket, or sorting by facet function, see Solr Facet Functionsdom


Aggregation Functions
Faceting involves breaking up the domain into multiple buckets and providing information about each bucket.
There are multiple aggregation functions / statistics that can be used:curl

Aggregation Example Effect
sum sum(sales) summation of numeric values
avg avg(popularity) average of numeric values
sumsq sumsq(rent) sum of squares
min min(salary) minimum value
max max(mul(price,popularity)) maximum value
unique unique(state) number of unique values (count distinct)
hll hll(state) number of unique values using the HyperLogLog algorithm
percentile percentile(salary,50,75,99,99.9)    calculates percentiles

下面是我寫的一個例子ide

curl http://192.168.1.1:8080/solr/xxshard/query?q=*:* -d ' json.facet={ x:"unique(RB040002)" }'
  • 1
  • 2
  • 3
  • 4

詳細用法及其餘功能在下面原文中url

http://yonik.com/solr-count-distinct/
  http://yonik.com/solr-facet-functions/spa

           
相關文章
相關標籤/搜索