SolrCloud Wiki翻譯(3)Shards & Indexing Data

When your data is too large for one node, you can break it up and store it in sections by creating one or more shards. Each is a portion of the logical index, or core, and it's the set of all nodes containing that section of the index. html

當你的數據放在一個節點上顯得太臃腫的時候,你能夠經過建立一個或者多個shard把他們分割開而且存儲到這多個shard中。每個shard都是邏輯索引或者說是core的一部分,而且它是包含了指定分段索引的全部節點的一個集合。 node

A shard is a way of splitting a core over a number of "servers", or nodes. For example, you might have a shard for data that represents each state, or different categories that are likely to be searched independently, but are often combined. apache

shard是把一個core分割到多個server或者node上面的一種方式。例如,你能夠用shard表示可能會被用做單獨搜索的每一個國家的數據,或者是不一樣目錄裏面的數據,可是全部的這些數據一般都是整合在一塊兒的。 網絡

Before SolrCloud, Solr supported Distributed Search, which allowed one query to be executed across multiple shards, so the query was executed against the entire Solr index and no documents would be missed from the search results. So splitting the core across shards is not a exclusively  SolrCloud concept. There were, however, several problems with the distributed approach that necessitated improvement with SolrCloud: app

在擁有SolrCloud以前Solr就已經支持了分佈式搜索,它容許一個查詢分發到多個索引碎片上執行,因此查詢是在完整的Solr索引上執行的,而且搜索結果中不會丟失任何文檔。因此把core分割到多個shard上面並非SolrCloud獨有的理念。然而這種分佈式會形成的許多問題,使用SolrCloud來增強分佈式處理成爲了一個必要的存在,以下: 負載均衡

  1. Splitting of the core into shards was somewhat manual.
  2. There was no support for distributed indexing, which meant that you needed to explicitly send documents to a specific shard; Solr couldn't figure out on its own what shards to send documents to.
  3. There was no load balancing or failover, so if you got a high number of queries, you needed to figure out where to send them and if one shard died it was just gone.


  1. 把core分開到多個shard裏面的操做大部分須要手動操做。
  2. 原有的分佈式處理不支持分佈式索引操做,這意味着你須要明確的把文檔發送到指定的shard上面去;Solr不可以本身決定要把文檔發送給哪一個shard。
  3. 沒有負載均衡和故障轉移的特性,所以若是你的shard接受了大量的查詢請求並致使該shard宕機的時候,你須要本身肯定要把請求發送到哪裏去。

SolrCloud fixes all those problems. There is support for distributing both the index process and the queries automatically, and ZooKeeper provides failover and load balancing. Additionally, every shard can also have multiple replicas for additional robustness. 分佈式

SolrCloud解決了全部上述問題。它既支持分佈式索引處理,也支持分佈式自動查詢,ZooKeeper會提供故障轉移和負載均衡的特性。另外,每一個shard均可以擁有多個replica來增長額外的應用健壯性。 ide

Unlike Solr 3.x, in SolrCloud there are no masters or slaves. Instead, there are leaders and replicas. Leaders are automatically elected, initially on a first-come-first-served basis, and then based on the Zookeeper process described at http://zookeeper.apache.org/doc/trunk/recipes.html#sc_leaderElection.. 性能

不像Solr 3.x同樣,在SolrCloud裏面沒有master和slave的存在,取而代之的是leader和replica。leader是自動選舉出來的,leader選舉首先是基於一個「先到先服務」的原則,而後纔是基於ZooKeeper處理(關於ZooKeeper的leader選舉的敘述http://zookeeper.apache.org/doc/trunk/recipes.html#sc_leaderElectionui

If a leader goes down, one of its replicas is automatically elected as the new leader. As each node is started, it's assigned to the shard with the fewest replicas. When there's a tie, it's assigned to the shard with the lowest shard ID.

若是leader宕機了,replica節點中的某一個節點將會自動被選舉成新的leader。在每個節點啓動以後,它都是自動分配給擁有replica最少的shard。當全部shard擁有同樣數量的replica的時候,新的節點會被分配給shard id值最小的shard。

When a document is sent to a machine for indexing, the system first determines if the machine is a replica or a leader.

當一個文檔被髮送到一臺主機進行索引的時候,系統會先肯定當前主機是replica仍是leader。

  • If the machine is a replica, the document is forwarded to the leader for processing.
  • If the machine is a leader, SolrCloud determines which shard the document should go to, forwards the document the leader for that shard, indexes the document for this shard, and forwards the index notation to itself and any replicas.

  • 若是當前節點是replica,文檔將會轉發給leader進行處理
  • 若是當前節點是leader,SolrCloud會肯定該文檔應該在哪一個shard上面進行處理,而且把文檔發送給指定shard的leader節點,leader節點收到請求後會處理該文檔,而且把索引數據發送給本身和所有的replica節點。


Document Routing

文檔路由

Solr 4.1 added the ability to co-locate documents to improve query performance.

Solr4.1添加了文檔聚類(譯註:此處翻譯準確性須要權衡,意思是將文檔歸類在一塊兒的意思)的功能來提高查詢性能。

Solr 4.5 has added the ability to specify the router implementation with the router.name parameter. If you use the "compositeId" router, you can send documents with a prefix in the document ID which will be used to calculate the hash Solr uses to determine the shard a document is sent to for indexing. The prefix can be anything you'd like it to be (it doesn't have to be the shard name, for example), but it must be consistent so Solr behaves consistently. For example, if you wanted to co-locate documents for a customer, you could use the customer name or ID as the prefix. If your customer is "IBM", for example, with a document with the ID "12345", you would insert the prefix into the document id field: "IBM!12345". The exclamation mark ('!') is critical here, as it defines the shard to direct the document to.

Solr4.5添加了經過一個router.name參數來指定一個特定的路由器實現的功能。若是你使用「compositeId」路由器,你能夠在要發送到Solr進行索引的文檔的ID前面添加一個前綴,這個前綴將會用來計算一個hash值,Solr使用這個hash值來肯定文檔發送到哪一個shard來進行索引。這個前綴的值沒有任何限制(好比沒有必要是shard的名稱),可是它必須老是保持一致來保證Solr的執行結果一致。例如,你須要爲不一樣的顧客聚類文檔,你可能會使用顧客的名字或者是ID做爲一個前綴。好比你的顧客是「IBM」,若是你有一個文檔的ID是「12345」,把前綴插入到文檔的id字段中變成:「IBM!12345」,在這裏感嘆號是一個分割符號,這裏的「IBM」定義了這個文檔會指向一個特定的shard。

Then at query time, you include the prefix(es) into your query with the _route_ parameter (i.e., q=solr&_route_=IBM!) to direct queries to specific shards. In some situations, this may improve query performance because it overcomes network latency when querying all the shards.

而後在查詢的時候,你須要把這個前綴包含到你的_route_參數裏面(好比:q=solr&_route_=IBM!)使查詢指向指定的shard。在某些狀況下,這樣操做能提高查詢的性能,由於它省掉了須要在全部shard上查詢耗費的網絡傳輸用時。

The _route_ parameter replaces shard.keys, which has been deprecated and will be removed in a future Solr release.

使用_route_代替shard.keys參數。shard.keys參數已通過時了,在Solr的將來版本中這個參數會被移除掉。

If you do not want to influence how documents are stored, you don't need to specify a prefix in your document ID.

若是你不想變更文檔的存儲過程,那就不須要在文檔的ID前面添加前綴。

If you created the collection and defined the "implicit" router at the time of creation, you can additionally define a router.field parameter to use a field from each document to identify a shard where the document belongs. If the field specified is missing in the document, however, the document will be rejected. You could also use the _route_ parameter to name a specific shard.

若是你建立了collection而且在建立的時候指定了一個「implicit」路由器,你能夠另外定義一個router.field參數,這個參數定義了經過使用文檔中的一個字段來肯定文檔是屬於哪一個shard的。可是,若是在一個文檔中你指定的字段沒有值得話,這個文檔Solr會拒絕處理。同時你也可使用_route_參數來指定一個特定的shard。

Shard Splitting

Shard分割

Until Solr 4.3, when you created a collection in SolrCloud, you had to decide on your number of shards when you created the collection and you could not change it later. It can be difficult to know in advance the number of shards that you need, particularly when organizational requirements can change at a moment's notice, and the cost of finding out later that you chose wrong can be high, involving creating new cores and re-indexing all of your data.

直到Solr4.3時,當你在SolrCloud裏面建立一個collection的時候,你必須在建立的時候就肯定好shard的數量,而且這個數量在往後都不能修改的。要知道你往後會須要多少個shard有點難,特別是你的需求隨時均可能會變,往後來找出用了一個錯誤的shard數量的代價可能會很是大,還包括須要建立新的core和從新索引全部的數據。

The ability to split shards is in the Collections API. It currently allows splitting a shard into two pieces. The existing shard is left as-is, so the split action effectively makes two copies of the data as new shards. You can delete the old shard at a later time when you're ready.

在Collection API中包含了分割shard的功能。如今容許經過它來把一個shard分開到兩個塊中。原來存在的shard仍是會保持原狀,因此分割操做其實是建立了它的數據的兩個副本做爲新的shard(譯註:這裏應該是把原來shard裏面的數據做爲一個副本分開到兩個新的shard裏面去)。當你一切都準備好了以後,你能夠把舊的shard給刪除掉。

注:關於shard分割的詳細操做和工做原理能夠看一下searchhub上的一篇文章

More details on how to use shard splitting is in the section on the Collections API.

關於怎麼使用shard分割的更多細節在Collections API.

全文完

相關文章
相關標籤/搜索