【原創】大數據基礎之ElasticSearch（1）簡介、安裝、使用

時間 2019-11-21

原文原文鏈接

ElasticSearch 6.6.0html

官方：https://www.elastic.co/node

一簡介

ElasticSearch簡單來講是對lucene的分佈式封裝，增長了shard（每一個shard是一個子索引，也是一個lucene的index）和replica的概念；因此在ElasticSearch也能夠見到lucene中的概念，好比index、document等。docker

Elasticsearch is a highly scalable open-source full-text search and analytics engine. It allows you to store, search, and analyze big volumes of data quickly and in near real time. It is generally used as the underlying engine/technology that powers applications that have complex search features and requirements.express

es是一個可擴展、開源的全文檢索和分析引擎。es能夠近乎實時的存儲、搜索和分析大規模數據；es一般做爲底層的技術或者引擎使得應用能夠實現複雜查詢的需求和場景。json

核心概念

1 集羣概念

Cluster

A cluster is a collection of one or more nodes (servers) that together holds your entire data and provides federated indexing and search capabilities across all nodes. A cluster is identified by a unique name which by default is "elasticsearch". This name is important because a node can only be part of a cluster if the node is set up to join the cluster by its name.bootstrap

Cluster（集羣）由一個或多個Node組成，Cluster做爲一個總體持有全部的數據，同時提供索引和查詢功能；cluster有一個惟一的名字，默認是elasticsearch。api

Node

A node is a single server that is part of your cluster, stores your data, and participates in the cluster’s indexing and search capabilities. Just like a cluster, a node is identified by a name which by default is a random Universally Unique IDentifier (UUID) that is assigned to the node at startup. You can define any node name you want if you do not want the default. This name is important for administration purposes where you want to identify which servers in your network correspond to which nodes in your Elasticsearch cluster.
A node can be configured to join a specific cluster by the cluster name. By default, each node is set up to join a cluster named elasticsearch which means that if you start up a number of nodes on your network and—assuming they can discover each other—they will all automatically form and join a single cluster named elasticsearch.服務器

一個Node（節點）是一臺服務器，做爲集羣的組成部分，存儲數據同時參與集羣的索引和查詢功能；每一個Node也有一個惟一的名字，默認是UUID。
每一個Node均可以經過指定一個集羣的名字來加入cluster；網絡

2 索引概念

Index

An index is a collection of documents that have somewhat similar characteristics. An index is identified by a name (that must be all lowercase) and this name is used to refer to the index when performing indexing, search, update, and delete operations against the documents in it.app

一個Index是多個具備類似特徵的document的集合；一個Index也有一個名字（全小寫）。

Type（廢棄）

A type used to be a logical category/partition of your index to allow you to store different types of documents in the same index. It is no longer possible to create multiple types in an index, and the whole concept of types will be removed in a later version.

一個type是一個邏輯概念，表示index中的category或者partition，將來會被廢棄掉；

廢棄進度：
Indices created in Elasticsearch 6.0.0 or later may only contain a single mapping type. Indices created in 5.x with multiple mapping types will continue to function as before in Elasticsearch 6.x. Types will be deprecated in APIs in Elasticsearch 7.0.0, and completely removed in 8.0.0.

廢棄緣由：
Initially, we spoke about an 「index」 being similar to a 「database」 in an SQL database, and a 「type」 being equivalent to a 「table」.
This was a bad analogy that led to incorrect assumptions. In an SQL database, tables are independent of each other. The columns in one table have no bearing on columns with the same name in another table. This is not the case for fields in a mapping type.

Document

A document is a basic unit of information that can be indexed. This document is expressed in JSON (JavaScript Object Notation) which is a ubiquitous internet data interchange format.

一個document是一個索引的基本信息單元，格式爲json；

3 其餘

Near Realtime (NRT)

Elasticsearch is a near-realtime search platform. What this means is there is a slight latency (normally one second) from the time you index a document until the time it becomes searchable.

es是一個近乎實時的搜索平臺，近乎實時的含義是從插入document到能夠搜索到document的延遲很小，一般是1s。

Shards

An index can potentially store a large amount of data that can exceed the hardware limits of a single node. For example, a single index of a billion documents taking up 1TB of disk space may not fit on the disk of a single node or may be too slow to serve search requests from a single node alone.
To solve this problem, Elasticsearch provides the ability to subdivide your index into multiple pieces called shards. When you create an index, you can simply define the number of shards that you want. Each shard is in itself a fully-functional and independent "index" that can be hosted on any node in the cluster.

Sharding is important for two primary reasons:

It allows you to horizontally split/scale your content volume
It allows you to distribute and parallelize operations across shards (potentially on multiple nodes) thus increasing performance/throughput

The mechanics of how a shard is distributed and also how its documents are aggregated back into search requests are completely managed by Elasticsearch and is transparent to you as the user.

一個索引若是很是大可能會超出單機硬件限制，es的解決方案是將索引劃分爲多個子索引，每一個子索引也叫shard（分片）。當建立索引的時候，能夠指定shard的數量，每一個shard都是獨立的索引；
shard很重要：1）水平擴展；2）分佈式和並行操做；

Replicas

In a network/cloud environment where failures can be expected anytime, it is very useful and highly recommended to have a failover mechanism in case a shard/node somehow goes offline or disappears for whatever reason. To this end, Elasticsearch allows you to make one or more copies of your index’s shards into what are called replica shards, or replicas for short.

Replication is important for two primary reasons:

It provides high availability in case a shard/node fails. For this reason, it is important to note that a replica shard is never allocated on the same node as the original/primary shard that it was copied from.
It allows you to scale out your search volume/throughput since searches can be executed on all replicas in parallel.

網絡環境中失敗（硬件故障或網絡故障）在所不免，因此failover就很重要；es容許你配置每一個shard有0個或多個備份，也叫replica（副本）；
replica很重要：1）高可用；2）並行操做；

Summary

To summarize, each index can be split into multiple shards. An index can also be replicated zero (meaning no replicas) or more times. Once replicated, each index will have primary shards (the original shards that were replicated from) and replica shards (the copies of the primary shards).

The number of shards and replicas can be defined per index at the time the index is created. After the index is created, you may also change the number of replicas dynamically anytime. You can change the number of shards for an existing index using the _shrink and _split APIs, however this is not a trivial task and pre-planning for the correct number of shards is the optimal approach.

By default, each index in Elasticsearch is allocated 5 primary shards and 1 replica which means that if you have at least two nodes in your cluster, your index will have 5 primary shards and another 5 replica shards (1 complete replica) for a total of 10 shards per index.

每一個index能夠被拆分爲多個shard，每一個索引均可以被複制0或屢次；一旦被複制，每一個索引都會有primary shard（主分片，被複制數據）和replica shard（複製分片，從主分片複製數據）；
shard和replica的數量能夠在建立索引的時候配置，後續也能夠修改；
默認的shard數量是5，默認的replica數量是1；

Each Elasticsearch shard is a Lucene index. There is a maximum number of documents you can have in a single Lucene index. As of LUCENE-5843, the limit is 2,147,483,519 (= Integer.MAX_VALUE - 128) documents. You can monitor shard sizes using the _cat/shards API.

每一個es的shard都是一個lucene的index，lucene的index中有document數量限制，若是超出數量，es須要增長更多的分片；

二安裝

1 docker安裝

# docker run elasticsearch

2 ambari安裝

詳見：http://www.javashuo.com/article/p-ugvdlakh-dz.html

3 手工tar安裝

# curl -L -O https://artifacts.elastic.co/downloads/elasticsearch/elasticsearch-6.6.2.tar.gz
# tar -xvf elasticsearch-6.6.2.tar.gz
# cd elasticsearch-6.6.2

配置文件

config/elasticsearch.yml

能夠配置集羣名稱，數據目錄（和hdfs同樣，能夠配置多個硬盤提高性能）等；

啓動

# bin/elasticsearch

4 手工yum安裝

# yum install elasticsearch

配置文件

/etc/elasticsearch/elasticsearch.yml

啓動

# service elasticsearch start

啓動以後訪問

# curl http://$es_server:9200

{
"name" : "y8NAQ9f",
"cluster_name" : "es_jdc",
"cluster_uuid" : "fZU17hfjTKO2IDUwnGmbEg",
"version" : {
"number" : "6.6.0",
"build_flavor" : "default",
"build_type" : "tar",
"build_hash" : "a9861f4",
"build_date" : "2019-01-24T11:27:09.439740Z",
"build_snapshot" : false,
"lucene_version" : "7.6.0",
"minimum_wire_compatibility_version" : "5.6.0",
"minimum_index_compatibility_version" : "5.0.0"
},
"tagline" : "You Know, for Search"
}