Notes from Exploring ElasticSearchjavascript
The installation of Elasticsearch is very simple. It's a server for processing texts.java
Elasticsearch is a standalone Java app, and can be easily started from command line. A copy can be obtained from the elasticsearch download page.node
Microsoft Windows:git
Download the .zip version and unpack it to a folder. Navigate to the bin folder, then double click elasticsearch.bat to run.github
If the serve is successfully be startedl, you'll see information in terminal like this:app
[2015-02-04 20:43:12,747][INFO ][node ] [Joe Fixit] startedcurl
P.S: There's a problem you may meet. If the terminal provides the information like this:elasticsearch
[2014-12-17 09:31:03,820][WARN ][cluster.routing.allocation.decider]
[logstash test] high disk watermark [10%] exceeded on
[7drCr113QgSM8wcjNss_Mg][Blur] free: 632.3mb[8.4%], shards will be
relocated away from this node
[2014-12-17 09:31:03,820][INFO ][cluster.routing.allocation.decider]
[logstash test] high disk watermark exceeded on one or more nodes,
rerouting shardside
It just means there's no enough space in your current disk. So you only need to delete some files for freeing space.ui
After you've started your server, you can ensure it's running properly by opening your browser to the URL: http://localhost:9200. You should see a page like this:
{ "status" : 200, "name" : "Joe Fixit", "cluster_name" : "elasticsearch", "version" : { "number" : "1.4.2", "build_hash" : "927caff6f05403e936c20bf4529f144f0c89fd8c", "build_timestamp" : "2014-12-16T14:11:12Z", "build_snapshot" : false, "lucene_version" : "4.10.2" }, "tagline" : "You Know, for Search" }
As it's free to use any tool you wish to query elasticsearch, we can install curl and cygwin to query elasticsearch.
But if you're reading the book Exploring ElasticSearch, you'd better install the tool made by the author: elastic-hammer. You can find the detailed information on Github: https://github.com/andrewvc/elastic-hammer. It's very easy to install it as a plugin with the following steps:
./plugin -install andrewvc/elastic-hammer
http://<yourelasticsearchserver>/_plugin/elastic-hammer/. By default, <yourelasticsearchserver> is just localhost:9200.
./plugin -remove elastic-hammer; ./plugin -install andrewvc/elastic-hammer
Modeling Data
field: the smallest individual unit of data.
documents: collections of fields, and comprise the base unit of storage in elasticsearch.
The primary data-format elasticsearch uses is JSON. A sample document:
{ "_id" : 1, "handle" : "ron", "hobbies" : ["hacking", "the great outdoors"], "computer" : {"cpu" : "pentium pro", "mhz" : 200} }
The user-dfined type is analogous to a database schema. Types are defined with the Mapping APIs:
{ "user" : { "properties" : { "handle" : {"type" : "string"}, "age" : {"type" : "integer"}, "hobbies" : {"type" : "string"}, "computer" : { "properties" : { "cpu" : {"type" : string}, "speed" : {"type" : "integer"} } } } } }
Basic CRUD
The full CRUD lifecycle in elasticsearch is Create, Read, Update, Delete. We'll create an index, then a type, and finally a document within that index using tat type. The URL scheme is consistent for these operations, with most URLs having the form /index/type/docid, and that special operations on a given namespace are namespaced with an uderscore prefix.
// create an index named 'planet' PUT /planet // create a type called 'hacker' PUT /planet/hacker/_mapping { "hacker" : { "properties" : { "handle" : {"type" : "string"}, "age" : {"type" : "long"} } } } // create a document PUT /planet/hacker/1 {"handle" : "jean-michea", "age" : 18} // retrieve the document GET /planet/hacker/1 // update the document's age field POST /planet/hacker/1/_update {"doc" : {"age" : 19}} // delete the document DELETE /planet/hacker/1
Search Data
First, create our schema:
// Delete the document DELETE /planet/hacker/1 // Delete any existing indexes named planet DELETE /planet // Create our index PUT /planet/ { "mappings" : { "hacker" : { "properties" : { "handle" : {"type" : "string"}, "hobbies" : {"type" : "string", "analyzer" : "snowball"} } } } }
Then, seed some data by datasets as hacker_planet.eloader.
The data repository can be got at http://github.com/andrewvc/ee-datasets. After cloned the repository, you can load examples into your server by executing the included elastic-loader.jar program, providing the address of your elasticsearch server, and the path to the data-file. For example, to load the hacker_planet dataset, open a command prompt in the ee-datasets folder, an run:
java -jar elastic-loader.jar http://localhost:9200 datasets/hacker_planet.eloader
Finally, we can perform our search:
// Do the search POST /planet/hacker/_search { "query" : { "match" : { "hobbies" : "rollerblading" } } }
The above codes perform a search for those who like rollerblading out of the 3 users we've created in the datbase.
Searches in elasticsearch are handled by the aptly named search API. The search API is provided by the _search endpoint.
For example:
// index search POST /planet/_search ... // document type search POST /planet/hacker/_search ...
A complex search's skeleton
// Load Dataset: hacker_planet.eloader POST /planet/_search { "from" : 0, "size" : 15, "query" : {"match_all" : {}}, "sort" : {"handle" : "desc"}, "filter" : {"term" : {"_all" : "coding"}}, "facet" : { "hobbies" : { "term" : { "field" : "hobbies" } } } }
All elasticsearch queries boil down to the task of
Text Analysis
Elasticsearch has toolbox with which we can slice and dice words in order to efficiently searched. Utilizing these tools we can narrow our search space, and find common ground between linguistically similar terms.
The Snowball analyzer is great at figuring out what the stems of English words are. The stem of a word is its root.
The process by which documents are analyzed is as follows:
The easist way to see analysis in action is with the Analyzer API:
GET /_analyze?analyzer=snowball&text=candles%20candle&pretty=true'
An analyzer is a really a three stage pipeline comprised of the following execution steps:
Let's dive in by building a cutom analyzer for tokenizing CSV data. Custom analyzer can be stored at the index level either during or after index creation. Lets's:
// Create the index PUT /recipes // Close the index for settings update POST /recipes/_close // Create the analyzer PUT /recipes/_settings { "index" : { "analysis" : { "tokenizer" : { "comma" : {"type" : "pattern", "pattern" : ","} }, "analyzer" : { "recipe_csv" : { "type" : "custom", "tokenizer" : "comma", "filter" : ["trim", "lowercase"] } } } } } // Reopen the index POST /recipes/_open
Faceting
Facets are always attached to a query, letting you return aggregate statistics alongside regular query results. We'll create a database of movies and return facets based on the movies genres alongside standard query results. As usual, we need to load the movie_db.eloader data-set into elasticsearch server.
Simple movie mapping:
// Load Dataset: movie_db.eloader GET /movie_db/movie/_mapping?pretty=true { "movie" : { "properties" : { "actors" : {"type" : "string", "analyzer" : "standard", "position_offset_gap" : 100}, "genre" : {"type" : "string", "index" : "not_analyzed"}, "release_year" : {"type" : "integer", "index" : "not_analyzed"}, "title" : {"type" : "string", "analyzer" : "snowball"}, "description" : {"type" : "string", "analyzer" : "snowball"} } } }
Simple terms faceting:
// Load Dataset: movie_db.eloader POST /movie_db/_search { "query" : {"match" : {"description" : "hacking"}}, "facets" : { "genre" : { "terms" : {"field" : "genre"}, "size" : 10 } } }
This query searches for movies with a description containing "hacking". The query will return a list of facets showing which genres have descriptions containing the term "hacking", and how often films are in that genre with a matching description.