Elasticsearch is a real-time distributed search engine. It is used for full-text search, structured search, analytics, and all three in combination.
Here are some use cases.
- Wikipedia uses Elasticsearch to provide full-text search with highlighted search snippets, and search-as-you-type and did-you-mean suggestions.
- Stack Overflow combines full-text search with geolocation queries and uses more-like-this to find related questions and answers.
- GitHub uses Elasticsearch to query 130 billion lines of code.
Elasticsearch is an open-source search engine built on top of
Apache Lucene™, a full-text search-engine library.
The distributed document store enables it to process large volumes of data in parallel, quickly finding the best matches for queries.
RESTful API with JSON over HTTP
Consumers can communicate with Elasticsearch using RESTful API.
Elasticsearch provides official clients
for several languages—Groovy, JavaScript, .NET, PHP, Perl, Python, and Ruby—and there are numerous community-provided clients and integrations, all of which can be found in
Elasticsearch Clients.
Phrase search
Elasticsearch supports matching exact sequences of words or phrases.
For instance, we could perform a query that will match only employee records that contain the phrase “rock climbing”.
Elasticsearch supports highlighting snippets
of text from each search result so the user can see why the document matched the query.
Analytics
Elasticsearch supports aggregations to generate sophisticated analytics.
Aggregations allow hierarchical rollups too.
For example, you can find the average age of employees who share a particular interest.
Scalability
Elasticsearch
can scale out to hundreds (or even thousands) of servers and handle petabytes of data.
Elasticsearch hides the complexity of distributed systems.
Search across entities
It's supported to search across all documents in the cluster. Elasticsearch forwarded the search request in parallel to a primary or replica of every shard in the cluster, gathered the results to select the overall top 10, and returned them to us.
Sorting and relevance
By default, results are returned and sorted by
relevance.
did-you-mean suggestions
Elasticsearch uses the
query domain-specific language, or query DSL to expose most of the power of Lucene.
Configuring Analyzers
Index setting is used to configure existing analyzers or to create new custom analyzers specific to an index.
The default analyzer is a good choice for most Western languages.
It consists of the following:
- The standard tokenizer, which splits the input text on word boundaries
- The standard token filter, which is intended to tidy up the tokens emitted by the tokenizer (but currently does nothing)
- The lowercase token filter, which converts all tokens into lowercase
- The stop token filter, which removes stopwords.
A custom
analyzer can be created to combine the following functions into a single package,
which are executed in sequence:
Character filters
- Character filters are used to “tidy up” a string before it is tokenized. For instance, the html_strip character filter can remove all HTML tags and convert HTML entities like Á into the corresponding Unicode character Á.
Tokenizers
- The keywordtokenizer outputs exactly the same string as it received, without any tokenization. The whitespacetokenizer splits text on whitespace only. The patterntokenizer can be used to split text on a matching regular expression.
Token filters
- Stemming token filters “stem” words to their root form. The ascii_folding filter removes diacritics, converting a term like "très" into "tres". The ngram and edge_ngram token filters can produce tokens suitable for partial matching or autocomplete. The synonym token filter allows to easily handle synonyms.
Fuzzy Query
Elasticsearch support fuzzy query which treats two words that are “fuzzily” similar as if they were the same word. It also supports p
honetic matching which can search for words that sound similar, even if their spelling differs.