Elasticsearch has a unique way of indexing data. Getting to know the storage process will allow you to understand its stages and relevance to search speed.
According to Elastic, “Elasticsearch is an open-source, distributed, and RESTful search and analysis engine capable of solving numerous use cases.”
Elasticsearch is able to achieve fast search responses because, instead of searching the text directly, it searches an index instead. This can be done because when Elasticsearch indexes data it performs a series of analysis and conversions. And then applies what is called inverted indexes, with which the search is carried out.
Here is an analysis of this process in detail, so you can have a better perspective of each stage.
The following image shows you “the big picture” of the storage process,
Now that you have the bigger picture in mind, we can review each one of the stages,
An index can be considered a collection of documents, where each document has a collection of fields. When you create one or more documents, Elasticsearch creates the index for you. Elasticsearch stores documents in different ways: by submitting a text file with the correct JSON format, or by ordering a POST injection where the JSON document, which includes properties, is sent. This option of injection applies for more than one single JSON document (bulk).
Another relevant, yet optional, feature is that you don’t have to specify each different field when you make the document because Elasticsearch is schema-less.
Once you create the document the analysis phase arrives, documents are filtered, processed (Charfilter, lowercase, stopwords, steamer), and tokenized. Here’s an example,
Indexing: Inverted Index
Now the documents are filtered, it is time to index them. Elasticsearch is based on Lucene which in turn is based on inverted indexes, this lists every unique word that appears in the document and identifies all of the documents in which each word occurs.
The previous inverted indexes are saved in a buffer waiting to be full and passed into a segment. Many segments form a shard. As I said before, Elasticsearch is schema-less, at this time Elasticsearch infers the model for the document you stored, detecting and adding fields, according to the appropriate Elasticsearch datatype. Once a shard is full, it is eligible for searching.
The index store process represents a lot of work, making the analyzing phase and the inverted indexes take time, nevertheless, it’s totally worth it.
You can see this reflected when making searches. Searching becomes really fast, in near real-time-within. Retrieving data with that speed even makes Elasticsearch a suitable option for real-time systems that demand workload and low latency.
Elasticsearch invests heavily in the way it stores information: nevertheless, each stage and every filter is necessary to achieve the performance it has at searching. Inverted indexes are very useful to gain velocity when searching, and its combination with Elastic creates a powerful tool for solving a wide variety of necessities that imply real-time search.
Want to perform a search as quickly as you index a document? Use Elasticsearch.
In this article, I show the tip of the iceberg of Elasticsearch elements. In case you are interested in learning more, have a look at their product page and their guides for a quick start.
If you have any questions feel free to reach me at [email protected].