
Understanding the architecture
To understand how Elasticsearch works, it's necessary that we learn about the architecture of it.
To understand how index, types, documents, and fields work together, let's refer to the following figure:

As seen in the preceding figure, an index contains one or multiple types. A type can be thought of as a table in a relational database. A type has one or more documents. There are one or more fields in the document. Fields are key value pairs.
A cluster has one or more nodes. Clusters are identified by their names. By default, elasticsearch
is the name of the cluster. In case you have to set up multiple Elasticsearch instances, in the same network, you should keep different names or else all nodes will join the same cluster. Similar to clusters, a node also has a name. We can assign it a name and a cluster name to join. In case we don't provide a cluster name to join, then nodes will automatically search and join the cluster with the name elasticsearch
.
If we don't provide a name, a node ID is assigned, which is a random Universally Unique Identifier (UUID) and the node will choose its name as the first seven digits of the automatically generated node ID, which will remain unique for each node.
It might happen that an index stores huge data and that it might exceed the hardware size of one node. For such cases, an index can be pided into shards.
There are two types of shards - primary and replica. Each document, when indexed, is first added to the primary shard and then to one or more replica shards. If there are more than one node set up for a cluster, replica shards will be on a different node.
By default, Elasticsearch creates five primary shards and one replica shard for each primary shard. Thus, for an index, if not specified, a total of 10 shards will be created. We can change this configuration and we will learn about it with the Indices API later in this chapter.
Recommended cluster configurations
There are a few configurations that should be taken care of for a proper cluster setup. As we have already discussed, we should provide a cluster name:
cluster.name: my-cluster-name
Similarly, each node should be given a unique name in order to be easily recognized:
node.name: node-1
There are other settings as well that are related to data path, discovery of nodes, and so on.
Minimum master nodes
We can specify a property that Elasticsearch evaluates to find out the minimum number of nodes that are needed as master eligible, before it can be called a cluster. Elasticsearch has a default setting for this, but still when you are planning for an Elasticsearch cluster, always consider this setting.
The reason for using this is to prevent data loss. In case of network failure or some other reason it is possible that the cluster chooses another master while the first one is still up, but unreachable. This brings in a situation that is known as split brain and the cluster is pided into two different clusters and causes data loss.
To avoid this problem, split brain, use this setting for cluster setup.
The value of this setting can be derived as: ( Total number of nodes / 2 ) + 1
So in case you are going to set up a cluster with four nodes, you should keep this value to 3
:
discovery.zen.minimum_master_nodes: 3
All such nodes should also be using the node.master
setting with the value true
:
node.master: true
We will be learning more about types of nodes in Chapter 8, Elasticsearch APIs modules section.
Local cluster settings
There will be times when you are trying to set multiple instances on the same server. These instances can also share the same data directory that contains indices, shards, cluster metadata, and so on. In order to share the data among instances, each node can use the following code:
path.data: /path/to/data/directory
Usually, when Elasticsearch is extracted from ZIP/TAR and used, it will use the data folder inside the Elasticsearch directory. As a best practice you should keep the data directory outside the Elasticsearch installation directory.
In case you want a node to behave as a data node you should only set the following:
node.data: true
And then you should keep node.master
as false
.
Elasticsearch, by default, will not allow more than one node to share the same data directory. If you want to share the same data, use this setting:
node.max_local_storage_nodes: 2
The preceding setting code will allow two instances to share the data. You should, however, remember that different type of nodes such as master, data, and so on, should not share the same data directory.
Tip
Setting up multiple instance of Elasticsearch on a single machine is not a recommended configuration for a cluster.
Understanding document processing
As we know, whenever an index is created, shards and replica shards are created. Each shard can have multiple replicas. This complete group is also known as a replication group. In the replication group, the primary shard acts as an entry point for any document indexing operations. The primary shard makes sure that the document and the operation is valid. If everything is alright, the operation is performed and then the primary shard replicates the same operation at the replica shards as well. All of these responsibilities are of the primary shard.
It is not necessary that a document is to be replicated to all replica shards; instead Elasticsearch maintains a copy of who should receive the operation. This list is also called in-sync copies, which is maintained by the master node. By maintaining this list, it is guaranteed that the operation is performed by these shards and the user is also acknowledged. This processing model is also known as the data-replication model, which is based on the primary-backup model.
In case of a failure of a primary shard, the node that has the primary shard sends a message to the master informing it of this. During this period none of the indexing operation will take place and all shards wait for the master to define a primary shard out of the replicas.
Apart from this, the master node also checks the heath of all shards and it can decide to demote a primary shard based on poor health (probably because of a network failure or disconnection). The master instructs another node to start building a new shard so that the system can be restored to a healthy state, that is, the green statue.