Elasticsearch In Practice

Elasticsearch

Elasticsearch (Source: technocratsid.com)

Index

As mentioned earlier, Elasticsearch doesn’t have a concept like a database; thus, we can create an index (or table) directly. But how do we use Elasticsearch for more than one application? A common practice is to prefix the index name with the application’s name, for example: applicationname_indexname. This way, index names do not conflict between applications.

Creating an Index

To create an Index, we can make an HTTP call like so:

PUT /{index_name}

The rules for index names are as follows:

  • They must be lowercase.
  • No special characters are allowed, except for -, +, and _, but these cannot be at the start.
  • They cannot be longer than 255 bytes.

By creating unique index names, you ensure that data from different applications is stored separately and can be queried independently.

Before we start practicing, install the elasticsearch package

%pip install elasticsearch

import the packages we need

from elasticsearch import Elasticsearch
import time
import json

Create a connection to elasticsearch, make sure elasticsearch is running on your computer’s localhost or is running on Google Collab.

es = Elasticsearch([{'host': 'localhost', 'port': 9200, 'scheme': 'http'}])

Creating an Index

# Create index customers
# PUT http://localhost:9200/customers
response = es.options(ignore_status=[400]).indices.create(index='customers')
print(json.dumps(response.body, indent=4))
# Create index products
# PUT http://localhost:9200/products
response = es.options(ignore_status=[400]).indices.create(index='products')
print(json.dumps(response.body, indent=4))
# Create index orders
# PUT http://localhost:9200/orders
response = es.options(ignore_status=[400]).indices.create(index='orders')
print(json.dumps(response.body, indent=4))
# Get All Indexes
# GET http://localhost:9200/_cat/indices?v
response = es.cat.indices(v=True)
print(response)

Deleting an Index

To delete an index that you’ve created, you can simply use the DELETE HTTP method. Deleting an index will automatically remove all data associated with that index. Here’s how you can do it:

DELETE /{index_name}

Just replace {index_name} with the name of the index you wish to delete. But be careful, this operation is irreversible. Once an index is deleted, all the data within it is permanently lost unless you have a backup or recovery mechanism in place.

Delete Index

# Delete index customers
# DELETE http://localhost:9200/customers
response = es.options(ignore_status=[400, 404]).indices.delete(index='customers')
print(json.dumps(response.body, indent=4))
# Delete index products
# DELETE http://localhost:9200/products
response = es.options(ignore_status=[400, 404]).indices.delete(index='products')
print(json.dumps(response.body, indent=4))
# Delete index orders
# DELETE http://localhost:9200/orders
response = es.options(ignore_status=[400, 404]).indices.delete(index='orders')
print(json.dumps(response.body, indent=4))
# Get All Indexes
# GET http://localhost:9200/_cat/indices?v
response = es.cat.indices(v=True)
print(response)

Dynamic Mapping in Elasticsearch

In Elasticsearch, defining the schema of an index is known as mapping. By default, a feature called Dynamic Mapping is enabled, where Elasticsearch auto-detects the data type of each JSON attribute and creates a mapping accordingly.

While convenient, it’s generally recommended to manually create mappings for better control over data indexing and querying. This ensures proper interpretation and storage of your data.

Dynamic Field Mapping

This feature auto-detects the data type of a field in a JSON document and assigns a corresponding Elasticsearch data type:

JSON Data Type Elasticsearch Data Type
null No field added
true / false boolean
double float
long long
array Depends on the first data item type
string date, float, long, text (auto-detected)

Date Detection

By default, Elasticsearch detects if a string data is in a date format and assigns it a date data type using the format yyyy/MM/dd HH:mm:ss. This feature is active by default but can be deactivated by setting the date_detection attribute in the mapping to false. The date format can be changed by adjusting the dynamic_date_formats attribute in the mapping.

Update Dynamic Mapping for Date

# Update customers mapping
# PUT http://localhost:9200/customers/_mapping
response = es.indices.put_mapping(index='customers', body={
    'properties': {
        'date': {
            'type': 'date',
            'format': 'yyyy-MM-dd HH:mm:ss||yyyy-MM-dd||yyyy/MM/dd HH:mm:ss||yyyy/MM/dd'
        },
        'register_at': {
            'type': 'date',
            'format': 'yyyy-MM-dd HH:mm:ss||yyyy-MM-dd||yyyy/MM/dd HH:mm:ss||yyyy/MM/dd'
        }
    }
})
print(json.dumps(response.body, indent=4))

# Get Mapping
# GET http://localhost:9200/customers/_mapping
response = es.indices.get_mapping(index='customers')
print(json.dumps(response.body, indent=4))
# Update products mapping
# PUT http://localhost:9200/products/_mapping
response = es.indices.put_mapping(index='products', body={
    'properties': {
        'date': {
            'type': 'date',
            'format': 'yyyy-MM-dd HH:mm:ss||yyyy-MM-dd||yyyy/MM/dd HH:mm:ss||yyyy/MM/dd'
        }
    }
})
print(json.dumps(response.body, indent=4))

# Get Mapping
# GET http://localhost:9200/products/_mapping
response = es.indices.get_mapping(index='products')
print(json.dumps(response.body, indent=4))
# Update orders mapping
# PUT http://localhost:9200/orders/_mapping
response = es.indices.put_mapping(index='orders', body={
    'properties': {
        'date': {
            'type': 'date',
            'format': 'yyyy-MM-dd HH:mm:ss||yyyy-MM-dd||yyyy/MM/dd HH:mm:ss||yyyy/MM/dd'
        }
    }
})
print(json.dumps(response.body, indent=4))

# Get Mapping
# GET http://localhost:9200/orders/_mapping
response = es.indices.get_mapping(index='orders')
print(json.dumps(response.body, indent=4))

Number Detection

Although JSON has a number data type, sometimes users send numbers in a string format, such as “100” or “12.12”. In these cases, Elasticsearch may need to detect and convert these string-formatted numbers to actual number data types (long or float).

By default, automatic number detection is not active in Elasticsearch. If you want to activate it, you need to change the numeric_detection attribute in the mapping to true.

If a mapping for a certain attribute is not yet available, Elasticsearch will automatically try to convert the data to a number type (either long or float). If the conversion is successful, Elasticsearch will use the corresponding number data type for the field.

# Update customers mapping
# PUT http://localhost:9200/customers/_mapping
response = es.indices.put_mapping(index='customers', body={
    'numeric_detection': True
})
print(json.dumps(response.body, indent=4))

# Get Mapping
# GET http://localhost:9200/customers/_mapping
response = es.indices.get_mapping(index='customers')
print(json.dumps(response.body, indent=4))
# Update products mapping
# PUT http://localhost:9200/products/_mapping
response = es.indices.put_mapping(index='products', body={
    'numeric_detection': True
})
print(json.dumps(response.body, indent=4))

# Get Mapping
# GET http://localhost:9200/products/_mapping
response = es.indices.get_mapping(index='products')
print(json.dumps(response.body, indent=4))
# Update orders mapping
# PUT http://localhost:9200/orders/_mapping
response = es.indices.put_mapping(index='orders', body={
    'numeric_detection': True
})
print(json.dumps(response.body, indent=4))

# Get Mapping
# GET http://localhost:9200/orders/_mapping
response = es.indices.get_mapping(index='orders')
print(json.dumps(response.body, indent=4))

Create API

The Create API is used to add new data to Elasticsearch.

The Create API has a “save” operation, which means it will only create a new document if a document with the provided _id does not already exist. Attempting to create a document with an _id that already exists will result in a conflict error.

To use the Create API, you use either the POST or PUT HTTP method with the following endpoint:

POST/PUT /<index_name>/_create/<id>

Here, <index_name> is the name of the index where you want to create the document, and <id> is the unique identifier you want to assign to the new document. If the document is successfully created, Elasticsearch will return a confirmation response.

Create Customer

# Insert customers aditira
# POST http://localhost:9200/customers/_create/aditira
response = es.index(index='customers', id='aditira', body={
    'name': 'Aditira Jamhuri',
    'register_at': '2023-11-30 00:00:00',
})
print(json.dumps(response.body, indent=4))

# Get Mapping
# GET http://localhost:9200/customers/_mapping
response = es.indices.get_mapping(index='customers')
print(json.dumps(response.body, indent=4))

Create Product

# Insert products 1
# POST http://localhost:9200/products/_create/1
response = es.index(index='products', id='1', body={
    'name': 'Product 1',
    'price': 10000,
})
print(json.dumps(response.body, indent=4))
# Insert products 2
# POST http://localhost:9200/products/_create/2
response = es.index(index='products', id='2', body={
    'name': 'Product 2',
    'price': 20000,
})
print(json.dumps(response.body, indent=4))
# Get Mapping
# GET http://localhost:9200/products/_mapping
response = es.indices.get_mapping(index='products')
print(json.dumps(response.body, indent=4))

Create Order

# Insert orders 1
# POST http://localhost:9200/orders/_create/1
response = es.index(index='orders', id='1', body={
    "order_date": "2023-12-01 00:00:00",
    "customer_id": "aditira",
    "total": 40000,
    "items": [
        {
            "product_id": "1",
            "price": 10000,
            "quantity": 2
        },
        {
            "product_id": "2",
            "price": 20000,
            "quantity": 1
        }
    ]
})
print(json.dumps(response.body, indent=4))

# Get Mapping
# GET http://localhost:9200/orders/_mapping
response = es.indices.get_mapping(index='orders')
print(json.dumps(response.body, indent=4))

Get API

Once you’ve stored data in Elasticsearch using the Create API, you can retrieve this data using the Get API.

The Get API returns the requested data along with its associated metadata, such as the _id, index name, document version, and so on.

If the data you’re trying to retrieve is not available (i.e., there is no document with the requested _id in the specified index), the HTTP response code will be 404 Not Found.

To use the Get API, you make an HTTP GET request to the following endpoint:

GET /<index_name>/_doc/<id>

In this endpoint, <index_name> is the name of the index from which you want to retrieve data, and <id> is the unique identifier of the document you want to retrieve. If the document is found, Elasticsearch will return the document and its metadata in the response.

Get Document

# Get Customers aditira
# GET http://localhost:9200/customers/_doc/aditira
response = es.get(index='customers', id='aditira')
print(json.dumps(response.body, indent=4))

Get Source API

If you’re interested in retrieving the document data but do not wish to receive the metadata associated with the document, you can use the Get Source API.

To use the Get Source API, you make an HTTP GET request to the following endpoint:

GET /<index_name>/_source/<id>

In this endpoint, <index_name> is the name of the index from which you want to retrieve data, and <id> is the unique identifier of the document you want to retrieve.

This call will return only the actual data that you inserted, without any of the metadata information, including the _id. This is because the _id is already included in the URL where you’re making the HTTP call. As such, if the document is found, Elasticsearch will return the document data in the response, without any metadata.

Get Source Document

# Get Source Customers aditira
# GET http://localhost:9200/customers/_doc/aditira/_source
response = es.get_source(index='customers', id='aditira')
print(json.dumps(response.body, indent=4))

Check Exists

There may be cases where you only want to check if a document exists in an index, without needing to retrieve the document’s data. In such cases, you can use the Get API, but with the HTTP method HEAD instead of GET.

To check if a document exists, you make an HTTP HEAD request to the following endpoint:

HEAD /<index_name>/_doc/<id>

In this endpoint, <index_name> is the name of the index where you’re checking for the document, and <id> is the unique identifier of the document you’re checking for.

Elasticsearch will return a 200 OK response without any body if the document exists. If the document does not exist, it will return a 404 Not Found response. This is a quick and efficient way to check for the existence of a document without retrieving or transferring any data.

If data exists

# Check Customers aditira
# HEAD http://localhost:9200/customers/_doc/aditira
response = es.exists(index='customers', id='aditira')
print(json.dumps(response.body, indent=4))

If the data does not exist

# Check Customers wrong
# HEAD http://localhost:9200/customers/_doc/wrong
response = es.exists(index='customers', id='wrong')
print(json.dumps(response.body, indent=4))

Multi Get API

Elasticsearch provides a Multi Get API that allows you to retrieve multiple documents at once. This is useful when you need to fetch documents from different indices in a single API call.

You can use the Multi Get API with the following RESTful API endpoints:

POST /_mget
POST /<index_name>/_mget

In these endpoints, <index_name> is the name of the index from which you want to retrieve documents. If you omit the <index_name>, Elasticsearch will retrieve documents from all indices.

The _mget endpoint accepts a request body that specifies the documents to retrieve. The request body should be a JSON object that contains an ids array, like this:

{
  "ids" : ["1", "2", "3", "4"]
}

If you’re using <index_name>/_mget, then all IDs in the ids array will be retrieved from the specified index. If you’re using /_mget, you can specify the index for each ID in the ids array, like this:

{
  "docs" : [
    {
      "_index" : "index1",
      "_id" : "1"
    },
    {
      "_index" : "index2",
      "_id" : "2"
    }
  ]
}

In this case, Elasticsearch will retrieve each document from the specified index.

The Multi Get API is a powerful tool that can significantly reduce the number of API calls you need to make when working with multiple documents across multiple indices.

Multiget Document

# Multiget products
# POST http://localhost:9200/products/_mget
response = es.mget(index='products', body={
    'ids': ['1', '2']
})
print(json.dumps(response.body, indent=4))

Search API

While the Get API is used for retrieving a single document using its _id, the Search API in Elasticsearch is used when you want to search for documents without knowing their _id. The Search API is quite complex, offering a wide range of querying and filtering options that allow you to perform full-text search, term-based search, and much more.

To use the Search API, you can use the following RESTful API endpoints:

POST /_search
POST /<index_name>/_search

In these endpoints, <index_name> is the name of the index that you want to search in. If you don’t specify an index name, Elasticsearch will search all indices.

The _search endpoint accepts a request body that defines the search query. This search query is written in Elasticsearch’s Query DSL (domain-specific language), which is a flexible and powerful language for defining queries.

Here’s a simple example of a search query:

{
  "query": {
    "match": {
      "field_name": "search term"
    }
  }
}

In this query, Elasticsearch will return documents where field_name matches the “search term”.

It’s important to note that this is just the tip of the iceberg when it comes to the Search API. It supports a wide range of querying and filtering options, allowing you to perform complex searches on your data. More advanced features of the Search API will be covered in later discussions.

Search Document

# Search products
# POST http://localhost:9200/products/_search
response = es.search(index='products', body={
    "query": {
    "match": {
      "price": 10000
    }
  }
})
print(json.dumps(response.body, indent=4))

Pagination

When you are working with large amounts of data, it can be useful to break up the results of a search query into manageable chunks, or “pages”. Elasticsearch’s Search API supports pagination through the use of query parameters.

There are two important parameters for pagination:

  • from: This parameter determines the starting document from where the results should be returned. The count starts from 0. So, if you set from to 10, Elasticsearch will skip the first 10 results.
  • size: This parameter determines the number of search hits to return. By default, Elasticsearch returns 10 results per page. If you want more or fewer results, you can change the size parameter.

Here’s an example of how to use these parameters in a search query:

{
  "from" : 0, "size" : 20,
  "query": {
    "match": {
      "field_name": "search term"
    }
  }
}

In this example, Elasticsearch will return the first 20 documents that match the search term. If you want to get the next 20 documents, you can change from to 20:

{
  "from" : 20, "size" : 20,
  "query": {
    "match": {
      "field_name": "search term"
    }
  }
}

This way, you can navigate through the search results page by page. It’s important to note that the maximum value of from + size is 10000 by default. If you need to handle more data, you need to use the scroll API or increase this limit.

Search Document dengan Pagination

# Search products page 1
# POST http://localhost:9200/products/_search?size=1&from=0
response = es.search(index='products', body={
    'size': 1,
    'from': 0
})
print(json.dumps(response.body, indent=4))
# Search products page 2
# POST http://localhost:9200/products/_search?size=1&from=1
response = es.search(index='products', body={
    'size': 1,
    'from': 1
})
print(json.dumps(response.body, indent=4))

Sorting

The Search API in Elasticsearch also supports sorting of the search results. This is done using the sort query parameter.

The sort parameter’s value is defined as:

<field>:<direction>

In this, <field> is the name of the field you want to sort by and <direction> can be either asc (for ascending order) or desc (for descending order).

For example, if you have a timestamp field in your documents and you want to sort the results in descending order of timestamp, you can do this:

{
  "query": {
    "match": {
      "field_name": "search term"
    }
  },
  "sort" : [
    { "timestamp" : {"order" : "desc"}}
  ]
}

If you need to sort by more than one field, you can specify multiple fields separated by commas. The sorting will be applied in the order you specify. For example:

{
  "query": {
    "match": {
      "field_name": "search term"
    }
  },
  "sort" : [
    { "field1" : {"order" : "asc"}},
    { "field2" : {"order" : "desc"}},
    { "field3" : {"order" : "asc"}}
  ]
}

In this example, Elasticsearch first sorts the documents in ascending order by field1. For documents where field1 is the same, it then sorts in descending order by field2, and so on. This way, you can fine-tune the order of your search results to meet your needs.

Search Document dengan Sorting

# Search products sort by price
# POST http://localhost:9200/products/_search?sort=price:asc
response = es.search(index='products', body={
    'sort': {
        'price': 'asc'
    }
})
print(json.dumps(response.body, indent=4))
# Search products sort by name
# POST http://localhost:9200/products/_search?sort=name:asc
response = es.search(index='products', body={
    'sort': [
        { 'name.keyword': 'asc' }
    ]
})
print(json.dumps(response.body, indent=4))

Index API

In the past, the Index API was commonly used to create documents in Elasticsearch, but in newer versions, the Create API is often used instead. However, the Index API still has its uses.

The Index API has a “create or replace” nature. This means that if a document with the specified ID does not exist, it will be created. If a document with the specified ID already exists, the existing document will be replaced (deleted and created anew) with the new document.

This is a key difference between the Index API and the Create API. With the Create API, if a document with the specified ID already exists, an error conflict will occur. With the Index API, no such error will occur, and the existing document will simply be replaced.

To use the Index API, you can use the following RESTful API command:

POST /<index_name>/_doc/<id>
PUT /<index_name>/_doc/<id>

Here, <index_name> is the name of the index where you want to create or replace the document, and <id> is the ID of the document.

In the body of the request, you would include the full content of the new document you want to create or replace. For example:

PUT /my_index/_doc/1
{
  "field1": "value1",
  "field2": "value2"
}

In this example, a document with ID 1 and the specified fields and values will be created in my_index. If a document with ID 1 already exists, it will be replaced with the new document.

Index Document

# Save product 3
# POST http://localhost:9200/products/_doc/3
response = es.index(index='products', id='3', body={
    'name': 'Product 3',
    'price': 30000,
})
print(json.dumps(response.body, indent=4))

# Save product 4
# POST http://localhost:9200/products/_doc/4
response = es.index(index='products', id='4', body={
    'name': 'Product 4',
    'price': 40000,
})
print(json.dumps(response.body, indent=4))

# Save product 5
# POST http://localhost:9200/products/_doc/5
response = es.index(index='products', id='5', body={
    'name': 'Product 5',
    'price': 50000,
})
print(json.dumps(response.body, indent=4))

Choose Create API or Index API?

Your choice between Create API and Index API depends on your needs:

  • Index API: This is the most common choice among programmers because it’s safe from conflicts if a document with the same ID already exists in Elasticsearch. The Index API will replace the existing document, rather than producing an error.

  • Create API: Use this if you don’t want to replace an existing document. If a document with the same ID already exists, the Create API will produce an error. To prevent this, it’s recommended to use the Get API first to check if the document already exists.

Update API

The Index API behaves much like a strict librarian. Upon an update, it replaces the old document entirely, demanding all attributes anew. If you only submit the updated attribute, it overwrites the rest, potentially causing data loss.

Need to tweak a few attributes without a full rewrite? Use the Update API, your friendly editor:

POST /<index_name>/_update/<id>

But beware, if the document doesn’t exist, Elasticsearch returns a 404 Not Found error. Always check your document’s existence before updating.

Update Document

# Update product 5
# POST http://localhost:9200/products/_update/5
response = es.update(index='products', id='5', body={
    'doc': {
        'price': 50000000
    }
})
print(json.dumps(response.body, indent=4))

# Get product 5
response = es.get(index='products', id='5')
print(json.dumps(response.body, indent=4))

Delete API

To remove a document, the Delete API is your digital shredder. Utilize it through this RESTful API command:

DELETE /<index_name>/_doc/<id>

But remember, if you attempt to delete a document that doesn’t exist, Elasticsearch will return a 404 Not Found error. Always make sure the document is in your collection before trying to remove it.

Insert Spammer

# Insert customer spammer
# POST http://localhost:9200/customers/_create/spammer
response = es.index(index='customers', id='spammer', body={
    'name': 'Spammer',
    'register_at': '2023-12-06 00:00:00',
})
print(json.dumps(response.body, indent=4))

# Get customer spammer
response = es.get_source(index='customers', id='spammer')
print(json.dumps(response.body, indent=4))

Delete Spammer

# Delete customer spammer
# DELETE http://localhost:9200/customers/_doc/spammer
response = es.delete(index='customers', id='spammer', ignore=[404])
print(json.dumps(response.body, indent=4))

# Get customer spammer
response = es.get_source(index='customers', id='spammer', ignore=[404])
print(json.dumps(response.body, indent=4))

Bulk API

When dealing with many operations in Elasticsearch, the Bulk API shines by speeding up the process instead of manually handling each operation. The Bulk API bundles various operations together, including create, index, update, or delete. To use the Bulk API, apply the following RESTful API commands:

POST /_bulk
POST /<index_name>/_bulk

This allows more efficient handling of large volumes of operations.

Format Request Body untuk Bulk API

action_and_meta_data\n
optional_source\n
action_and_meta_data\n
optional_source\n
....
action_and_meta_data\n
optional_source\n

Bulk Document

# Bulk documents
# POST http://localhost:9200/_bulk
response = es.bulk(body=[
    {'create': {'_index': 'customers', '_id': 'dina'}},
    {'name': 'Dina', 'register_at': '2023-11-30 00:00:00'},
    {'index': {'_index': 'customers', '_id': 'yuni'}},
    {'name': 'Yuni', 'register_at': '2023-11-30 00:00:00'},
    {'update': {'_index': 'products', '_id': '1'}},
    {'doc': {'price': 250000}},
    {'create': {'_index': 'customers', '_id': 'spammer'}},
    {'name': 'Spammer', 'register_at': '2023-12-06 00:00:00'},
    {'delete': {'_index': 'customers', '_id': 'spammer'}},
])
print(json.dumps(response.body, indent=4))
# Get document dina
# GET http://localhost:9200/customers/_doc/dina
response = es.get(index='customers', id='dina')
print(json.dumps(response.body, indent=4))

# Get document yuni
# GET http://localhost:9200/customers/_doc/yuni
response = es.get(index='customers', id='yuni')
print(json.dumps(response.body, indent=4))

# Get document spammer
# GET http://localhost:9200/customers/_doc/spammer
response = es.get(index='customers', id='spammer', ignore=[404])
print(json.dumps(response.body, indent=4))

# Get document product 1
# GET http://localhost:9200/products/_doc/1
response = es.get(index='products', id='1')
print(json.dumps(response.body, indent=4))
Back to top