Elasticsearch reindex - How to decrease the reindex time by more than 90% and simplify reindex process

9 min readMar 15, 2023

5 years ago, as many times before that, I was required to do a reindex for the cluster with 20 indices. One of the indices contained 3 TB of primary data. You would call it a routine picture, won’t you? But no, not this time! This time, the cluster we were dealing with was less stable, and the process was repeatedly interrupted by the collapse of the nodes. So, instead of 2 days it took me almost 3 weeks, and a great deal of time I spent on monitoring the process, then running the next process, and finally — rerunning the terminated process from scratch.

I started thinking about optimization of the reindex process. At first, I came up with writing a simple script that helped to analyze the reindex process, run the next one, and rerun the terminated one. This script did not speed up the process but did its service in reducing the time of my work. After that, I started thinking of optimization that could enable faster reindexing.

Finally, I thought it would be great to develop some user-friendly application to facilitate the reindexing.

In this story I want to share with you a review for the existing Elasticsearch solutions and present an open-source application that helps to simplify the reindex process and eventually save you time. From the benchmarks, using this application would save as much as 90% of your time!

You can find the project on GitHub:

https://github.com/dbeast-co/Reindex

You can find the User guide on my site:

https://dbeast.co/reindex-for-elasticsearch

You can download the application from:

https://github.com/dbeast-co/Reindex/releases/tag/release

About the reindex process in the Elasticsearch

Elasticsearch provides an ability to reindex your data, which involves copying data from one index to another. There are several common use cases for reindexing in Elasticsearch, including:

· Upgrading Elasticsearch

When Elasticsearch upgrades, reindexing your data is often necessary to ensure compatibility with the new version of Elasticsearch. This is because new features may require index mapping or data structure changes.

· Changing the index settings or mapping

If you need to change the settings or the mapping of an existing index, you can create a new index with the updated settings or mapping and reindex the data into it.

· Combining multiple indices

If you have multiple indices that contain similar data, you may want to combine them into a single index. You can create a new index and reindex the existing indices’ data into the new one.

· Removing unused fields

Over time, an index may accumulate unused fields that are no longer needed. Reindexing allows you to remove these fields and optimize the index.

· Migrating data to a different cluster

If you need to move your data to a different Elasticsearch cluster you can reindex your data into the new cluster.

· Fixing data issues

Reindexing can also be useful for fixing data issues, such as removing duplicates or correcting errors in the data.

Reindex API — Main features

The Reindex API creates a new index and copies data from the source index to the new one.

You can specify various parameters to control the behavior of the copy operation, such as the number of documents to copy at a time, or whether to preserve the original document IDs or not.

These are some key features of the Reindex API:

· Filtering

You can specify a query to filter the documents you are going to copy. This can be useful if you only want to copy a subset of documents from the source index.

· Transformation

You can use a script to transform documents during the copy process. This can be useful if you need to modify the structure of your documents or apply some other type of data transformation.

· Error handling

The Reindex API can handle errors that occur during the copy process, such as conflicts or document failures.

· Progress tracking

You can track the reindex process via _tasks API, which provides detailed information about the progress of the copy operation, including the number of documents copied and the time elapsed.

While reindexing is a useful tool in Elasticsearch, it is not very user-friendly and sometimes can be terminated by Elasticsearch.

These are some common issues that might come up when you run the Reindex API:

· Multiple indices reindex take your time and attention.

If you have to reindex multiple indices, and you cannot do this at the same time (with the different reindex requests), you must wait for the end of the process and only after that you are allowed to run the next one.

· Reindex termination occurs due to the Elasticsearch problems

In case of a node failure or a cluster problem, Elasticsearch terminates the reindex process and you are required to start it again… From scratch! Just imagine that you have a huge index…

· Reindex process monitoring is tiresome.

You can monitor the reindex process via _tasks API, but it will take a lot of your time and attention, which can hardly be considered user-friendly.

General Elasticsearch recommendation for speeding up the reindexing process

Elasticsearch provides several recommendations for speeding up the reindexing. In my practice, this can decrease the reindex time by up to 50%.

1. Disable the refresh interval.

The refresh interval controls how often Elasticsearch updates the index with new data. In the regular ingest, increasing the refresh interval decreases the disk IOPS and improves the resource usage, but in case of the reindex, the high refresh interval decreases the ingest rate. The refresh interval is a dynamic option, so you can return to the original value upon the reindex completion.

After that the index will be created from the “Dev tools”.

To change the refresh interval:

PUT INDEX_NAME/_settings
{
  "refresh_interval": "-1"
}

With the same command, you can revert the refresh interval to the default value of your index (for logs and metrics, I set up the default of 30 sec, but you can set your own value).

PUT INDEX_NAME/_settings
{
  "refresh_interval": "30s"
}

2. Disable replicas

By disabling replicas during reindexing, you can reduce the load on your Elasticsearch cluster and improve the indexing speed. As soon as the reindexing is complete, you can re-enable replicas to ensure data redundancy.

After that the index will be created from the dev tools.

To change the number of replicas:

PUT INDEX_NAME/_settings
{
  "number_of_replicas": 0
}

With the same command, you can revert the number of replicas to the default value of your index (the default number of replicas is 1).

PUT INDEX_NAME/_settings
{
  "number_of_replicas": 1
}

All the above tricks are well-known and highly useful but still, the real life could be full of surprises…

So, having performed all the above, let us see…

How we can do it all with the Reindex application.

The Application is essentially a dedicated UI designed to facilitate the reindexing of your data without directly using the Elasticsearch reindex API. This application works via a web browser, so you can use it on a local or remote machine. For the reindex process, we use Reindex API, and all that we do is send to Elasticsearch the REST requests that contain the source, destination, and reindex parameters, depending on the reindex algorithm.

This application is very easy to start, and it makes the rest of the reindex process very simple.

To start a new reindex process:

1. Fill in the following settings:

· Source Elasticsearch cluster

· The indices that you want to reindex

· Destination parameters (index, alias, etc.)

· Parallel process parameters

· Reindex algorithm

· Project name

2. Save and validate the settings and start to work.

3. Press the “Start” button.

That’s all… The reindex is started😊!

Simple, right?

Later you can monitor the reindex process with the monitoring dashboard.

In the monitoring dashboard you find the reindex status, the number of the already reindexed documents, the currently running processes, the failed processes (with the reason for failure), and the processes waiting in the queue.

If your reindex failes (for example, due to an Elasticsearch problem), you will not be required to start it from scratch. You can restart only the failed frames.

All right, so far we have seen how to save time and ease the monitoring process.

But where is the speed-up factor?

As I said earlier, we use Reindex API for reindex process and we send to Elasticsearch the REST requests depending on the reindex algorithm. And this does the trick! For the time series data, we implement a “time oriented” algorithm that splits your index to the frames on the basis of their “Date field”.

For example:

You have a huge index (for example 500GB), that contains your intraday logs. As bad luck would have it, at the start you have missed to design several shards, and now all your data is placed in one single shard. The search process is slow, and you decide to reindex your index to a new one with 10 shards (50GB per shard, as recommended by Elasticsearch).

You can run this process in a regular way in the “Dev tools”, but in this case you get a single process that takes a long time. The application that I offer uses a splitting algorithm to run several processes in parallel, and this allows us to accelerate the reindex process.

“Time oriented” reindex algorithm — Define the process split.

“Date field” — Define the date filed name that contains the date of the logs (for example, @timestamp).

“Time frame” — Contains the frame size (in minutes) which will be used for the requests splitting (for example, 60 minutes). In this case, the application will split the reindex into 24 requests (one request per 60 minutes).

“Date format” — Contains the format of the data in the field defined in the index mapping.

“Number of concurrently processed indices” Define the number of indices that you want to process at the same time. In our case, we have only one index, so we don’t need to change this default value.

For our purpose in this procedure it is highly important that you define the second parameter:

“Total number of threads per index” — Define the number of requests that will be sent to Elasticsearch at the same time. You set up the number of threads basing on the free cluster resources (CPU).

Let us suppose that you have a rather free cluster to which you define 10 threads.

It means that as soon as the process starts, your cluster will manage as many as 10 parallel reindex processes. As soon as one of them ends up, the application sends out the next one.

The benefit of several parallel processes is not linear but it gives a huge boost to the indexing speed.

WARNING! Too many parallel threads can cause excessive CPU usage and slow down your cluster!

This is only one example, but there are many more cases in which the application can boost the reindex speed.

The reindex application provides the following features:

· Option to split the reindex request into multiple requests and their parallel processing

· Option to set up the number of the concurrently processed indices (this number not necessarily equals the index number), and the number of the concurrently processed data frames per index (in case you are using the Time oriented algorithm). Upon the end of one of the processes, the applications continue to the new one.

· Reindex to the same/remote cluster

· Reindex multiple indices into a one index

· Reindex to the alias

· Reindex to the index/indices named with prefix + original_index_name

· Reindex to the index/indices named with original_index_name + suffix

· Reindex to the index/indices named with original_index_name — suffix

· Reindex to ILM rollover alias with an option to create the first ILM index (for example, index-000001 for size rollover, or %3CINDEX_NAME-%7Bnow%2Fd%7D-000001%3E for time series indices)

· Remote reindex with the same index name

· Restart failed reindex parts

Hope you will enjoy our reindex application!

Feel free to send us your feedback on the GitHub:

https://github.com/dbeast-co/Reindex/discussions

https://github.com/dbeast-co/Reindex/issues