Fixing 5s Latency in Elasticsearch 8: Shard Sizing & ILM War Stories

The alerting channel screamed at 2:00 AM on a Friday. Our production ELK Stack—responsible for ingesting 2TB of log data and powering the site's internal search—had ground to a halt. Simple term queries that usually took milliseconds were timing out at 5 seconds, and the Data Nodes were sitting at 95% JVM Heap usage despite CPU utilization being relatively low. We weren't just seeing a slowdown; we were seeing a cluster on the brink of a garbage collection death spiral.

The Sharding Trap: Why "More" Isn't Better

To understand the failure, we need to look at our environment. We were running Elasticsearch 8.6 on AWS `r6g.2xlarge` instances (64GB RAM). Our indexing strategy was a naive "daily index" pattern (`logs-2025-12-26`). While this is easy to set up, it ignores data volume fluctuations. On quiet Sundays, an index might be 5GB. On Black Friday, it ballooned to 800GB.

The root cause wasn't the data volume itself, but the Sharding Strategy. Because we statically defined 10 primary shards per index to "future proof" for high traffic, we ended up with thousands of tiny, 500MB shards for older data, and massive, unstable shards for peak days. In the world of Elasticsearch, every shard consumes a fixed amount of heap memory for Lucene segment metadata, regardless of how much data it holds.

Analysis Log: GET /_cat/shards revealed over 4,000 shards for just 5TB of data. The cluster state size was enormous, causing the master node to timeout when publishing updates.

This is a classic "Oversharding" scenario. The overhead of managing thousands of Lucene instances choked the JVM, leaving no memory for the actual query cache or file system cache.

The Failed "Force Merge" Attempt

My first reaction was to panic-fix the symptom. I thought, "If we have too many segments, let's merge them." I ran a `POST /_forcemerge?max_num_segments=1` on the active writing indices. This was a catastrophic mistake. Force merging causes massive I/O pressure. Doing this on an index that is actively receiving writes blocked the indexing threads, causing the Logstash buffers to fill up and eventually drop data. We learned the hard way: never force merge an index that is still being written to.

The Fix: ILM and Dynamic Rollover

The solution required shifting from time-based indices to size-based indices using Index Lifecycle Management (ILM). This approach ensures that a shard only "rolls over" to a new index when it hits an optimal size (usually 30GB - 50GB), regardless of whether it took 1 hour or 1 week to fill.

This allows us to maintain a perfect balance: shards are large enough to be efficient but small enough to move easily during rebalancing.

// PUT _ilm/policy/logs_hot_warm_policy
// This policy moves data through phases to optimize cost and performance.
{
  "policy": {
    "phases": {
      "hot": {
        "min_age": "0ms",
        "actions": {
          "rollover": {
            // Roll over when the primary shard hits 50GB
            // OR the index is 7 days old (whichever comes first)
            "max_primary_shard_size": "50gb",
            "max_age": "7d"
          }
        }
      },
      "warm": {
        // Move to warm nodes after 7 days
        "min_age": "7d", 
        "actions": {
          "shrink": {
            // Condense shards to save metadata overhead
            "number_of_shards": 1
          },
          "forcemerge": {
            // NOW it is safe to merge, as the index is read-only
            "max_num_segments": 1
          },
          "allocate": {
             // Move to cheaper hardware (e.g., HDD or lower RAM)
             "require": {
               "data": "warm"
             }
          }
        }
      },
      "delete": {
        "min_age": "90d",
        "actions": {
          "delete": {}
        }
      }
    }
  }
}

In the configuration above, the `hot` phase handles the ingestion. The critical setting is `max_primary_shard_size: "50gb"`. This ensures that even during traffic spikes, we create new shards automatically, preventing any single shard from becoming a "hotspot" that drags down a node. Once the data enters the `warm` phase, we use `shrink` and `forcemerge` to optimize it for read-only access, drastically reducing the heap footprint.

Performance Impact & Verification

We applied this policy and reindexed the data. The results were immediate. By consolidating the tiny shards and limiting the huge ones, we stabilized the heap usage. Interestingly, this improved our application's response time, which directly contributed to our internal Search Engine Optimization metrics (specifically LCP and TTFB on pages relying on search results).

Metric Before Optimization After ILM & Sharding Fix
Avg Query Latency 5,200 ms 120 ms
Shard Count 4,200+ 380
Heap Usage (Peak) 96% (Constant GC) 65% (Healthy)
Data Node CPU 30% (I/O Wait high) 45% (Efficient Processing)

The reduction in shard count by over 90% meant that the cluster state updates were instantaneous. The Master node stopped timing out, and the data nodes could dedicate RAM to file system caching rather than just keeping shard metadata alive. This is the essence of Elasticsearch Tuning: it's rarely about changing a boolean flag in `elasticsearch.yml` and almost always about data architecture.

Read Official ILM Docs

Edge Cases & Critical Warnings

While ILM is powerful, it introduces complexity that can bite you if ignored. Specifically, the "Write Alias" confusion. When using ILM, you must write to an alias (e.g., `logs-write-alias`), not the concrete index name.

Warning: If your application or Logstash accidentally writes to `logs-2025-01-01-000001` directly instead of the alias, the rollover mechanism will break. ILM only rotates the alias pointer. Direct writes to a concrete index will bypass the policy controls, leading to infinite growth of that single index.

Additionally, beware of the `shrink` action if you don't have enough disk space. To shrink an index, Elasticsearch must copy the data to a new, smaller set of shards on the same node before deleting the old ones. This means you momentarily need double the disk space for that index. If your warm nodes are running at 85% disk capacity, the shrink action will fail.

Conclusion

Resolving the latency issues in our ELK Stack required moving away from intuitive "daily indices" to a disciplined, calculated approach using Index Lifecycle Management. By aligning our Sharding Strategy with actual data volume rather than calendar dates, we reduced heap pressure and regained query performance. If you are managing a cluster larger than a few terabytes, static index patterns are a technical debt you cannot afford to keep.

Post a Comment