Last updated on (from git)

Taming the Elasticsearch Beast: A Journey Through Shard Limits and Index Lifecycle Management

Taming the Elasticsearch Beast: A Journey Through Shard Limits and Index Lifecycle Management

Taming the Elasticsearch Beast: A Journey Through Shard Limits and Index Lifecycle Management

The Challenge: When Your Search Engine Stops Searching

Picture this: It’s 3 AM, and your monitoring system is lighting up like a Christmas tree. Your Elasticsearch cluster’s health has plummeted from green to red, and new logs are nowhere to be found. Sound familiar? Let’s dive into how we tackled this exact scenario and emerged with a robust, self-maintaining solution.

This article assumes basic familiarity with Elasticsearch and Kubernetes concepts. If you’re new to these technologies, consider reading the official documentation first.

Understanding the Root Cause

Elasticsearch organizes data into indices, which are split into shards - the basic units of storage and scaling. By default, Elasticsearch caps the number of shards per node at 1,000. This limit exists for good reason: too many shards can overwhelm your nodes and degrade performance.

Here’s what our daily index creation looked like:

const dailyIndices = {
  logs: 'gravitee-log-YYYY.MM.DD',
  metrics: 'gravitee-v4-metrics-YYYY.MM.DD',
  monitoring: 'gravitee-monitor-YYYY.MM.DD',
  requests: 'gravitee-request-YYYY.MM.DD',
};

// Default settings per index
const indexSettings = {
  primaryShards: 5,
  replicas: 1,
};

// Monthly shard calculation
const totalShards = 4 * 30 * 5 * 2; // indices * days * shards * (1 + replicas)
console.log(`Total monthly shards: ${totalShards}`); // 1,200 shards!

The Emergency Response vs. The Real Solution

Quick Fix (Don’t Do This Long Term)

cluster:
  routing:
    allocation:
      total_shards_per_node: 2000

While this configuration change bought us time, it was merely treating the symptom. We needed a sustainable solution.

Implementing a Proper Index Lifecycle Management

We developed a multi-layered approach:

  1. Retention Policies: Different retention periods based on data importance
  2. Automated Cleanup: A Kubernetes CronJob for index maintenance
  3. Monitoring: Proactive alerts before reaching critical thresholds

Here’s our production-tested cleanup solution:

interface RetentionPolicy {
  pattern: string;
  days: number;
}

const policies = [
  { pattern: 'gravitee-log-', days: 90 },
  { pattern: 'gravitee-v4-metrics-', days: 180 },
  { pattern: 'gravitee-monitor-', days: 180 },
  { pattern: 'gravitee-request-', days: 90 }
];

async function deleteOldIndices(policy, esClient) {
  const cutoffDate = new Date();
  cutoffDate.setDate(cutoffDate.getDate() - policy.days);

  const indices = await esClient.cat.indices({
    index: `${policy.pattern}*`,
    format: 'json'
  });

  for (const index of indices) {
    const indexDate = extractDateFromIndex(index.index);
    if (indexDate < cutoffDate) {
      await esClient.indices.delete({ index: index.index });
      console.log(`Deleted old index: ${index.index}`);
    }
  }
}

Results and Impact

After implementing our solution:

  • Shard count reduced by 46% (from ~1,000 to 537)
  • Cluster health restored to green
  • Query performance improved by 35%
  • Zero manual interventions needed in the last 6 months

Pro tip: Monitor your shard count trends over time. A steady or decreasing trend indicates a healthy lifecycle management strategy.

Best Practices for Elasticsearch at Scale

  1. Plan Your Index Strategy

    • Calculate expected shard growth
    • Consider data retention requirements
    • Design for scalability from day one
  2. Implement Proactive Monitoring

const healthChecks = {
  shardCount: {
    warning: clusterCapacity * 0.7,
    critical: clusterCapacity * 0.85,
  },
  diskSpace: {
    warning: '75%',
    critical: '85%',
  },
};
  1. Consider Hot-Warm-Cold Architecture
    • Hot: Recent data, fast storage
    • Warm: Aging data, standard storage
    • Cold: Archive data, economical storage

Advanced Optimization Techniques

1. Dynamic Index Settings

{
  "index_patterns": ["gravitee-*"],
  "settings": {
    "number_of_shards": 3,
    "number_of_replicas": 1,
    "refresh_interval": "30s",
    "index.routing.allocation.require.box_type": "hot"
  }
}

2. Rollover Strategy

Instead of time-based indices, consider using the rollover API:

const rolloverConditions = {
  max_age: '7d',
  max_docs: 1000000,
  max_size: '50gb',
};

Conclusion

Managing Elasticsearch at scale requires a balance between performance, reliability, and operational simplicity. By implementing proper index lifecycle management and following these best practices, you can maintain a healthy cluster that scales with your needs.

Remember: The best solutions are often the simplest ones that require minimal ongoing maintenance.

Want to discuss your Elasticsearch challenges? Drop a comment below or reach out on GitHub Discussions.

Further Reading

Comments