Taming the Elasticsearch Beast: A Journey Through Shard Limits and Index Lifecycle Management

Taming the Elasticsearch Beast: A Journey Through Shard Limits and Index Lifecycle Management
The Challenge: When Your Search Engine Stops Searching
Picture this: It’s 3 AM, and your monitoring system is lighting up like a Christmas tree. Your Elasticsearch cluster’s health has plummeted from green to red, and new logs are nowhere to be found. Sound familiar? Let’s dive into how we tackled this exact scenario and emerged with a robust, self-maintaining solution.
This article assumes basic familiarity with Elasticsearch and Kubernetes concepts. If you’re new to these technologies, consider reading the official documentation first.
Understanding the Root Cause
Elasticsearch organizes data into indices, which are split into shards - the basic units of storage and scaling. By default, Elasticsearch caps the number of shards per node at 1,000. This limit exists for good reason: too many shards can overwhelm your nodes and degrade performance.
Here’s what our daily index creation looked like:
const dailyIndices = {
logs: 'gravitee-log-YYYY.MM.DD',
metrics: 'gravitee-v4-metrics-YYYY.MM.DD',
monitoring: 'gravitee-monitor-YYYY.MM.DD',
requests: 'gravitee-request-YYYY.MM.DD',
};
// Default settings per index
const indexSettings = {
primaryShards: 5,
replicas: 1,
};
// Monthly shard calculation
const totalShards = 4 * 30 * 5 * 2; // indices * days * shards * (1 + replicas)
console.log(`Total monthly shards: ${totalShards}`); // 1,200 shards!
The Emergency Response vs. The Real Solution
Quick Fix (Don’t Do This Long Term)
cluster:
routing:
allocation:
total_shards_per_node: 2000
While this configuration change bought us time, it was merely treating the symptom. We needed a sustainable solution.
Implementing a Proper Index Lifecycle Management
We developed a multi-layered approach:
- Retention Policies: Different retention periods based on data importance
- Automated Cleanup: A Kubernetes CronJob for index maintenance
- Monitoring: Proactive alerts before reaching critical thresholds
Here’s our production-tested cleanup solution:
interface RetentionPolicy {
pattern: string;
days: number;
}
const policies = [
{ pattern: 'gravitee-log-', days: 90 },
{ pattern: 'gravitee-v4-metrics-', days: 180 },
{ pattern: 'gravitee-monitor-', days: 180 },
{ pattern: 'gravitee-request-', days: 90 }
];
async function deleteOldIndices(policy, esClient) {
const cutoffDate = new Date();
cutoffDate.setDate(cutoffDate.getDate() - policy.days);
const indices = await esClient.cat.indices({
index: `${policy.pattern}*`,
format: 'json'
});
for (const index of indices) {
const indexDate = extractDateFromIndex(index.index);
if (indexDate < cutoffDate) {
await esClient.indices.delete({ index: index.index });
console.log(`Deleted old index: ${index.index}`);
}
}
}
Results and Impact
After implementing our solution:
- Shard count reduced by 46% (from ~1,000 to 537)
- Cluster health restored to green
- Query performance improved by 35%
- Zero manual interventions needed in the last 6 months
Pro tip: Monitor your shard count trends over time. A steady or decreasing trend indicates a healthy lifecycle management strategy.
Best Practices for Elasticsearch at Scale
-
Plan Your Index Strategy
- Calculate expected shard growth
- Consider data retention requirements
- Design for scalability from day one
-
Implement Proactive Monitoring
const healthChecks = {
shardCount: {
warning: clusterCapacity * 0.7,
critical: clusterCapacity * 0.85,
},
diskSpace: {
warning: '75%',
critical: '85%',
},
};
- Consider Hot-Warm-Cold Architecture
- Hot: Recent data, fast storage
- Warm: Aging data, standard storage
- Cold: Archive data, economical storage
Advanced Optimization Techniques
1. Dynamic Index Settings
{
"index_patterns": ["gravitee-*"],
"settings": {
"number_of_shards": 3,
"number_of_replicas": 1,
"refresh_interval": "30s",
"index.routing.allocation.require.box_type": "hot"
}
}
2. Rollover Strategy
Instead of time-based indices, consider using the rollover API:
const rolloverConditions = {
max_age: '7d',
max_docs: 1000000,
max_size: '50gb',
};
Conclusion
Managing Elasticsearch at scale requires a balance between performance, reliability, and operational simplicity. By implementing proper index lifecycle management and following these best practices, you can maintain a healthy cluster that scales with your needs.
Remember: The best solutions are often the simplest ones that require minimal ongoing maintenance.
Want to discuss your Elasticsearch challenges? Drop a comment below or reach out on GitHub Discussions.
Comments