Here is a checklist of Elasticsearch must-know things which I collect during the most recent year.
More recent release version has better performance. We tested that indexing performance on 5.6.10 is ~20% increased compare with 5.1.1
Memory per node
Do not over 32GB, 31GB is likely safe.
Number of nodes
Two 16GB nodes are better than one 32GB, two 8GB are better than one 16GB, generally, but not always. Do your load test on scenarios you care.
Number of shards
More shards, faster query, slower agg.
Try to avoid very large shard.
Index bulk size and thread count
Load test with your expected data, start with 500-1000 bulk size in several threads.
Lower interval, lower performance, sooner visible.
Use async with longer interval will increase index speed, with risk of losing data.
Fully merge your read only segments.
Choose proper codec,
best_compression has best compress ratio with slowest speed.
Split huge range query to small ones for saving memory/resp time, unless you need them all at the same time.
text if you need full-text search, otherwise
keyword is generally good enough.
Use SSD in hot data zone if possible.
RAID can also boost speed. We tested that a 10 disks RAID5 nodes cluster is 10-20% better compare with raw multi data path nodes which enabled replica.
If Elasticsearch is not your only heavily CPU usage program on the same server, try to bind processes on specific CPUs via e.g. taskset.
Comparison testing GC algorithms. We firstly use G1 then switch back to CMS, case of G1 node sometimes just crashed.
Limit off-heap usage by
Disable or minimal swapping.
Unlimit or turn up
Enable coredump, but don’t forget to clean up history data.
Aggregation bucket size
Limit bucket size especially in a multi nested agg, to avoid discrete explosion.
Enable sniffing indexing across cluster nodes.
Use long-lived connection, setup retry and rebuild strategy.
Monitor disk space usage. Alarm when high, auto delete/stop storing when really high.
Cluster node may stuck in dead lock like
java.io.IOException: failed to read.+file:(.+).st, this could happen on an accidentally lose power or disk corruption. Monitor the error log and delete the
Sometime host IP changed without inform Elasticsearch. Detect it and modify the setting.
Index allocate status
Shards may not success assigned, although it’s rare. Monitor the UNASSIGNED shard and reassign/reroute it.
Auto restart is generally useful, customize your startup/cron script.
Throughput, latency, GC status, CPU, memory, disk and so on.
Protect your service from outside world, by firewall, x-pack, searchguard.
Disable wildcard/regex delete feature unless you are fully confident.
27 Feb 2019