Elasticsearch Best Practice

Here is a checklist of Elasticsearch must-know things which I collect during the most recent year.

Elasticsearch version
More recent release version has better performance. We tested that indexing performance on 5.6.10 is ~20% increased compare with 5.1.1
Memory per node
Do not over 32GB, 31GB is likely safe.
Number of nodes
Two 16GB nodes are better than one 32GB, two 8GB are better than one 16GB, generally, but not always. Do your load test on scenarios you care.
Number of shards
More shards, faster query, slower agg.
Try to avoid very large shard.
Index bulk size and thread count
Load test with your expected data, start with 500-1000 bulk size in several threads.
Refresh interval
Lower interval, lower performance, sooner visible.
Translog durability
Use async with longer interval will increase index speed, with risk of losing data.
Merge segments
Fully merge your read only segments.
Codec
Choose proper codec, best_compression has best compress ratio with slowest speed.
Query
Split huge range query to small ones for saving memory/resp time, unless you need them all at the same time.
Field type
Use text if you need full-text search, otherwise keyword is generally good enough.
RAID/SSD
Use SSD in hot data zone if possible.
RAID can also boost speed. We tested that a 10 disks RAID5 nodes cluster is 10-20% better compare with raw multi data path nodes which enabled replica.
CPU bounds
If Elasticsearch is not your only heavily CPU usage program on the same server, try to bind processes on specific CPUs via e.g. taskset.

Garbage Collectors
Comparison testing GC algorithms. We firstly use G1 then switch back to CMS, case of G1 node sometimes just crashed.
Off-heap
Limit off-heap usage by -XX:MaxDirectMemorySize.
OS
Disable or minimal swapping.
Unlimit or turn up memlock and nofile.
Enable coredump, but don’t forget to clean up history data.
Aggregation bucket size
Limit bucket size especially in a multi nested agg, to avoid discrete explosion.
Sniffing
Enable sniffing indexing across cluster nodes.
Connection
Use long-lived connection, setup retry and rebuild strategy.
Disk space
Monitor disk space usage. Alarm when high, auto delete/stop storing when really high.
Deadlock
Cluster node may stuck in dead lock like java.io.IOException: failed to read.+file:(.+).st, this could happen on an accidentally lose power or disk corruption. Monitor the error log and delete the st file.
IP
Sometime host IP changed without inform Elasticsearch. Detect it and modify the setting.
Index allocate status
Shards may not success assigned, although it’s rare. Monitor the UNASSIGNED shard and reassign/reroute it.
Auto restart
Auto restart is generally useful, customize your startup/cron script.
Other monitor
Throughput, latency, GC status, CPU, memory, disk and so on.

920X/930X
Protect your service from outside world, by firewall, x-pack, searchguard.
Wildcards
Disable wildcard/regex delete feature unless you are fully confident.

27 Feb 2019