Troubleshooting
The short list of things that go wrong in production, what they look like, and what to do about them. Every symptom links to the metric, the log line, and the config key you turn.
Symptom: ingest throughput drops after a few hours
Likely cause: merge pressure. Segments are piling up faster than the merger can consolidate them, and the memtable is flushing small segments.
Check:
$ curl -s http://127.0.0.1:8080/v1/metrics | grep -E 'segment_count|merge_duration'
xerj_segment_count{index="logs"} 2847
xerj_merge_duration_seconds_count{index="logs"} 183
xerj_merge_duration_seconds_sum{index="logs"} 1847.2
Fix: raise [merge] max_concurrent (1 → 2–4), raise io_rate_mb_per_sec (100 → 250–500 on NVMe), and raise [storage] flush_size_mb so flushes produce bigger starter segments.
Symptom: queries time out under load
Likely cause: too many concurrent queries or a single query that grew too large.
Check: look for "query cancelled: max_query_memory_mb exceeded" in the logs or active_searches pegged at max_concurrent_searches.
$ journalctl -u xerj --since "10 min ago" | grep -E 'cancel|timeout|rejected'
Fix: if it's memory — raise [limits] max_query_memory_mb (512 → 1024 or 2048 for aggregation-heavy workloads). If it's concurrency — raise max_concurrent_searches, but check that the host actually has headroom first.
Symptom: node RAM climbs forever
Likely cause: HNSW index growing without quantization, or too many indices with large flush_size_mb.
Check:
$ curl -s http://127.0.0.1:8080/v1/metrics | grep memory_usage xerj_memory_usage_bytes 14200000000 # 14 GB and climbing
Fix: set [vector] hnsw_offload_threshold = 1000000 to auto-scalar4 once an index exceeds 1 M vectors. Or lower flush_size_mb so memtables don't grow unbounded.
Symptom: "WAL replay failed" on restart
Likely cause: the server was killed mid-fsync (power loss, OOM kill). The tail WAL file is torn.
Check: first boot log — "wal replay: truncating torn tail at offset N" is benign (XERJ truncates the torn suffix and continues). A hard "wal replay: checksum mismatch at offset N, refusing to start" is not.
Fix: if truncation worked on its own, nothing to do — the last few seconds of writes are lost but the index is consistent. If it refused to start, run xerj verify --data-dir /var/lib/xerj --repair-wal which truncates the WAL at the last valid entry.
Symptom: "disk full" during a merge
Likely cause: no reservation for merge scratch space. A merge of two N-sized segments needs 2N free until the merge completes.
Fix: lower [merge] max_segment_mb so individual merges stay smaller, or free disk. Once the merge retries and succeeds, old segments are unlinked.
Symptom: cluster flapping — leader changes every few seconds
Likely cause: network latency between peers is high enough that heartbeats miss. Raft responds by calling a new election.
Check:
$ journalctl -u xerj --since "5 min ago" | grep -E 'term|election|leader' ... raft: election timeout, starting new term 17 ... raft: received higher term 18, stepping down
Fix: raise [cluster] tick_ms from 50 to 150 or 250 — gives heartbeats more room on a slow network. Never drop the tick interval below the RTT between your worst pair of nodes.
Symptom: search returns stale results
Likely cause: recent docs are still in the memtable and a query on a replica is hitting a node that hasn't replicated them yet.
Fix: pass ?preference=primary on the search query to force routing to the primary shard, or lower [storage] flush_interval_secs so the memtable flushes more often.
Getting help
Collect diagnostics before filing a bug. This command builds a self-contained tarball with the config, recent logs, and a metrics snapshot:
$ xerj support-bundle --out /tmp/xerj-support-$(date +%s).tar.gz wrote /tmp/xerj-support-1745000000.tar.gz (384 KiB) contents: config.toml logs/xerj.log.gz metrics.txt cluster-health.json indices-stats.json
Source · engine/crates/server/src/main.rs · engine/crates/common/src/metrics.rs