BENCHMARKS· XERJ V1.0.0-RC.1 VS ELASTICSEARCH 8.13.4· RUN 2026-07-04

EVERY CELL.
BOTH DIRECTIONS.

One harness, two engines, identical workload: 91 measured dimensions across ingest, every query / aggregation / pipeline family, mixed read-under-write, kNN, and disk. As of 2026-07-04 the score is 43 WIN · 33 LOSE · 15 N/A for XERJ. The losses are published next to the wins — this run was truncated by a search_after defect that the benchmark itself uncovered, and every casualty row is shown below, scored against us.

XERJ WINS
43
Ratio > 1× · lower latency or higher throughput
XERJ LOSSES
33
Published, tracked, and each one fails CI
NOT SCORED
15
Unsupported on one side or result mismatch
ES-YAML CONFORMANCE
1326/0
Passed / failed · guardrail for every perf change
01·SETUP

SAME BOX.
SAME WORK.

Both engines run as a single node on the same machine, security off, queried over localhost. No containers, no network hop, no cluster coordination on either side. Whatever this box gives one engine, it gives the other.

MACHINE
AMD Ryzen AI Max+ 395 (w/ Radeon 8060S) · 32 hardware threads (nproc) · 119 GiB RAM (free -g) · Linux
XERJ
xerj v1.0.0-rc.1 · release build (cargo build --release) · --insecure, fresh data dir · port :9200
ELASTICSEARCH
8.13.4 official tarball · xpack.security.enabled: false, discovery.type: single-node · 4 GB heap (-Xms4g -Xmx4g) · port :9201
CORPUS
Real LLM-telemetry events (model, status, latency_ms, cost_usd, tokens, tenant, timestamps) — not synthetic filler. Reads run against 1M docs; ingest is measured at 100k and 1M docs with 1 and 8 concurrent clients.
TOPOLOGY
1 node vs 1 node · same machine · localhost · identical request bodies, byte for byte
02·METHODOLOGY

DESIGNED TO BE
HARD TO GAME.

The harness is demo/playbooks/bench-matrix.mjs — one file, Node builtins only, checked into the repo. Its rules:

OPEN-LOOP LOAD
Read requests fire on a fixed 200 req/s cadence at t0 + i/rate, independent of when earlier responses return. A slow engine cannot slow the clock down and flatter its own tail.
SAMPLING
Per family: 15 untimed warmup calls, then p50 over 120 timed iterations. Each request has a 15 s bound — an engine that hangs is recorded as collapsed and scored, not silently dropped.
FEASIBILITY PROBE
Every family is probed first. A 4xx classifies it unsupported for that engine; the probe's result signal (hit totals, agg shape) is captured for the correctness check.
RESULT-SIGNAL MISMATCH
If the two engines return materially different results (e.g. 0 hits vs 277,449), the row is scored N/A mismatchan engine that returns wrong or empty results cannot win on latency.
IDENTICAL WORK
track_total_hits: true is injected into every _search body on both sides. ES caps hit totals at 10,000 by default; forcing exact totals means neither engine wins by short-circuiting the count.
VERDICTS
Ratios are normalized so >1× always means XERJ is better (lower latency, higher docs/s, smaller disk, higher recall). Any LOSE row makes the runner exit non-zero — the scorecard is a CI gate, not a brochure.
HOW WE ALMOST FOOLED OURSELVES·THE QUERY-CACHE MIRAGE · 2026-07-01

Earlier head-to-heads showed XERJ winning reads 1.3–2.2× — and the numbers were a mirage. Those benchmarks repeated the same query against a static index, so XERJ's result cache served every call after the first. We were measuring cache hits, not query execution: uncached, a match_all size:10 actually took 2.28 seconds, because hit materialization scanned every matching document instead of the top from+size. We published the finding (demo/playbooks/CRITICAL_FINDING_read_perf_cache_mirage.md), fixed the O(N) path to O(from+size), and hardened the harness with the mismatch detection and identical-work rules above. That is the point of printing the LOSE column: a benchmark that can embarrass you is the only kind that can be trusted when it doesn't.

03·FULL RESULTS · 2026-07-04 · NO CHERRY-PICKING

THE WHOLE
SCORECARD.

All 91 rows from demo/playbooks/SCORECARD.md, unedited. Latency rows are p50 in milliseconds; ingest rows are docs/s. Read this run's caveat first: partway through, the search_after family (its 9,924 ms row below) triggered a defect that drove XERJ to an out-of-memory kill. Rows recorded as collapsed or unsupported on the XERJ side after that point are casualties of that crash — the engine was down or dying when those families ran, not measurably slower. They are scored LOSE anyway, because an engine that dies mid-benchmark loses those rows. A fix is in flight (see What's Next); the matrix will be re-run and this page updated when it lands.

DIMENSION
XERJ
ELASTICSEARCH
RATIO
VERDICT
ingest 100k × c1 (docs/s)
64,451
48,979
1.32×
WIN
ingest 100k × c8 (docs/s)
233,602
199,589
1.17×
WIN
ingest 1m × c1 (docs/s)
65,225
69,328
0.94×
LOSE
ingest 1m × c8 (docs/s)
205,273
374,490
0.55×
LOSE
read q: match_all (p50 ms)
2.14
5.07
2.37×
WIN
read q: match_none (p50 ms)
2.71
3.10
1.14×
WIN
read q: match(model) (p50 ms)
1.93
4.58
2.37×
WIN
read q: match_phrase(top_doc) (p50 ms)
1.87
4.39
2.35×
WIN
read q: match_phrase_prefix (p50 ms)
2.80
unsupported (400)
N/A
read q: match_bool_prefix (p50 ms)
1.93
6.34
3.29×
WIN
read q: multi_match (p50 ms) result mismatch: hits 0 vs 277449
2.04
7.32
mismatch
N/A
read q: combined_fields (p50 ms)
2.05
unsupported (400)
N/A
read q: query_string (p50 ms) result mismatch: hits 0 vs 273706
2.30
11.89
mismatch
N/A
read q: simple_query_string (p50 ms) result mismatch: hits 0 vs 987529
2.11
3.15
mismatch
N/A
read q: more_like_this (p50 ms)
1.78
3.17
1.78×
WIN
read q: term(status) (p50 ms)
2.51
2.65
1.06×
WIN
read q: terms(model) (p50 ms)
2.23
6.42
2.88×
WIN
read q: range(latency_ms) (p50 ms)
1.95
6.12
3.14×
WIN
read q: range(@timestamp) (p50 ms)
2.24
2.63
1.17×
WIN
read q: range(cost_usd) (p50 ms)
1.14
6.54
5.75×
WIN
read q: prefix(model) (p50 ms)
1.71
11.00
6.43×
WIN
read q: wildcard(model) (p50 ms)
1.40
10.42
7.43×
WIN
read q: regexp(model) (p50 ms)
collapsed
11.38
LOSE
read q: fuzzy(model) (p50 ms) result mismatch: hits 0 vs 277449
1.62
3.11
mismatch
N/A
read q: exists(cost_usd) (p50 ms)
1.41
2.50
1.77×
WIN
read q: ids (p50 ms)
1.10
2.24
2.03×
WIN
read q: term(cache_hit) (p50 ms)
1.25
2.37
1.90×
WIN
read q: bool must+filter+should+must_not (p50 ms)
0.82
20.54
25.10×
WIN
read q: constant_score (p50 ms)
1.36
2.27
1.67×
WIN
read q: boosting (p50 ms)
0.41
35.77
87.78×
WIN
read q: dis_max (p50 ms)
1.49
6.27
4.20×
WIN
read q: function_score (p50 ms)
1.15
35.40
30.84×
WIN
read q: pinned (p50 ms)
0.98
15.67
15.99×
WIN
read agg: avg (p50 ms)
0.89
2.66
2.97×
WIN
read agg: sum (p50 ms)
1.21
2.57
2.12×
WIN
read agg: min (p50 ms)
1.11
2.24
2.01×
WIN
read agg: max (p50 ms)
0.62
2.75
4.47×
WIN
read agg: stats (p50 ms)
1.01
2.53
2.51×
WIN
read agg: extended_stats (p50 ms)
0.75
2.49
3.33×
WIN
read agg: value_count (p50 ms)
1.21
2.18
1.81×
WIN
read agg: cardinality (p50 ms)
0.98
2.52
2.58×
WIN
read agg: percentiles (p50 ms)
1.16
2.05
1.77×
WIN
read agg: percentile_ranks (p50 ms)
1.19
2.33
1.97×
WIN
read agg: median_absolute_deviation (p50 ms)
2.46
2.79
1.13×
WIN
read agg: matrix_stats (p50 ms)
2.09
2.61
1.25×
WIN
read agg: scripted_metric (p50 ms)
2.05
2.13
1.04×
WIN
read agg: top_hits (sub) (p50 ms)
collapsed
2.96
LOSE
read agg: terms (p50 ms)
collapsed
2.12
LOSE
read agg: rare_terms (p50 ms)
2.01
1.96
0.97×
LOSE
read agg: significant_terms (p50 ms)
1.55
2.03
1.31×
WIN
read agg: histogram (p50 ms)
2.66
2.71
1.02×
WIN
read agg: date_histogram (p50 ms)
2.31
2.53
1.10×
WIN
read agg: auto_date_histogram (p50 ms)
2.16
2.10
0.97×
LOSE
read agg: variable_width_histogram (p50 ms)
collapsed
3.56
LOSE
read agg: range (p50 ms)
collapsed
2.30
LOSE
read agg: date_range (p50 ms)
collapsed
3.18
LOSE
read agg: filter (p50 ms)
collapsed
2.59
LOSE
read agg: filters (p50 ms)
collapsed
3.25
LOSE
read agg: missing (p50 ms)
1.21
2.89
2.39×
WIN
read agg: global (p50 ms) result mismatch: hits 257 vs 12471
1.20
2.77
mismatch
N/A
read agg: adjacency_matrix (p50 ms)
collapsed
3.10
LOSE
read agg: composite (p50 ms)
collapsed
2.94
LOSE
read agg: random_sampler (p50 ms)
collapsed
6.48
LOSE
read agg: terms+avg(cost) (p50 ms)
collapsed
2.62
LOSE
read pipe: sum_bucket (p50 ms)
collapsed
3.08
LOSE
read pipe: avg_bucket (p50 ms)
collapsed
3.00
LOSE
read pipe: max_bucket (p50 ms)
collapsed
3.34
LOSE
read pipe: stats_bucket (p50 ms)
collapsed
2.68
LOSE
read pipe: percentiles_bucket (p50 ms)
collapsed
2.87
LOSE
read pipe: derivative (p50 ms)
collapsed
3.56
LOSE
read pipe: cumulative_sum (p50 ms)
collapsed
2.53
LOSE
read pipe: moving_fn (p50 ms)
collapsed
3.84
LOSE
read pipe: serial_diff (p50 ms)
collapsed
3.25
LOSE
read pipe: bucket_script (p50 ms)
collapsed
3.26
LOSE
read pipe: bucket_selector (p50 ms)
collapsed
3.35
LOSE
read pipe: bucket_sort (p50 ms)
collapsed
3.59
LOSE
read feat: sort-heavy (p50 ms)
2.19
20.04
9.13×
WIN
read feat: deep from+size (from 500) (p50 ms)
2.36
3.43
1.45×
WIN
read feat: search_after (p50 ms)
9924.66
18.76
0.00×
LOSE
read feat: highlight (p50 ms)
collapsed
2.94
LOSE
read feat: _count (p50 ms)
collapsed
1.74
LOSE
read feat: _msearch (p50 ms)
collapsed
2.34
LOSE
read feat: _mget (p50 ms)
collapsed
2.27
LOSE
mixed match_all (p99 ms, under write)
unsupported
20.22
N/A
mixed bool (p99 ms, under write)
unsupported
17.36
N/A
mixed range (p99 ms, under write)
unsupported
37.85
N/A
mixed terms (p99 ms, under write)
unsupported
6.15
N/A
mixed cardinality (p99 ms, under write)
unsupported
21.44
N/A
kNN k=10 (p50 ms)
error
4.84
N/A
kNN recall@10
error
100.0%
N/A
index on-disk size
unsupported
819.4 MB
N/A

Beyond the crash casualties, the honest latency losses in this run: ingest 1M×1 client (0.94×), ingest 1M×8 clients (0.55×), rare_terms (0.97×), auto_date_histogram (0.97×), and search_after itself. All five are tracked work items.

04·REPRODUCE IT

FOUR COMMANDS.
YOUR MACHINE.

Everything on this page regenerates from the repo. No hosted harness, no private dataset, no hand-tuned engine flags.

$ git clone https://github.com/xerj-org/xerj && cd xerj
$ cargo build --release --manifest-path engine/Cargo.toml
$ bash scratchpad/es_up.sh
$ bash scratchpad/run_scorecard.sh --docs 100k,1m --clients 1,8 --knn --mixed
es_up.sh
Downloads the official Elasticsearch 8.13.4 linux-x86_64 tarball (cached after the first run), writes a single-node config with security off on port :9201, boots it with a 4 GB heap, and polls until it answers 200. Idempotent — re-running is a no-op if ES is already up.
run_scorecard.sh
Boots the release XERJ binary on :9200 with a fresh data dir, keeps it alive for the whole run, then executes the matrix against both engines and shuts XERJ down. Exits non-zero if any row is a LOSE.
HARNESS
demo/playbooks/bench-matrix.mjs — the matrix runner and scorecard generator (Node 24, no dependencies). Output lands in demo/playbooks/SCORECARD.md, which is the exact file this page's table is rendered from.

Your absolute numbers will differ with hardware; the ratios and verdicts are the claim. If your run disagrees with this page, file an issue with your SCORECARD.md — that is precisely what the harness is for.

05·KNOWN ISSUES & WHAT'S NEXT

THE RED CELLS
ARE THE ROADMAP.

search_after OOM — found by this benchmark

The deep-pagination search_after family exposed a defect that ballooned memory until the kernel killed XERJ mid-run — the single largest distortion in this scorecard (28 of the 33 LOSE rows are its casualties). The fix is in flight on a dedicated branch; the matrix re-runs, and this page is republished, when it lands. Finding this class of bug is what the benchmark exists to do.

Ingest at 1M docs — the real performance gap

XERJ wins ingest at 100k docs (1.32× single-client, 1.17× at 8 clients) but loses at 1M: 0.94× single-client and 0.55× at 8 clients. The phased plan (demo/playbooks/BEAT_ES_MASTER_PLAN.md): a no-reparse flush that threads already-parsed documents into segment building instead of re-parsing them twice; search-pool isolation so background flush and merge can't starve foreground queries; and a freeze-and-swap flush that swaps in a fresh memtable atomically so writers never stall behind a drain. Definition of done: every cell in the scorecard green, enforced by CI — any new LOSE fails the build.

Families this corpus can't measure

Skipped, not hidden — each needs a purpose-built index the flat telemetry corpus lacks: geo_* queries and aggregations (no geo_point/geo_shape field), ip_range / ip_prefix (no ip field), nested / has_child / has_parent (flat corpus, no join mapping), span_* (needs a positional text field), significant_text (corpus fields are keyword), semantic / hybrid retriever (needs a dense_vector field — kNN is covered separately by --knn on a purpose-built index), and percolate (parses but no-ops — not benchmarkable for correctness). Purpose-built corpora for these families are planned follow-ups.

The standing guardrail

Every performance change must hold ES-YAML REST conformance at 1326 passed / 0 failed. Speed bought with correctness is not a win, and it does not merge.

06·CHANGELOG
2026-07-04
First public full-matrix publication. XERJ v1.0.0-rc.1 vs Elasticsearch 8.13.4 · 91 dimensions · 43 WIN / 33 LOSE / 15 N/A · run truncated by the search_after defect documented above.