BENCHMARKS· XERJ V1.0.0-RC.1 VS ELASTICSEARCH 8.13.4· RUN 2026-07-04

EVERY CELL.
BOTH DIRECTIONS.

One harness, two engines, identical workload: 91 measured dimensions across ingest, every query / aggregation / pipeline family, mixed read-under-write, kNN, and disk. As of 2026-07-04 the score is 43 WIN · 33 LOSE · 15 N/A for XERJ. The losses are published next to the wins — this run was truncated by a search_after defect that the benchmark itself uncovered, and every casualty row is shown below, scored against us.

XERJ WINS

43

Ratio > 1× · lower latency or higher throughput

XERJ LOSSES

33

Published, tracked, and each one fails CI

NOT SCORED

15

Unsupported on one side or result mismatch

ES-YAML CONFORMANCE

1326/0

Passed / failed · guardrail for every perf change

01·SETUP

SAME BOX.
SAME WORK.

Both engines run as a single node on the same machine, security off, queried over localhost. No containers, no network hop, no cluster coordination on either side. Whatever this box gives one engine, it gives the other.

MACHINE

AMD Ryzen AI Max+ 395 (w/ Radeon 8060S) · 32 hardware threads (nproc) · 119 GiB RAM (free -g) · Linux

XERJ

xerj v1.0.0-rc.1 · release build (cargo build --release) · --insecure, fresh data dir · port :9200

ELASTICSEARCH

8.13.4 official tarball · xpack.security.enabled: false, discovery.type: single-node · 4 GB heap (-Xms4g -Xmx4g) · port :9201

CORPUS

Real LLM-telemetry events (model, status, latency_ms, cost_usd, tokens, tenant, timestamps) — not synthetic filler. Reads run against 1M docs; ingest is measured at 100k and 1M docs with 1 and 8 concurrent clients.

TOPOLOGY

1 node vs 1 node · same machine · localhost · identical request bodies, byte for byte

02·METHODOLOGY

DESIGNED TO BE
HARD TO GAME.

The harness is demo/playbooks/bench-matrix.mjs — one file, Node builtins only, checked into the repo. Its rules:

OPEN-LOOP LOAD

Read requests fire on a fixed 200 req/s cadence at t0 + i/rate, independent of when earlier responses return. A slow engine cannot slow the clock down and flatter its own tail.

SAMPLING

Per family: 15 untimed warmup calls, then p50 over 120 timed iterations. Each request has a 15 s bound — an engine that hangs is recorded as collapsed and scored, not silently dropped.

FEASIBILITY PROBE

Every family is probed first. A 4xx classifies it unsupported for that engine; the probe's result signal (hit totals, agg shape) is captured for the correctness check.

RESULT-SIGNAL MISMATCH

If the two engines return materially different results (e.g. 0 hits vs 277,449), the row is scored N/A mismatch — an engine that returns wrong or empty results cannot win on latency.

IDENTICAL WORK

track_total_hits: true is injected into every _search body on both sides. ES caps hit totals at 10,000 by default; forcing exact totals means neither engine wins by short-circuiting the count.

VERDICTS

Ratios are normalized so >1× always means XERJ is better (lower latency, higher docs/s, smaller disk, higher recall). Any LOSE row makes the runner exit non-zero — the scorecard is a CI gate, not a brochure.

HOW WE ALMOST FOOLED OURSELVES·THE QUERY-CACHE MIRAGE · 2026-07-01

Earlier head-to-heads showed XERJ winning reads 1.3–2.2× — and the numbers were a mirage. Those benchmarks repeated the same query against a static index, so XERJ's result cache served every call after the first. We were measuring cache hits, not query execution: uncached, a match_all size:10 actually took 2.28 seconds, because hit materialization scanned every matching document instead of the top from+size. We published the finding (demo/playbooks/CRITICAL_FINDING_read_perf_cache_mirage.md), fixed the O(N) path to O(from+size), and hardened the harness with the mismatch detection and identical-work rules above. That is the point of printing the LOSE column: a benchmark that can embarrass you is the only kind that can be trusted when it doesn't.

03·FULL RESULTS · 2026-07-04 · NO CHERRY-PICKING

THE WHOLE
SCORECARD.

All 91 rows from demo/playbooks/SCORECARD.md, unedited. Latency rows are p50 in milliseconds; ingest rows are docs/s. Read this run's caveat first: partway through, the search_after family (its 9,924 ms row below) triggered a defect that drove XERJ to an out-of-memory kill. Rows recorded as collapsed or unsupported on the XERJ side after that point are casualties of that crash — the engine was down or dying when those families ran, not measurably slower. They are scored LOSE anyway, because an engine that dies mid-benchmark loses those rows. A fix is in flight (see What's Next); the matrix will be re-run and this page updated when it lands.

DIMENSION

XERJ

ELASTICSEARCH

RATIO

VERDICT

ingest 100k × c1 (docs/s)

64,451

48,979

1.32×

WIN

ingest 100k × c8 (docs/s)

233,602

199,589

1.17×

WIN

ingest 1m × c1 (docs/s)

65,225

69,328

0.94×

LOSE

ingest 1m × c8 (docs/s)

205,273

374,490

0.55×

LOSE

read q: match_all (p50 ms)

2.14

5.07

2.37×

WIN

read q: match_none (p50 ms)

2.71

3.10

1.14×

WIN

read q: match(model) (p50 ms)

1.93

4.58

2.37×

WIN

read q: match_phrase(top_doc) (p50 ms)

1.87

4.39

2.35×

WIN

read q: match_phrase_prefix (p50 ms)

2.80

unsupported (400)

—

N/A

read q: match_bool_prefix (p50 ms)

1.93

6.34

3.29×

WIN

read q: multi_match (p50 ms) result mismatch: hits 0 vs 277449

2.04

7.32

mismatch

N/A

read q: combined_fields (p50 ms)

2.05

unsupported (400)

—

N/A

read q: query_string (p50 ms) result mismatch: hits 0 vs 273706

2.30

11.89

mismatch

N/A

read q: simple_query_string (p50 ms) result mismatch: hits 0 vs 987529

2.11

3.15

mismatch

N/A

read q: more_like_this (p50 ms)

1.78

3.17

1.78×

WIN

read q: term(status) (p50 ms)

2.51

2.65

1.06×

WIN

read q: terms(model) (p50 ms)

2.23

6.42

2.88×

WIN

read q: range(latency_ms) (p50 ms)

1.95

6.12

3.14×

WIN

read q: range(@timestamp) (p50 ms)

2.24

2.63

1.17×

WIN

read q: range(cost_usd) (p50 ms)

1.14

6.54

5.75×

WIN

read q: prefix(model) (p50 ms)

1.71

11.00

6.43×

WIN

read q: wildcard(model) (p50 ms)

1.40

10.42

7.43×

WIN

read q: regexp(model) (p50 ms)

collapsed

11.38

—

LOSE

read q: fuzzy(model) (p50 ms) result mismatch: hits 0 vs 277449

1.62

3.11

mismatch

N/A

read q: exists(cost_usd) (p50 ms)

1.41

2.50

1.77×

WIN

read q: ids (p50 ms)

1.10

2.24

2.03×

WIN

read q: term(cache_hit) (p50 ms)

1.25

2.37

1.90×

WIN

read q: bool must+filter+should+must_not (p50 ms)

0.82

20.54

25.10×

WIN

read q: constant_score (p50 ms)

1.36

2.27

1.67×

WIN

read q: boosting (p50 ms)

0.41

35.77

87.78×

WIN

read q: dis_max (p50 ms)

1.49

6.27

4.20×

WIN

read q: function_score (p50 ms)

1.15

35.40

30.84×

WIN

read q: pinned (p50 ms)

0.98

15.67

15.99×

WIN

read agg: avg (p50 ms)

0.89

2.66

2.97×

WIN

read agg: sum (p50 ms)

1.21

2.57

2.12×

WIN

read agg: min (p50 ms)

1.11

2.24

2.01×

WIN

read agg: max (p50 ms)

0.62

2.75

4.47×

WIN

read agg: stats (p50 ms)

1.01

2.53

2.51×

WIN

read agg: extended_stats (p50 ms)

0.75

2.49

3.33×

WIN

read agg: value_count (p50 ms)

1.21

2.18

1.81×

WIN

read agg: cardinality (p50 ms)

0.98

2.52

2.58×

WIN

read agg: percentiles (p50 ms)

1.16

2.05

1.77×

WIN

read agg: percentile_ranks (p50 ms)

1.19

2.33

1.97×

WIN

read agg: median_absolute_deviation (p50 ms)

2.46

2.79

1.13×

WIN

read agg: matrix_stats (p50 ms)

2.09

2.61

1.25×

WIN

read agg: scripted_metric (p50 ms)

2.05

2.13

1.04×

WIN

read agg: top_hits (sub) (p50 ms)

collapsed

2.96

—

LOSE

read agg: terms (p50 ms)

collapsed

2.12

—

LOSE

read agg: rare_terms (p50 ms)

2.01

1.96

0.97×

LOSE

read agg: significant_terms (p50 ms)

1.55

2.03

1.31×

WIN

read agg: histogram (p50 ms)

2.66

2.71

1.02×

WIN

read agg: date_histogram (p50 ms)

2.31

2.53

1.10×

WIN

read agg: auto_date_histogram (p50 ms)

2.16

2.10

0.97×

LOSE

read agg: variable_width_histogram (p50 ms)

collapsed

3.56

—

LOSE

read agg: range (p50 ms)

collapsed

2.30

—

LOSE

read agg: date_range (p50 ms)

collapsed

3.18

—

LOSE

read agg: filter (p50 ms)

collapsed

2.59

—

LOSE

read agg: filters (p50 ms)

collapsed

3.25

—

LOSE

read agg: missing (p50 ms)

1.21

2.89

2.39×

WIN

read agg: global (p50 ms) result mismatch: hits 257 vs 12471

1.20

2.77

mismatch

N/A

read agg: adjacency_matrix (p50 ms)

collapsed

3.10

—

LOSE

read agg: composite (p50 ms)

collapsed

2.94

—

LOSE

read agg: random_sampler (p50 ms)

collapsed

6.48

—

LOSE

read agg: terms+avg(cost) (p50 ms)

collapsed

2.62

—

LOSE

read pipe: sum_bucket (p50 ms)

collapsed

3.08

—

LOSE

read pipe: avg_bucket (p50 ms)

collapsed

3.00

—

LOSE

read pipe: max_bucket (p50 ms)

collapsed

3.34

—

LOSE

read pipe: stats_bucket (p50 ms)

collapsed

2.68

—

LOSE

read pipe: percentiles_bucket (p50 ms)

collapsed

2.87

—

LOSE

read pipe: derivative (p50 ms)

collapsed

3.56

—

LOSE

read pipe: cumulative_sum (p50 ms)

collapsed

2.53

—

LOSE

read pipe: moving_fn (p50 ms)

collapsed

3.84

—

LOSE

read pipe: serial_diff (p50 ms)

collapsed

3.25

—

LOSE

read pipe: bucket_script (p50 ms)

collapsed

3.26

—

LOSE

read pipe: bucket_selector (p50 ms)

collapsed

3.35

—

LOSE

read pipe: bucket_sort (p50 ms)

collapsed

3.59

—

LOSE

read feat: sort-heavy (p50 ms)

2.19

20.04

9.13×

WIN

read feat: deep from+size (from 500) (p50 ms)

2.36

3.43

1.45×

WIN

read feat: search_after (p50 ms)

9924.66

18.76

0.00×

LOSE

read feat: highlight (p50 ms)

collapsed

2.94

—

LOSE

read feat: _count (p50 ms)

collapsed

1.74

—

LOSE

read feat: _msearch (p50 ms)

collapsed

2.34

—

LOSE

read feat: _mget (p50 ms)

collapsed

2.27

—

LOSE

mixed match_all (p99 ms, under write)

unsupported

20.22

—

N/A

mixed bool (p99 ms, under write)

unsupported

17.36

—

N/A

mixed range (p99 ms, under write)

unsupported

37.85

—

N/A

mixed terms (p99 ms, under write)

unsupported

6.15

—

N/A

mixed cardinality (p99 ms, under write)

unsupported

21.44

—

N/A

kNN k=10 (p50 ms)

error

4.84

—

N/A

kNN recall@10

error

100.0%

—

N/A

index on-disk size

unsupported

819.4 MB

—

N/A

Beyond the crash casualties, the honest latency losses in this run: ingest 1M×1 client (0.94×), ingest 1M×8 clients (0.55×), rare_terms (0.97×), auto_date_histogram (0.97×), and search_after itself. All five are tracked work items.

04·REPRODUCE IT

FOUR COMMANDS.
YOUR MACHINE.

Everything on this page regenerates from the repo. No hosted harness, no private dataset, no hand-tuned engine flags.

$ git clone https://github.com/xerj-org/xerj && cd xerj
$ cargo build --release --manifest-path engine/Cargo.toml
$ bash scratchpad/es_up.sh
$ bash scratchpad/run_scorecard.sh --docs 100k,1m --clients 1,8 --knn --mixed

es_up.sh

Downloads the official Elasticsearch 8.13.4 linux-x86_64 tarball (cached after the first run), writes a single-node config with security off on port :9201, boots it with a 4 GB heap, and polls until it answers 200. Idempotent — re-running is a no-op if ES is already up.

run_scorecard.sh

Boots the release XERJ binary on :9200 with a fresh data dir, keeps it alive for the whole run, then executes the matrix against both engines and shuts XERJ down. Exits non-zero if any row is a LOSE.

HARNESS

demo/playbooks/bench-matrix.mjs — the matrix runner and scorecard generator (Node 24, no dependencies). Output lands in demo/playbooks/SCORECARD.md, which is the exact file this page's table is rendered from.

Your absolute numbers will differ with hardware; the ratios and verdicts are the claim. If your run disagrees with this page, file an issue with your SCORECARD.md — that is precisely what the harness is for.

05·KNOWN ISSUES & WHAT'S NEXT

THE RED CELLS
ARE THE ROADMAP.

search_after OOM — found by this benchmark

The deep-pagination search_after family exposed a defect that ballooned memory until the kernel killed XERJ mid-run — the single largest distortion in this scorecard (28 of the 33 LOSE rows are its casualties). The fix is in flight on a dedicated branch; the matrix re-runs, and this page is republished, when it lands. Finding this class of bug is what the benchmark exists to do.

Ingest at 1M docs — the real performance gap

XERJ wins ingest at 100k docs (1.32× single-client, 1.17× at 8 clients) but loses at 1M: 0.94× single-client and 0.55× at 8 clients. The phased plan (demo/playbooks/BEAT_ES_MASTER_PLAN.md): a no-reparse flush that threads already-parsed documents into segment building instead of re-parsing them twice; search-pool isolation so background flush and merge can't starve foreground queries; and a freeze-and-swap flush that swaps in a fresh memtable atomically so writers never stall behind a drain. Definition of done: every cell in the scorecard green, enforced by CI — any new LOSE fails the build.

Families this corpus can't measure

Skipped, not hidden — each needs a purpose-built index the flat telemetry corpus lacks: geo_* queries and aggregations (no geo_point/geo_shape field), ip_range / ip_prefix (no ip field), nested / has_child / has_parent (flat corpus, no join mapping), span_* (needs a positional text field), significant_text (corpus fields are keyword), semantic / hybrid retriever (needs a dense_vector field — kNN is covered separately by --knn on a purpose-built index), and percolate (parses but no-ops — not benchmarkable for correctness). Purpose-built corpora for these families are planned follow-ups.

The standing guardrail

Every performance change must hold ES-YAML REST conformance at 1326 passed / 0 failed. Speed bought with correctness is not a win, and it does not merge.

06·CHANGELOG

2026-07-04

First public full-matrix publication. XERJ v1.0.0-rc.1 vs Elasticsearch 8.13.4 · 91 dimensions · 43 WIN / 33 LOSE / 15 N/A · run truncated by the search_after defect documented above.

EVERY CELL.BOTH DIRECTIONS.

SAME BOX.SAME WORK.

DESIGNED TO BEHARD TO GAME.

THE WHOLESCORECARD.

FOUR COMMANDS.YOUR MACHINE.

THE RED CELLSARE THE ROADMAP.

search_after OOM — found by this benchmark

Ingest at 1M docs — the real performance gap

Families this corpus can't measure

The standing guardrail

EVERY CELL.
BOTH DIRECTIONS.

SAME BOX.
SAME WORK.

DESIGNED TO BE
HARD TO GAME.

THE WHOLE
SCORECARD.

FOUR COMMANDS.
YOUR MACHINE.

THE RED CELLS
ARE THE ROADMAP.