04 · ENGINE

Clustering

XERJ ships an embedded Raft implementation — no etcd, no ZooKeeper, no Consul. One binary is still one binary; cluster mode is a config switch, not a separate process. Raft replicates cluster metadata only: index schemas, shard assignments, and the node roster. Actual index data lives under the storage layer and is replicated per-shard outside the Raft group.

What you get

Leader election — exactly one leader per term, automatic failover within an election timeout when a leader is lost.
Log replication — every metadata change (create index, reassign shard, add node) is proposed by the leader and replicated to a quorum before it's committed.
Log safety — commits are durable across crashes. A node rejoining the cluster catches up from the leader's log without losing data.
Shard routing — writes are forwarded to the node holding the primary for that shard. Reads can hit any active replica.
Region split / merge — for range-partitioned indices, regions split when they exceed a byte threshold and merge when neighbours drop below it.

Node lifecycle

Every node has a NodeState in the replicated metadata:

ActiveAccepting reads and writes. Eligible for shard assignment.

DrainingGracefully shedding load. New shards are not assigned here; existing shards stay until moved.

DownUnreachable or explicitly removed. Shards owned by a Down node are reassigned to Active peers.

Bootstrapping a 3-node cluster

Start three nodes, each with its own data dir, and point them at each other. On first startup they hold a Raft election and one becomes leader:

# /etc/xerj/node-a.toml  (run on 10.0.0.11)
[server]
data_dir     = "/var/lib/xerj"
bind_address = "10.0.0.11"

[cluster]
enabled = true
port    = 9300                            # intra-cluster Raft + search port
peers   = [
  "a=10.0.0.11:9300",                     # this node
  "b=10.0.0.12:9300",
  "c=10.0.0.13:9300",
]
tick_ms = 50                              # Raft tick interval

# /etc/xerj/node-b.toml  (run on 10.0.0.12)
[server]
data_dir     = "/var/lib/xerj"
bind_address = "10.0.0.12"

[cluster]
enabled = true
port    = 9300
peers   = [
  "a=10.0.0.11:9300",
  "b=10.0.0.12:9300",                     # this node
  "c=10.0.0.13:9300",
]
tick_ms = 50

The peers list uses "<node_id>=<host>:<port>". The current node's id is inferred from the entry whose address matches its own bind_address:port. Start all three boxes:

$ xerj --config /etc/xerj/node-a.toml     # on 10.0.0.11
$ xerj --config /etc/xerj/node-b.toml     # on 10.0.0.12
$ xerj --config /etc/xerj/node-c.toml     # on 10.0.0.13

Within a second the cluster elects a leader. You can check the status on any node:

$ curl -sH "Authorization: ApiKey $XERJ_API_KEY" \
    http://10.0.0.11:8080/v1/cluster/health | jq
{
  "state":   "active",
  "leader":  "b",
  "term":    3,
  "nodes": [
    { "id": "a", "address": "10.0.0.11:8100", "state": "active" },
    { "id": "b", "address": "10.0.0.12:8100", "state": "active" },
    { "id": "c", "address": "10.0.0.13:8100", "state": "active" }
  ],
  "indices": 4,
  "shards":  24
}

Creating a sharded index

Specify shards at create time. Shards are assigned to nodes round-robin; the leader replicates the assignment through Raft so every node sees the same mapping:

$ curl -sXPUT -H "Authorization: ApiKey $XERJ_API_KEY" \
    -H "Content-Type: application/json" \
    http://10.0.0.11:8080/v1/indices/events \
    -d '{
      "shards": 6,
      "schema": {
        "@timestamp": "date",
        "host":       "keyword",
        "message":    "text"
      }
    }'

Draining a node for maintenance

Before rebooting a node, mark it draining so the cluster migrates its shards to peers first:

$ curl -sXPOST -H "Authorization: ApiKey $XERJ_API_KEY" \
    http://10.0.0.11:8080/v1/cluster/nodes/b/drain
{ "state": "draining", "shards_remaining": 8 }

# ... wait for shards_remaining to hit 0 ...

$ systemctl restart xerj    # on node b

# After the reboot, mark it active again:
$ curl -sXPOST -H "Authorization: ApiKey $XERJ_API_KEY" \
    http://10.0.0.11:8080/v1/cluster/nodes/b/activate

Regions — range partitioning

For range-partitioned indices (typically logs and time-series), XERJ splits the key space into regions. Each region is owned by exactly one node. When a region crosses split_bytes_threshold it splits into two; when two neighbouring regions together fall below merge_bytes_threshold they merge back. The region manager rebalances regions across nodes when load skew exceeds the configured tolerance.

Regions are transparent to clients — writes and reads use the normal index name; the cluster router looks up the correct region and forwards the request.

Failure modes

Leader crashFollowers time out, hold an election, a new leader is chosen. In-flight proposals from the old term are retried by clients.

Minority partitionThe minority side loses quorum and refuses writes. The majority continues. When the partition heals, the minority catches up from the leader's log.

Majority partitionCluster metadata is read-only until quorum is restored. Existing shards continue to serve reads from any active node.

Permanent node lossMark the node down with POST /v1/cluster/nodes/:id/remove. The router reassigns its shards to peers.

When not to cluster

If one node handles your load, don't cluster. One box avoids network round-trips, needs no quorum, and is trivial to back up. Clustering earns its complexity when you need more CPU or disk than a single box can give you, when you need HA without a hot standby pair, or when you're geo-distributed. The default config ships with [cluster] enabled = false.

Source · engine/crates/cluster/src/node.rs · engine/crates/cluster/src/metadata.rs · engine/crates/cluster/src/regions.rs

◀ PREVCompression

NEXT ▶Metrics