vpp-maglev/docs/design.md

# vpp-maglev Design Document

## Metadata

| | |
| --- | --- |
| **Status** | Retrofit — describes shipped behavior as of `v0.9.5` |
| **Author** | Pim van Pelt `<pim@ipng.ch>` |
| **Last updated** | 2026-04-15 |
| **Audience** | Operators and contributors who will read the source tree next |

The key words **MUST**, **MUST NOT**, **SHOULD**, **SHOULD NOT**, and
**MAY** are used as described in
[RFC 2119](https://datatracker.ietf.org/doc/html/rfc2119), and are
reserved in this document for requirements that are actually enforced
in code or by an external dependency. Plain-language descriptions of
what the system or an operator can do are written in lowercase —
"can", "will", "does" — and should not be read as normative.

## Summary

`vpp-maglev` is a control plane for the VPP `lb` (Maglev load
balancer) plugin. A single daemon — `maglevd` — probes a fleet of
backends, maintains an authoritative view of their health, and
programs the VPP dataplane so that traffic hashed to a given VIP
lands only on healthy backends. Operators drive the system through
`maglevc` (an interactive CLI) or `maglevd-frontend` (a read-only
web dashboard with an optional authenticated admin surface). A small
companion binary, `maglevt`, validates VIPs from outside the control
plane by sending live HTTP probes and reporting failover behavior.

## Background

VPP's `lb` plugin implements Maglev consistent hashing inside the
dataplane: a VIP is backed by a pool of Application Servers (ASes),
each with an integer weight in `[0, 100]`, and incoming flows are
hashed onto a bucket ring so that weight changes disturb as few
existing flows as possible. The plugin knows nothing about backend
health; if an AS dies while it holds buckets, traffic to those
buckets is black-holed until something external tells `lb` to remove
or re-weight the AS.

`vpp-maglev` is that external thing. Before `vpp-maglev`, operators
maintained VIP configurations by hand and reacted to incidents with
`vppctl`. The project replaces that loop with a daemon that owns the
health story, reconciles it with the dataplane, and exposes the
result through a uniform gRPC API so that CLIs, dashboards, and
scripts all read the same source of truth.

## Goals and Non-Goals

### Product Goals

1. **Accurate backend health.** Detect that a backend is up,
   degraded, or down quickly enough to keep user-visible error rates
   low, and avoid flapping under transient faults.
2. **Correct VPP state.** The set of VIPs and per-AS weights in VPP
   converges to the configured intent, filtered by current health,
   for every supported failure mode.
3. **Restart neutrality.** Restarting `maglevd` with VPP already up
   MUST NOT cause traffic to be black-holed while health probes warm
   up.
4. **Operator control.** A human can pause, drain, or weight-shift
   a backend in seconds without editing config files.
5. **Uniform observability.** Every state transition, VPP API call,
   and probe result is emitted as a structured log, a Prometheus
   metric, or a streaming event — ideally all three.
6. **One source of truth.** Every other component (CLI, web
   frontend, scripts) reads `maglevd` through one typed interface.
   There is no secondary control plane.

### Non-Goals

- `vpp-maglev` is not a VPP installer or packaging layer. It assumes
  VPP is already running with the `lb` plugin loaded.
- It does not implement its own dataplane fast path. All forwarding
  stays in VPP; `maglevd` only programs the plugin.
- It is not a generic service mesh. There is no L7 routing, cert
  issuance, service discovery, or east-west policy — only VIPs,
  pools, and backends.
- It is not a config store. Configuration is a YAML file on disk;
  the gRPC API can check and reload it but cannot author it.
- It does not secure its own transport. gRPC runs insecure by
  default; TLS, mTLS, or firewalls are the operator's
  responsibility.

## Requirements

Each requirement carries a unique identifier (`FR-X.Y` or `NFR-X.Y`)
so that later sections can cite it.

### Functional Requirements

**FR-1 Health checking**

- **FR-1.1** The system supports ICMP, TCP, HTTP, and HTTPS health
  checks, each with its own protocol-specific success criteria.
- **FR-1.2** Each health check MUST apply HAProxy rise/fall
  semantics with operator-configurable thresholds.
- **FR-1.3** A health check MAY declare distinct `interval`,
  `fast-interval`, and `down-interval` values so that recovery from
  a degraded or down state is faster than steady-state polling.
- **FR-1.4** Each probe attempt is bounded by a configurable
  per-probe timeout, independent of the scheduling interval.
- **FR-1.5** If the configuration sets `healthchecker.netns`, every
  probe MUST execute inside the named Linux network namespace.
- **FR-1.6** The first probe result against a newly-created backend
  forces an immediate transition out of `Unknown`, without waiting
  for `rise` or `fall` consecutive results.
- **FR-1.7** A backend MAY omit its `healthcheck` reference to
  declare itself **static**. A static backend is not probed and is
  treated as permanently Up; it still participates in pool failover
  and still honors operator Pause and Disable overrides.

**FR-2 Aggregation and pool failover**

- **FR-2.1** A frontend MAY reference one or more named pools.
  Each referenced pool MUST contain at least one
  `(backend, configured-weight)` tuple; an empty pool is a
  configuration error and is rejected at load time.
- **FR-2.2** At any time, exactly one pool — the first, in
  configuration order, that contains a healthy backend with
  non-zero configured weight — is active; backends in other pools
  contribute zero effective weight.
- **FR-2.3** The effective weight of a `(frontend, pool, backend)`
  tuple is the configured weight when the backend is Up **and** the
  pool is active, and zero in every other case.
- **FR-2.4** A frontend's aggregate state is Up when at least one
  backend has non-zero effective weight, Unknown when every
  referenced backend is still Unknown (or the frontend references
  no backends), and Down otherwise.

**FR-3 Operator control**

- **FR-3.1** Operators can pause and resume individual backends at
  runtime. Pausing stops the probe worker, freezes the rise/fall
  counter, and drives effective weight to zero in **every** pool
  and **every** frontend that references the backend. Existing
  flows are not torn down; this is a soft drain.
- **FR-3.2** Operators can disable and re-enable individual
  backends at runtime. Disabling drives effective weight to zero in
  **every** pool and **every** frontend that references the
  backend, and MUST cause existing flows to be torn down on the
  next VPP sync.
- **FR-3.3** Operators can set the configured weight of a specific
  `(frontend, pool, backend)` tuple at runtime.
- **FR-3.4** Operator overrides (Pause, Disable) and operator
  weight mutations survive a configuration **reload** (`SIGHUP`)
  as long as the underlying backend and tuple still exist in the
  new configuration.
- **FR-3.5** Operator overrides and operator weight mutations do
  **not** survive a `maglevd` **restart**. After a restart, the
  YAML configuration file is authoritative for every backend and
  every tuple: paused backends come back unpaused, disabled
  backends come back enabled, mutated weights revert to the
  configured value. Operators who need persistent changes must
  edit the config file.

**FR-4 VPP reconciliation**

- **FR-4.1** For every backend state transition that changes an
  effective weight, `maglevd` pushes the resulting AS state into
  VPP for every affected VIP.
- **FR-4.2** `maglevd` runs a periodic full reconciliation on a
  configurable cadence (default thirty seconds) as a safety net
  against missed events and VPP restarts.
- **FR-4.3** Weight-to-zero is communicated to VPP as a graceful
  drain by default; transitions to Disabled and transitions to
  Down while `flush-on-down` is true MUST tear existing flows down
  on the next sync.
- **FR-4.4** `maglevd` tolerates VPP disconnects by auto-reconnecting
  and resuming reconciliation once the connection is
  re-established.

**FR-5 Configuration**

- **FR-5.1** Configuration is loaded from a single YAML file
  specified at startup and referenced by all later operations.
- **FR-5.2** Configuration validation distinguishes **parse
  errors** (malformed YAML) from **semantic errors** (structural
  invariants) and MUST report each with its own exit code from
  `--check`: 0 (OK), 1 (parse), 2 (semantic).
- **FR-5.3** `maglevd` reloads its configuration on `SIGHUP`
  without restarting the process, without restarting unchanged
  probe workers, and without losing operator overrides (see
  FR-3.4).
- **FR-5.4** A parse or semantic error encountered during reload
  MUST leave the running configuration in place.
- **FR-5.5** The same validation and reload semantics are also
  reachable through gRPC (`CheckConfig`, `ReloadConfig`).

**FR-6 Observability**

- **FR-6.1** All logs are emitted as structured JSON on stdout at
  a configurable level.
- **FR-6.2** `maglevd` exposes Prometheus metrics for probe
  outcomes, probe latency, backend state transitions, VPP API
  traffic, and VPP LB sync mutations.
- **FR-6.3** A streaming gRPC API multiplexes log entries, backend
  transitions, and frontend aggregate transitions to any number of
  subscribers with per-subscriber filters.
- **FR-6.4** Per-VIP packet counters from VPP's stats segment are
  surfaced through both the gRPC API and the Prometheus surface.

**FR-7 Clients and peripheral tools**

- **FR-7.1** An interactive CLI (`maglevc`) provides a
  tab-completing shell and a one-shot command mode, both backed
  by the same command tree.
- **FR-7.2** A web frontend (`maglevd-frontend`) can multiplex more
  than one `maglevd` in a single process and present their
  combined state.
- **FR-7.3** The web frontend partitions its HTTP surface into a
  public read-only path (`/view/`) and an authenticated mutating
  path (`/admin/`). If credentials are not configured, `/admin/`
  MUST NOT be advertised (the path returns 404).
- **FR-7.4** An out-of-band tester (`maglevt`) probes configured
  VIPs from outside the control plane, measures latency, and
  tallies a configurable response header.

### Non-Functional Requirements

**NFR-1 Availability and reliability**

- **NFR-1.1** A `maglevd` outage MUST NOT stop the dataplane.
  While `maglevd` is absent, VPP continues to forward traffic
  with its last-programmed state.
- **NFR-1.2** Restarting `maglevd` with VPP up MUST NOT black-hole
  new flows during the probe warm-up window; this is enforced by
  the startup warmup state machine described under `maglevd`.
- **NFR-1.3** The warmup clock is tied to process start and MUST
  NOT be reset by VPP reconnects or configuration reloads.
- **NFR-1.4** A `maglevd`-side reload with a broken file MUST NOT
  interrupt any running probe.

**NFR-2 Determinism and correctness**

- **NFR-2.1** Two `maglevd` instances given the same configuration
  and the same backend state MUST issue the same sequence of
  `lb_as_add_del` calls to VPP, so that VPP's bucket assignment is
  stable across process swaps. This is the job of the
  deterministic AS ordering rule.
- **NFR-2.2** Configuration reload MUST be atomic: either every
  change in the new file takes effect, or none of them do.
- **NFR-2.3** Probe scheduling SHOULD apply bounded jitter so
  that, after a daemon restart or a configuration reload, probes
  do not phase-lock to the wall clock.
- **NFR-2.4** Operator mutations, event-driven syncs, and
  periodic full syncs against VPP MUST be serialized with respect
  to one another; they MUST NOT interleave.

**NFR-3 Performance and scalability**

- **NFR-3.1** Probing N backends costs roughly N goroutines doing
  mostly idle waits; there is no central probe scheduler.
- **NFR-3.2** Event fan-out, transition history, and
  per-subscriber event queues MUST all be bounded; no structure
  grows without limit under sustained load.
- **NFR-3.3** VPP stats snapshots are published as an atomic
  pointer so that Prometheus scrapes and gRPC counter reads are
  wait-free.
- **NFR-3.4** A gRPC subscriber that cannot keep up MUST be
  dropped rather than blocking the central fan-out.

**NFR-4 Security**

- **NFR-4.1** `maglevd` runs with only the Linux capabilities it
  actually needs: `CAP_NET_RAW` only when ICMP probes are in use,
  `CAP_SYS_ADMIN` only when `healthchecker.netns` is set.
- **NFR-4.2** gRPC transport security is explicitly out of scope;
  the daemon runs insecure by default and deployments SHOULD
  front it with a firewall, a trusted network, or a
  TLS-terminating sidecar.
- **NFR-4.3** The web frontend's mutating surface MUST be hidden
  entirely (HTTP 404) when either of its basic-auth environment
  variables is unset.

**NFR-5 Operability**

- **NFR-5.1** Every CLI flag on every binary SHOULD have an
  environment-variable equivalent so that the binaries can be
  driven purely through env in container deployments.
- **NFR-5.2** `maglevd --check` MUST provide a stable exit-code
  contract (0 / 1 / 2) for use by packaging scripts and
  `ExecStartPre` handlers.
- **NFR-5.3** Dashboards can track state in real time through the
  streaming event interface rather than by tight polling.
- **NFR-5.4** `maglevc` and `maglevd-frontend` MUST NOT maintain
  any authoritative state of their own; all truth lives in
  `maglevd`.

## Architecture Overview

### Process Model

The system ships as three independent executables plus one optional
companion tester:

- **`maglevd`** — the long-running daemon. Hosts both the health
  checker and the VPP control plane.
- **`maglevc`** — short-lived CLI client.
- **`maglevd-frontend`** — long-running web dashboard (optional).
- **`maglevt`** — short-lived out-of-band probe TUI (optional).

VPP itself is a fourth moving part, but it is an external
dependency, not part of the `vpp-maglev` codebase.

### Data Flow

Configuration flows **in** from a YAML file on disk (read by
`maglevd`) and from runtime mutations issued over gRPC by `maglevc`
or `maglevd-frontend`. Health state flows **out** of `maglevd` in
three directions: into VPP (as AS weight changes), into Prometheus
(as metrics), and into gRPC clients (as streaming events and
snapshot reads). Traffic counters flow **back in** from VPP's stats
segment and are surfaced through the same gRPC and Prometheus
channels. No component writes to VPP except `maglevd`. No component
serves `maglevd`'s state except `maglevd` itself.

## Components

### maglevd

`maglevd` is the entire control plane. It is a single Go process
that bundles three internal concerns — a fleet of probe workers, a
VPP reconciler, and a gRPC server — around one shared, versioned
view of `(config, backend state, frontend state)`.

#### Responsibilities

- Load and validate configuration; accept reloads on `SIGHUP`
  (FR-5.3, FR-5.4).
- Run one health-check worker per backend defined in config
  (NFR-3.1).
- Maintain each backend's rise/fall counter and derive its state
  (FR-1.2, FR-1.6).
- Aggregate backend state into per-frontend state, honoring
  pool-based failover and per-backend operator overrides
  (FR-2.x, FR-3.x).
- Connect to VPP's binary API and stats socket, reconnecting
  automatically on disconnect (FR-4.4).
- Compute a desired VPP `lb` state from current configuration and
  health, and drive VPP to match it (FR-4.1, FR-4.2).
- Expose the whole picture through a gRPC service and a Prometheus
  `/metrics` endpoint (FR-6.x).

#### Probe Types and Intervals

Four probe types are supported (FR-1.1):

- **ICMP** — sends an echo request, expects a matching reply. This
  probe type MUST have access to a raw socket, which requires
  `CAP_NET_RAW` (NFR-4.1).
- **TCP** — establishes a TCP connection and immediately closes
  it. No payload is exchanged.
- **HTTP** — issues a request against a configured path, matches
  the response code against a configured numeric range, and
  optionally matches the response body against a regular
  expression.
- **HTTPS** — HTTP over TLS with configurable SNI and an option to
  skip certificate verification.

Each health check configures three candidate intervals (FR-1.3):
the nominal `interval`, an optional faster `fast-interval` used
while the counter is in its degraded zone, and an optional slower
`down-interval` used while the backend is fully down. If an
optional interval is not set, the nominal interval is used. Every
scheduled sleep receives bounded random jitter; this is the
mechanism that satisfies NFR-2.3.

Each probe also has a `timeout` (FR-1.4). The probe-level timeout
bounds a single attempt; the interval bounds the time between the
**start** of consecutive attempts, with the actual probe latency
deducted from the next sleep so that slow probes do not push the
schedule later and later.

If the configuration sets `healthchecker.netns`, every probe of
every type MUST run inside that Linux network namespace (FR-1.5).
Entering a netns requires `CAP_SYS_ADMIN`; without it, probes will
fail and the backend will go down. This is a deliberate deployment
choice, not a bug — see the security subsection below.

#### Rise/Fall State Machine

Each backend carries a single integer counter in the closed range
`[0, rise + fall − 1]`. A backend is considered **Up** when the
counter is at or above `rise`, and **Down** otherwise. A successful
probe increments the counter, saturating at the maximum; a failing
probe decrements it, saturating at zero. This is the HAProxy
hysteresis model adapted to a single scalar (FR-1.2).

Four additional states overlay the rise/fall logic:

- **Unknown** — the backend has not yet produced any probe result
  since `maglevd` started (or since it was re-added by a reload).
  An Unknown backend contributes zero effective weight and the
  transition to Up or Down is taken on the *first* result rather
  than after `rise` or `fall` consecutive results (FR-1.6). This
  asymmetric rule lets fresh daemons discover the world quickly
  while still requiring hysteresis for steady-state flaps.
- **Paused** — operator override (FR-3.1). The probe worker is
  stopped and the counter is frozen. Effective weight is zero in
  every pool and every frontend that references the backend, but
  existing flows are not torn down; this is a soft drain.
- **Disabled** — operator override (FR-3.2). The probe worker is
  stopped and effective weight is zero in every pool and every
  frontend that references the backend. Unlike Paused, Disabled
  causes existing flows to be torn down on the next VPP sync
  (FR-4.3).
- **Removed** — the backend was deleted by a configuration reload.
  Its final transition is emitted on the event stream and then
  all references are dropped.

Backends declared **static** (no `healthcheck` reference in
config, FR-1.7) bypass the rise/fall machinery entirely. They are
not probed, their counter is not maintained, and they enter Up on
startup via a single synthetic pass. They still participate in
pool-failover weight computation like any other backend and still
honor operator Pause and Disable overrides.

Operator overrides and operator weight mutations are held in
process memory only. They survive a `SIGHUP` reload (FR-3.4) but
do **not** survive a daemon restart (FR-3.5): when `maglevd`
starts, the YAML file is the sole source of truth, and any
earlier runtime mutation is gone. Operators who need durable
changes must commit them to the configuration file.

#### Aggregation to Frontend State

A frontend references one or more named pools. Each referenced
pool contains one or more backends with a per-reference configured
weight in `[0, 100]` (FR-2.1). The effective weight that `maglevd`
computes for a given `(frontend, pool, backend)` tuple is
(FR-2.3):

- The configured weight, if the backend is Up **and** the
  backend's pool is the active pool (see below).
- Zero in every other case.

The active pool is the first pool, in configuration order, that
contains at least one Up backend whose configured weight is
non-zero (FR-2.2). If no pool is active (e.g. all backends are
Down), every backend contributes zero weight and the frontend's
aggregate state is Down. A frontend with no backends at all, or
with every referenced backend still in Unknown, is itself Unknown.
A frontend with at least one non-zero effective weight is Up
(FR-2.4).

Whether effective weight zero also flushes existing flows depends
on the cause (FR-4.3):

- Up in a non-active pool: weight zero, **no** flush (standby
  pool).
- Down while `flush-on-down` is true: weight zero, flush.
- Disabled: weight zero, flush, always.
- Paused or Unknown: weight zero, no flush.

#### VPP Reconciliation

`maglevd` treats VPP's LB configuration as a desired-state
reconciliation target. The desired state is a pure function of
`(current config, current backend state)`; the observed state is
read back from VPP through the `lb` plugin's binary API. A sync
operation diffs the two and issues the minimal set of
`lb_vip_add_del`, `lb_as_add_del`, and `lb_as_set_weight` messages
to make them match.

Two triggers drive a sync:

1. **Event-driven, single VIP** (FR-4.1). When the health checker
   emits a backend transition, the reconciler recomputes desired
   state for every frontend that references that backend and
   syncs those VIPs. This is the primary path for convergence
   during incidents.
2. **Periodic, full** (FR-4.2). A background loop runs a full
   sync on a configurable interval (default thirty seconds).
   This is the safety net that closes gaps left by missed events,
   VPP restarts, or bugs in the event path.

For determinism (NFR-2.1), whenever a sync operation iterates
over ASes it does so in a total order defined by the numeric
representation of the AS address, with IPv4 addresses ordered
before IPv6. Two `maglevd` instances given the same input MUST
therefore issue the same `lb_as_add_del` sequence, which in turn
means VPP produces the same bucket-to-AS assignment regardless of
which instance is driving.

Operator mutations, event-driven syncs, and periodic full syncs
are serialized through a single mutex at the VPP-call boundary
(NFR-2.4); they never interleave.

#### Startup Warmup and Restart Neutrality

A naive sync loop would, on restart, immediately synthesize a
desired state in which every backend is Unknown, map every
backend through the effective-weight rules to zero, and push
"zero weight everywhere" into VPP before a single probe had
completed. The result would be a multi-second black hole on
every `maglevd` restart. NFR-1.2 forbids this, and the warmup
state machine is how it is enforced.

The warmup has three phases, keyed off two configurable delays
`startup-min-delay` (default five seconds) and `startup-max-delay`
(default thirty seconds):

1. **Hands-off.** From process start to `startup-min-delay`, the
   reconciler MUST NOT write anything to VPP at all. Event-driven
   syncs are suppressed; the periodic full sync is suppressed.
2. **Per-VIP release.** From `startup-min-delay` to
   `startup-max-delay`, a VIP becomes eligible for sync the moment
   every backend it references has produced at least one probe
   result (i.e. none are Unknown). Eligible VIPs are released
   individually so that healthy VIPs converge as fast as their
   slowest backend, without being held back by unrelated slow
   VIPs.
3. **Watchdog.** At `startup-max-delay`, any VIPs still held are
   released unconditionally by a final full sync. This bounds the
   worst-case blackout to `startup-max-delay` rather than "as long
   as the slowest backend takes".

The warmup clock is tied to process start, not to VPP reconnect
or configuration reload (NFR-1.3). Reconnecting to a flapping VPP
does not re-enter warmup, and `SIGHUP` does not re-enter warmup.

Setting both delays to zero disables the warmup entirely, which
is useful for tests but SHOULD NOT be done in production.

#### Configuration and Reload

Configuration lives in a single YAML file (FR-5.1), typically
`/etc/vpp-maglev/maglev.yaml`. It is validated in two distinct
phases (FR-5.2): a **parse** phase that catches YAML errors, and
a **semantic** phase that enforces structural invariants such as:

- Every frontend whose VIPs share an address MUST use backends of
  the same address family (IPv4 or IPv6), because VPP picks an
  encap type per VIP and mixing families on one VIP is not
  supported.
- Every backend referenced by a frontend MUST exist.
- Every referenced health check MUST exist.
- Every pool referenced by a frontend MUST contain at least one
  backend (FR-2.1).
- VPP LB knobs MUST satisfy plugin constraints: `flow-timeout`
  in `[1s, 120s]`, `sticky-buckets-per-core` a power of two,
  `sync-interval` strictly positive, `startup-max-delay` not less
  than `startup-min-delay`.
- `transition-history` MUST be at least one.

`maglevd --check` runs both phases and exits with code 0 on
success, 1 on parse errors, and 2 on semantic errors (NFR-5.2).
This exit code contract is what packaging scripts and systemd
`ExecStartPre` rely on.

On `SIGHUP` the same two-phase validation runs against the file
on disk. If either phase fails, `maglevd` MUST log the error and
leave the running configuration untouched (FR-5.4, NFR-1.4). On
success, the delta is applied atomically (NFR-2.2): new backends
spawn workers, removed backends have their workers stopped and
emit a terminal `Removed` event, changed backends restart their
workers, and metadata-only changes (address, weight, enable flag)
are updated in place without restarting anything. Operator
overrides (Pause, Disable) survive reloads (FR-3.4) but — to
repeat the point from FR-3.5 — do **not** survive a daemon
restart.

#### Lifecycle, Signals, and Security

`maglevd` handles three signals:

- **`SIGHUP`** triggers a configuration reload as described
  above.
- **`SIGTERM`** and **`SIGINT`** initiate a graceful shutdown:
  the gRPC server drains, stream subscribers are released, probe
  workers are cancelled, and the VPP connection is closed. VPP's
  last-programmed state is not torn down; traffic continues to
  flow (NFR-1.1).

`maglevd` requires two Linux capabilities, each tied to a
specific feature (NFR-4.1):

- **`CAP_NET_RAW`** is required if and only if any configured
  health check is of type ICMP. Without it, raw-socket creation
  will fail and all ICMP probes will error out.
- **`CAP_SYS_ADMIN`** is required if and only if
  `healthchecker.netns` is set. The kernel's `setns(CLONE_NEWNET)`
  call requires it; without it, every probe will fail on
  namespace entry.

The shipped Debian unit grants both capabilities through
`AmbientCapabilities` and `CapabilityBoundingSet`, which is why
the package "just works" out of the box. Hand-run invocations
SHOULD set capabilities explicitly (e.g. via `setcap`) rather
than running as root.

`maglevd` does not secure its own gRPC listener (NFR-4.2).
Operators SHOULD bind the listener to loopback, to a
control-plane VRF, or behind a firewall, depending on their
threat model. The design deliberately pushes transport security
out of the binary on the theory that every deployment already
has an answer for it.

#### Interfaces

**Presents.**

- **A gRPC service on a TCP listener** (default `:9090`). This
  is the *only* programmatic interface to `maglevd`. Every other
  component talks to `maglevd` through this interface and no
  other. The service has read-only methods (`List*`, `Get*`,
  `WatchEvents`, `CheckConfig`), mutating methods
  (`PauseBackend`, `ResumeBackend`, `EnableBackend`,
  `DisableBackend`, `SetFrontendPoolBackendWeight`,
  `ReloadConfig`, `SyncVPPLBState`), and a single streaming
  method (`WatchEvents`) that multiplexes log entries and state
  transitions to any number of subscribers with per-subscriber
  filters (FR-6.3). gRPC reflection is enabled by default so
  that ad-hoc tooling can introspect the service.
- **A Prometheus `/metrics` HTTP endpoint** on a separate
  listener (default `:9091`) (FR-6.2). Counters are updated
  inline as probes run and VPP calls complete; gauges are
  computed on each scrape from the current checker and VPP
  state, so there is no sampling lag.
- **Structured JSON logs on stdout**, via `log/slog`, at a
  configurable level (FR-6.1). Key events — daemon start, config
  load, VPP connect/disconnect, backend transitions, LB sync
  mutations, warmup milestones — are logged at `info` or higher
  so that a default-level deployment has enough to post-mortem
  an incident.
- **Process exit codes** from `--check`: 0, 1, or 2 as described
  above (NFR-5.2). These form a small but load-bearing interface
  to packaging and systemd.

**Consumes.**

- **A YAML configuration file** on disk, passed via `--config`
  or `MAGLEV_CONFIG`. This is the declarative source of truth
  for intent; everything the operator mutates at runtime is a
  delta on top of it, and every runtime delta is lost on a
  daemon restart (FR-3.5).
- **VPP's binary API socket** (default `/run/vpp/api.sock`).
  The connection auto-reconnects on drop (FR-4.4), and while
  disconnected, the reconciler silently queues no work — the
  next periodic sync closes any gap.
- **VPP's stats segment socket** (default `/run/vpp/stats.sock`).
  Read periodically (five-second cadence) for per-VIP packet
  and byte counters (FR-6.4). Readers are non-blocking
  (NFR-3.3); a stale snapshot is always available.
- **The Linux kernel's namespace subsystem**, when
  `healthchecker.netns` is set. Requires `CAP_SYS_ADMIN`.
- **Raw sockets**, for ICMP probes. Requires `CAP_NET_RAW`.

### VPP Dataplane

The VPP dataplane is not part of the `vpp-maglev` codebase, but
it is the component every other piece revolves around, and its
contract with `maglevd` defines what `maglevd` is allowed to do.

#### Responsibilities

VPP's `lb` plugin implements Maglev consistent hashing in the
forwarding fast path. It owns:

- **Global configuration** — an IPv4 source address and an IPv6
  source address used as the outer header for GRE-encapsulated
  traffic to ASes, the number of sticky buckets per worker core,
  and a per-flow idle timeout.
- **A set of VIPs**, each identified by an address prefix, an IP
  protocol, and a port. A VIP carries an encap type (GRE4 or
  GRE6, picked by the family of the AS addresses) and a flag
  for source-IP sticky hashing.
- **A set of ASes per VIP**, each identified by address, with an
  integer weight in `[0, 100]`, a `used`/`flushed` state, and a
  bucket count derived from the Maglev ring.

It does **not** own: health, configuration intent, operator
overrides, transition history, or metrics. Those belong to
`maglevd`.

#### Interfaces

**Presents.**

- **A binary API** (GoVPP-style message exchange) for reading
  and mutating VIP and AS state. `maglevd` is the sole user.
- **A stats segment** with per-VIP counters from the LB plugin
  (existing-flow, first-flow, untracked, no-server) and
  per-prefix FIB counters. The LB plugin bypasses the FIB for
  forwarded packets, so per-backend traffic counters are not
  available; this is a known limitation that operators consuming
  metrics need to understand.
- **The forwarded-traffic fast path itself**, which is the whole
  reason this project exists.

**Consumes.**

- `maglevd`'s binary-API writes — nothing else. There is no
  third party programming `lb` state in a working deployment.

### maglevc

`maglevc` is the interactive and scripting CLI. It is a
short-lived client with no persistent state and no background
work (NFR-5.4).

#### Responsibilities

- Provide a human-readable tab-completing shell for `maglevd`
  (FR-7.1).
- Dispatch one-shot commands for scripts and automation.
- Render state snapshots (frontends, backends, health checks,
  VPP LB state, VPP counters) with optional ANSI color.
- Stream events in real time (`watch events`) with filters.

#### Interaction Model

With no positional arguments, `maglevc` starts a readline-based
REPL with a nested command tree: `show`, `set`, `watch`,
`config`, plus the usual `help`, `exit`, `quit`. Tab completion
is built from the same command tree the dispatcher uses, so
completion can never drift from the actual command set. With
positional arguments, `maglevc` executes one command against the
server and exits — in this mode color is off by default so that
pipes and logs stay clean, but `--color=true` can be set
explicitly.

#### Interfaces

**Presents.**

- **An interactive TTY shell** and a **one-shot command mode**.
  Humans and scripts are the only consumers; there is no API,
  no socket, no file output.

**Consumes.**

- **`maglevd`'s gRPC service**, over insecure credentials by
  default. `maglevc` MUST NOT talk to VPP directly, MUST NOT
  read the config file directly, and MUST NOT maintain any
  state of its own across invocations (NFR-5.4). Everything it
  shows and everything it mutates goes through the gRPC API.

### maglevd-frontend

`maglevd-frontend` is an optional web dashboard (FR-7.2). Unlike
`maglevc`, it is a long-running process: it holds open gRPC
streams, caches snapshots, and serves HTTP.

#### Responsibilities

- Connect to one or more `maglevd` servers simultaneously.
- Maintain a cached view of each server's state: frontends,
  backends, health checks, VPP LB state, and VPP counters.
- Serve a SolidJS single-page application and a JSON API to
  browsers.
- Stream live updates to browsers so that dashboards update
  without polling (NFR-5.3).
- Expose an optional authenticated mutation surface (FR-7.3).

#### Multi-Server Multiplexing

A single `maglevd-frontend` process accepts a comma-separated
list of gRPC server addresses. For each one, it runs an
independent pool of goroutines: one to stream events, one to
refresh list-oriented data on a roughly one-second cadence, one
to refresh per-health-check detail, and one (debounced on
incoming events) to refresh VPP LB state and counters. Failures
on one server MUST NOT block the others, and the served JSON
state always reports per-server connection status so that the
SPA can mark partially-available views.

All per-server event streams publish into a single shared event
broker with a bounded replay buffer (capped both in time and in
event count, satisfying NFR-3.2). The broker assigns each event
a monotonic `epoch-seq` identifier so that browsers reconnecting
a dropped Server-Sent-Events stream can resume from where they
left off without a full refresh — and so that a broker restart,
which reshuffles the epoch, forces a full refresh rather than
silently handing out ambiguous IDs.

#### Read-Only and Admin Surfaces

The HTTP surface is partitioned into two paths (FR-7.3):

- **`/view/`** serves the SPA and a read-only JSON API. It is
  always publicly accessible: there is no auth, and there are
  no mutation endpoints under it at all. The design intent is
  that `/view/` can be exposed to a broader audience (NOC,
  management UIs, screens on walls) without risk.
- **`/admin/`** serves the SPA entry point and the mutating
  JSON API behind HTTP basic auth. Credentials come from
  `MAGLEV_FRONTEND_USER` and `MAGLEV_FRONTEND_PASSWORD`. If
  either is unset or empty, the `/admin/` path MUST return 404
  (NFR-4.3) — the admin surface is not merely locked, it is
  not advertised. This makes accidental exposure self-limiting:
  forgetting to set the env vars disables admin rather than
  leaving it open.

Both surfaces talk to the same underlying cache; the difference
is only what endpoints exist.

#### Interfaces

**Presents.**

- **An HTTP listener** (default `:8080`) serving:
  - `/view/` — the SolidJS SPA (embedded in the binary).
  - `/view/api/*` — read-only JSON endpoints for version,
    server list, aggregated state, and per-server state.
  - `/view/api/events` — an SSE stream bridged from the
    internal event broker, with `Last-Event-ID` replay.
  - `/admin/` — the SPA entry point, gated on basic auth.
  - `/admin/api/*` — mutating JSON endpoints that translate
    to gRPC mutations against the appropriate `maglevd`.
  - `/healthz` — a liveness probe.

**Consumes.**

- **One or more `maglevd` gRPC services.** As with `maglevc`,
  this is the *only* way `maglevd-frontend` reaches into the
  system. It MUST NOT read the YAML config file and MUST NOT
  talk to VPP directly (NFR-5.4).
- **Two environment variables**, `MAGLEV_FRONTEND_USER` and
  `MAGLEV_FRONTEND_PASSWORD`, for the optional admin surface.

### maglevt

`maglevt` is a small out-of-band probe TUI (FR-7.4). It is not
part of the control loop at all; it is a validation tool that
an operator runs on a laptop, a jump host, or a monitoring box
to see VIPs the way a client sees them.

#### Responsibilities

- Read one or more `maglev.yaml` files and enumerate TCP-style
  VIPs from the `frontends` section.
- Probe each VIP at a configurable interval with a real HTTP or
  HTTPS request against a configurable path.
- Measure latency (min/max/average and a handful of
  percentiles) and success rate over a rolling window.
- Tally the value of a configurable response header (by
  default, `X-IPng-Frontend`) so that operators can see which
  backend actually served each request. Because keep-alives are
  disabled by default, this tally reflects fresh Maglev hashing
  decisions rather than a pinned connection.

#### Scope Boundary

`maglevt` is intentionally decoupled from `maglevd`. It does
not talk gRPC, it does not read the VPP stats segment, and it
does not know or care whether the target VIPs are actually
served by the `vpp-maglev` control plane at all — it simply
probes addresses. This makes it useful in at least three
scenarios: validating a `maglevd` restart end-to-end from a
client perspective, debugging pool failover by watching the
header tally reshuffle, and sanity-checking that a given VIP is
reachable across deployments when the gRPC control plane is
unavailable or out of reach.

#### Interfaces

**Presents.**

- **A full-screen TUI** built on Bubble Tea, with a
  deterministic grid layout and a few interactive toggles (e.g.
  reverse-DNS lookup). There is no machine-readable output; if
  you need metrics, use Prometheus on `maglevd`.

**Consumes.**

- **One or more YAML configuration files**, which it parses
  with the same library `maglevd` uses. Only the subset of the
  schema describing frontends is actually consumed; unknown
  fields are ignored. Duplicate VIPs discovered across files
  are de-duplicated by `(scheme, address, port)` so that
  multi-file deployments don't double-probe.
- **The outbound network**, directly. No special capabilities
  are required — `maglevt` is a plain HTTP client.

## Operational Concerns

### Configuration Reload Semantics

Reload is triggered by `SIGHUP` to `maglevd`, or by the
`ReloadConfig` gRPC method. Both paths run the same validation
as `--check`. A reload MUST NOT partially apply (NFR-2.2):
either every change in the new file takes effect, or none of
them do. A reload MUST NOT restart unchanged probe workers; the
probe state machine is preserved precisely because operators
use reloads as a routine operation and expect backends whose
health-check definitions did not change to simply keep running.

Operator overrides (Pause, Disable) survive a reload as long as
the backend still exists in the new config (FR-3.4). A backend
that disappears from the new config transitions to `Removed`
and its worker is stopped; if it reappears in a later reload it
starts again in `Unknown` with a fresh counter.

A daemon **restart** is different from a reload. On restart,
the YAML configuration is the sole source of truth: every
runtime override is gone, every runtime weight mutation is gone
(FR-3.5). Operators who need an override to persist across
restarts must commit the intended state to the config file.

### Failure Modes

- **VPP restart.** `maglevd` detects the disconnect, enters a
  reconnect loop, and on reconnect reads VPP's version and
  current state (FR-4.4). The warmup clock is not reset by VPP
  reconnects (NFR-1.3) — a flapping VPP does not cause
  `maglevd` to go hands-off every time. The next periodic full
  sync pushes the current desired state into the freshly
  restarted plugin.
- **`maglevd` restart with VPP up.** Handled by the warmup
  state machine (NFR-1.2): new flows see the last-programmed
  weights until probes catch up, not zeros.
- **`maglevd` restart with VPP also down.** VPP comes back
  first, `maglevd` comes back second, warmup gates pushing
  anything until probes converge. This is the worst-case path,
  bounded by `startup-max-delay`.
- **Configuration reload with a broken file.** The reload is
  rejected; the running configuration is retained; an error
  is logged (FR-5.4). No probes are interrupted (NFR-1.4).
- **Probe namespace disappears.** Entering the namespace fails,
  the probe is counted as a failure, and the backend
  eventually transitions Down under normal rise/fall rules.
  There is no special-case handling; this is by design, because
  an operator removing the netns while `maglevd` is running is
  an operational error that SHOULD manifest as a visible Down,
  not as silent success.
- **gRPC subscriber too slow.** Per-subscriber event queues
  are bounded (NFR-3.2). A subscriber that cannot keep up MUST
  be dropped rather than backing up the central fan-out
  (NFR-3.4).
- **Mid-flight weight mutation during sync.** Operator weight
  changes and reconciler sync both route through the same
  state-protected code path, so mutations are serialized rather
  than interleaved with VPP writes (NFR-2.4).

### Observability

**Structured logging** (FR-6.1). All logs are slog-formatted
JSON written to stdout. The default level is `info`, which is
sized to produce one or two lines per incident rather than per
probe. The `debug` level dumps every probe attempt and every
VPP binary-API message, and is intended for post-mortem
investigation.

**Prometheus metrics** (FR-6.2, FR-6.4). `maglevd` exposes four
classes of metric: inline counters for probe outcomes,
probe-latency histograms, backend state-transition counters,
and VPP API and LB sync counters; and on-demand gauges for
current backend state, rise/fall counter values, configured
weights, VPP connection status, VPP uptime, VPP info labels,
and per-VIP LB plugin counters. Gauges are sampled from live
state on every scrape, so there is no sampling staleness.

**Streaming events** (FR-6.3). The gRPC `WatchEvents` method
multiplexes three event families into one stream: log events
(the same structured logs the daemon writes to stdout), backend
transitions (one per affected frontend, since a single backend
may participate in multiple frontends), and frontend aggregate
transitions (Up/Down/Unknown flips at the frontend level).
Clients MAY filter by event family and by minimum log level.
The web frontend consumes this stream and re-publishes it to
browsers over SSE, with an epoch-seq replay buffer layered on
top.

### Security and Capabilities

`maglevd` needs `CAP_NET_RAW` for ICMP probes and
`CAP_SYS_ADMIN` for netns entry (NFR-4.1). Neither is optional
for the feature that needs it, and neither is required
otherwise; operators who use neither feature MAY run `maglevd`
as an unprivileged user with no capabilities at all.

`maglevd-frontend` needs no special capabilities — it is a
plain HTTP client of `maglevd` and a plain HTTP server for
browsers. It does handle user credentials (basic auth), which
are read from the environment and held in process memory;
operators SHOULD terminate the frontend behind a TLS reverse
proxy if it is exposed beyond a trusted network.

`maglevc` and `maglevt` need no special capabilities.

All gRPC traffic runs insecure by default (NFR-4.2). Securing
transport is an operational decision, not a build-time one;
deployments that require mTLS SHOULD terminate gRPC at a
sidecar or colocate control and data plane on a trusted
segment.

### Concurrency Model

The concurrency model inside `maglevd` is deliberately simple
and local:

- Each backend owns exactly one probe worker goroutine
  (NFR-3.1). Workers do not share state with each other.
- All events — transitions and log records — travel through a
  single central channel which is then fanned out to bounded
  per-subscriber queues (NFR-3.2). The fan-out is the only
  place where multiple subscribers can observe the same event.
- The configuration pointer is swapped atomically on reload
  (NFR-2.2); readers take a read lock for the duration of a
  single access, so the live config is always internally
  consistent even mid-reload.
- The VPP stats snapshot is published as an atomic pointer
  (NFR-3.3), so Prometheus scrapes and gRPC reads of counters
  are wait-free.
- Reconciliation holds a mutex around VPP calls, which
  serializes operator mutations, event-driven syncs, and
  periodic full syncs against each other (NFR-2.4). This is
  intentional: the order in which VPP sees mutations matters
  for determinism, and serializing them is cheap at the scale
  of control-plane events.

Deadlock avoidance is structural rather than audited:
dependencies between subsystems are one-way. The checker does
not call into VPP; the reconciler reads checker state and calls
VPP; VPP never calls back. `maglevd-frontend` and `maglevc`
only read from `maglevd` over gRPC. There is no cycle in the
wait-for graph.

## Alternatives Considered

This is a retrofit of a shipped system, so the alternatives
here are the ones the code actively rejects, not speculative
designs.

- **Several probe schedulers sharing one goroutine pool.**
  Rejected in favor of one goroutine per backend. The
  per-backend model is trivially correct, has no shared state,
  and scales linearly with backend count at a cost of a few
  kilobytes per backend.
- **`maglevd-frontend` as a sidecar per `maglevd`.** Rejected
  in favor of one frontend speaking to many daemons. A single
  dashboard pane across a fleet is the common operator
  request; pushing multi-server logic into the frontend keeps
  the daemon simple.
- **Operator actions expressed as config edits plus SIGHUP.**
  Rejected in favor of direct gRPC mutations. Pausing a
  backend during an incident should not require editing a
  file, and the effect should survive subsequent reloads
  (FR-3.4) — though, by deliberate design, not a daemon
  restart (FR-3.5).
- **Persisting operator overrides across daemon restarts.**
  Rejected in favor of making the YAML config file the sole
  source of truth on startup (FR-3.5). Persisting runtime
  overrides would require an on-disk side store and a clear
  policy for what happens when the side store and the config
  file disagree; keeping the daemon stateless on startup is
  simpler and harder to get wrong.
- **Synchronous full sync after every transition.** Rejected
  in favor of event-driven single-VIP syncs with a periodic
  full sync as a safety net (FR-4.1, FR-4.2). Full syncs are
  cheap but not free, and the blast radius of a transient bug
  in the desired-state computation is smaller when
  per-transition work only touches one VIP.
- **Letting `maglevt` read `maglevd`'s gRPC.** Rejected in
  favor of probing the YAML file directly so that `maglevt`
  remains useful when `maglevd` itself is the thing being
  investigated.

## Open Questions

- **Mutual TLS for gRPC.** Currently insecure by default. A
  future version may wire in standard mTLS support once a
  credential-management story is picked.
- **Per-AS traffic counters.** The VPP `lb` plugin bypasses
  the FIB and therefore does not produce per-AS traffic
  counters. Surfacing real per-backend byte/packet counts
  would require a VPP-side change.
- **High-availability of the control plane.** Two `maglevd`
  instances on the same VPP would interleave writes harmlessly
  thanks to determinism (NFR-2.1), but there is no leader
  election and no formal story about which instance owns which
  VIPs. Today, operators run a single `maglevd` per VPP host.