diff --git a/README.md b/README.md index 1d9022e..72ec682 100644 --- a/README.md +++ b/README.md @@ -2,7 +2,7 @@ Health checker, gRPC control plane, CLI, and web dashboard for the VPP `lb` (load-balancer) plugin. Runs as a set of three binaries under one -Debian package: +Debian package, plus an out-of-band tester built alongside: - **`maglevd`** — the long-running health-checker daemon. Probes backends (HTTP, TCP, ICMP), tracks their aggregate state, programs the VPP @@ -14,6 +14,12 @@ Debian package: SolidJS Single-Page-App; connects to one or more maglevds over gRPC and serves a live HTTP view (read-only `/view/` and optional basic-auth `/admin/` with mutating commands). +- **`maglevt`** — optional out-of-band VIP probe TUI. Reads a + `maglev.yaml` and hits each frontend on a live HTTP path, reporting + latency and a configurable response-header tally so operators can see + failover as it happens. Does not talk gRPC; useful for validating a + `maglevd` restart end-to-end from a client perspective. Built by + `make` but not installed by the Debian package. ## Build and install @@ -94,6 +100,9 @@ deployments. ## Documentation +- [docs/design.md](docs/design.md) — architecture, components, and + numbered functional / non-functional requirements. Start here if + you want the big picture before diving into the code. - A minimal configuration file in [debian/maglev.yaml](debian/maglev.yaml) shows every knob. - [docs/user-guide.md](docs/user-guide.md) — flags, signals, and diff --git a/docs/design.md b/docs/design.md new file mode 100644 index 0000000..22c43b3 --- /dev/null +++ b/docs/design.md @@ -0,0 +1,1076 @@ +# vpp-maglev Design Document + +## Metadata + +| | | +| --- | --- | +| **Status** | Retrofit — describes shipped behavior as of `v0.9.5` | +| **Author** | Pim van Pelt `` | +| **Last updated** | 2026-04-15 | +| **Audience** | Operators and contributors who will read the source tree next | + +The key words **MUST**, **MUST NOT**, **SHOULD**, **SHOULD NOT**, and +**MAY** are used as described in +[RFC 2119](https://datatracker.ietf.org/doc/html/rfc2119), and are +reserved in this document for requirements that are actually enforced +in code or by an external dependency. Plain-language descriptions of +what the system or an operator can do are written in lowercase — +"can", "will", "does" — and should not be read as normative. + +## Summary + +`vpp-maglev` is a control plane for the VPP `lb` (Maglev load +balancer) plugin. A single daemon — `maglevd` — probes a fleet of +backends, maintains an authoritative view of their health, and +programs the VPP dataplane so that traffic hashed to a given VIP +lands only on healthy backends. Operators drive the system through +`maglevc` (an interactive CLI) or `maglevd-frontend` (a read-only +web dashboard with an optional authenticated admin surface). A small +companion binary, `maglevt`, validates VIPs from outside the control +plane by sending live HTTP probes and reporting failover behavior. + +## Background + +VPP's `lb` plugin implements Maglev consistent hashing inside the +dataplane: a VIP is backed by a pool of Application Servers (ASes), +each with an integer weight in `[0, 100]`, and incoming flows are +hashed onto a bucket ring so that weight changes disturb as few +existing flows as possible. The plugin knows nothing about backend +health; if an AS dies while it holds buckets, traffic to those +buckets is black-holed until something external tells `lb` to remove +or re-weight the AS. + +`vpp-maglev` is that external thing. Before `vpp-maglev`, operators +maintained VIP configurations by hand and reacted to incidents with +`vppctl`. The project replaces that loop with a daemon that owns the +health story, reconciles it with the dataplane, and exposes the +result through a uniform gRPC API so that CLIs, dashboards, and +scripts all read the same source of truth. + +## Goals and Non-Goals + +### Product Goals + +1. **Accurate backend health.** Detect that a backend is up, + degraded, or down quickly enough to keep user-visible error rates + low, and avoid flapping under transient faults. +2. **Correct VPP state.** The set of VIPs and per-AS weights in VPP + converges to the configured intent, filtered by current health, + for every supported failure mode. +3. **Restart neutrality.** Restarting `maglevd` with VPP already up + MUST NOT cause traffic to be black-holed while health probes warm + up. +4. **Operator control.** A human can pause, drain, or weight-shift + a backend in seconds without editing config files. +5. **Uniform observability.** Every state transition, VPP API call, + and probe result is emitted as a structured log, a Prometheus + metric, or a streaming event — ideally all three. +6. **One source of truth.** Every other component (CLI, web + frontend, scripts) reads `maglevd` through one typed interface. + There is no secondary control plane. + +### Non-Goals + +- `vpp-maglev` is not a VPP installer or packaging layer. It assumes + VPP is already running with the `lb` plugin loaded. +- It does not implement its own dataplane fast path. All forwarding + stays in VPP; `maglevd` only programs the plugin. +- It is not a generic service mesh. There is no L7 routing, cert + issuance, service discovery, or east-west policy — only VIPs, + pools, and backends. +- It is not a config store. Configuration is a YAML file on disk; + the gRPC API can check and reload it but cannot author it. +- It does not secure its own transport. gRPC runs insecure by + default; TLS, mTLS, or firewalls are the operator's + responsibility. + +## Requirements + +Each requirement carries a unique identifier (`FR-X.Y` or `NFR-X.Y`) +so that later sections can cite it. + +### Functional Requirements + +**FR-1 Health checking** + +- **FR-1.1** The system supports ICMP, TCP, HTTP, and HTTPS health + checks, each with its own protocol-specific success criteria. +- **FR-1.2** Each health check MUST apply HAProxy rise/fall + semantics with operator-configurable thresholds. +- **FR-1.3** A health check MAY declare distinct `interval`, + `fast-interval`, and `down-interval` values so that recovery from + a degraded or down state is faster than steady-state polling. +- **FR-1.4** Each probe attempt is bounded by a configurable + per-probe timeout, independent of the scheduling interval. +- **FR-1.5** If the configuration sets `healthchecker.netns`, every + probe MUST execute inside the named Linux network namespace. +- **FR-1.6** The first probe result against a newly-created backend + forces an immediate transition out of `Unknown`, without waiting + for `rise` or `fall` consecutive results. +- **FR-1.7** A backend MAY omit its `healthcheck` reference to + declare itself **static**. A static backend is not probed and is + treated as permanently Up; it still participates in pool failover + and still honors operator Pause and Disable overrides. + +**FR-2 Aggregation and pool failover** + +- **FR-2.1** A frontend MAY reference one or more named pools. + Each referenced pool MUST contain at least one + `(backend, configured-weight)` tuple; an empty pool is a + configuration error and is rejected at load time. +- **FR-2.2** At any time, exactly one pool — the first, in + configuration order, that contains a healthy backend with + non-zero configured weight — is active; backends in other pools + contribute zero effective weight. +- **FR-2.3** The effective weight of a `(frontend, pool, backend)` + tuple is the configured weight when the backend is Up **and** the + pool is active, and zero in every other case. +- **FR-2.4** A frontend's aggregate state is Up when at least one + backend has non-zero effective weight, Unknown when every + referenced backend is still Unknown (or the frontend references + no backends), and Down otherwise. + +**FR-3 Operator control** + +- **FR-3.1** Operators can pause and resume individual backends at + runtime. Pausing stops the probe worker, freezes the rise/fall + counter, and drives effective weight to zero in **every** pool + and **every** frontend that references the backend. Existing + flows are not torn down; this is a soft drain. +- **FR-3.2** Operators can disable and re-enable individual + backends at runtime. Disabling drives effective weight to zero in + **every** pool and **every** frontend that references the + backend, and MUST cause existing flows to be torn down on the + next VPP sync. +- **FR-3.3** Operators can set the configured weight of a specific + `(frontend, pool, backend)` tuple at runtime. +- **FR-3.4** Operator overrides (Pause, Disable) and operator + weight mutations survive a configuration **reload** (`SIGHUP`) + as long as the underlying backend and tuple still exist in the + new configuration. +- **FR-3.5** Operator overrides and operator weight mutations do + **not** survive a `maglevd` **restart**. After a restart, the + YAML configuration file is authoritative for every backend and + every tuple: paused backends come back unpaused, disabled + backends come back enabled, mutated weights revert to the + configured value. Operators who need persistent changes must + edit the config file. + +**FR-4 VPP reconciliation** + +- **FR-4.1** For every backend state transition that changes an + effective weight, `maglevd` pushes the resulting AS state into + VPP for every affected VIP. +- **FR-4.2** `maglevd` runs a periodic full reconciliation on a + configurable cadence (default thirty seconds) as a safety net + against missed events and VPP restarts. +- **FR-4.3** Weight-to-zero is communicated to VPP as a graceful + drain by default; transitions to Disabled and transitions to + Down while `flush-on-down` is true MUST tear existing flows down + on the next sync. +- **FR-4.4** `maglevd` tolerates VPP disconnects by auto-reconnecting + and resuming reconciliation once the connection is + re-established. + +**FR-5 Configuration** + +- **FR-5.1** Configuration is loaded from a single YAML file + specified at startup and referenced by all later operations. +- **FR-5.2** Configuration validation distinguishes **parse + errors** (malformed YAML) from **semantic errors** (structural + invariants) and MUST report each with its own exit code from + `--check`: 0 (OK), 1 (parse), 2 (semantic). +- **FR-5.3** `maglevd` reloads its configuration on `SIGHUP` + without restarting the process, without restarting unchanged + probe workers, and without losing operator overrides (see + FR-3.4). +- **FR-5.4** A parse or semantic error encountered during reload + MUST leave the running configuration in place. +- **FR-5.5** The same validation and reload semantics are also + reachable through gRPC (`CheckConfig`, `ReloadConfig`). + +**FR-6 Observability** + +- **FR-6.1** All logs are emitted as structured JSON on stdout at + a configurable level. +- **FR-6.2** `maglevd` exposes Prometheus metrics for probe + outcomes, probe latency, backend state transitions, VPP API + traffic, and VPP LB sync mutations. +- **FR-6.3** A streaming gRPC API multiplexes log entries, backend + transitions, and frontend aggregate transitions to any number of + subscribers with per-subscriber filters. +- **FR-6.4** Per-VIP packet counters from VPP's stats segment are + surfaced through both the gRPC API and the Prometheus surface. + +**FR-7 Clients and peripheral tools** + +- **FR-7.1** An interactive CLI (`maglevc`) provides a + tab-completing shell and a one-shot command mode, both backed + by the same command tree. +- **FR-7.2** A web frontend (`maglevd-frontend`) can multiplex more + than one `maglevd` in a single process and present their + combined state. +- **FR-7.3** The web frontend partitions its HTTP surface into a + public read-only path (`/view/`) and an authenticated mutating + path (`/admin/`). If credentials are not configured, `/admin/` + MUST NOT be advertised (the path returns 404). +- **FR-7.4** An out-of-band tester (`maglevt`) probes configured + VIPs from outside the control plane, measures latency, and + tallies a configurable response header. + +### Non-Functional Requirements + +**NFR-1 Availability and reliability** + +- **NFR-1.1** A `maglevd` outage MUST NOT stop the dataplane. + While `maglevd` is absent, VPP continues to forward traffic + with its last-programmed state. +- **NFR-1.2** Restarting `maglevd` with VPP up MUST NOT black-hole + new flows during the probe warm-up window; this is enforced by + the startup warmup state machine described under `maglevd`. +- **NFR-1.3** The warmup clock is tied to process start and MUST + NOT be reset by VPP reconnects or configuration reloads. +- **NFR-1.4** A `maglevd`-side reload with a broken file MUST NOT + interrupt any running probe. + +**NFR-2 Determinism and correctness** + +- **NFR-2.1** Two `maglevd` instances given the same configuration + and the same backend state MUST issue the same sequence of + `lb_as_add_del` calls to VPP, so that VPP's bucket assignment is + stable across process swaps. This is the job of the + deterministic AS ordering rule. +- **NFR-2.2** Configuration reload MUST be atomic: either every + change in the new file takes effect, or none of them do. +- **NFR-2.3** Probe scheduling SHOULD apply bounded jitter so + that, after a daemon restart or a configuration reload, probes + do not phase-lock to the wall clock. +- **NFR-2.4** Operator mutations, event-driven syncs, and + periodic full syncs against VPP MUST be serialized with respect + to one another; they MUST NOT interleave. + +**NFR-3 Performance and scalability** + +- **NFR-3.1** Probing N backends costs roughly N goroutines doing + mostly idle waits; there is no central probe scheduler. +- **NFR-3.2** Event fan-out, transition history, and + per-subscriber event queues MUST all be bounded; no structure + grows without limit under sustained load. +- **NFR-3.3** VPP stats snapshots are published as an atomic + pointer so that Prometheus scrapes and gRPC counter reads are + wait-free. +- **NFR-3.4** A gRPC subscriber that cannot keep up MUST be + dropped rather than blocking the central fan-out. + +**NFR-4 Security** + +- **NFR-4.1** `maglevd` runs with only the Linux capabilities it + actually needs: `CAP_NET_RAW` only when ICMP probes are in use, + `CAP_SYS_ADMIN` only when `healthchecker.netns` is set. +- **NFR-4.2** gRPC transport security is explicitly out of scope; + the daemon runs insecure by default and deployments SHOULD + front it with a firewall, a trusted network, or a + TLS-terminating sidecar. +- **NFR-4.3** The web frontend's mutating surface MUST be hidden + entirely (HTTP 404) when either of its basic-auth environment + variables is unset. + +**NFR-5 Operability** + +- **NFR-5.1** Every CLI flag on every binary SHOULD have an + environment-variable equivalent so that the binaries can be + driven purely through env in container deployments. +- **NFR-5.2** `maglevd --check` MUST provide a stable exit-code + contract (0 / 1 / 2) for use by packaging scripts and + `ExecStartPre` handlers. +- **NFR-5.3** Dashboards can track state in real time through the + streaming event interface rather than by tight polling. +- **NFR-5.4** `maglevc` and `maglevd-frontend` MUST NOT maintain + any authoritative state of their own; all truth lives in + `maglevd`. + +## Architecture Overview + +### Process Model + +The system ships as three independent executables plus one optional +companion tester: + +- **`maglevd`** — the long-running daemon. Hosts both the health + checker and the VPP control plane. +- **`maglevc`** — short-lived CLI client. +- **`maglevd-frontend`** — long-running web dashboard (optional). +- **`maglevt`** — short-lived out-of-band probe TUI (optional). + +VPP itself is a fourth moving part, but it is an external +dependency, not part of the `vpp-maglev` codebase. + +### Data Flow + +Configuration flows **in** from a YAML file on disk (read by +`maglevd`) and from runtime mutations issued over gRPC by `maglevc` +or `maglevd-frontend`. Health state flows **out** of `maglevd` in +three directions: into VPP (as AS weight changes), into Prometheus +(as metrics), and into gRPC clients (as streaming events and +snapshot reads). Traffic counters flow **back in** from VPP's stats +segment and are surfaced through the same gRPC and Prometheus +channels. No component writes to VPP except `maglevd`. No component +serves `maglevd`'s state except `maglevd` itself. + +## Components + +### maglevd + +`maglevd` is the entire control plane. It is a single Go process +that bundles three internal concerns — a fleet of probe workers, a +VPP reconciler, and a gRPC server — around one shared, versioned +view of `(config, backend state, frontend state)`. + +#### Responsibilities + +- Load and validate configuration; accept reloads on `SIGHUP` + (FR-5.3, FR-5.4). +- Run one health-check worker per backend defined in config + (NFR-3.1). +- Maintain each backend's rise/fall counter and derive its state + (FR-1.2, FR-1.6). +- Aggregate backend state into per-frontend state, honoring + pool-based failover and per-backend operator overrides + (FR-2.x, FR-3.x). +- Connect to VPP's binary API and stats socket, reconnecting + automatically on disconnect (FR-4.4). +- Compute a desired VPP `lb` state from current configuration and + health, and drive VPP to match it (FR-4.1, FR-4.2). +- Expose the whole picture through a gRPC service and a Prometheus + `/metrics` endpoint (FR-6.x). + +#### Probe Types and Intervals + +Four probe types are supported (FR-1.1): + +- **ICMP** — sends an echo request, expects a matching reply. This + probe type MUST have access to a raw socket, which requires + `CAP_NET_RAW` (NFR-4.1). +- **TCP** — establishes a TCP connection and immediately closes + it. No payload is exchanged. +- **HTTP** — issues a request against a configured path, matches + the response code against a configured numeric range, and + optionally matches the response body against a regular + expression. +- **HTTPS** — HTTP over TLS with configurable SNI and an option to + skip certificate verification. + +Each health check configures three candidate intervals (FR-1.3): +the nominal `interval`, an optional faster `fast-interval` used +while the counter is in its degraded zone, and an optional slower +`down-interval` used while the backend is fully down. If an +optional interval is not set, the nominal interval is used. Every +scheduled sleep receives bounded random jitter; this is the +mechanism that satisfies NFR-2.3. + +Each probe also has a `timeout` (FR-1.4). The probe-level timeout +bounds a single attempt; the interval bounds the time between the +**start** of consecutive attempts, with the actual probe latency +deducted from the next sleep so that slow probes do not push the +schedule later and later. + +If the configuration sets `healthchecker.netns`, every probe of +every type MUST run inside that Linux network namespace (FR-1.5). +Entering a netns requires `CAP_SYS_ADMIN`; without it, probes will +fail and the backend will go down. This is a deliberate deployment +choice, not a bug — see the security subsection below. + +#### Rise/Fall State Machine + +Each backend carries a single integer counter in the closed range +`[0, rise + fall − 1]`. A backend is considered **Up** when the +counter is at or above `rise`, and **Down** otherwise. A successful +probe increments the counter, saturating at the maximum; a failing +probe decrements it, saturating at zero. This is the HAProxy +hysteresis model adapted to a single scalar (FR-1.2). + +Four additional states overlay the rise/fall logic: + +- **Unknown** — the backend has not yet produced any probe result + since `maglevd` started (or since it was re-added by a reload). + An Unknown backend contributes zero effective weight and the + transition to Up or Down is taken on the *first* result rather + than after `rise` or `fall` consecutive results (FR-1.6). This + asymmetric rule lets fresh daemons discover the world quickly + while still requiring hysteresis for steady-state flaps. +- **Paused** — operator override (FR-3.1). The probe worker is + stopped and the counter is frozen. Effective weight is zero in + every pool and every frontend that references the backend, but + existing flows are not torn down; this is a soft drain. +- **Disabled** — operator override (FR-3.2). The probe worker is + stopped and effective weight is zero in every pool and every + frontend that references the backend. Unlike Paused, Disabled + causes existing flows to be torn down on the next VPP sync + (FR-4.3). +- **Removed** — the backend was deleted by a configuration reload. + Its final transition is emitted on the event stream and then + all references are dropped. + +Backends declared **static** (no `healthcheck` reference in +config, FR-1.7) bypass the rise/fall machinery entirely. They are +not probed, their counter is not maintained, and they enter Up on +startup via a single synthetic pass. They still participate in +pool-failover weight computation like any other backend and still +honor operator Pause and Disable overrides. + +Operator overrides and operator weight mutations are held in +process memory only. They survive a `SIGHUP` reload (FR-3.4) but +do **not** survive a daemon restart (FR-3.5): when `maglevd` +starts, the YAML file is the sole source of truth, and any +earlier runtime mutation is gone. Operators who need durable +changes must commit them to the configuration file. + +#### Aggregation to Frontend State + +A frontend references one or more named pools. Each referenced +pool contains one or more backends with a per-reference configured +weight in `[0, 100]` (FR-2.1). The effective weight that `maglevd` +computes for a given `(frontend, pool, backend)` tuple is +(FR-2.3): + +- The configured weight, if the backend is Up **and** the + backend's pool is the active pool (see below). +- Zero in every other case. + +The active pool is the first pool, in configuration order, that +contains at least one Up backend whose configured weight is +non-zero (FR-2.2). If no pool is active (e.g. all backends are +Down), every backend contributes zero weight and the frontend's +aggregate state is Down. A frontend with no backends at all, or +with every referenced backend still in Unknown, is itself Unknown. +A frontend with at least one non-zero effective weight is Up +(FR-2.4). + +Whether effective weight zero also flushes existing flows depends +on the cause (FR-4.3): + +- Up in a non-active pool: weight zero, **no** flush (standby + pool). +- Down while `flush-on-down` is true: weight zero, flush. +- Disabled: weight zero, flush, always. +- Paused or Unknown: weight zero, no flush. + +#### VPP Reconciliation + +`maglevd` treats VPP's LB configuration as a desired-state +reconciliation target. The desired state is a pure function of +`(current config, current backend state)`; the observed state is +read back from VPP through the `lb` plugin's binary API. A sync +operation diffs the two and issues the minimal set of +`lb_vip_add_del`, `lb_as_add_del`, and `lb_as_set_weight` messages +to make them match. + +Two triggers drive a sync: + +1. **Event-driven, single VIP** (FR-4.1). When the health checker + emits a backend transition, the reconciler recomputes desired + state for every frontend that references that backend and + syncs those VIPs. This is the primary path for convergence + during incidents. +2. **Periodic, full** (FR-4.2). A background loop runs a full + sync on a configurable interval (default thirty seconds). + This is the safety net that closes gaps left by missed events, + VPP restarts, or bugs in the event path. + +For determinism (NFR-2.1), whenever a sync operation iterates +over ASes it does so in a total order defined by the numeric +representation of the AS address, with IPv4 addresses ordered +before IPv6. Two `maglevd` instances given the same input MUST +therefore issue the same `lb_as_add_del` sequence, which in turn +means VPP produces the same bucket-to-AS assignment regardless of +which instance is driving. + +Operator mutations, event-driven syncs, and periodic full syncs +are serialized through a single mutex at the VPP-call boundary +(NFR-2.4); they never interleave. + +#### Startup Warmup and Restart Neutrality + +A naive sync loop would, on restart, immediately synthesize a +desired state in which every backend is Unknown, map every +backend through the effective-weight rules to zero, and push +"zero weight everywhere" into VPP before a single probe had +completed. The result would be a multi-second black hole on +every `maglevd` restart. NFR-1.2 forbids this, and the warmup +state machine is how it is enforced. + +The warmup has three phases, keyed off two configurable delays +`startup-min-delay` (default five seconds) and `startup-max-delay` +(default thirty seconds): + +1. **Hands-off.** From process start to `startup-min-delay`, the + reconciler MUST NOT write anything to VPP at all. Event-driven + syncs are suppressed; the periodic full sync is suppressed. +2. **Per-VIP release.** From `startup-min-delay` to + `startup-max-delay`, a VIP becomes eligible for sync the moment + every backend it references has produced at least one probe + result (i.e. none are Unknown). Eligible VIPs are released + individually so that healthy VIPs converge as fast as their + slowest backend, without being held back by unrelated slow + VIPs. +3. **Watchdog.** At `startup-max-delay`, any VIPs still held are + released unconditionally by a final full sync. This bounds the + worst-case blackout to `startup-max-delay` rather than "as long + as the slowest backend takes". + +The warmup clock is tied to process start, not to VPP reconnect +or configuration reload (NFR-1.3). Reconnecting to a flapping VPP +does not re-enter warmup, and `SIGHUP` does not re-enter warmup. + +Setting both delays to zero disables the warmup entirely, which +is useful for tests but SHOULD NOT be done in production. + +#### Configuration and Reload + +Configuration lives in a single YAML file (FR-5.1), typically +`/etc/vpp-maglev/maglev.yaml`. It is validated in two distinct +phases (FR-5.2): a **parse** phase that catches YAML errors, and +a **semantic** phase that enforces structural invariants such as: + +- Every frontend whose VIPs share an address MUST use backends of + the same address family (IPv4 or IPv6), because VPP picks an + encap type per VIP and mixing families on one VIP is not + supported. +- Every backend referenced by a frontend MUST exist. +- Every referenced health check MUST exist. +- Every pool referenced by a frontend MUST contain at least one + backend (FR-2.1). +- VPP LB knobs MUST satisfy plugin constraints: `flow-timeout` + in `[1s, 120s]`, `sticky-buckets-per-core` a power of two, + `sync-interval` strictly positive, `startup-max-delay` not less + than `startup-min-delay`. +- `transition-history` MUST be at least one. + +`maglevd --check` runs both phases and exits with code 0 on +success, 1 on parse errors, and 2 on semantic errors (NFR-5.2). +This exit code contract is what packaging scripts and systemd +`ExecStartPre` rely on. + +On `SIGHUP` the same two-phase validation runs against the file +on disk. If either phase fails, `maglevd` MUST log the error and +leave the running configuration untouched (FR-5.4, NFR-1.4). On +success, the delta is applied atomically (NFR-2.2): new backends +spawn workers, removed backends have their workers stopped and +emit a terminal `Removed` event, changed backends restart their +workers, and metadata-only changes (address, weight, enable flag) +are updated in place without restarting anything. Operator +overrides (Pause, Disable) survive reloads (FR-3.4) but — to +repeat the point from FR-3.5 — do **not** survive a daemon +restart. + +#### Lifecycle, Signals, and Security + +`maglevd` handles three signals: + +- **`SIGHUP`** triggers a configuration reload as described + above. +- **`SIGTERM`** and **`SIGINT`** initiate a graceful shutdown: + the gRPC server drains, stream subscribers are released, probe + workers are cancelled, and the VPP connection is closed. VPP's + last-programmed state is not torn down; traffic continues to + flow (NFR-1.1). + +`maglevd` requires two Linux capabilities, each tied to a +specific feature (NFR-4.1): + +- **`CAP_NET_RAW`** is required if and only if any configured + health check is of type ICMP. Without it, raw-socket creation + will fail and all ICMP probes will error out. +- **`CAP_SYS_ADMIN`** is required if and only if + `healthchecker.netns` is set. The kernel's `setns(CLONE_NEWNET)` + call requires it; without it, every probe will fail on + namespace entry. + +The shipped Debian unit grants both capabilities through +`AmbientCapabilities` and `CapabilityBoundingSet`, which is why +the package "just works" out of the box. Hand-run invocations +SHOULD set capabilities explicitly (e.g. via `setcap`) rather +than running as root. + +`maglevd` does not secure its own gRPC listener (NFR-4.2). +Operators SHOULD bind the listener to loopback, to a +control-plane VRF, or behind a firewall, depending on their +threat model. The design deliberately pushes transport security +out of the binary on the theory that every deployment already +has an answer for it. + +#### Interfaces + +**Presents.** + +- **A gRPC service on a TCP listener** (default `:9090`). This + is the *only* programmatic interface to `maglevd`. Every other + component talks to `maglevd` through this interface and no + other. The service has read-only methods (`List*`, `Get*`, + `WatchEvents`, `CheckConfig`), mutating methods + (`PauseBackend`, `ResumeBackend`, `EnableBackend`, + `DisableBackend`, `SetFrontendPoolBackendWeight`, + `ReloadConfig`, `SyncVPPLBState`), and a single streaming + method (`WatchEvents`) that multiplexes log entries and state + transitions to any number of subscribers with per-subscriber + filters (FR-6.3). gRPC reflection is enabled by default so + that ad-hoc tooling can introspect the service. +- **A Prometheus `/metrics` HTTP endpoint** on a separate + listener (default `:9091`) (FR-6.2). Counters are updated + inline as probes run and VPP calls complete; gauges are + computed on each scrape from the current checker and VPP + state, so there is no sampling lag. +- **Structured JSON logs on stdout**, via `log/slog`, at a + configurable level (FR-6.1). Key events — daemon start, config + load, VPP connect/disconnect, backend transitions, LB sync + mutations, warmup milestones — are logged at `info` or higher + so that a default-level deployment has enough to post-mortem + an incident. +- **Process exit codes** from `--check`: 0, 1, or 2 as described + above (NFR-5.2). These form a small but load-bearing interface + to packaging and systemd. + +**Consumes.** + +- **A YAML configuration file** on disk, passed via `--config` + or `MAGLEV_CONFIG`. This is the declarative source of truth + for intent; everything the operator mutates at runtime is a + delta on top of it, and every runtime delta is lost on a + daemon restart (FR-3.5). +- **VPP's binary API socket** (default `/run/vpp/api.sock`). + The connection auto-reconnects on drop (FR-4.4), and while + disconnected, the reconciler silently queues no work — the + next periodic sync closes any gap. +- **VPP's stats segment socket** (default `/run/vpp/stats.sock`). + Read periodically (five-second cadence) for per-VIP packet + and byte counters (FR-6.4). Readers are non-blocking + (NFR-3.3); a stale snapshot is always available. +- **The Linux kernel's namespace subsystem**, when + `healthchecker.netns` is set. Requires `CAP_SYS_ADMIN`. +- **Raw sockets**, for ICMP probes. Requires `CAP_NET_RAW`. + +### VPP Dataplane + +The VPP dataplane is not part of the `vpp-maglev` codebase, but +it is the component every other piece revolves around, and its +contract with `maglevd` defines what `maglevd` is allowed to do. + +#### Responsibilities + +VPP's `lb` plugin implements Maglev consistent hashing in the +forwarding fast path. It owns: + +- **Global configuration** — an IPv4 source address and an IPv6 + source address used as the outer header for GRE-encapsulated + traffic to ASes, the number of sticky buckets per worker core, + and a per-flow idle timeout. +- **A set of VIPs**, each identified by an address prefix, an IP + protocol, and a port. A VIP carries an encap type (GRE4 or + GRE6, picked by the family of the AS addresses) and a flag + for source-IP sticky hashing. +- **A set of ASes per VIP**, each identified by address, with an + integer weight in `[0, 100]`, a `used`/`flushed` state, and a + bucket count derived from the Maglev ring. + +It does **not** own: health, configuration intent, operator +overrides, transition history, or metrics. Those belong to +`maglevd`. + +#### Interfaces + +**Presents.** + +- **A binary API** (GoVPP-style message exchange) for reading + and mutating VIP and AS state. `maglevd` is the sole user. +- **A stats segment** with per-VIP counters from the LB plugin + (existing-flow, first-flow, untracked, no-server) and + per-prefix FIB counters. The LB plugin bypasses the FIB for + forwarded packets, so per-backend traffic counters are not + available; this is a known limitation that operators consuming + metrics need to understand. +- **The forwarded-traffic fast path itself**, which is the whole + reason this project exists. + +**Consumes.** + +- `maglevd`'s binary-API writes — nothing else. There is no + third party programming `lb` state in a working deployment. + +### maglevc + +`maglevc` is the interactive and scripting CLI. It is a +short-lived client with no persistent state and no background +work (NFR-5.4). + +#### Responsibilities + +- Provide a human-readable tab-completing shell for `maglevd` + (FR-7.1). +- Dispatch one-shot commands for scripts and automation. +- Render state snapshots (frontends, backends, health checks, + VPP LB state, VPP counters) with optional ANSI color. +- Stream events in real time (`watch events`) with filters. + +#### Interaction Model + +With no positional arguments, `maglevc` starts a readline-based +REPL with a nested command tree: `show`, `set`, `watch`, +`config`, plus the usual `help`, `exit`, `quit`. Tab completion +is built from the same command tree the dispatcher uses, so +completion can never drift from the actual command set. With +positional arguments, `maglevc` executes one command against the +server and exits — in this mode color is off by default so that +pipes and logs stay clean, but `--color=true` can be set +explicitly. + +#### Interfaces + +**Presents.** + +- **An interactive TTY shell** and a **one-shot command mode**. + Humans and scripts are the only consumers; there is no API, + no socket, no file output. + +**Consumes.** + +- **`maglevd`'s gRPC service**, over insecure credentials by + default. `maglevc` MUST NOT talk to VPP directly, MUST NOT + read the config file directly, and MUST NOT maintain any + state of its own across invocations (NFR-5.4). Everything it + shows and everything it mutates goes through the gRPC API. + +### maglevd-frontend + +`maglevd-frontend` is an optional web dashboard (FR-7.2). Unlike +`maglevc`, it is a long-running process: it holds open gRPC +streams, caches snapshots, and serves HTTP. + +#### Responsibilities + +- Connect to one or more `maglevd` servers simultaneously. +- Maintain a cached view of each server's state: frontends, + backends, health checks, VPP LB state, and VPP counters. +- Serve a SolidJS single-page application and a JSON API to + browsers. +- Stream live updates to browsers so that dashboards update + without polling (NFR-5.3). +- Expose an optional authenticated mutation surface (FR-7.3). + +#### Multi-Server Multiplexing + +A single `maglevd-frontend` process accepts a comma-separated +list of gRPC server addresses. For each one, it runs an +independent pool of goroutines: one to stream events, one to +refresh list-oriented data on a roughly one-second cadence, one +to refresh per-health-check detail, and one (debounced on +incoming events) to refresh VPP LB state and counters. Failures +on one server MUST NOT block the others, and the served JSON +state always reports per-server connection status so that the +SPA can mark partially-available views. + +All per-server event streams publish into a single shared event +broker with a bounded replay buffer (capped both in time and in +event count, satisfying NFR-3.2). The broker assigns each event +a monotonic `epoch-seq` identifier so that browsers reconnecting +a dropped Server-Sent-Events stream can resume from where they +left off without a full refresh — and so that a broker restart, +which reshuffles the epoch, forces a full refresh rather than +silently handing out ambiguous IDs. + +#### Read-Only and Admin Surfaces + +The HTTP surface is partitioned into two paths (FR-7.3): + +- **`/view/`** serves the SPA and a read-only JSON API. It is + always publicly accessible: there is no auth, and there are + no mutation endpoints under it at all. The design intent is + that `/view/` can be exposed to a broader audience (NOC, + management UIs, screens on walls) without risk. +- **`/admin/`** serves the SPA entry point and the mutating + JSON API behind HTTP basic auth. Credentials come from + `MAGLEV_FRONTEND_USER` and `MAGLEV_FRONTEND_PASSWORD`. If + either is unset or empty, the `/admin/` path MUST return 404 + (NFR-4.3) — the admin surface is not merely locked, it is + not advertised. This makes accidental exposure self-limiting: + forgetting to set the env vars disables admin rather than + leaving it open. + +Both surfaces talk to the same underlying cache; the difference +is only what endpoints exist. + +#### Interfaces + +**Presents.** + +- **An HTTP listener** (default `:8080`) serving: + - `/view/` — the SolidJS SPA (embedded in the binary). + - `/view/api/*` — read-only JSON endpoints for version, + server list, aggregated state, and per-server state. + - `/view/api/events` — an SSE stream bridged from the + internal event broker, with `Last-Event-ID` replay. + - `/admin/` — the SPA entry point, gated on basic auth. + - `/admin/api/*` — mutating JSON endpoints that translate + to gRPC mutations against the appropriate `maglevd`. + - `/healthz` — a liveness probe. + +**Consumes.** + +- **One or more `maglevd` gRPC services.** As with `maglevc`, + this is the *only* way `maglevd-frontend` reaches into the + system. It MUST NOT read the YAML config file and MUST NOT + talk to VPP directly (NFR-5.4). +- **Two environment variables**, `MAGLEV_FRONTEND_USER` and + `MAGLEV_FRONTEND_PASSWORD`, for the optional admin surface. + +### maglevt + +`maglevt` is a small out-of-band probe TUI (FR-7.4). It is not +part of the control loop at all; it is a validation tool that +an operator runs on a laptop, a jump host, or a monitoring box +to see VIPs the way a client sees them. + +#### Responsibilities + +- Read one or more `maglev.yaml` files and enumerate TCP-style + VIPs from the `frontends` section. +- Probe each VIP at a configurable interval with a real HTTP or + HTTPS request against a configurable path. +- Measure latency (min/max/average and a handful of + percentiles) and success rate over a rolling window. +- Tally the value of a configurable response header (by + default, `X-IPng-Frontend`) so that operators can see which + backend actually served each request. Because keep-alives are + disabled by default, this tally reflects fresh Maglev hashing + decisions rather than a pinned connection. + +#### Scope Boundary + +`maglevt` is intentionally decoupled from `maglevd`. It does +not talk gRPC, it does not read the VPP stats segment, and it +does not know or care whether the target VIPs are actually +served by the `vpp-maglev` control plane at all — it simply +probes addresses. This makes it useful in at least three +scenarios: validating a `maglevd` restart end-to-end from a +client perspective, debugging pool failover by watching the +header tally reshuffle, and sanity-checking that a given VIP is +reachable across deployments when the gRPC control plane is +unavailable or out of reach. + +#### Interfaces + +**Presents.** + +- **A full-screen TUI** built on Bubble Tea, with a + deterministic grid layout and a few interactive toggles (e.g. + reverse-DNS lookup). There is no machine-readable output; if + you need metrics, use Prometheus on `maglevd`. + +**Consumes.** + +- **One or more YAML configuration files**, which it parses + with the same library `maglevd` uses. Only the subset of the + schema describing frontends is actually consumed; unknown + fields are ignored. Duplicate VIPs discovered across files + are de-duplicated by `(scheme, address, port)` so that + multi-file deployments don't double-probe. +- **The outbound network**, directly. No special capabilities + are required — `maglevt` is a plain HTTP client. + +## Operational Concerns + +### Configuration Reload Semantics + +Reload is triggered by `SIGHUP` to `maglevd`, or by the +`ReloadConfig` gRPC method. Both paths run the same validation +as `--check`. A reload MUST NOT partially apply (NFR-2.2): +either every change in the new file takes effect, or none of +them do. A reload MUST NOT restart unchanged probe workers; the +probe state machine is preserved precisely because operators +use reloads as a routine operation and expect backends whose +health-check definitions did not change to simply keep running. + +Operator overrides (Pause, Disable) survive a reload as long as +the backend still exists in the new config (FR-3.4). A backend +that disappears from the new config transitions to `Removed` +and its worker is stopped; if it reappears in a later reload it +starts again in `Unknown` with a fresh counter. + +A daemon **restart** is different from a reload. On restart, +the YAML configuration is the sole source of truth: every +runtime override is gone, every runtime weight mutation is gone +(FR-3.5). Operators who need an override to persist across +restarts must commit the intended state to the config file. + +### Failure Modes + +- **VPP restart.** `maglevd` detects the disconnect, enters a + reconnect loop, and on reconnect reads VPP's version and + current state (FR-4.4). The warmup clock is not reset by VPP + reconnects (NFR-1.3) — a flapping VPP does not cause + `maglevd` to go hands-off every time. The next periodic full + sync pushes the current desired state into the freshly + restarted plugin. +- **`maglevd` restart with VPP up.** Handled by the warmup + state machine (NFR-1.2): new flows see the last-programmed + weights until probes catch up, not zeros. +- **`maglevd` restart with VPP also down.** VPP comes back + first, `maglevd` comes back second, warmup gates pushing + anything until probes converge. This is the worst-case path, + bounded by `startup-max-delay`. +- **Configuration reload with a broken file.** The reload is + rejected; the running configuration is retained; an error + is logged (FR-5.4). No probes are interrupted (NFR-1.4). +- **Probe namespace disappears.** Entering the namespace fails, + the probe is counted as a failure, and the backend + eventually transitions Down under normal rise/fall rules. + There is no special-case handling; this is by design, because + an operator removing the netns while `maglevd` is running is + an operational error that SHOULD manifest as a visible Down, + not as silent success. +- **gRPC subscriber too slow.** Per-subscriber event queues + are bounded (NFR-3.2). A subscriber that cannot keep up MUST + be dropped rather than backing up the central fan-out + (NFR-3.4). +- **Mid-flight weight mutation during sync.** Operator weight + changes and reconciler sync both route through the same + state-protected code path, so mutations are serialized rather + than interleaved with VPP writes (NFR-2.4). + +### Observability + +**Structured logging** (FR-6.1). All logs are slog-formatted +JSON written to stdout. The default level is `info`, which is +sized to produce one or two lines per incident rather than per +probe. The `debug` level dumps every probe attempt and every +VPP binary-API message, and is intended for post-mortem +investigation. + +**Prometheus metrics** (FR-6.2, FR-6.4). `maglevd` exposes four +classes of metric: inline counters for probe outcomes, +probe-latency histograms, backend state-transition counters, +and VPP API and LB sync counters; and on-demand gauges for +current backend state, rise/fall counter values, configured +weights, VPP connection status, VPP uptime, VPP info labels, +and per-VIP LB plugin counters. Gauges are sampled from live +state on every scrape, so there is no sampling staleness. + +**Streaming events** (FR-6.3). The gRPC `WatchEvents` method +multiplexes three event families into one stream: log events +(the same structured logs the daemon writes to stdout), backend +transitions (one per affected frontend, since a single backend +may participate in multiple frontends), and frontend aggregate +transitions (Up/Down/Unknown flips at the frontend level). +Clients MAY filter by event family and by minimum log level. +The web frontend consumes this stream and re-publishes it to +browsers over SSE, with an epoch-seq replay buffer layered on +top. + +### Security and Capabilities + +`maglevd` needs `CAP_NET_RAW` for ICMP probes and +`CAP_SYS_ADMIN` for netns entry (NFR-4.1). Neither is optional +for the feature that needs it, and neither is required +otherwise; operators who use neither feature MAY run `maglevd` +as an unprivileged user with no capabilities at all. + +`maglevd-frontend` needs no special capabilities — it is a +plain HTTP client of `maglevd` and a plain HTTP server for +browsers. It does handle user credentials (basic auth), which +are read from the environment and held in process memory; +operators SHOULD terminate the frontend behind a TLS reverse +proxy if it is exposed beyond a trusted network. + +`maglevc` and `maglevt` need no special capabilities. + +All gRPC traffic runs insecure by default (NFR-4.2). Securing +transport is an operational decision, not a build-time one; +deployments that require mTLS SHOULD terminate gRPC at a +sidecar or colocate control and data plane on a trusted +segment. + +### Concurrency Model + +The concurrency model inside `maglevd` is deliberately simple +and local: + +- Each backend owns exactly one probe worker goroutine + (NFR-3.1). Workers do not share state with each other. +- All events — transitions and log records — travel through a + single central channel which is then fanned out to bounded + per-subscriber queues (NFR-3.2). The fan-out is the only + place where multiple subscribers can observe the same event. +- The configuration pointer is swapped atomically on reload + (NFR-2.2); readers take a read lock for the duration of a + single access, so the live config is always internally + consistent even mid-reload. +- The VPP stats snapshot is published as an atomic pointer + (NFR-3.3), so Prometheus scrapes and gRPC reads of counters + are wait-free. +- Reconciliation holds a mutex around VPP calls, which + serializes operator mutations, event-driven syncs, and + periodic full syncs against each other (NFR-2.4). This is + intentional: the order in which VPP sees mutations matters + for determinism, and serializing them is cheap at the scale + of control-plane events. + +Deadlock avoidance is structural rather than audited: +dependencies between subsystems are one-way. The checker does +not call into VPP; the reconciler reads checker state and calls +VPP; VPP never calls back. `maglevd-frontend` and `maglevc` +only read from `maglevd` over gRPC. There is no cycle in the +wait-for graph. + +## Alternatives Considered + +This is a retrofit of a shipped system, so the alternatives +here are the ones the code actively rejects, not speculative +designs. + +- **Several probe schedulers sharing one goroutine pool.** + Rejected in favor of one goroutine per backend. The + per-backend model is trivially correct, has no shared state, + and scales linearly with backend count at a cost of a few + kilobytes per backend. +- **`maglevd-frontend` as a sidecar per `maglevd`.** Rejected + in favor of one frontend speaking to many daemons. A single + dashboard pane across a fleet is the common operator + request; pushing multi-server logic into the frontend keeps + the daemon simple. +- **Operator actions expressed as config edits plus SIGHUP.** + Rejected in favor of direct gRPC mutations. Pausing a + backend during an incident should not require editing a + file, and the effect should survive subsequent reloads + (FR-3.4) — though, by deliberate design, not a daemon + restart (FR-3.5). +- **Persisting operator overrides across daemon restarts.** + Rejected in favor of making the YAML config file the sole + source of truth on startup (FR-3.5). Persisting runtime + overrides would require an on-disk side store and a clear + policy for what happens when the side store and the config + file disagree; keeping the daemon stateless on startup is + simpler and harder to get wrong. +- **Synchronous full sync after every transition.** Rejected + in favor of event-driven single-VIP syncs with a periodic + full sync as a safety net (FR-4.1, FR-4.2). Full syncs are + cheap but not free, and the blast radius of a transient bug + in the desired-state computation is smaller when + per-transition work only touches one VIP. +- **Letting `maglevt` read `maglevd`'s gRPC.** Rejected in + favor of probing the YAML file directly so that `maglevt` + remains useful when `maglevd` itself is the thing being + investigated. + +## Open Questions + +- **Mutual TLS for gRPC.** Currently insecure by default. A + future version may wire in standard mTLS support once a + credential-management story is picked. +- **Per-AS traffic counters.** The VPP `lb` plugin bypasses + the FIB and therefore does not produce per-AS traffic + counters. Surfacing real per-backend byte/packet counts + would require a VPP-side change. +- **High-availability of the control plane.** Two `maglevd` + instances on the same VPP would interleave writes harmlessly + thanks to determinism (NFR-2.1), but there is no leader + election and no formal story about which instance owns which + VIPs. Today, operators run a single `maglevd` per VPP host. diff --git a/docs/user-guide.md b/docs/user-guide.md index 8a33c86..34118f1 100644 --- a/docs/user-guide.md +++ b/docs/user-guide.md @@ -535,3 +535,61 @@ Nginx, HAProxy, or any proxy in front of `maglevd-frontend` must: the live-stream property. See `maglevd-frontend(8)` for the full reference. + +--- + +## maglevt + +`maglevt` is an optional out-of-band VIP probe TUI. It reads one or +more `maglev.yaml` files, enumerates the configured TCP/HTTP frontends, +and probes each one on a configurable HTTP path at a configurable +interval. It does not talk gRPC and does not depend on a running +`maglevd` — it's a purely client-side view of the VIPs, driven entirely +from the config file on disk. + +It's useful for a handful of things in particular: + +- Validating a `maglevd` restart end-to-end from a client perspective: + the probe tally keeps running regardless of what the control plane + is doing, so a brief blip or a missed failover is visible directly. +- Debugging pool failover: with keep-alives off, every probe opens a + fresh TCP connection and is reshuffled by VPP's Maglev hash, so the + response-header tally visibly reshuffles the moment a standby pool + takes over. +- Sanity-checking VIP reachability across multi-site deployments, + especially when the gRPC control plane isn't reachable from the + machine you're debugging on. + +`maglevt` is built by `make` alongside the other binaries but is not +shipped in the Debian package; run it from the `build/` tree or copy +it onto the host by hand. + +### Flags + +| Flag | Environment variable | Default | Description | +|---|---|---|---| +| `--config` | — | `/etc/vpp-maglev/maglev.yaml` | Path to a `maglev.yaml` file. Repeatable; also accepts a comma-separated list. Frontends are unioned across files and de-duplicated by `(address, protocol, port)`. | +| `--interval` | — | `100ms` | Probe interval per VIP, with ±10% jitter applied per probe to avoid phase-locking. | +| `--timeout` | — | `2s` | Per-request timeout. | +| `--host` | — | (VIP address) | Override for the HTTP `Host` header. Defaults to the VIP address literal. | +| `--uri` / `--path` | — | `/.well-known/ipng/healthz` | HTTP request path used in the GET. `--path` is an alias for `--uri`. | +| `--header` | — | `X-IPng-Frontend` | Response header whose value is extracted and tallied, so you can see which backend served each request. | +| `--insecure` | — | `true` | Skip TLS verification for HTTPS frontends. | +| `--keepalive` / `-k` | — | `false` | Enable HTTP keep-alives. Off by default so every probe opens a fresh connection — required for failover visibility, because a pinned keep-alive would mask a Maglev reshuffle. | +| `--filter` | — | — | Regular expression; only probe frontends whose name matches. | +| `--version` | — | — | Print version, commit hash, and build date, then exit. | + +### UI + +The TUI is built with Bubble Tea and shows a deterministic grid — +one tile per `(scheme, address, port)` VIP, IPv6 before IPv4 and +HTTPS before HTTP, so the layout is stable across runs and across +machines. Each tile carries a rolling latency summary (min, max, +average, plus a few percentiles), running success and failure +counts, and a tally of the configured response-header values seen +from that VIP. Press `d` to toggle reverse-DNS resolution on the +addresses shown in the tile headers; press `q` or `Ctrl-C` to +exit. + +There is no machine-readable output. If you need metrics, scrape +Prometheus on `maglevd` instead.