1077 lines
47 KiB
Markdown
1077 lines
47 KiB
Markdown
# vpp-maglev Design Document
|
||
|
||
## Metadata
|
||
|
||
| | |
|
||
| --- | --- |
|
||
| **Status** | Retrofit — describes shipped behavior as of `v0.9.5` |
|
||
| **Author** | Pim van Pelt `<pim@ipng.ch>` |
|
||
| **Last updated** | 2026-04-15 |
|
||
| **Audience** | Operators and contributors who will read the source tree next |
|
||
|
||
The key words **MUST**, **MUST NOT**, **SHOULD**, **SHOULD NOT**, and
|
||
**MAY** are used as described in
|
||
[RFC 2119](https://datatracker.ietf.org/doc/html/rfc2119), and are
|
||
reserved in this document for requirements that are actually enforced
|
||
in code or by an external dependency. Plain-language descriptions of
|
||
what the system or an operator can do are written in lowercase —
|
||
"can", "will", "does" — and should not be read as normative.
|
||
|
||
## Summary
|
||
|
||
`vpp-maglev` is a control plane for the VPP `lb` (Maglev load
|
||
balancer) plugin. A single daemon — `maglevd` — probes a fleet of
|
||
backends, maintains an authoritative view of their health, and
|
||
programs the VPP dataplane so that traffic hashed to a given VIP
|
||
lands only on healthy backends. Operators drive the system through
|
||
`maglevc` (an interactive CLI) or `maglevd-frontend` (a read-only
|
||
web dashboard with an optional authenticated admin surface). A small
|
||
companion binary, `maglevt`, validates VIPs from outside the control
|
||
plane by sending live HTTP probes and reporting failover behavior.
|
||
|
||
## Background
|
||
|
||
VPP's `lb` plugin implements Maglev consistent hashing inside the
|
||
dataplane: a VIP is backed by a pool of Application Servers (ASes),
|
||
each with an integer weight in `[0, 100]`, and incoming flows are
|
||
hashed onto a bucket ring so that weight changes disturb as few
|
||
existing flows as possible. The plugin knows nothing about backend
|
||
health; if an AS dies while it holds buckets, traffic to those
|
||
buckets is black-holed until something external tells `lb` to remove
|
||
or re-weight the AS.
|
||
|
||
`vpp-maglev` is that external thing. Before `vpp-maglev`, operators
|
||
maintained VIP configurations by hand and reacted to incidents with
|
||
`vppctl`. The project replaces that loop with a daemon that owns the
|
||
health story, reconciles it with the dataplane, and exposes the
|
||
result through a uniform gRPC API so that CLIs, dashboards, and
|
||
scripts all read the same source of truth.
|
||
|
||
## Goals and Non-Goals
|
||
|
||
### Product Goals
|
||
|
||
1. **Accurate backend health.** Detect that a backend is up,
|
||
degraded, or down quickly enough to keep user-visible error rates
|
||
low, and avoid flapping under transient faults.
|
||
2. **Correct VPP state.** The set of VIPs and per-AS weights in VPP
|
||
converges to the configured intent, filtered by current health,
|
||
for every supported failure mode.
|
||
3. **Restart neutrality.** Restarting `maglevd` with VPP already up
|
||
MUST NOT cause traffic to be black-holed while health probes warm
|
||
up.
|
||
4. **Operator control.** A human can pause, drain, or weight-shift
|
||
a backend in seconds without editing config files.
|
||
5. **Uniform observability.** Every state transition, VPP API call,
|
||
and probe result is emitted as a structured log, a Prometheus
|
||
metric, or a streaming event — ideally all three.
|
||
6. **One source of truth.** Every other component (CLI, web
|
||
frontend, scripts) reads `maglevd` through one typed interface.
|
||
There is no secondary control plane.
|
||
|
||
### Non-Goals
|
||
|
||
- `vpp-maglev` is not a VPP installer or packaging layer. It assumes
|
||
VPP is already running with the `lb` plugin loaded.
|
||
- It does not implement its own dataplane fast path. All forwarding
|
||
stays in VPP; `maglevd` only programs the plugin.
|
||
- It is not a generic service mesh. There is no L7 routing, cert
|
||
issuance, service discovery, or east-west policy — only VIPs,
|
||
pools, and backends.
|
||
- It is not a config store. Configuration is a YAML file on disk;
|
||
the gRPC API can check and reload it but cannot author it.
|
||
- It does not secure its own transport. gRPC runs insecure by
|
||
default; TLS, mTLS, or firewalls are the operator's
|
||
responsibility.
|
||
|
||
## Requirements
|
||
|
||
Each requirement carries a unique identifier (`FR-X.Y` or `NFR-X.Y`)
|
||
so that later sections can cite it.
|
||
|
||
### Functional Requirements
|
||
|
||
**FR-1 Health checking**
|
||
|
||
- **FR-1.1** The system supports ICMP, TCP, HTTP, and HTTPS health
|
||
checks, each with its own protocol-specific success criteria.
|
||
- **FR-1.2** Each health check MUST apply HAProxy rise/fall
|
||
semantics with operator-configurable thresholds.
|
||
- **FR-1.3** A health check MAY declare distinct `interval`,
|
||
`fast-interval`, and `down-interval` values so that recovery from
|
||
a degraded or down state is faster than steady-state polling.
|
||
- **FR-1.4** Each probe attempt is bounded by a configurable
|
||
per-probe timeout, independent of the scheduling interval.
|
||
- **FR-1.5** If the configuration sets `healthchecker.netns`, every
|
||
probe MUST execute inside the named Linux network namespace.
|
||
- **FR-1.6** The first probe result against a newly-created backend
|
||
forces an immediate transition out of `Unknown`, without waiting
|
||
for `rise` or `fall` consecutive results.
|
||
- **FR-1.7** A backend MAY omit its `healthcheck` reference to
|
||
declare itself **static**. A static backend is not probed and is
|
||
treated as permanently Up; it still participates in pool failover
|
||
and still honors operator Pause and Disable overrides.
|
||
|
||
**FR-2 Aggregation and pool failover**
|
||
|
||
- **FR-2.1** A frontend MAY reference one or more named pools.
|
||
Each referenced pool MUST contain at least one
|
||
`(backend, configured-weight)` tuple; an empty pool is a
|
||
configuration error and is rejected at load time.
|
||
- **FR-2.2** At any time, exactly one pool — the first, in
|
||
configuration order, that contains a healthy backend with
|
||
non-zero configured weight — is active; backends in other pools
|
||
contribute zero effective weight.
|
||
- **FR-2.3** The effective weight of a `(frontend, pool, backend)`
|
||
tuple is the configured weight when the backend is Up **and** the
|
||
pool is active, and zero in every other case.
|
||
- **FR-2.4** A frontend's aggregate state is Up when at least one
|
||
backend has non-zero effective weight, Unknown when every
|
||
referenced backend is still Unknown (or the frontend references
|
||
no backends), and Down otherwise.
|
||
|
||
**FR-3 Operator control**
|
||
|
||
- **FR-3.1** Operators can pause and resume individual backends at
|
||
runtime. Pausing stops the probe worker, freezes the rise/fall
|
||
counter, and drives effective weight to zero in **every** pool
|
||
and **every** frontend that references the backend. Existing
|
||
flows are not torn down; this is a soft drain.
|
||
- **FR-3.2** Operators can disable and re-enable individual
|
||
backends at runtime. Disabling drives effective weight to zero in
|
||
**every** pool and **every** frontend that references the
|
||
backend, and MUST cause existing flows to be torn down on the
|
||
next VPP sync.
|
||
- **FR-3.3** Operators can set the configured weight of a specific
|
||
`(frontend, pool, backend)` tuple at runtime.
|
||
- **FR-3.4** Operator overrides (Pause, Disable) and operator
|
||
weight mutations survive a configuration **reload** (`SIGHUP`)
|
||
as long as the underlying backend and tuple still exist in the
|
||
new configuration.
|
||
- **FR-3.5** Operator overrides and operator weight mutations do
|
||
**not** survive a `maglevd` **restart**. After a restart, the
|
||
YAML configuration file is authoritative for every backend and
|
||
every tuple: paused backends come back unpaused, disabled
|
||
backends come back enabled, mutated weights revert to the
|
||
configured value. Operators who need persistent changes must
|
||
edit the config file.
|
||
|
||
**FR-4 VPP reconciliation**
|
||
|
||
- **FR-4.1** For every backend state transition that changes an
|
||
effective weight, `maglevd` pushes the resulting AS state into
|
||
VPP for every affected VIP.
|
||
- **FR-4.2** `maglevd` runs a periodic full reconciliation on a
|
||
configurable cadence (default thirty seconds) as a safety net
|
||
against missed events and VPP restarts.
|
||
- **FR-4.3** Weight-to-zero is communicated to VPP as a graceful
|
||
drain by default; transitions to Disabled and transitions to
|
||
Down while `flush-on-down` is true MUST tear existing flows down
|
||
on the next sync.
|
||
- **FR-4.4** `maglevd` tolerates VPP disconnects by auto-reconnecting
|
||
and resuming reconciliation once the connection is
|
||
re-established.
|
||
|
||
**FR-5 Configuration**
|
||
|
||
- **FR-5.1** Configuration is loaded from a single YAML file
|
||
specified at startup and referenced by all later operations.
|
||
- **FR-5.2** Configuration validation distinguishes **parse
|
||
errors** (malformed YAML) from **semantic errors** (structural
|
||
invariants) and MUST report each with its own exit code from
|
||
`--check`: 0 (OK), 1 (parse), 2 (semantic).
|
||
- **FR-5.3** `maglevd` reloads its configuration on `SIGHUP`
|
||
without restarting the process, without restarting unchanged
|
||
probe workers, and without losing operator overrides (see
|
||
FR-3.4).
|
||
- **FR-5.4** A parse or semantic error encountered during reload
|
||
MUST leave the running configuration in place.
|
||
- **FR-5.5** The same validation and reload semantics are also
|
||
reachable through gRPC (`CheckConfig`, `ReloadConfig`).
|
||
|
||
**FR-6 Observability**
|
||
|
||
- **FR-6.1** All logs are emitted as structured JSON on stdout at
|
||
a configurable level.
|
||
- **FR-6.2** `maglevd` exposes Prometheus metrics for probe
|
||
outcomes, probe latency, backend state transitions, VPP API
|
||
traffic, and VPP LB sync mutations.
|
||
- **FR-6.3** A streaming gRPC API multiplexes log entries, backend
|
||
transitions, and frontend aggregate transitions to any number of
|
||
subscribers with per-subscriber filters.
|
||
- **FR-6.4** Per-VIP packet counters from VPP's stats segment are
|
||
surfaced through both the gRPC API and the Prometheus surface.
|
||
|
||
**FR-7 Clients and peripheral tools**
|
||
|
||
- **FR-7.1** An interactive CLI (`maglevc`) provides a
|
||
tab-completing shell and a one-shot command mode, both backed
|
||
by the same command tree.
|
||
- **FR-7.2** A web frontend (`maglevd-frontend`) can multiplex more
|
||
than one `maglevd` in a single process and present their
|
||
combined state.
|
||
- **FR-7.3** The web frontend partitions its HTTP surface into a
|
||
public read-only path (`/view/`) and an authenticated mutating
|
||
path (`/admin/`). If credentials are not configured, `/admin/`
|
||
MUST NOT be advertised (the path returns 404).
|
||
- **FR-7.4** An out-of-band tester (`maglevt`) probes configured
|
||
VIPs from outside the control plane, measures latency, and
|
||
tallies a configurable response header.
|
||
|
||
### Non-Functional Requirements
|
||
|
||
**NFR-1 Availability and reliability**
|
||
|
||
- **NFR-1.1** A `maglevd` outage MUST NOT stop the dataplane.
|
||
While `maglevd` is absent, VPP continues to forward traffic
|
||
with its last-programmed state.
|
||
- **NFR-1.2** Restarting `maglevd` with VPP up MUST NOT black-hole
|
||
new flows during the probe warm-up window; this is enforced by
|
||
the startup warmup state machine described under `maglevd`.
|
||
- **NFR-1.3** The warmup clock is tied to process start and MUST
|
||
NOT be reset by VPP reconnects or configuration reloads.
|
||
- **NFR-1.4** A `maglevd`-side reload with a broken file MUST NOT
|
||
interrupt any running probe.
|
||
|
||
**NFR-2 Determinism and correctness**
|
||
|
||
- **NFR-2.1** Two `maglevd` instances given the same configuration
|
||
and the same backend state MUST issue the same sequence of
|
||
`lb_as_add_del` calls to VPP, so that VPP's bucket assignment is
|
||
stable across process swaps. This is the job of the
|
||
deterministic AS ordering rule.
|
||
- **NFR-2.2** Configuration reload MUST be atomic: either every
|
||
change in the new file takes effect, or none of them do.
|
||
- **NFR-2.3** Probe scheduling SHOULD apply bounded jitter so
|
||
that, after a daemon restart or a configuration reload, probes
|
||
do not phase-lock to the wall clock.
|
||
- **NFR-2.4** Operator mutations, event-driven syncs, and
|
||
periodic full syncs against VPP MUST be serialized with respect
|
||
to one another; they MUST NOT interleave.
|
||
|
||
**NFR-3 Performance and scalability**
|
||
|
||
- **NFR-3.1** Probing N backends costs roughly N goroutines doing
|
||
mostly idle waits; there is no central probe scheduler.
|
||
- **NFR-3.2** Event fan-out, transition history, and
|
||
per-subscriber event queues MUST all be bounded; no structure
|
||
grows without limit under sustained load.
|
||
- **NFR-3.3** VPP stats snapshots are published as an atomic
|
||
pointer so that Prometheus scrapes and gRPC counter reads are
|
||
wait-free.
|
||
- **NFR-3.4** A gRPC subscriber that cannot keep up MUST be
|
||
dropped rather than blocking the central fan-out.
|
||
|
||
**NFR-4 Security**
|
||
|
||
- **NFR-4.1** `maglevd` runs with only the Linux capabilities it
|
||
actually needs: `CAP_NET_RAW` only when ICMP probes are in use,
|
||
`CAP_SYS_ADMIN` only when `healthchecker.netns` is set.
|
||
- **NFR-4.2** gRPC transport security is explicitly out of scope;
|
||
the daemon runs insecure by default and deployments SHOULD
|
||
front it with a firewall, a trusted network, or a
|
||
TLS-terminating sidecar.
|
||
- **NFR-4.3** The web frontend's mutating surface MUST be hidden
|
||
entirely (HTTP 404) when either of its basic-auth environment
|
||
variables is unset.
|
||
|
||
**NFR-5 Operability**
|
||
|
||
- **NFR-5.1** Every CLI flag on every binary SHOULD have an
|
||
environment-variable equivalent so that the binaries can be
|
||
driven purely through env in container deployments.
|
||
- **NFR-5.2** `maglevd --check` MUST provide a stable exit-code
|
||
contract (0 / 1 / 2) for use by packaging scripts and
|
||
`ExecStartPre` handlers.
|
||
- **NFR-5.3** Dashboards can track state in real time through the
|
||
streaming event interface rather than by tight polling.
|
||
- **NFR-5.4** `maglevc` and `maglevd-frontend` MUST NOT maintain
|
||
any authoritative state of their own; all truth lives in
|
||
`maglevd`.
|
||
|
||
## Architecture Overview
|
||
|
||
### Process Model
|
||
|
||
The system ships as three independent executables plus one optional
|
||
companion tester:
|
||
|
||
- **`maglevd`** — the long-running daemon. Hosts both the health
|
||
checker and the VPP control plane.
|
||
- **`maglevc`** — short-lived CLI client.
|
||
- **`maglevd-frontend`** — long-running web dashboard (optional).
|
||
- **`maglevt`** — short-lived out-of-band probe TUI (optional).
|
||
|
||
VPP itself is a fourth moving part, but it is an external
|
||
dependency, not part of the `vpp-maglev` codebase.
|
||
|
||
### Data Flow
|
||
|
||
Configuration flows **in** from a YAML file on disk (read by
|
||
`maglevd`) and from runtime mutations issued over gRPC by `maglevc`
|
||
or `maglevd-frontend`. Health state flows **out** of `maglevd` in
|
||
three directions: into VPP (as AS weight changes), into Prometheus
|
||
(as metrics), and into gRPC clients (as streaming events and
|
||
snapshot reads). Traffic counters flow **back in** from VPP's stats
|
||
segment and are surfaced through the same gRPC and Prometheus
|
||
channels. No component writes to VPP except `maglevd`. No component
|
||
serves `maglevd`'s state except `maglevd` itself.
|
||
|
||
## Components
|
||
|
||
### maglevd
|
||
|
||
`maglevd` is the entire control plane. It is a single Go process
|
||
that bundles three internal concerns — a fleet of probe workers, a
|
||
VPP reconciler, and a gRPC server — around one shared, versioned
|
||
view of `(config, backend state, frontend state)`.
|
||
|
||
#### Responsibilities
|
||
|
||
- Load and validate configuration; accept reloads on `SIGHUP`
|
||
(FR-5.3, FR-5.4).
|
||
- Run one health-check worker per backend defined in config
|
||
(NFR-3.1).
|
||
- Maintain each backend's rise/fall counter and derive its state
|
||
(FR-1.2, FR-1.6).
|
||
- Aggregate backend state into per-frontend state, honoring
|
||
pool-based failover and per-backend operator overrides
|
||
(FR-2.x, FR-3.x).
|
||
- Connect to VPP's binary API and stats socket, reconnecting
|
||
automatically on disconnect (FR-4.4).
|
||
- Compute a desired VPP `lb` state from current configuration and
|
||
health, and drive VPP to match it (FR-4.1, FR-4.2).
|
||
- Expose the whole picture through a gRPC service and a Prometheus
|
||
`/metrics` endpoint (FR-6.x).
|
||
|
||
#### Probe Types and Intervals
|
||
|
||
Four probe types are supported (FR-1.1):
|
||
|
||
- **ICMP** — sends an echo request, expects a matching reply. This
|
||
probe type MUST have access to a raw socket, which requires
|
||
`CAP_NET_RAW` (NFR-4.1).
|
||
- **TCP** — establishes a TCP connection and immediately closes
|
||
it. No payload is exchanged.
|
||
- **HTTP** — issues a request against a configured path, matches
|
||
the response code against a configured numeric range, and
|
||
optionally matches the response body against a regular
|
||
expression.
|
||
- **HTTPS** — HTTP over TLS with configurable SNI and an option to
|
||
skip certificate verification.
|
||
|
||
Each health check configures three candidate intervals (FR-1.3):
|
||
the nominal `interval`, an optional faster `fast-interval` used
|
||
while the counter is in its degraded zone, and an optional slower
|
||
`down-interval` used while the backend is fully down. If an
|
||
optional interval is not set, the nominal interval is used. Every
|
||
scheduled sleep receives bounded random jitter; this is the
|
||
mechanism that satisfies NFR-2.3.
|
||
|
||
Each probe also has a `timeout` (FR-1.4). The probe-level timeout
|
||
bounds a single attempt; the interval bounds the time between the
|
||
**start** of consecutive attempts, with the actual probe latency
|
||
deducted from the next sleep so that slow probes do not push the
|
||
schedule later and later.
|
||
|
||
If the configuration sets `healthchecker.netns`, every probe of
|
||
every type MUST run inside that Linux network namespace (FR-1.5).
|
||
Entering a netns requires `CAP_SYS_ADMIN`; without it, probes will
|
||
fail and the backend will go down. This is a deliberate deployment
|
||
choice, not a bug — see the security subsection below.
|
||
|
||
#### Rise/Fall State Machine
|
||
|
||
Each backend carries a single integer counter in the closed range
|
||
`[0, rise + fall − 1]`. A backend is considered **Up** when the
|
||
counter is at or above `rise`, and **Down** otherwise. A successful
|
||
probe increments the counter, saturating at the maximum; a failing
|
||
probe decrements it, saturating at zero. This is the HAProxy
|
||
hysteresis model adapted to a single scalar (FR-1.2).
|
||
|
||
Four additional states overlay the rise/fall logic:
|
||
|
||
- **Unknown** — the backend has not yet produced any probe result
|
||
since `maglevd` started (or since it was re-added by a reload).
|
||
An Unknown backend contributes zero effective weight and the
|
||
transition to Up or Down is taken on the *first* result rather
|
||
than after `rise` or `fall` consecutive results (FR-1.6). This
|
||
asymmetric rule lets fresh daemons discover the world quickly
|
||
while still requiring hysteresis for steady-state flaps.
|
||
- **Paused** — operator override (FR-3.1). The probe worker is
|
||
stopped and the counter is frozen. Effective weight is zero in
|
||
every pool and every frontend that references the backend, but
|
||
existing flows are not torn down; this is a soft drain.
|
||
- **Disabled** — operator override (FR-3.2). The probe worker is
|
||
stopped and effective weight is zero in every pool and every
|
||
frontend that references the backend. Unlike Paused, Disabled
|
||
causes existing flows to be torn down on the next VPP sync
|
||
(FR-4.3).
|
||
- **Removed** — the backend was deleted by a configuration reload.
|
||
Its final transition is emitted on the event stream and then
|
||
all references are dropped.
|
||
|
||
Backends declared **static** (no `healthcheck` reference in
|
||
config, FR-1.7) bypass the rise/fall machinery entirely. They are
|
||
not probed, their counter is not maintained, and they enter Up on
|
||
startup via a single synthetic pass. They still participate in
|
||
pool-failover weight computation like any other backend and still
|
||
honor operator Pause and Disable overrides.
|
||
|
||
Operator overrides and operator weight mutations are held in
|
||
process memory only. They survive a `SIGHUP` reload (FR-3.4) but
|
||
do **not** survive a daemon restart (FR-3.5): when `maglevd`
|
||
starts, the YAML file is the sole source of truth, and any
|
||
earlier runtime mutation is gone. Operators who need durable
|
||
changes must commit them to the configuration file.
|
||
|
||
#### Aggregation to Frontend State
|
||
|
||
A frontend references one or more named pools. Each referenced
|
||
pool contains one or more backends with a per-reference configured
|
||
weight in `[0, 100]` (FR-2.1). The effective weight that `maglevd`
|
||
computes for a given `(frontend, pool, backend)` tuple is
|
||
(FR-2.3):
|
||
|
||
- The configured weight, if the backend is Up **and** the
|
||
backend's pool is the active pool (see below).
|
||
- Zero in every other case.
|
||
|
||
The active pool is the first pool, in configuration order, that
|
||
contains at least one Up backend whose configured weight is
|
||
non-zero (FR-2.2). If no pool is active (e.g. all backends are
|
||
Down), every backend contributes zero weight and the frontend's
|
||
aggregate state is Down. A frontend with no backends at all, or
|
||
with every referenced backend still in Unknown, is itself Unknown.
|
||
A frontend with at least one non-zero effective weight is Up
|
||
(FR-2.4).
|
||
|
||
Whether effective weight zero also flushes existing flows depends
|
||
on the cause (FR-4.3):
|
||
|
||
- Up in a non-active pool: weight zero, **no** flush (standby
|
||
pool).
|
||
- Down while `flush-on-down` is true: weight zero, flush.
|
||
- Disabled: weight zero, flush, always.
|
||
- Paused or Unknown: weight zero, no flush.
|
||
|
||
#### VPP Reconciliation
|
||
|
||
`maglevd` treats VPP's LB configuration as a desired-state
|
||
reconciliation target. The desired state is a pure function of
|
||
`(current config, current backend state)`; the observed state is
|
||
read back from VPP through the `lb` plugin's binary API. A sync
|
||
operation diffs the two and issues the minimal set of
|
||
`lb_vip_add_del`, `lb_as_add_del`, and `lb_as_set_weight` messages
|
||
to make them match.
|
||
|
||
Two triggers drive a sync:
|
||
|
||
1. **Event-driven, single VIP** (FR-4.1). When the health checker
|
||
emits a backend transition, the reconciler recomputes desired
|
||
state for every frontend that references that backend and
|
||
syncs those VIPs. This is the primary path for convergence
|
||
during incidents.
|
||
2. **Periodic, full** (FR-4.2). A background loop runs a full
|
||
sync on a configurable interval (default thirty seconds).
|
||
This is the safety net that closes gaps left by missed events,
|
||
VPP restarts, or bugs in the event path.
|
||
|
||
For determinism (NFR-2.1), whenever a sync operation iterates
|
||
over ASes it does so in a total order defined by the numeric
|
||
representation of the AS address, with IPv4 addresses ordered
|
||
before IPv6. Two `maglevd` instances given the same input MUST
|
||
therefore issue the same `lb_as_add_del` sequence, which in turn
|
||
means VPP produces the same bucket-to-AS assignment regardless of
|
||
which instance is driving.
|
||
|
||
Operator mutations, event-driven syncs, and periodic full syncs
|
||
are serialized through a single mutex at the VPP-call boundary
|
||
(NFR-2.4); they never interleave.
|
||
|
||
#### Startup Warmup and Restart Neutrality
|
||
|
||
A naive sync loop would, on restart, immediately synthesize a
|
||
desired state in which every backend is Unknown, map every
|
||
backend through the effective-weight rules to zero, and push
|
||
"zero weight everywhere" into VPP before a single probe had
|
||
completed. The result would be a multi-second black hole on
|
||
every `maglevd` restart. NFR-1.2 forbids this, and the warmup
|
||
state machine is how it is enforced.
|
||
|
||
The warmup has three phases, keyed off two configurable delays
|
||
`startup-min-delay` (default five seconds) and `startup-max-delay`
|
||
(default thirty seconds):
|
||
|
||
1. **Hands-off.** From process start to `startup-min-delay`, the
|
||
reconciler MUST NOT write anything to VPP at all. Event-driven
|
||
syncs are suppressed; the periodic full sync is suppressed.
|
||
2. **Per-VIP release.** From `startup-min-delay` to
|
||
`startup-max-delay`, a VIP becomes eligible for sync the moment
|
||
every backend it references has produced at least one probe
|
||
result (i.e. none are Unknown). Eligible VIPs are released
|
||
individually so that healthy VIPs converge as fast as their
|
||
slowest backend, without being held back by unrelated slow
|
||
VIPs.
|
||
3. **Watchdog.** At `startup-max-delay`, any VIPs still held are
|
||
released unconditionally by a final full sync. This bounds the
|
||
worst-case blackout to `startup-max-delay` rather than "as long
|
||
as the slowest backend takes".
|
||
|
||
The warmup clock is tied to process start, not to VPP reconnect
|
||
or configuration reload (NFR-1.3). Reconnecting to a flapping VPP
|
||
does not re-enter warmup, and `SIGHUP` does not re-enter warmup.
|
||
|
||
Setting both delays to zero disables the warmup entirely, which
|
||
is useful for tests but SHOULD NOT be done in production.
|
||
|
||
#### Configuration and Reload
|
||
|
||
Configuration lives in a single YAML file (FR-5.1), typically
|
||
`/etc/vpp-maglev/maglev.yaml`. It is validated in two distinct
|
||
phases (FR-5.2): a **parse** phase that catches YAML errors, and
|
||
a **semantic** phase that enforces structural invariants such as:
|
||
|
||
- Every frontend whose VIPs share an address MUST use backends of
|
||
the same address family (IPv4 or IPv6), because VPP picks an
|
||
encap type per VIP and mixing families on one VIP is not
|
||
supported.
|
||
- Every backend referenced by a frontend MUST exist.
|
||
- Every referenced health check MUST exist.
|
||
- Every pool referenced by a frontend MUST contain at least one
|
||
backend (FR-2.1).
|
||
- VPP LB knobs MUST satisfy plugin constraints: `flow-timeout`
|
||
in `[1s, 120s]`, `sticky-buckets-per-core` a power of two,
|
||
`sync-interval` strictly positive, `startup-max-delay` not less
|
||
than `startup-min-delay`.
|
||
- `transition-history` MUST be at least one.
|
||
|
||
`maglevd --check` runs both phases and exits with code 0 on
|
||
success, 1 on parse errors, and 2 on semantic errors (NFR-5.2).
|
||
This exit code contract is what packaging scripts and systemd
|
||
`ExecStartPre` rely on.
|
||
|
||
On `SIGHUP` the same two-phase validation runs against the file
|
||
on disk. If either phase fails, `maglevd` MUST log the error and
|
||
leave the running configuration untouched (FR-5.4, NFR-1.4). On
|
||
success, the delta is applied atomically (NFR-2.2): new backends
|
||
spawn workers, removed backends have their workers stopped and
|
||
emit a terminal `Removed` event, changed backends restart their
|
||
workers, and metadata-only changes (address, weight, enable flag)
|
||
are updated in place without restarting anything. Operator
|
||
overrides (Pause, Disable) survive reloads (FR-3.4) but — to
|
||
repeat the point from FR-3.5 — do **not** survive a daemon
|
||
restart.
|
||
|
||
#### Lifecycle, Signals, and Security
|
||
|
||
`maglevd` handles three signals:
|
||
|
||
- **`SIGHUP`** triggers a configuration reload as described
|
||
above.
|
||
- **`SIGTERM`** and **`SIGINT`** initiate a graceful shutdown:
|
||
the gRPC server drains, stream subscribers are released, probe
|
||
workers are cancelled, and the VPP connection is closed. VPP's
|
||
last-programmed state is not torn down; traffic continues to
|
||
flow (NFR-1.1).
|
||
|
||
`maglevd` requires two Linux capabilities, each tied to a
|
||
specific feature (NFR-4.1):
|
||
|
||
- **`CAP_NET_RAW`** is required if and only if any configured
|
||
health check is of type ICMP. Without it, raw-socket creation
|
||
will fail and all ICMP probes will error out.
|
||
- **`CAP_SYS_ADMIN`** is required if and only if
|
||
`healthchecker.netns` is set. The kernel's `setns(CLONE_NEWNET)`
|
||
call requires it; without it, every probe will fail on
|
||
namespace entry.
|
||
|
||
The shipped Debian unit grants both capabilities through
|
||
`AmbientCapabilities` and `CapabilityBoundingSet`, which is why
|
||
the package "just works" out of the box. Hand-run invocations
|
||
SHOULD set capabilities explicitly (e.g. via `setcap`) rather
|
||
than running as root.
|
||
|
||
`maglevd` does not secure its own gRPC listener (NFR-4.2).
|
||
Operators SHOULD bind the listener to loopback, to a
|
||
control-plane VRF, or behind a firewall, depending on their
|
||
threat model. The design deliberately pushes transport security
|
||
out of the binary on the theory that every deployment already
|
||
has an answer for it.
|
||
|
||
#### Interfaces
|
||
|
||
**Presents.**
|
||
|
||
- **A gRPC service on a TCP listener** (default `:9090`). This
|
||
is the *only* programmatic interface to `maglevd`. Every other
|
||
component talks to `maglevd` through this interface and no
|
||
other. The service has read-only methods (`List*`, `Get*`,
|
||
`WatchEvents`, `CheckConfig`), mutating methods
|
||
(`PauseBackend`, `ResumeBackend`, `EnableBackend`,
|
||
`DisableBackend`, `SetFrontendPoolBackendWeight`,
|
||
`ReloadConfig`, `SyncVPPLBState`), and a single streaming
|
||
method (`WatchEvents`) that multiplexes log entries and state
|
||
transitions to any number of subscribers with per-subscriber
|
||
filters (FR-6.3). gRPC reflection is enabled by default so
|
||
that ad-hoc tooling can introspect the service.
|
||
- **A Prometheus `/metrics` HTTP endpoint** on a separate
|
||
listener (default `:9091`) (FR-6.2). Counters are updated
|
||
inline as probes run and VPP calls complete; gauges are
|
||
computed on each scrape from the current checker and VPP
|
||
state, so there is no sampling lag.
|
||
- **Structured JSON logs on stdout**, via `log/slog`, at a
|
||
configurable level (FR-6.1). Key events — daemon start, config
|
||
load, VPP connect/disconnect, backend transitions, LB sync
|
||
mutations, warmup milestones — are logged at `info` or higher
|
||
so that a default-level deployment has enough to post-mortem
|
||
an incident.
|
||
- **Process exit codes** from `--check`: 0, 1, or 2 as described
|
||
above (NFR-5.2). These form a small but load-bearing interface
|
||
to packaging and systemd.
|
||
|
||
**Consumes.**
|
||
|
||
- **A YAML configuration file** on disk, passed via `--config`
|
||
or `MAGLEV_CONFIG`. This is the declarative source of truth
|
||
for intent; everything the operator mutates at runtime is a
|
||
delta on top of it, and every runtime delta is lost on a
|
||
daemon restart (FR-3.5).
|
||
- **VPP's binary API socket** (default `/run/vpp/api.sock`).
|
||
The connection auto-reconnects on drop (FR-4.4), and while
|
||
disconnected, the reconciler silently queues no work — the
|
||
next periodic sync closes any gap.
|
||
- **VPP's stats segment socket** (default `/run/vpp/stats.sock`).
|
||
Read periodically (five-second cadence) for per-VIP packet
|
||
and byte counters (FR-6.4). Readers are non-blocking
|
||
(NFR-3.3); a stale snapshot is always available.
|
||
- **The Linux kernel's namespace subsystem**, when
|
||
`healthchecker.netns` is set. Requires `CAP_SYS_ADMIN`.
|
||
- **Raw sockets**, for ICMP probes. Requires `CAP_NET_RAW`.
|
||
|
||
### VPP Dataplane
|
||
|
||
The VPP dataplane is not part of the `vpp-maglev` codebase, but
|
||
it is the component every other piece revolves around, and its
|
||
contract with `maglevd` defines what `maglevd` is allowed to do.
|
||
|
||
#### Responsibilities
|
||
|
||
VPP's `lb` plugin implements Maglev consistent hashing in the
|
||
forwarding fast path. It owns:
|
||
|
||
- **Global configuration** — an IPv4 source address and an IPv6
|
||
source address used as the outer header for GRE-encapsulated
|
||
traffic to ASes, the number of sticky buckets per worker core,
|
||
and a per-flow idle timeout.
|
||
- **A set of VIPs**, each identified by an address prefix, an IP
|
||
protocol, and a port. A VIP carries an encap type (GRE4 or
|
||
GRE6, picked by the family of the AS addresses) and a flag
|
||
for source-IP sticky hashing.
|
||
- **A set of ASes per VIP**, each identified by address, with an
|
||
integer weight in `[0, 100]`, a `used`/`flushed` state, and a
|
||
bucket count derived from the Maglev ring.
|
||
|
||
It does **not** own: health, configuration intent, operator
|
||
overrides, transition history, or metrics. Those belong to
|
||
`maglevd`.
|
||
|
||
#### Interfaces
|
||
|
||
**Presents.**
|
||
|
||
- **A binary API** (GoVPP-style message exchange) for reading
|
||
and mutating VIP and AS state. `maglevd` is the sole user.
|
||
- **A stats segment** with per-VIP counters from the LB plugin
|
||
(existing-flow, first-flow, untracked, no-server) and
|
||
per-prefix FIB counters. The LB plugin bypasses the FIB for
|
||
forwarded packets, so per-backend traffic counters are not
|
||
available; this is a known limitation that operators consuming
|
||
metrics need to understand.
|
||
- **The forwarded-traffic fast path itself**, which is the whole
|
||
reason this project exists.
|
||
|
||
**Consumes.**
|
||
|
||
- `maglevd`'s binary-API writes — nothing else. There is no
|
||
third party programming `lb` state in a working deployment.
|
||
|
||
### maglevc
|
||
|
||
`maglevc` is the interactive and scripting CLI. It is a
|
||
short-lived client with no persistent state and no background
|
||
work (NFR-5.4).
|
||
|
||
#### Responsibilities
|
||
|
||
- Provide a human-readable tab-completing shell for `maglevd`
|
||
(FR-7.1).
|
||
- Dispatch one-shot commands for scripts and automation.
|
||
- Render state snapshots (frontends, backends, health checks,
|
||
VPP LB state, VPP counters) with optional ANSI color.
|
||
- Stream events in real time (`watch events`) with filters.
|
||
|
||
#### Interaction Model
|
||
|
||
With no positional arguments, `maglevc` starts a readline-based
|
||
REPL with a nested command tree: `show`, `set`, `watch`,
|
||
`config`, plus the usual `help`, `exit`, `quit`. Tab completion
|
||
is built from the same command tree the dispatcher uses, so
|
||
completion can never drift from the actual command set. With
|
||
positional arguments, `maglevc` executes one command against the
|
||
server and exits — in this mode color is off by default so that
|
||
pipes and logs stay clean, but `--color=true` can be set
|
||
explicitly.
|
||
|
||
#### Interfaces
|
||
|
||
**Presents.**
|
||
|
||
- **An interactive TTY shell** and a **one-shot command mode**.
|
||
Humans and scripts are the only consumers; there is no API,
|
||
no socket, no file output.
|
||
|
||
**Consumes.**
|
||
|
||
- **`maglevd`'s gRPC service**, over insecure credentials by
|
||
default. `maglevc` MUST NOT talk to VPP directly, MUST NOT
|
||
read the config file directly, and MUST NOT maintain any
|
||
state of its own across invocations (NFR-5.4). Everything it
|
||
shows and everything it mutates goes through the gRPC API.
|
||
|
||
### maglevd-frontend
|
||
|
||
`maglevd-frontend` is an optional web dashboard (FR-7.2). Unlike
|
||
`maglevc`, it is a long-running process: it holds open gRPC
|
||
streams, caches snapshots, and serves HTTP.
|
||
|
||
#### Responsibilities
|
||
|
||
- Connect to one or more `maglevd` servers simultaneously.
|
||
- Maintain a cached view of each server's state: frontends,
|
||
backends, health checks, VPP LB state, and VPP counters.
|
||
- Serve a SolidJS single-page application and a JSON API to
|
||
browsers.
|
||
- Stream live updates to browsers so that dashboards update
|
||
without polling (NFR-5.3).
|
||
- Expose an optional authenticated mutation surface (FR-7.3).
|
||
|
||
#### Multi-Server Multiplexing
|
||
|
||
A single `maglevd-frontend` process accepts a comma-separated
|
||
list of gRPC server addresses. For each one, it runs an
|
||
independent pool of goroutines: one to stream events, one to
|
||
refresh list-oriented data on a roughly one-second cadence, one
|
||
to refresh per-health-check detail, and one (debounced on
|
||
incoming events) to refresh VPP LB state and counters. Failures
|
||
on one server MUST NOT block the others, and the served JSON
|
||
state always reports per-server connection status so that the
|
||
SPA can mark partially-available views.
|
||
|
||
All per-server event streams publish into a single shared event
|
||
broker with a bounded replay buffer (capped both in time and in
|
||
event count, satisfying NFR-3.2). The broker assigns each event
|
||
a monotonic `epoch-seq` identifier so that browsers reconnecting
|
||
a dropped Server-Sent-Events stream can resume from where they
|
||
left off without a full refresh — and so that a broker restart,
|
||
which reshuffles the epoch, forces a full refresh rather than
|
||
silently handing out ambiguous IDs.
|
||
|
||
#### Read-Only and Admin Surfaces
|
||
|
||
The HTTP surface is partitioned into two paths (FR-7.3):
|
||
|
||
- **`/view/`** serves the SPA and a read-only JSON API. It is
|
||
always publicly accessible: there is no auth, and there are
|
||
no mutation endpoints under it at all. The design intent is
|
||
that `/view/` can be exposed to a broader audience (NOC,
|
||
management UIs, screens on walls) without risk.
|
||
- **`/admin/`** serves the SPA entry point and the mutating
|
||
JSON API behind HTTP basic auth. Credentials come from
|
||
`MAGLEV_FRONTEND_USER` and `MAGLEV_FRONTEND_PASSWORD`. If
|
||
either is unset or empty, the `/admin/` path MUST return 404
|
||
(NFR-4.3) — the admin surface is not merely locked, it is
|
||
not advertised. This makes accidental exposure self-limiting:
|
||
forgetting to set the env vars disables admin rather than
|
||
leaving it open.
|
||
|
||
Both surfaces talk to the same underlying cache; the difference
|
||
is only what endpoints exist.
|
||
|
||
#### Interfaces
|
||
|
||
**Presents.**
|
||
|
||
- **An HTTP listener** (default `:8080`) serving:
|
||
- `/view/` — the SolidJS SPA (embedded in the binary).
|
||
- `/view/api/*` — read-only JSON endpoints for version,
|
||
server list, aggregated state, and per-server state.
|
||
- `/view/api/events` — an SSE stream bridged from the
|
||
internal event broker, with `Last-Event-ID` replay.
|
||
- `/admin/` — the SPA entry point, gated on basic auth.
|
||
- `/admin/api/*` — mutating JSON endpoints that translate
|
||
to gRPC mutations against the appropriate `maglevd`.
|
||
- `/healthz` — a liveness probe.
|
||
|
||
**Consumes.**
|
||
|
||
- **One or more `maglevd` gRPC services.** As with `maglevc`,
|
||
this is the *only* way `maglevd-frontend` reaches into the
|
||
system. It MUST NOT read the YAML config file and MUST NOT
|
||
talk to VPP directly (NFR-5.4).
|
||
- **Two environment variables**, `MAGLEV_FRONTEND_USER` and
|
||
`MAGLEV_FRONTEND_PASSWORD`, for the optional admin surface.
|
||
|
||
### maglevt
|
||
|
||
`maglevt` is a small out-of-band probe TUI (FR-7.4). It is not
|
||
part of the control loop at all; it is a validation tool that
|
||
an operator runs on a laptop, a jump host, or a monitoring box
|
||
to see VIPs the way a client sees them.
|
||
|
||
#### Responsibilities
|
||
|
||
- Read one or more `maglev.yaml` files and enumerate TCP-style
|
||
VIPs from the `frontends` section.
|
||
- Probe each VIP at a configurable interval with a real HTTP or
|
||
HTTPS request against a configurable path.
|
||
- Measure latency (min/max/average and a handful of
|
||
percentiles) and success rate over a rolling window.
|
||
- Tally the value of a configurable response header (by
|
||
default, `X-IPng-Frontend`) so that operators can see which
|
||
backend actually served each request. Because keep-alives are
|
||
disabled by default, this tally reflects fresh Maglev hashing
|
||
decisions rather than a pinned connection.
|
||
|
||
#### Scope Boundary
|
||
|
||
`maglevt` is intentionally decoupled from `maglevd`. It does
|
||
not talk gRPC, it does not read the VPP stats segment, and it
|
||
does not know or care whether the target VIPs are actually
|
||
served by the `vpp-maglev` control plane at all — it simply
|
||
probes addresses. This makes it useful in at least three
|
||
scenarios: validating a `maglevd` restart end-to-end from a
|
||
client perspective, debugging pool failover by watching the
|
||
header tally reshuffle, and sanity-checking that a given VIP is
|
||
reachable across deployments when the gRPC control plane is
|
||
unavailable or out of reach.
|
||
|
||
#### Interfaces
|
||
|
||
**Presents.**
|
||
|
||
- **A full-screen TUI** built on Bubble Tea, with a
|
||
deterministic grid layout and a few interactive toggles (e.g.
|
||
reverse-DNS lookup). There is no machine-readable output; if
|
||
you need metrics, use Prometheus on `maglevd`.
|
||
|
||
**Consumes.**
|
||
|
||
- **One or more YAML configuration files**, which it parses
|
||
with the same library `maglevd` uses. Only the subset of the
|
||
schema describing frontends is actually consumed; unknown
|
||
fields are ignored. Duplicate VIPs discovered across files
|
||
are de-duplicated by `(scheme, address, port)` so that
|
||
multi-file deployments don't double-probe.
|
||
- **The outbound network**, directly. No special capabilities
|
||
are required — `maglevt` is a plain HTTP client.
|
||
|
||
## Operational Concerns
|
||
|
||
### Configuration Reload Semantics
|
||
|
||
Reload is triggered by `SIGHUP` to `maglevd`, or by the
|
||
`ReloadConfig` gRPC method. Both paths run the same validation
|
||
as `--check`. A reload MUST NOT partially apply (NFR-2.2):
|
||
either every change in the new file takes effect, or none of
|
||
them do. A reload MUST NOT restart unchanged probe workers; the
|
||
probe state machine is preserved precisely because operators
|
||
use reloads as a routine operation and expect backends whose
|
||
health-check definitions did not change to simply keep running.
|
||
|
||
Operator overrides (Pause, Disable) survive a reload as long as
|
||
the backend still exists in the new config (FR-3.4). A backend
|
||
that disappears from the new config transitions to `Removed`
|
||
and its worker is stopped; if it reappears in a later reload it
|
||
starts again in `Unknown` with a fresh counter.
|
||
|
||
A daemon **restart** is different from a reload. On restart,
|
||
the YAML configuration is the sole source of truth: every
|
||
runtime override is gone, every runtime weight mutation is gone
|
||
(FR-3.5). Operators who need an override to persist across
|
||
restarts must commit the intended state to the config file.
|
||
|
||
### Failure Modes
|
||
|
||
- **VPP restart.** `maglevd` detects the disconnect, enters a
|
||
reconnect loop, and on reconnect reads VPP's version and
|
||
current state (FR-4.4). The warmup clock is not reset by VPP
|
||
reconnects (NFR-1.3) — a flapping VPP does not cause
|
||
`maglevd` to go hands-off every time. The next periodic full
|
||
sync pushes the current desired state into the freshly
|
||
restarted plugin.
|
||
- **`maglevd` restart with VPP up.** Handled by the warmup
|
||
state machine (NFR-1.2): new flows see the last-programmed
|
||
weights until probes catch up, not zeros.
|
||
- **`maglevd` restart with VPP also down.** VPP comes back
|
||
first, `maglevd` comes back second, warmup gates pushing
|
||
anything until probes converge. This is the worst-case path,
|
||
bounded by `startup-max-delay`.
|
||
- **Configuration reload with a broken file.** The reload is
|
||
rejected; the running configuration is retained; an error
|
||
is logged (FR-5.4). No probes are interrupted (NFR-1.4).
|
||
- **Probe namespace disappears.** Entering the namespace fails,
|
||
the probe is counted as a failure, and the backend
|
||
eventually transitions Down under normal rise/fall rules.
|
||
There is no special-case handling; this is by design, because
|
||
an operator removing the netns while `maglevd` is running is
|
||
an operational error that SHOULD manifest as a visible Down,
|
||
not as silent success.
|
||
- **gRPC subscriber too slow.** Per-subscriber event queues
|
||
are bounded (NFR-3.2). A subscriber that cannot keep up MUST
|
||
be dropped rather than backing up the central fan-out
|
||
(NFR-3.4).
|
||
- **Mid-flight weight mutation during sync.** Operator weight
|
||
changes and reconciler sync both route through the same
|
||
state-protected code path, so mutations are serialized rather
|
||
than interleaved with VPP writes (NFR-2.4).
|
||
|
||
### Observability
|
||
|
||
**Structured logging** (FR-6.1). All logs are slog-formatted
|
||
JSON written to stdout. The default level is `info`, which is
|
||
sized to produce one or two lines per incident rather than per
|
||
probe. The `debug` level dumps every probe attempt and every
|
||
VPP binary-API message, and is intended for post-mortem
|
||
investigation.
|
||
|
||
**Prometheus metrics** (FR-6.2, FR-6.4). `maglevd` exposes four
|
||
classes of metric: inline counters for probe outcomes,
|
||
probe-latency histograms, backend state-transition counters,
|
||
and VPP API and LB sync counters; and on-demand gauges for
|
||
current backend state, rise/fall counter values, configured
|
||
weights, VPP connection status, VPP uptime, VPP info labels,
|
||
and per-VIP LB plugin counters. Gauges are sampled from live
|
||
state on every scrape, so there is no sampling staleness.
|
||
|
||
**Streaming events** (FR-6.3). The gRPC `WatchEvents` method
|
||
multiplexes three event families into one stream: log events
|
||
(the same structured logs the daemon writes to stdout), backend
|
||
transitions (one per affected frontend, since a single backend
|
||
may participate in multiple frontends), and frontend aggregate
|
||
transitions (Up/Down/Unknown flips at the frontend level).
|
||
Clients MAY filter by event family and by minimum log level.
|
||
The web frontend consumes this stream and re-publishes it to
|
||
browsers over SSE, with an epoch-seq replay buffer layered on
|
||
top.
|
||
|
||
### Security and Capabilities
|
||
|
||
`maglevd` needs `CAP_NET_RAW` for ICMP probes and
|
||
`CAP_SYS_ADMIN` for netns entry (NFR-4.1). Neither is optional
|
||
for the feature that needs it, and neither is required
|
||
otherwise; operators who use neither feature MAY run `maglevd`
|
||
as an unprivileged user with no capabilities at all.
|
||
|
||
`maglevd-frontend` needs no special capabilities — it is a
|
||
plain HTTP client of `maglevd` and a plain HTTP server for
|
||
browsers. It does handle user credentials (basic auth), which
|
||
are read from the environment and held in process memory;
|
||
operators SHOULD terminate the frontend behind a TLS reverse
|
||
proxy if it is exposed beyond a trusted network.
|
||
|
||
`maglevc` and `maglevt` need no special capabilities.
|
||
|
||
All gRPC traffic runs insecure by default (NFR-4.2). Securing
|
||
transport is an operational decision, not a build-time one;
|
||
deployments that require mTLS SHOULD terminate gRPC at a
|
||
sidecar or colocate control and data plane on a trusted
|
||
segment.
|
||
|
||
### Concurrency Model
|
||
|
||
The concurrency model inside `maglevd` is deliberately simple
|
||
and local:
|
||
|
||
- Each backend owns exactly one probe worker goroutine
|
||
(NFR-3.1). Workers do not share state with each other.
|
||
- All events — transitions and log records — travel through a
|
||
single central channel which is then fanned out to bounded
|
||
per-subscriber queues (NFR-3.2). The fan-out is the only
|
||
place where multiple subscribers can observe the same event.
|
||
- The configuration pointer is swapped atomically on reload
|
||
(NFR-2.2); readers take a read lock for the duration of a
|
||
single access, so the live config is always internally
|
||
consistent even mid-reload.
|
||
- The VPP stats snapshot is published as an atomic pointer
|
||
(NFR-3.3), so Prometheus scrapes and gRPC reads of counters
|
||
are wait-free.
|
||
- Reconciliation holds a mutex around VPP calls, which
|
||
serializes operator mutations, event-driven syncs, and
|
||
periodic full syncs against each other (NFR-2.4). This is
|
||
intentional: the order in which VPP sees mutations matters
|
||
for determinism, and serializing them is cheap at the scale
|
||
of control-plane events.
|
||
|
||
Deadlock avoidance is structural rather than audited:
|
||
dependencies between subsystems are one-way. The checker does
|
||
not call into VPP; the reconciler reads checker state and calls
|
||
VPP; VPP never calls back. `maglevd-frontend` and `maglevc`
|
||
only read from `maglevd` over gRPC. There is no cycle in the
|
||
wait-for graph.
|
||
|
||
## Alternatives Considered
|
||
|
||
This is a retrofit of a shipped system, so the alternatives
|
||
here are the ones the code actively rejects, not speculative
|
||
designs.
|
||
|
||
- **Several probe schedulers sharing one goroutine pool.**
|
||
Rejected in favor of one goroutine per backend. The
|
||
per-backend model is trivially correct, has no shared state,
|
||
and scales linearly with backend count at a cost of a few
|
||
kilobytes per backend.
|
||
- **`maglevd-frontend` as a sidecar per `maglevd`.** Rejected
|
||
in favor of one frontend speaking to many daemons. A single
|
||
dashboard pane across a fleet is the common operator
|
||
request; pushing multi-server logic into the frontend keeps
|
||
the daemon simple.
|
||
- **Operator actions expressed as config edits plus SIGHUP.**
|
||
Rejected in favor of direct gRPC mutations. Pausing a
|
||
backend during an incident should not require editing a
|
||
file, and the effect should survive subsequent reloads
|
||
(FR-3.4) — though, by deliberate design, not a daemon
|
||
restart (FR-3.5).
|
||
- **Persisting operator overrides across daemon restarts.**
|
||
Rejected in favor of making the YAML config file the sole
|
||
source of truth on startup (FR-3.5). Persisting runtime
|
||
overrides would require an on-disk side store and a clear
|
||
policy for what happens when the side store and the config
|
||
file disagree; keeping the daemon stateless on startup is
|
||
simpler and harder to get wrong.
|
||
- **Synchronous full sync after every transition.** Rejected
|
||
in favor of event-driven single-VIP syncs with a periodic
|
||
full sync as a safety net (FR-4.1, FR-4.2). Full syncs are
|
||
cheap but not free, and the blast radius of a transient bug
|
||
in the desired-state computation is smaller when
|
||
per-transition work only touches one VIP.
|
||
- **Letting `maglevt` read `maglevd`'s gRPC.** Rejected in
|
||
favor of probing the YAML file directly so that `maglevt`
|
||
remains useful when `maglevd` itself is the thing being
|
||
investigated.
|
||
|
||
## Open Questions
|
||
|
||
- **Mutual TLS for gRPC.** Currently insecure by default. A
|
||
future version may wire in standard mTLS support once a
|
||
credential-management story is picked.
|
||
- **Per-AS traffic counters.** The VPP `lb` plugin bypasses
|
||
the FIB and therefore does not produce per-AS traffic
|
||
counters. Surfacing real per-backend byte/packet counts
|
||
would require a VPP-side change.
|
||
- **High-availability of the control plane.** Two `maglevd`
|
||
instances on the same VPP would interleave writes harmlessly
|
||
thanks to determinism (NFR-2.1), but there is no leader
|
||
election and no formal story about which instance owns which
|
||
VIPs. Today, operators run a single `maglevd` per VPP host.
|