Files

Pim van Pelt bc6ccaa844 v1.0.0 — first release

Bump VERSION to 1.0.0 and cut the first tagged release of vpp-maglev.

Also in this commit:

- maglevc: MAGLEV_SERVER env var as an alternative to the --server
  flag, matching the MAGLEV_CONFIG / MAGLEV_GRPC_ADDR convention on
  the other binaries. The flag takes precedence when both are set.
- Rename cmd/maglevd -> cmd/server and cmd/maglevc -> cmd/client so
  the source directory names are decoupled from binary names (the
  frontend and tester commands already followed this convention).
  Build outputs and the Debian packages are unchanged.

2026-04-15 15:29:31 +02:00

47 KiB

Raw Blame History

vpp-maglev Design Document

Metadata


Status	Retrofit — describes shipped behavior as of `v1.0.0`
Author	Pim van Pelt `<pim@ipng.ch>`
Last updated	2026-04-15
Audience	Operators and contributors who will read the source tree next

The key words MUST, MUST NOT, SHOULD, SHOULD NOT, and MAY are used as described in RFC 2119, and are reserved in this document for requirements that are actually enforced in code or by an external dependency. Plain-language descriptions of what the system or an operator can do are written in lowercase — "can", "will", "does" — and should not be read as normative.

Summary

vpp-maglev is a control plane for the VPP lb (Maglev load balancer) plugin. A single daemon — maglevd — probes a fleet of backends, maintains an authoritative view of their health, and programs the VPP dataplane so that traffic hashed to a given VIP lands only on healthy backends. Operators drive the system through maglevc (an interactive CLI) or maglevd-frontend (a read-only web dashboard with an optional authenticated admin surface). A small companion binary, maglevt, validates VIPs from outside the control plane by sending live HTTP probes and reporting failover behavior.

Background

VPP's lb plugin implements Maglev consistent hashing inside the dataplane: a VIP is backed by a pool of Application Servers (ASes), each with an integer weight in [0, 100], and incoming flows are hashed onto a bucket ring so that weight changes disturb as few existing flows as possible. The plugin knows nothing about backend health; if an AS dies while it holds buckets, traffic to those buckets is black-holed until something external tells lb to remove or re-weight the AS.

vpp-maglev is that external thing. Before vpp-maglev, operators maintained VIP configurations by hand and reacted to incidents with vppctl. The project replaces that loop with a daemon that owns the health story, reconciles it with the dataplane, and exposes the result through a uniform gRPC API so that CLIs, dashboards, and scripts all read the same source of truth.

Goals and Non-Goals

Product Goals

Accurate backend health. Detect that a backend is up, degraded, or down quickly enough to keep user-visible error rates low, and avoid flapping under transient faults.
Correct VPP state. The set of VIPs and per-AS weights in VPP converges to the configured intent, filtered by current health, for every supported failure mode.
Restart neutrality. Restarting maglevd with VPP already up MUST NOT cause traffic to be black-holed while health probes warm up.
Operator control. A human can pause, drain, or weight-shift a backend in seconds without editing config files.
Uniform observability. Every state transition, VPP API call, and probe result is emitted as a structured log, a Prometheus metric, or a streaming event — ideally all three.
One source of truth. Every other component (CLI, web frontend, scripts) reads maglevd through one typed interface. There is no secondary control plane.

Non-Goals

vpp-maglev is not a VPP installer or packaging layer. It assumes VPP is already running with the lb plugin loaded.
It does not implement its own dataplane fast path. All forwarding stays in VPP; maglevd only programs the plugin.
It is not a generic service mesh. There is no L7 routing, cert issuance, service discovery, or east-west policy — only VIPs, pools, and backends.
It is not a config store. Configuration is a YAML file on disk; the gRPC API can check and reload it but cannot author it.
It does not secure its own transport. gRPC runs insecure by default; TLS, mTLS, or firewalls are the operator's responsibility.

Requirements

Each requirement carries a unique identifier (FR-X.Y or NFR-X.Y) so that later sections can cite it.

Functional Requirements

FR-1 Health checking

FR-1.1 The system supports ICMP, TCP, HTTP, and HTTPS health checks, each with its own protocol-specific success criteria.
FR-1.2 Each health check MUST apply HAProxy rise/fall semantics with operator-configurable thresholds.
FR-1.3 A health check MAY declare distinct interval, fast-interval, and down-interval values so that recovery from a degraded or down state is faster than steady-state polling.
FR-1.4 Each probe attempt is bounded by a configurable per-probe timeout, independent of the scheduling interval.
FR-1.5 If the configuration sets healthchecker.netns, every probe MUST execute inside the named Linux network namespace.
FR-1.6 The first probe result against a newly-created backend forces an immediate transition out of Unknown, without waiting for rise or fall consecutive results.
FR-1.7 A backend MAY omit its healthcheck reference to declare itself static. A static backend is not probed and is treated as permanently Up; it still participates in pool failover and still honors operator Pause and Disable overrides.

FR-2 Aggregation and pool failover

FR-2.1 A frontend MAY reference one or more named pools. Each referenced pool MUST contain at least one (backend, configured-weight) tuple; an empty pool is a configuration error and is rejected at load time.
FR-2.2 At any time, exactly one pool — the first, in configuration order, that contains a healthy backend with non-zero configured weight — is active; backends in other pools contribute zero effective weight.
FR-2.3 The effective weight of a (frontend, pool, backend) tuple is the configured weight when the backend is Up and the pool is active, and zero in every other case.
FR-2.4 A frontend's aggregate state is Up when at least one backend has non-zero effective weight, Unknown when every referenced backend is still Unknown (or the frontend references no backends), and Down otherwise.

FR-3 Operator control

FR-3.1 Operators can pause and resume individual backends at runtime. Pausing stops the probe worker, freezes the rise/fall counter, and drives effective weight to zero in every pool and every frontend that references the backend. Existing flows are not torn down; this is a soft drain.
FR-3.2 Operators can disable and re-enable individual backends at runtime. Disabling drives effective weight to zero in every pool and every frontend that references the backend, and MUST cause existing flows to be torn down on the next VPP sync.
FR-3.3 Operators can set the configured weight of a specific (frontend, pool, backend) tuple at runtime.
FR-3.4 Operator overrides (Pause, Disable) and operator weight mutations survive a configuration reload (SIGHUP) as long as the underlying backend and tuple still exist in the new configuration.
FR-3.5 Operator overrides and operator weight mutations do not survive a maglevd restart. After a restart, the YAML configuration file is authoritative for every backend and every tuple: paused backends come back unpaused, disabled backends come back enabled, mutated weights revert to the configured value. Operators who need persistent changes must edit the config file.

FR-4 VPP reconciliation

FR-4.1 For every backend state transition that changes an effective weight, maglevd pushes the resulting AS state into VPP for every affected VIP.
FR-4.2 maglevd runs a periodic full reconciliation on a configurable cadence (default thirty seconds) as a safety net against missed events and VPP restarts.
FR-4.3 Weight-to-zero is communicated to VPP as a graceful drain by default; transitions to Disabled and transitions to Down while flush-on-down is true MUST tear existing flows down on the next sync.
FR-4.4 maglevd tolerates VPP disconnects by auto-reconnecting and resuming reconciliation once the connection is re-established.

FR-5 Configuration

FR-5.1 Configuration is loaded from a single YAML file specified at startup and referenced by all later operations.
FR-5.2 Configuration validation distinguishes parse errors (malformed YAML) from semantic errors (structural invariants) and MUST report each with its own exit code from --check: 0 (OK), 1 (parse), 2 (semantic).
FR-5.3 maglevd reloads its configuration on SIGHUP without restarting the process, without restarting unchanged probe workers, and without losing operator overrides (see FR-3.4).
FR-5.4 A parse or semantic error encountered during reload MUST leave the running configuration in place.
FR-5.5 The same validation and reload semantics are also reachable through gRPC (CheckConfig, ReloadConfig).

FR-6 Observability

FR-6.1 All logs are emitted as structured JSON on stdout at a configurable level.
FR-6.2 maglevd exposes Prometheus metrics for probe outcomes, probe latency, backend state transitions, VPP API traffic, and VPP LB sync mutations.
FR-6.3 A streaming gRPC API multiplexes log entries, backend transitions, and frontend aggregate transitions to any number of subscribers with per-subscriber filters.
FR-6.4 Per-VIP packet counters from VPP's stats segment are surfaced through both the gRPC API and the Prometheus surface.

FR-7 Clients and peripheral tools

FR-7.1 An interactive CLI (maglevc) provides a tab-completing shell and a one-shot command mode, both backed by the same command tree.
FR-7.2 A web frontend (maglevd-frontend) can multiplex more than one maglevd in a single process and present their combined state.
FR-7.3 The web frontend partitions its HTTP surface into a public read-only path (/view/) and an authenticated mutating path (/admin/). If credentials are not configured, /admin/ MUST NOT be advertised (the path returns 404).
FR-7.4 An out-of-band tester (maglevt) probes configured VIPs from outside the control plane, measures latency, and tallies a configurable response header.

Non-Functional Requirements

NFR-1 Availability and reliability

NFR-1.1 A maglevd outage MUST NOT stop the dataplane. While maglevd is absent, VPP continues to forward traffic with its last-programmed state.
NFR-1.2 Restarting maglevd with VPP up MUST NOT black-hole new flows during the probe warm-up window; this is enforced by the startup warmup state machine described under maglevd.
NFR-1.3 The warmup clock is tied to process start and MUST NOT be reset by VPP reconnects or configuration reloads.
NFR-1.4 A maglevd-side reload with a broken file MUST NOT interrupt any running probe.

NFR-2 Determinism and correctness

NFR-2.1 Two maglevd instances given the same configuration and the same backend state MUST issue the same sequence of lb_as_add_del calls to VPP, so that VPP's bucket assignment is stable across process swaps. This is the job of the deterministic AS ordering rule.
NFR-2.2 Configuration reload MUST be atomic: either every change in the new file takes effect, or none of them do.
NFR-2.3 Probe scheduling SHOULD apply bounded jitter so that, after a daemon restart or a configuration reload, probes do not phase-lock to the wall clock.
NFR-2.4 Operator mutations, event-driven syncs, and periodic full syncs against VPP MUST be serialized with respect to one another; they MUST NOT interleave.

NFR-3 Performance and scalability

NFR-3.1 Probing N backends costs roughly N goroutines doing mostly idle waits; there is no central probe scheduler.
NFR-3.2 Event fan-out, transition history, and per-subscriber event queues MUST all be bounded; no structure grows without limit under sustained load.
NFR-3.3 VPP stats snapshots are published as an atomic pointer so that Prometheus scrapes and gRPC counter reads are wait-free.
NFR-3.4 A gRPC subscriber that cannot keep up MUST be dropped rather than blocking the central fan-out.

NFR-4 Security

NFR-4.1 maglevd runs with only the Linux capabilities it actually needs: CAP_NET_RAW only when ICMP probes are in use, CAP_SYS_ADMIN only when healthchecker.netns is set.
NFR-4.2 gRPC transport security is explicitly out of scope; the daemon runs insecure by default and deployments SHOULD front it with a firewall, a trusted network, or a TLS-terminating sidecar.
NFR-4.3 The web frontend's mutating surface MUST be hidden entirely (HTTP 404) when either of its basic-auth environment variables is unset.

NFR-5 Operability

NFR-5.1 Every CLI flag on every binary SHOULD have an environment-variable equivalent so that the binaries can be driven purely through env in container deployments.
NFR-5.2 maglevd --check MUST provide a stable exit-code contract (0 / 1 / 2) for use by packaging scripts and ExecStartPre handlers.
NFR-5.3 Dashboards can track state in real time through the streaming event interface rather than by tight polling.
NFR-5.4 maglevc and maglevd-frontend MUST NOT maintain any authoritative state of their own; all truth lives in maglevd.

Architecture Overview

Process Model

The system ships as three independent executables plus one optional companion tester:

maglevd — the long-running daemon. Hosts both the health checker and the VPP control plane.
maglevc — short-lived CLI client.
maglevd-frontend — long-running web dashboard (optional).
maglevt — short-lived out-of-band probe TUI (optional).

VPP itself is a fourth moving part, but it is an external dependency, not part of the vpp-maglev codebase.

Data Flow

Configuration flows in from a YAML file on disk (read by maglevd) and from runtime mutations issued over gRPC by maglevc or maglevd-frontend. Health state flows out of maglevd in three directions: into VPP (as AS weight changes), into Prometheus (as metrics), and into gRPC clients (as streaming events and snapshot reads). Traffic counters flow back in from VPP's stats segment and are surfaced through the same gRPC and Prometheus channels. No component writes to VPP except maglevd. No component serves maglevd's state except maglevd itself.

Components

maglevd

maglevd is the entire control plane. It is a single Go process that bundles three internal concerns — a fleet of probe workers, a VPP reconciler, and a gRPC server — around one shared, versioned view of (config, backend state, frontend state).

Responsibilities

Load and validate configuration; accept reloads on SIGHUP (FR-5.3, FR-5.4).
Run one health-check worker per backend defined in config (NFR-3.1).
Maintain each backend's rise/fall counter and derive its state (FR-1.2, FR-1.6).
Aggregate backend state into per-frontend state, honoring pool-based failover and per-backend operator overrides (FR-2.x, FR-3.x).
Connect to VPP's binary API and stats socket, reconnecting automatically on disconnect (FR-4.4).
Compute a desired VPP lb state from current configuration and health, and drive VPP to match it (FR-4.1, FR-4.2).
Expose the whole picture through a gRPC service and a Prometheus /metrics endpoint (FR-6.x).

Probe Types and Intervals

Four probe types are supported (FR-1.1):

ICMP — sends an echo request, expects a matching reply. This probe type MUST have access to a raw socket, which requires CAP_NET_RAW (NFR-4.1).
TCP — establishes a TCP connection and immediately closes it. No payload is exchanged.
HTTP — issues a request against a configured path, matches the response code against a configured numeric range, and optionally matches the response body against a regular expression.
HTTPS — HTTP over TLS with configurable SNI and an option to skip certificate verification.

Each health check configures three candidate intervals (FR-1.3): the nominal interval, an optional faster fast-interval used while the counter is in its degraded zone, and an optional slower down-interval used while the backend is fully down. If an optional interval is not set, the nominal interval is used. Every scheduled sleep receives bounded random jitter; this is the mechanism that satisfies NFR-2.3.

Each probe also has a timeout (FR-1.4). The probe-level timeout bounds a single attempt; the interval bounds the time between the start of consecutive attempts, with the actual probe latency deducted from the next sleep so that slow probes do not push the schedule later and later.

If the configuration sets healthchecker.netns, every probe of every type MUST run inside that Linux network namespace (FR-1.5). Entering a netns requires CAP_SYS_ADMIN; without it, probes will fail and the backend will go down. This is a deliberate deployment choice, not a bug — see the security subsection below.

Rise/Fall State Machine

Each backend carries a single integer counter in the closed range [0, rise + fall − 1]. A backend is considered Up when the counter is at or above rise, and Down otherwise. A successful probe increments the counter, saturating at the maximum; a failing probe decrements it, saturating at zero. This is the HAProxy hysteresis model adapted to a single scalar (FR-1.2).

Four additional states overlay the rise/fall logic:

Unknown — the backend has not yet produced any probe result since maglevd started (or since it was re-added by a reload). An Unknown backend contributes zero effective weight and the transition to Up or Down is taken on the first result rather than after rise or fall consecutive results (FR-1.6). This asymmetric rule lets fresh daemons discover the world quickly while still requiring hysteresis for steady-state flaps.
Paused — operator override (FR-3.1). The probe worker is stopped and the counter is frozen. Effective weight is zero in every pool and every frontend that references the backend, but existing flows are not torn down; this is a soft drain.
Disabled — operator override (FR-3.2). The probe worker is stopped and effective weight is zero in every pool and every frontend that references the backend. Unlike Paused, Disabled causes existing flows to be torn down on the next VPP sync (FR-4.3).
Removed — the backend was deleted by a configuration reload. Its final transition is emitted on the event stream and then all references are dropped.

Backends declared static (no healthcheck reference in config, FR-1.7) bypass the rise/fall machinery entirely. They are not probed, their counter is not maintained, and they enter Up on startup via a single synthetic pass. They still participate in pool-failover weight computation like any other backend and still honor operator Pause and Disable overrides.

Operator overrides and operator weight mutations are held in process memory only. They survive a SIGHUP reload (FR-3.4) but do not survive a daemon restart (FR-3.5): when maglevd starts, the YAML file is the sole source of truth, and any earlier runtime mutation is gone. Operators who need durable changes must commit them to the configuration file.

Aggregation to Frontend State

A frontend references one or more named pools. Each referenced pool contains one or more backends with a per-reference configured weight in [0, 100] (FR-2.1). The effective weight that maglevd computes for a given (frontend, pool, backend) tuple is (FR-2.3):

The configured weight, if the backend is Up and the backend's pool is the active pool (see below).
Zero in every other case.

The active pool is the first pool, in configuration order, that contains at least one Up backend whose configured weight is non-zero (FR-2.2). If no pool is active (e.g. all backends are Down), every backend contributes zero weight and the frontend's aggregate state is Down. A frontend with no backends at all, or with every referenced backend still in Unknown, is itself Unknown. A frontend with at least one non-zero effective weight is Up (FR-2.4).

Whether effective weight zero also flushes existing flows depends on the cause (FR-4.3):

Up in a non-active pool: weight zero, no flush (standby pool).
Down while flush-on-down is true: weight zero, flush.
Disabled: weight zero, flush, always.
Paused or Unknown: weight zero, no flush.

VPP Reconciliation

maglevd treats VPP's LB configuration as a desired-state reconciliation target. The desired state is a pure function of (current config, current backend state); the observed state is read back from VPP through the lb plugin's binary API. A sync operation diffs the two and issues the minimal set of lb_vip_add_del, lb_as_add_del, and lb_as_set_weight messages to make them match.

Two triggers drive a sync:

Event-driven, single VIP (FR-4.1). When the health checker emits a backend transition, the reconciler recomputes desired state for every frontend that references that backend and syncs those VIPs. This is the primary path for convergence during incidents.
Periodic, full (FR-4.2). A background loop runs a full sync on a configurable interval (default thirty seconds). This is the safety net that closes gaps left by missed events, VPP restarts, or bugs in the event path.

For determinism (NFR-2.1), whenever a sync operation iterates over ASes it does so in a total order defined by the numeric representation of the AS address, with IPv4 addresses ordered before IPv6. Two maglevd instances given the same input MUST therefore issue the same lb_as_add_del sequence, which in turn means VPP produces the same bucket-to-AS assignment regardless of which instance is driving.

Operator mutations, event-driven syncs, and periodic full syncs are serialized through a single mutex at the VPP-call boundary (NFR-2.4); they never interleave.

Startup Warmup and Restart Neutrality

A naive sync loop would, on restart, immediately synthesize a desired state in which every backend is Unknown, map every backend through the effective-weight rules to zero, and push "zero weight everywhere" into VPP before a single probe had completed. The result would be a multi-second black hole on every maglevd restart. NFR-1.2 forbids this, and the warmup state machine is how it is enforced.

The warmup has three phases, keyed off two configurable delays startup-min-delay (default five seconds) and startup-max-delay (default thirty seconds):

Hands-off. From process start to startup-min-delay, the reconciler MUST NOT write anything to VPP at all. Event-driven syncs are suppressed; the periodic full sync is suppressed.
Per-VIP release. From startup-min-delay to startup-max-delay, a VIP becomes eligible for sync the moment every backend it references has produced at least one probe result (i.e. none are Unknown). Eligible VIPs are released individually so that healthy VIPs converge as fast as their slowest backend, without being held back by unrelated slow VIPs.
Watchdog. At startup-max-delay, any VIPs still held are released unconditionally by a final full sync. This bounds the worst-case blackout to startup-max-delay rather than "as long as the slowest backend takes".

The warmup clock is tied to process start, not to VPP reconnect or configuration reload (NFR-1.3). Reconnecting to a flapping VPP does not re-enter warmup, and SIGHUP does not re-enter warmup.

Setting both delays to zero disables the warmup entirely, which is useful for tests but SHOULD NOT be done in production.

Configuration and Reload

Configuration lives in a single YAML file (FR-5.1), typically /etc/vpp-maglev/maglev.yaml. It is validated in two distinct phases (FR-5.2): a parse phase that catches YAML errors, and a semantic phase that enforces structural invariants such as:

Every frontend whose VIPs share an address MUST use backends of the same address family (IPv4 or IPv6), because VPP picks an encap type per VIP and mixing families on one VIP is not supported.
Every backend referenced by a frontend MUST exist.
Every referenced health check MUST exist.
Every pool referenced by a frontend MUST contain at least one backend (FR-2.1).
VPP LB knobs MUST satisfy plugin constraints: flow-timeout in [1s, 120s], sticky-buckets-per-core a power of two, sync-interval strictly positive, startup-max-delay not less than startup-min-delay.
transition-history MUST be at least one.

maglevd --check runs both phases and exits with code 0 on success, 1 on parse errors, and 2 on semantic errors (NFR-5.2). This exit code contract is what packaging scripts and systemd ExecStartPre rely on.

On SIGHUP the same two-phase validation runs against the file on disk. If either phase fails, maglevd MUST log the error and leave the running configuration untouched (FR-5.4, NFR-1.4). On success, the delta is applied atomically (NFR-2.2): new backends spawn workers, removed backends have their workers stopped and emit a terminal Removed event, changed backends restart their workers, and metadata-only changes (address, weight, enable flag) are updated in place without restarting anything. Operator overrides (Pause, Disable) survive reloads (FR-3.4) but — to repeat the point from FR-3.5 — do not survive a daemon restart.

Lifecycle, Signals, and Security

maglevd handles three signals:

SIGHUP triggers a configuration reload as described above.
SIGTERM and SIGINT initiate a graceful shutdown: the gRPC server drains, stream subscribers are released, probe workers are cancelled, and the VPP connection is closed. VPP's last-programmed state is not torn down; traffic continues to flow (NFR-1.1).

maglevd requires two Linux capabilities, each tied to a specific feature (NFR-4.1):

CAP_NET_RAW is required if and only if any configured health check is of type ICMP. Without it, raw-socket creation will fail and all ICMP probes will error out.
CAP_SYS_ADMIN is required if and only if healthchecker.netns is set. The kernel's setns(CLONE_NEWNET) call requires it; without it, every probe will fail on namespace entry.

The shipped Debian unit grants both capabilities through AmbientCapabilities and CapabilityBoundingSet, which is why the package "just works" out of the box. Hand-run invocations SHOULD set capabilities explicitly (e.g. via setcap) rather than running as root.

maglevd does not secure its own gRPC listener (NFR-4.2). Operators SHOULD bind the listener to loopback, to a control-plane VRF, or behind a firewall, depending on their threat model. The design deliberately pushes transport security out of the binary on the theory that every deployment already has an answer for it.

Interfaces

Presents.

A gRPC service on a TCP listener (default :9090). This is the only programmatic interface to maglevd. Every other component talks to maglevd through this interface and no other. The service has read-only methods (List*, Get*, WatchEvents, CheckConfig), mutating methods (PauseBackend, ResumeBackend, EnableBackend, DisableBackend, SetFrontendPoolBackendWeight, ReloadConfig, SyncVPPLBState), and a single streaming method (WatchEvents) that multiplexes log entries and state transitions to any number of subscribers with per-subscriber filters (FR-6.3). gRPC reflection is enabled by default so that ad-hoc tooling can introspect the service.
A Prometheus /metrics HTTP endpoint on a separate listener (default :9091) (FR-6.2). Counters are updated inline as probes run and VPP calls complete; gauges are computed on each scrape from the current checker and VPP state, so there is no sampling lag.
Structured JSON logs on stdout, via log/slog, at a configurable level (FR-6.1). Key events — daemon start, config load, VPP connect/disconnect, backend transitions, LB sync mutations, warmup milestones — are logged at info or higher so that a default-level deployment has enough to post-mortem an incident.
Process exit codes from --check: 0, 1, or 2 as described above (NFR-5.2). These form a small but load-bearing interface to packaging and systemd.

Consumes.

A YAML configuration file on disk, passed via --config or MAGLEV_CONFIG. This is the declarative source of truth for intent; everything the operator mutates at runtime is a delta on top of it, and every runtime delta is lost on a daemon restart (FR-3.5).
VPP's binary API socket (default /run/vpp/api.sock). The connection auto-reconnects on drop (FR-4.4), and while disconnected, the reconciler silently queues no work — the next periodic sync closes any gap.
VPP's stats segment socket (default /run/vpp/stats.sock). Read periodically (five-second cadence) for per-VIP packet and byte counters (FR-6.4). Readers are non-blocking (NFR-3.3); a stale snapshot is always available.
The Linux kernel's namespace subsystem, when healthchecker.netns is set. Requires CAP_SYS_ADMIN.
Raw sockets, for ICMP probes. Requires CAP_NET_RAW.

VPP Dataplane

The VPP dataplane is not part of the vpp-maglev codebase, but it is the component every other piece revolves around, and its contract with maglevd defines what maglevd is allowed to do.

Responsibilities

VPP's lb plugin implements Maglev consistent hashing in the forwarding fast path. It owns:

Global configuration — an IPv4 source address and an IPv6 source address used as the outer header for GRE-encapsulated traffic to ASes, the number of sticky buckets per worker core, and a per-flow idle timeout.
A set of VIPs, each identified by an address prefix, an IP protocol, and a port. A VIP carries an encap type (GRE4 or GRE6, picked by the family of the AS addresses) and a flag for source-IP sticky hashing.
A set of ASes per VIP, each identified by address, with an integer weight in [0, 100], a used/flushed state, and a bucket count derived from the Maglev ring.

It does not own: health, configuration intent, operator overrides, transition history, or metrics. Those belong to maglevd.

Interfaces

Presents.

A binary API (GoVPP-style message exchange) for reading and mutating VIP and AS state. maglevd is the sole user.
A stats segment with per-VIP counters from the LB plugin (existing-flow, first-flow, untracked, no-server) and per-prefix FIB counters. The LB plugin bypasses the FIB for forwarded packets, so per-backend traffic counters are not available; this is a known limitation that operators consuming metrics need to understand.
The forwarded-traffic fast path itself, which is the whole reason this project exists.

Consumes.

maglevd's binary-API writes — nothing else. There is no third party programming lb state in a working deployment.

maglevc

maglevc is the interactive and scripting CLI. It is a short-lived client with no persistent state and no background work (NFR-5.4).

Responsibilities

Provide a human-readable tab-completing shell for maglevd (FR-7.1).
Dispatch one-shot commands for scripts and automation.
Render state snapshots (frontends, backends, health checks, VPP LB state, VPP counters) with optional ANSI color.
Stream events in real time (watch events) with filters.

Interaction Model

With no positional arguments, maglevc starts a readline-based REPL with a nested command tree: show, set, watch, config, plus the usual help, exit, quit. Tab completion is built from the same command tree the dispatcher uses, so completion can never drift from the actual command set. With positional arguments, maglevc executes one command against the server and exits — in this mode color is off by default so that pipes and logs stay clean, but --color=true can be set explicitly.

Interfaces

Presents.

An interactive TTY shell and a one-shot command mode. Humans and scripts are the only consumers; there is no API, no socket, no file output.

Consumes.

maglevd's gRPC service, over insecure credentials by default. maglevc MUST NOT talk to VPP directly, MUST NOT read the config file directly, and MUST NOT maintain any state of its own across invocations (NFR-5.4). Everything it shows and everything it mutates goes through the gRPC API.

maglevd-frontend

maglevd-frontend is an optional web dashboard (FR-7.2). Unlike maglevc, it is a long-running process: it holds open gRPC streams, caches snapshots, and serves HTTP.

Responsibilities

Connect to one or more maglevd servers simultaneously.
Maintain a cached view of each server's state: frontends, backends, health checks, VPP LB state, and VPP counters.
Serve a SolidJS single-page application and a JSON API to browsers.
Stream live updates to browsers so that dashboards update without polling (NFR-5.3).
Expose an optional authenticated mutation surface (FR-7.3).

Multi-Server Multiplexing

A single maglevd-frontend process accepts a comma-separated list of gRPC server addresses. For each one, it runs an independent pool of goroutines: one to stream events, one to refresh list-oriented data on a roughly one-second cadence, one to refresh per-health-check detail, and one (debounced on incoming events) to refresh VPP LB state and counters. Failures on one server MUST NOT block the others, and the served JSON state always reports per-server connection status so that the SPA can mark partially-available views.

All per-server event streams publish into a single shared event broker with a bounded replay buffer (capped both in time and in event count, satisfying NFR-3.2). The broker assigns each event a monotonic epoch-seq identifier so that browsers reconnecting a dropped Server-Sent-Events stream can resume from where they left off without a full refresh — and so that a broker restart, which reshuffles the epoch, forces a full refresh rather than silently handing out ambiguous IDs.

Read-Only and Admin Surfaces

The HTTP surface is partitioned into two paths (FR-7.3):

/view/ serves the SPA and a read-only JSON API. It is always publicly accessible: there is no auth, and there are no mutation endpoints under it at all. The design intent is that /view/ can be exposed to a broader audience (NOC, management UIs, screens on walls) without risk.
/admin/ serves the SPA entry point and the mutating JSON API behind HTTP basic auth. Credentials come from MAGLEV_FRONTEND_USER and MAGLEV_FRONTEND_PASSWORD. If either is unset or empty, the /admin/ path MUST return 404 (NFR-4.3) — the admin surface is not merely locked, it is not advertised. This makes accidental exposure self-limiting: forgetting to set the env vars disables admin rather than leaving it open.

Both surfaces talk to the same underlying cache; the difference is only what endpoints exist.

Interfaces

Presents.

An HTTP listener (default :8080) serving:
- /view/ — the SolidJS SPA (embedded in the binary).
- /view/api/* — read-only JSON endpoints for version, server list, aggregated state, and per-server state.
- /view/api/events — an SSE stream bridged from the internal event broker, with Last-Event-ID replay.
- /admin/ — the SPA entry point, gated on basic auth.
- /admin/api/* — mutating JSON endpoints that translate to gRPC mutations against the appropriate maglevd.
- /healthz — a liveness probe.

Consumes.

One or more maglevd gRPC services. As with maglevc, this is the only way maglevd-frontend reaches into the system. It MUST NOT read the YAML config file and MUST NOT talk to VPP directly (NFR-5.4).
Two environment variables, MAGLEV_FRONTEND_USER and MAGLEV_FRONTEND_PASSWORD, for the optional admin surface.

maglevt

maglevt is a small out-of-band probe TUI (FR-7.4). It is not part of the control loop at all; it is a validation tool that an operator runs on a laptop, a jump host, or a monitoring box to see VIPs the way a client sees them.

Responsibilities

Read one or more maglev.yaml files and enumerate TCP-style VIPs from the frontends section.
Probe each VIP at a configurable interval with a real HTTP or HTTPS request against a configurable path.
Measure latency (min/max/average and a handful of percentiles) and success rate over a rolling window.
Tally the value of a configurable response header (by default, X-IPng-Frontend) so that operators can see which backend actually served each request. Because keep-alives are disabled by default, this tally reflects fresh Maglev hashing decisions rather than a pinned connection.

Scope Boundary

maglevt is intentionally decoupled from maglevd. It does not talk gRPC, it does not read the VPP stats segment, and it does not know or care whether the target VIPs are actually served by the vpp-maglev control plane at all — it simply probes addresses. This makes it useful in at least three scenarios: validating a maglevd restart end-to-end from a client perspective, debugging pool failover by watching the header tally reshuffle, and sanity-checking that a given VIP is reachable across deployments when the gRPC control plane is unavailable or out of reach.

Interfaces

Presents.

A full-screen TUI built on Bubble Tea, with a deterministic grid layout and a few interactive toggles (e.g. reverse-DNS lookup). There is no machine-readable output; if you need metrics, use Prometheus on maglevd.

Consumes.

One or more YAML configuration files, which it parses with the same library maglevd uses. Only the subset of the schema describing frontends is actually consumed; unknown fields are ignored. Duplicate VIPs discovered across files are de-duplicated by (scheme, address, port) so that multi-file deployments don't double-probe.
The outbound network, directly. No special capabilities are required — maglevt is a plain HTTP client.

Operational Concerns

Configuration Reload Semantics

Reload is triggered by SIGHUP to maglevd, or by the ReloadConfig gRPC method. Both paths run the same validation as --check. A reload MUST NOT partially apply (NFR-2.2): either every change in the new file takes effect, or none of them do. A reload MUST NOT restart unchanged probe workers; the probe state machine is preserved precisely because operators use reloads as a routine operation and expect backends whose health-check definitions did not change to simply keep running.

Operator overrides (Pause, Disable) survive a reload as long as the backend still exists in the new config (FR-3.4). A backend that disappears from the new config transitions to Removed and its worker is stopped; if it reappears in a later reload it starts again in Unknown with a fresh counter.

A daemon restart is different from a reload. On restart, the YAML configuration is the sole source of truth: every runtime override is gone, every runtime weight mutation is gone (FR-3.5). Operators who need an override to persist across restarts must commit the intended state to the config file.

Failure Modes

VPP restart. maglevd detects the disconnect, enters a reconnect loop, and on reconnect reads VPP's version and current state (FR-4.4). The warmup clock is not reset by VPP reconnects (NFR-1.3) — a flapping VPP does not cause maglevd to go hands-off every time. The next periodic full sync pushes the current desired state into the freshly restarted plugin.
maglevd restart with VPP up. Handled by the warmup state machine (NFR-1.2): new flows see the last-programmed weights until probes catch up, not zeros.
maglevd restart with VPP also down. VPP comes back first, maglevd comes back second, warmup gates pushing anything until probes converge. This is the worst-case path, bounded by startup-max-delay.
Configuration reload with a broken file. The reload is rejected; the running configuration is retained; an error is logged (FR-5.4). No probes are interrupted (NFR-1.4).
Probe namespace disappears. Entering the namespace fails, the probe is counted as a failure, and the backend eventually transitions Down under normal rise/fall rules. There is no special-case handling; this is by design, because an operator removing the netns while maglevd is running is an operational error that SHOULD manifest as a visible Down, not as silent success.
gRPC subscriber too slow. Per-subscriber event queues are bounded (NFR-3.2). A subscriber that cannot keep up MUST be dropped rather than backing up the central fan-out (NFR-3.4).
Mid-flight weight mutation during sync. Operator weight changes and reconciler sync both route through the same state-protected code path, so mutations are serialized rather than interleaved with VPP writes (NFR-2.4).

Observability

Structured logging (FR-6.1). All logs are slog-formatted JSON written to stdout. The default level is info, which is sized to produce one or two lines per incident rather than per probe. The debug level dumps every probe attempt and every VPP binary-API message, and is intended for post-mortem investigation.

Prometheus metrics (FR-6.2, FR-6.4). maglevd exposes four classes of metric: inline counters for probe outcomes, probe-latency histograms, backend state-transition counters, and VPP API and LB sync counters; and on-demand gauges for current backend state, rise/fall counter values, configured weights, VPP connection status, VPP uptime, VPP info labels, and per-VIP LB plugin counters. Gauges are sampled from live state on every scrape, so there is no sampling staleness.

Streaming events (FR-6.3). The gRPC WatchEvents method multiplexes three event families into one stream: log events (the same structured logs the daemon writes to stdout), backend transitions (one per affected frontend, since a single backend may participate in multiple frontends), and frontend aggregate transitions (Up/Down/Unknown flips at the frontend level). Clients MAY filter by event family and by minimum log level. The web frontend consumes this stream and re-publishes it to browsers over SSE, with an epoch-seq replay buffer layered on top.

Security and Capabilities

maglevd needs CAP_NET_RAW for ICMP probes and CAP_SYS_ADMIN for netns entry (NFR-4.1). Neither is optional for the feature that needs it, and neither is required otherwise; operators who use neither feature MAY run maglevd as an unprivileged user with no capabilities at all.

maglevd-frontend needs no special capabilities — it is a plain HTTP client of maglevd and a plain HTTP server for browsers. It does handle user credentials (basic auth), which are read from the environment and held in process memory; operators SHOULD terminate the frontend behind a TLS reverse proxy if it is exposed beyond a trusted network.

maglevc and maglevt need no special capabilities.

All gRPC traffic runs insecure by default (NFR-4.2). Securing transport is an operational decision, not a build-time one; deployments that require mTLS SHOULD terminate gRPC at a sidecar or colocate control and data plane on a trusted segment.

Concurrency Model

The concurrency model inside maglevd is deliberately simple and local:

Each backend owns exactly one probe worker goroutine (NFR-3.1). Workers do not share state with each other.
All events — transitions and log records — travel through a single central channel which is then fanned out to bounded per-subscriber queues (NFR-3.2). The fan-out is the only place where multiple subscribers can observe the same event.
The configuration pointer is swapped atomically on reload (NFR-2.2); readers take a read lock for the duration of a single access, so the live config is always internally consistent even mid-reload.
The VPP stats snapshot is published as an atomic pointer (NFR-3.3), so Prometheus scrapes and gRPC reads of counters are wait-free.
Reconciliation holds a mutex around VPP calls, which serializes operator mutations, event-driven syncs, and periodic full syncs against each other (NFR-2.4). This is intentional: the order in which VPP sees mutations matters for determinism, and serializing them is cheap at the scale of control-plane events.

Deadlock avoidance is structural rather than audited: dependencies between subsystems are one-way. The checker does not call into VPP; the reconciler reads checker state and calls VPP; VPP never calls back. maglevd-frontend and maglevc only read from maglevd over gRPC. There is no cycle in the wait-for graph.

Alternatives Considered

This is a retrofit of a shipped system, so the alternatives here are the ones the code actively rejects, not speculative designs.

Several probe schedulers sharing one goroutine pool. Rejected in favor of one goroutine per backend. The per-backend model is trivially correct, has no shared state, and scales linearly with backend count at a cost of a few kilobytes per backend.
maglevd-frontend as a sidecar per maglevd. Rejected in favor of one frontend speaking to many daemons. A single dashboard pane across a fleet is the common operator request; pushing multi-server logic into the frontend keeps the daemon simple.
Operator actions expressed as config edits plus SIGHUP. Rejected in favor of direct gRPC mutations. Pausing a backend during an incident should not require editing a file, and the effect should survive subsequent reloads (FR-3.4) — though, by deliberate design, not a daemon restart (FR-3.5).
Persisting operator overrides across daemon restarts. Rejected in favor of making the YAML config file the sole source of truth on startup (FR-3.5). Persisting runtime overrides would require an on-disk side store and a clear policy for what happens when the side store and the config file disagree; keeping the daemon stateless on startup is simpler and harder to get wrong.
Synchronous full sync after every transition. Rejected in favor of event-driven single-VIP syncs with a periodic full sync as a safety net (FR-4.1, FR-4.2). Full syncs are cheap but not free, and the blast radius of a transient bug in the desired-state computation is smaller when per-transition work only touches one VIP.
Letting maglevt read maglevd's gRPC. Rejected in favor of probing the YAML file directly so that maglevt remains useful when maglevd itself is the thing being investigated.

Open Questions

Mutual TLS for gRPC. Currently insecure by default. A future version may wire in standard mTLS support once a credential-management story is picked.
Per-AS traffic counters. The VPP lb plugin bypasses the FIB and therefore does not produce per-AS traffic counters. Surfacing real per-backend byte/packet counts would require a VPP-side change.
High-availability of the control plane. Two maglevd instances on the same VPP would interleave writes harmlessly thanks to determinism (NFR-2.1), but there is no leader election and no formal story about which instance owns which VIPs. Today, operators run a single maglevd per VPP host.

47 KiB Raw Blame History Unescape Escape

vpp-maglev Design Document

Metadata

Summary

Background

Goals and Non-Goals

Product Goals

Non-Goals

Requirements

Functional Requirements

Non-Functional Requirements

Architecture Overview

Process Model

Data Flow

Components

maglevd

Responsibilities

Probe Types and Intervals

Rise/Fall State Machine

Aggregation to Frontend State

VPP Reconciliation

Startup Warmup and Restart Neutrality

Configuration and Reload

Lifecycle, Signals, and Security

Interfaces

VPP Dataplane

Responsibilities

Interfaces

maglevc

Responsibilities

Interaction Model

Interfaces

maglevd-frontend

Responsibilities

Multi-Server Multiplexing

Read-Only and Admin Surfaces

Interfaces

maglevt

Responsibilities

Scope Boundary

Interfaces

Operational Concerns

Configuration Reload Semantics

Failure Modes

Observability

Security and Capabilities

Concurrency Model

Alternatives Considered

Open Questions

47 KiB

Raw Blame History