A small bubbletea TUI that reads maglev.yaml (repeatable --config),
enumerates every VIP, and probes each from outside the load balancer
on a tight cadence (default 100ms, ±10% jitter). HTTP/HTTPS VIPs get
a GET against a configurable URI (default /.well-known/ipng/healthz)
with per-VIP rolling latency (p50/p95/p99/max), lifetime N/FAIL
counters, LAST status, and a response-header tally. Non-HTTP VIPs
get a TCP connect probe. A bounded error panel classifies anomalies
as timeout / http-err / net-err / spike and auto-sizes to fill the
screen.
Utility: during a failover drill (backend flap, AS drain, config
push) the tally panel shows which backend each VIP is actually
steering to, with two-colour activity highlighting over a 5s
window — white = receiving traffic, grey = drained. Paired with
the rolling OK%/latency columns it gives an at-a-glance answer to
"is the VIP healthy from the outside right now, and which backend
is it hitting", without relying on maglevd's own view of the
world.
Also bumps Makefile/go.mod to build the new binary.
New metrics plus the corresponding documentation for everything that's
accumulated since the last Prometheus pass.
internal/metrics/metrics.go
- New VPPSource interface (IsConnected, VPPInfo) plus a metrics-local
VPPInfo struct that mirrors vpp.Info. Decoupling via interface +
struct-mirror keeps the dependency direction one-way (vpp → metrics),
so vpp can import metrics to update inline counters without a cycle.
- New Collector gauges scraped on demand: maglev_vpp_connected,
maglev_vpp_uptime_seconds (from /sys/boottime), maglev_vpp_connected_seconds
(time since maglevd connected), and maglev_vpp_info (static 1-gauge
carrying version, build_date, and pid as labels).
- New inline counters:
- maglev_vpp_api_total{msg, direction, result} — bumped from the
loggedChannel wrapper on every VPP binary-API send/recv. Gives full
visibility into what maglevd is doing with VPP, broken down by
message name, direction (send/recv), and result (success/failure).
- maglev_vpp_lbsync_total{scope, kind} — bumped from the reconciler
at the end of each SyncLBStateAll/SyncLBStateVIP run. kind ∈
{vip_added, vip_removed, as_added, as_removed, as_weight_updated};
scope ∈ {all, vip}. Zero-valued kinds are not emitted so noise
stays low.
- Register() signature now takes a VPPSource (may be nil) alongside
the existing StateSource.
internal/vpp/client.go
- New VPPInfo() (metrics.VPPInfo, bool) shim method on *Client that
satisfies metrics.VPPSource. Returns (_, false) when disconnected so
the collector skips the vpp_* gauges cleanly.
internal/vpp/apilog.go
- The loggedChannel's SendRequest / SendMultiRequest / ReceiveReply
paths now call metrics.VPPAPITotal.WithLabelValues(...).Inc() in
addition to slog.Debug. Since every VPP API call in the codebase
must go through loggedChannel (NewAPIChannel is unexported), this
one instrumentation point catches everything.
internal/vpp/lbsync.go
- New recordSyncStats(scope, st) helper called once at the end of
SyncLBStateAll and SyncLBStateVIP to bump maglev_vpp_lbsync_total.
Zero-valued stats are skipped.
cmd/maglevd/main.go
- Added github.com/grpc-ecosystem/go-grpc-middleware/providers/prometheus
for the standard gRPC server metrics (grpc_server_started_total,
grpc_server_handled_total, grpc_server_handling_seconds, etc.,
labelled by service/method/type/code).
- Constructs grpcprom.NewServerMetrics(WithServerHandlingTimeHistogram())
before creating the grpc.Server, installs it as UnaryInterceptor +
StreamInterceptor, then calls InitializeMetrics(srv) after service
registration so every method appears at 0 on the first scrape
instead of materialising lazily on first RPC.
- Passes the vppClient (or nil) as a metrics.VPPSource to
metrics.Register so the vpp_* gauges are emitted when integration
is enabled and silently omitted otherwise.
docs/user-guide.md
- New 'Prometheus metrics' section in the maglevd chapter,
tabulating every metric family: backend state gauges, probe
counters/histogram, transition counters, the new VPP gauges and
counters, and the standard gRPC server metrics.
- 'show frontends <name>' description updated to mention the two
weight columns ('weight' = configured from YAML, 'effective' =
state-aware after pool-failover logic).
- Pause / disable descriptions clarified: transition history is
preserved across these operator actions.
docs/healthchecks.md
- New 'Static (no-healthcheck) backends' section explaining that
backends without a healthcheck use rise/fall=1, fire a synthetic
passing probe immediately on startup (no 30s wait), and idle at
30s between iterations thereafter.
- New 'Pool failover' section documenting the priority-tier model,
the active-pool definition, when promotion happens, cascading to
further tiers, and graceful drain on demotion. Points readers at
'maglevc show frontends <name>' as the inspection interface.
docs/config-guide.md
- healthcheck field doc now describes static-backend behavior and
cross-references healthchecks.md.
- pools field doc now explains failover semantics at a high level
and cross-references the detailed healthchecks.md section.
VPP client (internal/vpp/)
- New package managing connections to both VPP API and stats sockets,
treated as a unit: if either drops, both are torn down and
re-established together.
- Run() loop: connect, fetch version via vpe.ShowVersion, read
/sys/boottime from the stats segment, log vpp-connect, then monitor
with control_ping every 10s. On failure, disconnect both, retry
after 5s.
- Registers as client name "vpp-maglev" (visible in VPP's
"show api clients").
- Flags: --vpp-api-addr (default /run/vpp/api.sock) and
--vpp-stats-addr (default /run/vpp/stats.sock). Empty api addr
disables VPP integration entirely.
gRPC / proto
- Add GetVPPInfo RPC returning VPPInfo: version, build_date,
build_directory, pid, boottime_ns, connecttime_ns. Both times are
unix timestamps in nanoseconds — the client computes durations
locally for display.
- Returns codes.Unavailable if VPP is disabled or not connected.
maglevc
- Add 'show vpp info' command displaying version, build-date,
build-dir, vpp-pid, vpp-boottime (with duration), and connected
time (with duration).
Prometheus metrics (internal/metrics/, cmd/maglevd/)
- New --metrics-addr flag (default :9091, env MAGLEV_METRICS_ADDR)
serving /metrics via promhttp.
- Gauge metrics scraped on demand via a custom prometheus.Collector:
maglev_backend_state, maglev_backend_health, maglev_backend_enabled,
maglev_frontend_pool_backend_weight.
- Inline counter/histogram metrics updated per probe:
maglev_probe_total (by backend, type, result, code),
maglev_probe_duration_seconds (by backend, type),
maglev_backend_transitions_total (by backend, from, to).
- StateSource interface in metrics package breaks the import cycle
with checker; checker.Checker satisfies it via GetBackendInfo.
Integration tests
- Run maglevd inside a containerlab node (debian:trixie-slim with
build/ bind-mounted) instead of on the host. Eliminates port
collisions with any host maglevd.
- maglevc commands run via docker exec into the maglevd container.
- Add 6 Prometheus test cases: endpoint reachable, all backends
report state=up, probe counters non-zero, duration histogram
populated, pool weights correct, transition counters present.
gRPC / proto
- Rename WatchBackendEvents → WatchEvents; return a stream of Event
oneof (LogEvent, BackendEvent, FrontendEvent) with optional filter
flags (log, log_level, backend, frontend)
- Add EnableBackend, DisableBackend, SetFrontendPoolBackendWeight RPCs
- Rename PauseResumeRequest → BackendRequest
- Add CheckConfig RPC returning ok/parse_error/semantic_error
maglevd
- Route slog through a LogBroadcaster (slog.Handler) so WatchEvents
subscribers can receive structured log records independently of the
daemon's own --log-level
- Add --reflection flag (default true) to toggle gRPC server reflection
- Add --check flag: validates config file and exits 0/1/2
- SIGHUP: use config.Check before applying reload; log parse vs semantic
error separately; refuse reload on any error
- Rename default config path /etc/maglev → /etc/vpp-maglev
maglevc
- Add 'watch events [num <n>] [log [level <level>]] [backend] [frontend]'
command; prints compact protojson, stops on any keypress or Ctrl-C;
uses cbreak mode (not raw) so output post-processing is preserved
- Add 'set backend <name> enable|disable'
- Add 'set frontend <name> pool <pool> backend <name> weight <0-100>'
- Add 'config check' command
Debian packaging
- Rename service unit to vpp-maglevd.service
- Rename conffiles to /etc/default/vpp-maglev and /etc/vpp-maglev/
- Create maglevd system user/group in postinst; add to vpp group if present
- Add postrm; add adduser to Depends