vpp-maglev

Author	SHA1	Message	Date
Pim van Pelt	9a3c5c5dc0	checker: fix ResumeBackend leaking goroutine on non-paused backend; v1.0.2 Calling ResumeBackend on a backend that wasn't actually paused (state != StatePaused) would overwrite w.cancel and spawn a fresh probe goroutine without cancelling the old one, leaving two probe loops running for the same backend until process exit. The guard now mirrors EnableBackend's early-return on a non-target state.	2026-04-19 20:39:07 +02:00
Pim van Pelt	a8f02b913d	maglevt: show FQDN in header instead of config paths; v1.0.1 Replace the cfgPath field in the TUI header with the system's fully-qualified hostname via gethostname + CNAME lookup, matching what `hostname -f` produces. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-04-17 22:20:35 +02:00
Pim van Pelt	6a48c12449	Add multi-arch Docker build and docker-compose stack Introduce a multi-stage Alpine Dockerfile that cross-compiles via buildx ($BUILDPLATFORM -> $TARGETARCH) so a single invocation produces both linux/amd64 and linux/arm64 images without a qemu-emulated builder. `make docker` loads the native-arch image locally for smoke tests; `make docker-push` publishes a multi-arch manifest. Ship a docker-compose.yaml with opt-in profiles for maglevd/frontend and a .env.example template so operators can mirror /etc/default/vpp-maglev muscle memory into containers.	2026-04-15 18:09:35 +02:00
Pim van Pelt	bc6ccaa844	v1.0.0 — first release Bump VERSION to 1.0.0 and cut the first tagged release of vpp-maglev. Also in this commit: - maglevc: MAGLEV_SERVER env var as an alternative to the --server flag, matching the MAGLEV_CONFIG / MAGLEV_GRPC_ADDR convention on the other binaries. The flag takes precedence when both are set. - Rename cmd/maglevd -> cmd/server and cmd/maglevc -> cmd/client so the source directory names are decoupled from binary names (the frontend and tester commands already followed this convention). Build outputs and the Debian packages are unchanged.	2026-04-15 15:29:31 +02:00
Pim van Pelt	177d81cca1	Split Debian package into vpp-maglevd + vpp-maglev; add maglevt.1 manpage vpp-maglevd ships maglevd, maglevd-frontend, both systemd units, and the config conffiles. vpp-maglev ships maglevc and maglevt as pure client tools so jump hosts and workstations can install them without pulling in the daemon. pkg-deb now emits four .debs per release (2 packages x 2 archs); build-deb.sh takes a package-name argument and dispatches accordingly.	2026-04-15 15:29:23 +02:00
Pim van Pelt	6d78921edd	Restart-neutral VPP LB sync; deterministic AS ordering; maglevt cadence; v0.9.5 Three reliability fixes bundled with docs updates. Restart-neutral VPP LB sync via a startup warmup window (internal/vpp/warmup.go). Before this, a maglevd restart would immediately issue SyncLBStateAll with every backend still in StateUnknown — mapped through BackendEffectiveWeight to weight 0 — and VPP would black-hole all new flows until the checker's rise counters caught up, several seconds later. The new warmup tracker owns a process-wide state machine gated by two config knobs: vpp.lb.startup-min-delay (default 5s) is an absolute hands-off window during which neither the periodic sync loop nor the per-transition reconciler touches VPP; vpp.lb. startup-max-delay (default 30s) is the watchdog for a per-VIP release phase that runs between the two, releasing each frontend as soon as every backend it references reaches a non-Unknown state. At max-delay a final SyncLBStateAll runs for any stragglers still in Unknown. Config reload does not reset the clock. Both delays can be set to 0 to disable the warmup entirely. The reconciler's suppressed-during-warmup events log at DEBUG so operators can still see them with --log-level debug. Unit tests cover the tracker state machine, allBackendsKnown precondition, and the zero-delay escape hatch. Deterministic AS iteration in VPP LB sync. reconcileVIP and recreateVIP now issue their lb_as_add_del / lb_as_set_weight calls in numeric IP order (IPv4 before IPv6, ascending within each family) via a new sortedIPKeys helper, instead of Go map iteration order. VPP's LB plugin breaks per-bucket ties in the Maglev lookup table by insertion position in its internal AS vec, so without a stable call order two maglevd instances on the same config could push identical AS sets into VPP in different orders and produce divergent new-flow tables. Numeric sort is used in preference to lexicographic so the sync log stays human-readable: string order would place 10.0.0.10 before 10.0.0.2, and the same problem in v6. Unit tests cover empty, single, v4/v6 numeric vs lexicographic, v4-before-v6 grouping, a 1000-iteration stability loop against Go's randomised map iteration, insertion-order invariance, and the desiredAS call-site type. maglevt interval fix. runProbeLoop used to sleep the full jittered interval after every probe, so a 100ms --interval with a 30ms probe actually produced a 130ms period. The sleep now subtracts result.Duration so cadence matches the flag. Probes that overrun clamp sleep to zero and fire the next probe immediately without trying to catch up on missed cycles — a slow backend doesn't get flooded with back-to-back probes at the moment it's already struggling. Docs. config-guide now documents flush-on-down and the new startup-min-delay / startup-max-delay knobs; user-guide's maglevd section explains the restart-neutrality property, the three warmup phases, and the relevant slog lines operators should watch for during a bounce. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>	2026-04-15 12:53:42 +02:00
Pim van Pelt	6b2b04b2d1	Frontend flush-on-down policy; v0.9.3 Adds a per-frontend flush-on-down flag (default true) that causes maglevd to set is_flush=true on lb_as_set_weight when a backend transitions to StateDown, tearing down existing flows pinned to the dead AS instead of just draining them. rise/fall debouncing in the health checker already absorbs single-probe flaps, so a fall-counted down is almost always a real outage — and during a real outage the client-visible "connection refused" oscillation window (where VPP keeps steering existing flows at a dead AS until retry) is a reliability regression worth closing by default. Operators who want the pre-flag drain-only behaviour can set flush-on-down: false per frontend. BackendEffectiveWeight's truth table grows one axis: StateDown now returns (0, flushOnDown); StateDisabled still unconditionally flushes; StateUnknown / StatePaused still never flush. The unit test pins all four combinations. The flag surfaces in the gRPC FrontendInfo message and in `maglevc show frontend <name>` right next to src-ip-sticky.	2026-04-15 01:43:04 +02:00
Pim van Pelt	6293521157	New maglevt TUI component: out-of-band VIP health monitor A small bubbletea TUI that reads maglev.yaml (repeatable --config), enumerates every VIP, and probes each from outside the load balancer on a tight cadence (default 100ms, ±10% jitter). HTTP/HTTPS VIPs get a GET against a configurable URI (default /.well-known/ipng/healthz) with per-VIP rolling latency (p50/p95/p99/max), lifetime N/FAIL counters, LAST status, and a response-header tally. Non-HTTP VIPs get a TCP connect probe. A bounded error panel classifies anomalies as timeout / http-err / net-err / spike and auto-sizes to fill the screen. Utility: during a failover drill (backend flap, AS drain, config push) the tally panel shows which backend each VIP is actually steering to, with two-colour activity highlighting over a 5s window — white = receiving traffic, grey = drained. Paired with the rolling OK%/latency columns it gives an at-a-glance answer to "is the VIP healthy from the outside right now, and which backend is it hitting", without relying on maglevd's own view of the world. Also bumps Makefile/go.mod to build the new binary.	2026-04-15 01:23:52 +02:00
Pim van Pelt	744b1cb3d2	install-deps Makefile target; docs refresh; golangci-lint v2 clean Makefile: - New install-deps umbrella target split into three sub-targets: install-deps-apt — Debian/Trixie-packaged build deps (nodejs, npm, protobuf-compiler, git, make, dpkg-dev, ca-certificates, curl, tar). Uses sudo when not already root. install-deps-go — ensures a Go toolchain >= GO_VERSION (go.mod floor, default 1.25.0). Short-circuits when the system Go is already recent enough; otherwise downloads the upstream tarball from go.dev/dl/ into /usr/local/go. Trixie only ships 1.24 so this step is load-bearing. install-deps-go-tools — go install protoc-gen-go, protoc-gen-go-grpc, and golangci-lint/v2/cmd/golangci-lint. Then asserts the installed golangci-lint version parses as >= GOLANGCI_LINT_VERSION (default 1.64.0, the floor that supports Go 1.25 syntax) to catch stale binaries in $GOPATH /bin before they silently run against Go 1.25 code. - Parser bug fixed: golangci-lint v1.x prints "has version v1.64.8" but v2.x dropped the 'v' prefix and prints "has version 2.11.4". The original sed regex required the 'v' and returned an empty match on v2.x, making the assertion explode with "could not parse version output". Fixed by switching to extended regex (sed -En) with 'v?' so both forms parse cleanly. - GO_VERSION and GOLANGCI_LINT_VERSION exposed as Makefile variables so operators can override on the command line, e.g. make install-deps GO_VERSION=1.25.5 GOLANGCI_LINT_VERSION=2.0.0 - .PHONY extended with the four new target names. Docs: - README.md: capability note rewritten to cover CAP_NET_RAW (ICMP) and the new CAP_SYS_ADMIN requirement when healthchecker.netns is set, plus a paragraph explaining that the Debian systemd unit grants both automatically. Docker example gained a second variant that shows the additional --cap-add SYS_ADMIN and /var/run/netns bind mount for netns-scoped deployments. Also notes that maglevd-frontend ignores SIGHUP so controlling-terminal disconnects don't kill it. - docs/user-guide.md: Capabilities section rewritten as a bulleted list covering both caps, with the EPERM error string and three different ways to grant them (systemd unit, setcap, systemd-run); 'show vpp lb counters' command description updated to explain that per-backend packet counts are no longer shown (LB plugin's forwarding node bypasses ip{4,6}_lookup_inline, so /net/route/to at the backend's FIB entry never ticks for LB-forwarded traffic); new ~75-line "What the SPA shows" subsection covering the scope selector + maglev_scope cookie, the per-maglevd frontend cards, the health-cascade icon table (ok / bug-buckets / primary-drained / degraded / unknown), the lb buckets column semantics, the maglev_zippy_open cookie, the admin-mode lifecycle dialogs with their plain-English consequence text, and the debug panel. - docs/config-guide.md: healthchecker.netns field gains a capability- requirement note spelling out setns(CLONE_NEWNET), the EPERM symptom string, and the /var/run/netns/ readability requirement. - docs/healthchecks.md: new "Jitter" subsection explaining the +/-10% scaling on every computed interval, and a "Probe timing while a probe is in flight" subsection that explains why fast-interval alone doesn't give fast fault detection against hanging backends (the probe loop is synchronous, so each iteration is timeout + fast-interval; the advice is to lower timeout, not fast-interval). - docs/maglevd.8: description paragraph corrected (dropped the per-backend stats claim and added a short note pointing at the LB plugin forwarding-path bypass); new CAPABILITIES section between SIGNALS and FILES covering both CAP_NET_RAW and CAP_SYS_ADMIN with the drop-in-override hint. - docs/maglevd-frontend.8: new SIGNALS section documenting the explicit SIGHUP ignore (so a controlling-terminal disconnect doesn't kill the daemon); description extended with paragraphs on the two persistence cookies (maglev_scope, maglev_zippy_open) and on the health-cascade icon + lb buckets column. - docs/maglevc.1: left untouched — intentionally minimal and delegates to docs/user-guide.md. Lint (26 issues across 12 files, all errcheck / ineffassign / S1021): - cmd/frontend/handlers.go: _, _ = fmt.Fprintf(...) for the SSE retry hint and resync control-event writes. - cmd/maglevc/commands.go: bulk-prefix every fmt.Fprintf(w, ...) with _, _ =; also merged 'var watchEventsOptSlot *Node; ... = &Node{...}' into a single := declaration (staticcheck S1021) — the self- referencing pattern still works because the Children back-ref is assigned on the next statement, not inside the struct literal. - cmd/maglevc/complete.go: _, _ = fmt.Fprintf(ql.rl.Stderr(), ...) for the banner and help writes; removed the ineffectual 'partial = ""' assignment (nothing downstream reads partial after that branch, so setting it was dead code flagged by ineffassign). - cmd/maglevc/shell.go: defer func() { _ = rl.Close() }() for the readline instance; _, _ = fmt.Fprintf(rl.Stderr(), ...) for error display in the REPL loop. - cmd/maglevc/main.go: defer func() { _ = conn.Close() }() for the gRPC client connection. - internal/grpcapi/server_test.go: _ = conn.Close() in the test teardown closure. - internal/prober/http.go: _ = c.Close() in the TLS-handshake-failed path; defer func() { _ = conn.Close() }() and defer func() { _ = resp.Body.Close() }() for the two deferred cleanups. - internal/prober/http_test.go: defer func() { _ = resp.Body.Close() }() plus three _, _ = fmt.Fprint(w, ...) in the httptest.Server handlers and _, _ = fmt.Sscanf(...) when parsing the test listener's port. - internal/prober/icmp.go: defer func() { _ = pc.Close() }() for the ICMP packet conn. - internal/prober/netns.go: defer func() { _ = origNs.Close() }(), defer func() { _ = netns.Set(origNs) }(), defer func() { _ = targetNs.Close() }() — also dropped a stray //nolint:errcheck that was no longer needed once the closure wrapping handled the discard. - internal/prober/tcp.go: _ = conn.Close() in the L4-only path, _ = tlsConn.Close() in the failed and succeeded handshake branches, _ = tlsConn.SetDeadline(...) (also dropped a //nolint:errcheck previously covering it). Iterative 'make lint' runs were needed because golangci-lint v2.x caps same-linter reports per pass, so the first pass reported 21, then 4, then 3, then 1, then 0. Final pass: 0 issues. make test is green across every package, and make build produces all three binaries cleanly.	2026-04-14 17:37:53 +02:00
Pim van Pelt	4288e22b71	Compile/Test static binaries, for better portability	2026-04-13 14:38:43 +02:00
Pim van Pelt	35643fd774	Rename maglev-frontend → maglevd-frontend; v0.9.1; API RX/TX pulse Rename the web dashboard binary to maglevd-frontend and move it to /usr/sbin (it's a daemon and belongs with maglevd). The systemd unit name stays vpp-maglev-frontend.service since that prefix is the package name. Manpage, README, user-guide, and debian packaging all updated in lockstep; bump to 0.9.1 for the first real release. All frontend env vars are now prefixed MAGLEV_FRONTEND_ so a single /etc/default/vpp-maglev can be shared with maglevd without collisions. Every flag has an env equivalent for Docker use. MAGLEV_FRONTEND_USER and MAGLEV_FRONTEND_PASSWORD still gate the /admin surface. VPPInfoPanel now pulses "API: ↑↓" indicators in the zippy title whenever a vpp-api-send / vpp-api-recv log event arrives on the SSE stream for the scoped maglevd — 250ms blue flash, re-triggerable, with the two arrows tightly kerned via negative letter-spacing.	2026-04-13 00:13:52 +02:00
Pim van Pelt	25e9d79aba	Frontend: live clocks, admin mode, backend actions; packaging polish Builds on the maglev-frontend component introduced in `284b4cc` with quality-of-life improvements, an authenticated /admin surface, a live-action control plane, and Debian packaging cleanup. - Backend state now renders live: maglevd's FrontendEvent synthetic from==to replay hydrates FrontendSnapshot.State on WatchEvents subscribe, and live transitions update both the in-process cache and every connected browser via a new applyFrontendTransition reducer. Shown as a StatusBadge next to the frontend name. - VPP connection state surfaces in the VPP zippy title as a green/red badge. Driven by vpp-connect / vpp-disconnect and by the steady stream of vpp-api-send/recv debug heartbeats so a silent VPP drop is caught within one debug-log tick. - Probe heartbeat dot becomes ❤️ while a probe is in flight and reverts to · on probe-done. Fixed-size wrapper so the emoji swap doesn't jiggle the row; both states share the same font-size. - Flash component replaced its subtle background-only fade with a scale-pop + yellow halo box-shadow + longer duration so weight/effective/state changes are unmissable on tiny numeric cells. Initial mount still skipped via defer so no flash on load. - Last-transition age is now a live countdown driven by a global 1-second ticker signal (one timer, many subscribers). Two most significant units: 10m30s / 1h12m / 1d16h. Sub-second ages render as "now" to absorb clock skew between maglevd and the browser. - Event stream is now chronological (oldest at top) with tail- style auto-scroll, pause/resume, and the toolbar moved below the list. Row separators removed. Also shown only in /admin (see below) so /view stays a focused read-only surface. - Table nowrap so backend names like nginx0-frggh0 and the "last transition" header don't wrap. Frontends render in the order returned by ListFrontends instead of Go map iteration order so reload doesn't shuffle VIP order. - IPng logo in the header, clickable, links to the git repo. Header padding reduced so the logo can fill the bar up to the separator. Version + commit + build date shown in the brand area (fetched once from /view/api/version). - "view" / "admin" mode tag moved to sit just left of the admin toggle button so it reads as a pair. - Prettier wired in as the web-side fixstyle via a new fixstyle-web Make target that also runs from `make fixstyle`. Added .prettierrc.json and .prettierignore; 8 existing files were normalized in place. - Fixed a "20555d ago" rendering bug: maglevd's synthetic backend-replay events (from==to, at_unix_ns=0) were corrupting the local cache's LastTransition via applyBackendTransition. Backend synthetic events are now skipped entirely (refreshAll covers initial hydration for backends), while frontend synthetic events are still applied because FrontendInfo doesn't carry state — the event is the only source. - New MAGLEV_FRONTEND_USER / MAGLEV_FRONTEND_PASSWORD env vars. When both are set and non-empty, /admin/ becomes a basic-auth- protected SPA shell backed by the same embedded index.html as /view/. The SPA detects its base path via a new stores/mode.ts isAdmin constant and conditionally renders admin-only sections (currently: the Event Stream / DebugPanel). When disabled, /admin/ returns 404 (not 501) so operators who didn't configure it see no teasing affordance, and the SPA's admin-toggle button is hidden entirely via the admin_enabled flag on /view/api/version. - basicAuth uses crypto/subtle.ConstantTimeCompare for both user and password so timing can't distinguish a wrong username from a wrong password. - New POST /admin/api/{maglevd}/backend/{name}/{pause\|resume\| enable\|disable} endpoint, gated by the same basic-auth middleware as the SPA shell. maglevClient.BackendAction wraps the four matching gRPC RPCs and returns a fresh BackendSnapshot; the same transition lands via WatchEvents so every connected browser converges through the normal reducer path. - BackendActionsMenu Solid component: kebab (⋮) button in a new trailing column rendered only in /admin. Click-outside and Escape close the popover (document listeners installed only while open). Actions are state-aware: up/down/unknown → pause, disable; paused → resume, disable; disabled → enable; removed → menu suppressed entirely. Busy indicator per action; errors render inline under the item list. - Structured audit log: every mutation logs an admin-backend-action record with maglevd / backend / action / resulting state. - Renamed debian/vpp-maglevd.service → debian/vpp-maglev.service to align naming with the new vpp-maglev-frontend.service sibling. postinst handles upgrades by stopping + disabling any lingering vpp-maglevd.service before enabling the renamed unit; prerm stops both (the frontend unit is installed but not enabled by default — operators opt in with systemctl enable). - New debian/vpp-maglev-frontend.service (hardened: NoNewPrivileges, ProtectSystem=strict, ProtectHome, PrivateTmp, no capabilities). Reads the same /etc/default/vpp-maglev conffile and expands MAGLEV_FRONTEND_ARGS via `ExecStart=/usr/bin/maglev-frontend $MAGLEV_FRONTEND_ARGS` so word-splitting works. - docs/maglev-frontend.8 manpage documenting flags, endpoints, and SSE reverse-proxy requirements. - build-deb.sh: drops the commit hash from the .deb filename (now vpp-maglev_<version>_<arch>.deb) and no longer takes the commit as a CLI arg. Binaries continue to carry the commit via -ldflags so `maglevd --version` et al are the authoritative "which build is running" answer.	2026-04-12 20:04:53 +02:00
Pim van Pelt	284b4cc9a4	New maglev-frontend component; promote LB sync events to INFO Introduces maglev-frontend, a responsive, real-time web dashboard for one or more running maglevd instances. Source lives at cmd/frontend/; the built binary is maglev-frontend. It is a single Go process with the SolidJS SPA embedded via //go:embed — no runtime file dependencies. Architecture - One persistent gRPC connection per configured maglevd (-server A,B,C). Each connection runs three background loops: a WatchEvents stream subscribed at log_level=debug for live events, a 30s refresh loop as a safety net for drift, and a 5s health loop that surfaces connection drops quickly. - In-process pub/sub broker with a 30s / 2000-event replay ring using <epoch>-<seq> monotonic IDs. Short browser reconnects (nginx idle, wifi flap, laptop wake) silently replay buffered events via the EventSource Last-Event-ID header; longer outages or frontend restarts fall through to a "resync" event that triggers a full state refetch. - HTTP surface: /view/ (SPA), /view/api/state, /view/api/state/{name}, /view/api/maglevds, /view/api/version, /view/api/events (SSE), /healthz, and an /admin/* placeholder returning 501 for a future basic-auth mutation surface. - SSE handler follows the full operational checklist: retry hint, 15s : ping heartbeat, Flush after every write, r.Context().Done() teardown, X-Accel-Buffering: no, and no gzip. SolidJS SPA (cmd/frontend/web/, Vite + TypeScript) - solid-js/store for a reactive per-maglevd state tree; reducers apply backend transitions, maglevd-status flips, and resync refetches. - Scope selector tabs for multi-maglevd support, per-maglevd frontend cards with pool tables showing state, configured weight, effective weight, and last-transition age. - ProbeHeartbeat component turns a middle-dot into ❤️ on probe-start and back on probe-done, driven by real log events; fixed-size wrapper so the emoji swap doesn't jiggle the row. - Flash wrapper animates any primitive on change (1s yellow fade via Web Animations API, skipped on first mount). Wired into the state badge, configured weight, and effective weight columns. - DebugPanel: chronological rolling event tail with tail-style auto- scroll, pause/resume, and scope/firehose filter. Syntactic highlight for vpp-lb-sync-* events with fixed-order attribute formatting. - Live effective_weight updates: vpp-lb-sync-as-added/removed/weight- updated log events are routed through a reducer that walks the snapshot's pool rows and sets effective_weight on every match without waiting for the 30s refresh. - Header shows build version + commit with build date in a tooltip, fetched once from /view/api/version on mount. - Prettier wired in as the web-side fixstyle; make fixstyle now tidies both Go and web in one shot via a new fixstyle-web target. Per-mutation VPP LB sync logging - Promotes the addVIP/delVIP/addAS/delAS/setASWeight helpers from slog.Debug to slog.Info and renames them from vpp-lbsync-* to vpp-lb-sync-{vip-added,vip-removed,as-added,as-removed,as-weight- updated}. Matching rename for vpp-lb-sync-start / -done / -error / -vip-recreate. The Prometheus metric name (maglev_vpp_lbsync_total) is left alone to preserve dashboards. - setASWeight now takes the prior weight so the event can emit from=X to=Y and the UI can show the delta. - The vip field in every event is the bare address (no /32 or /128 mask), matching the CLI output style. - Any listener on the gRPC WatchEvents stream — CLI watch events or maglev-frontend — now sees every VIP/AS dataplane change in real time without needing to raise the log level. Build and tooling - Makefile: maglev-frontend added to BINARIES; build / build-amd64 / build-arm64 emit the binary alongside maglevd and maglevc. A new maglev-frontend-web target rebuilds the SolidJS bundle via npm. - web/dist/ is tracked so a bare `go build` keeps working for Go-only contributors and CI. - .gitignore skips cmd/frontend/web/node_modules/. Stability fixes - maglevd's WatchEvents synthetic replay events (from==to, at_unix_ns=0) were corrupting the frontend's LastTransition cache with at=0, rendering as "20555d ago" in the browser. Client now skips synthetic events: the cache comes from refreshAll and doesn't need them. - Frontends, Backends, and HealthChecks are now served in the order returned by the corresponding List* RPC instead of Go map iteration order, so reloads and refreshes keep the SPA stable.	2026-04-12 17:48:31 +02:00
Pim van Pelt	d3c5c86037	VPP load-balancer dataplane integration: state, sync, and global conf This commit wires maglevd through to VPP's LB plugin end-to-end, using locally-generated GoVPP bindings for the newer v2 API messages. VPP binapi (vendored) - New package internal/vpp/binapi/ containing lb, lb_types, ip_types, and interface_types, generated from a local VPP build (~/src/vpp) via a new 'make vpp-binapi' target. GoVPP v0.12.0 upstream lacks the v2 messages we need (lb_conf_get, lb_add_del_vip_v2, lb_add_del_as_v2, lb_as_v2_dump, lb_as_set_weight), so we commit the generated output in-tree. - All generated files go through our loggedChannel wrapper; every VPP API send/receive is recorded at DEBUG via slog (vpp-api-send / vpp-api-recv / vpp-api-send-multi / vpp-api-recv-multi) so the full wire-level trail is auditable. NewAPIChannel is unexported — callers must use c.apiChannel(). Read path: GetLBState{All,VIP} - GetLBStateAll returns a full snapshot (global conf + every VIP with its attached application servers). - GetLBStateVIP looks up a single VIP by (prefix, protocol, port) and returns (nil, nil) when the VIP doesn't exist in VPP. This is the efficient path for targeted updates on a busy LB. - Helpers factored out: getLBConf, dumpAllVIPs, dumpASesForVIP, lookupVIP, vipFromDetails. Write path: SyncLBState{All,VIP} - SyncLBStateAll reconciles every configured frontend with VPP: creates missing VIPs, removes stale ones (with AS flush), and reconciles AS membership and weights within VIPs that exist on both sides. - SyncLBStateVIP targets a single frontend by name. Never removes VIPs. Returns ErrFrontendNotFound (wrapped with the name) when the frontend isn't in config, so callers can use errors.Is. - Shared reconcileVIP helper does the per-VIP AS diff; removeVIP is used only by the full-sync pass. - LbAddDelVipV2 requests always set NewFlowsTableLength=1024. The .api default=1024 annotation is only applied by VAT/CLI parsers, not wire- level marshalling — sending 0 caused VPP to vec_validate with mask 0xFFFFFFFF and OOM-panic. - Pool semantics: backends in the primary (first) pool of a frontend get their configured weight; backends in secondary pools get weight 0. All backends are installed so higher layers can flip weights on failover without add/remove churn. - Every individual change emits a DEBUG slog (vpp-lbsync-vip-add/del, vpp-lbsync-as-add/del, vpp-lbsync-as-weight). Start/done INFO logs carry a scope=all\|vip label plus aggregate counts. Global conf push: SetLBConf - New SetLBConf(cfg) sends lb_conf with ipv4-src, ipv6-src, sticky-buckets, and flow-timeout. Called automatically on VPP (re)connect and after every config reload (via doReloadConfig). Results are cached on the Client so redundant pushes are silently skipped — only actual changes produce a vpp-lb-conf-set INFO log line. Periodic drift reconciliation - vpp.Client.lbSyncLoop runs in a goroutine tied to each VPP connection's lifetime. Its first tick is immediate (startup and post-reconnect sync quickly); subsequent ticks fire every vpp.lb.sync-interval from config (default 30s). Purpose: catch drift if something/someone modifies VPP state by hand. The loop uses a ConfigSource interface (satisfied by checker.Checker via its new Config() accessor) to avoid an import cycle with the checker package. Config schema additions (maglev.vpp.lb) - sync-interval: positive Go duration, default 30s. - ipv4-src-address: REQUIRED. Used as the outer source for GRE4 encap to application servers. Missing this is a hard semantic error — maglevd --check exits 2 and the daemon refuses to start. VPP GRE needs a source address and every VIP we program uses GRE, so there is no meaningful config without it. - ipv6-src-address: REQUIRED. Same treatment as ipv4-src-address. - sticky-buckets-per-core: default 65536, must be a power of 2. - flow-timeout: default 40s, must be a whole number of seconds in [1s, 120s]. - VPP validation runs at the end of convert() so structural errors in healthchecks/backends/frontends surface first — operators fix those, then get the VPP-specific requirements. gRPC API - New GetVPPLBState RPC returning VPPLBState: global conf + VIPs with ASes. Mirrors the read-path but strips fields irrelevant to our GRE-only deployment (srv_type, dscp, target_port). - New SyncVPPLBState RPC with optional frontend_name. Unset → full sync (may remove stale VIPs). Set → single-VIP sync (never removes). Returns codes.NotFound for unknown frontends, codes.Unavailable when VPP integration is disabled or disconnected. maglevc (CLI) - New 'show vpp lbstate' command displaying the LB plugin state. VPP-only fields the dataplane irrelevant to GRE are suppressed. Per-AS lines use a key-value format ("address X weight Y flow-table-buckets Z") instead of a tabwriter column, which avoids the ANSI-color alignment issue we hit with mixed label/data rows. - New 'sync vpp lbstate [<name>]' command. Without a name, triggers a full reconciliation; with a name, targets one frontend. - Previous 'show vpp lb' renamed to 'show vpp lbstate' for consistency with the new sync command. Test fixtures - validConfig and all ad-hoc config_test.go fixtures that reach the end of convert() now include the two required vpp.lb src addresses. - tests/01-maglevd/maglevd-lab/maglev.yaml gains a vpp.lb section so the robot integration tests can still load the config. - cmd/maglevc/tree_test.go gains expected paths for the new commands. Docs - config-guide.md: new 'vpp' section in the basic structure, detailed vpp.lb field reference, noting ipv4/ipv6 src addresses as REQUIRED (hard error) with no defaults; example config updated. - user-guide.md: documented 'show vpp info', 'show vpp lbstate', 'sync vpp lbstate [<name>]', new --vpp-api-addr and --vpp-stats-addr flags, the vpp-lb-conf-set log line, and corrected the pause/resume description to reflect that pause cancels the probe goroutine. - debian/maglev.yaml: example config gains a vpp.lb block with src addresses and commented optional overrides.	2026-04-12 10:58:44 +02:00
Pim van Pelt	1815675fb6	Distinguish disabled from removed backend state; add make fixstyle Add StateDisabled for operator-initiated disable, keeping StateRemoved for backends that disappear during a config reload. Previously both used StateRemoved, which was confusing: "removed" implies the backend no longer exists in config, but a disabled backend is still present and can be re-enabled on the fly. - health: add StateDisabled with String() "disabled", Disable() method with probe code "disabled". Record() rejects probes in all three inactive states (paused, disabled, removed). - checker: DisableBackend calls backend.Disable() instead of Remove(). - docs: healthchecks.md rewritten for pause (goroutine cancelled, not just results discarded), and separate disabled/removed state rows. user-guide.md updated to match. - Makefile: add fixstyle target (gofmt -w .).	2026-04-11 21:04:24 +02:00
Pim van Pelt	8bde00eb61	Fix pause to cancel probe goroutine; add Robot Framework integration tests Pause semantics - PauseBackend now cancels the probe goroutine so no HTTP/TCP/ICMP traffic is sent while the backend is paused. Previously the goroutine kept running and results were silently discarded. - ResumeBackend launches a fresh probe goroutine on the existing worker, preserving transition history. The backend re-enters unknown state. Integration tests (tests/01-maglevd/) - Containerlab topology with 3 nginx:alpine backends on a dedicated management network (172.20.30.0/24) with static IPs. - maglevd config with 200ms HTTP health-check interval for fast test convergence (rise=2, fall=2). - 8 test cases: deploy lab, start maglevd, all backends reach up, nginx logs confirm probes arriving, pause stops probes (probe count stable), resume restarts probes, disable stops probes, enable restarts probes. VPP dataplane test (tests/02-vpp-lb/) - Rewrite 01-e2e-lab.robot to match the actual single-VPP topology: test client-to-server ping through VPP bridge domains and verify nginx is serving on all app servers. The previous version referenced a non-existent topology file and tested OSPF/BFD between two VPP nodes that don't exist in this lab. Build infrastructure - Add 'make robot-test' target with TEST= for suite selection. - Add tests/.venv target for Robot Framework virtualenv. - Make IMAGE optional in rf-run.sh. - Add .gitignore entries for test output, venv, logs, and clab state.	2026-04-11 20:19:36 +02:00
Pim van Pelt	d612086a5f	Pools, CLI, versioning, Debian packaging, HTTPS fix - Replaced flat `backends: [...]` list on frontends with an ordered `pools:` list; each pool has a name and a map of backends with per-pool weights (0–100, default 100). Pools express priority: first pool with a healthy backend wins. - Removed global backend weight (was on the backend, now lives in the pool). - Config validation enforces non-empty pools, non-empty pool names, weight range, and consistent address families across all pools of a frontend. - Added `PoolBackendInfo { name, weight }` and changed `PoolInfo.backends` from `repeated string` to `repeated PoolBackendInfo` so weights are visible over the API. - Full interactive shell with readline, tab completion, and `?` inline help. - Command tree parser (Walk) handles fixed keywords and dynamic slot nodes; prefix matching with exact-match priority. - Commands: `show version/frontends/frontend/backends/backend/healthchecks/ healthcheck`, `set backend <name> pause\|resume`, `quit`/`exit`. - `show frontend` output is hierarchical (pools → backends) with per-backend weights and `[disabled]` notation; pool section uses fixed-width formatting so ANSI color codes don't corrupt tabwriter alignment. - `-color` flag (default true) wraps static field labels in dark-blue ANSI; works correctly with tabwriter because all labels carry identical-length escape sequences. - `cmd/version.go` package holds `version`, `commit`, `date` vars set at build time via `-ldflags -X`. - `make build` / `make build-amd64` / `make build-arm64` all inject `VERSION=0.1.1`, `COMMIT_HASH` (from `git rev-parse --short HEAD`), and `DATE` (UTC ISO-8601). - `maglevc` prints version on interactive startup and exposes `show version`. - `maglevd` logs version/commit/date at startup; `-version` flag prints and exits. - `doHTTPProbe` was building a `https://` target URL even though TLS was already applied to the connection inside `inNetns`. `http.Transport` then wrapped the connection in a second TLS layer, producing "http: server gave HTTP response to HTTPS client". Fixed by always using `http://` in the target URL. - Added `TestHTTPSProbe` using `httptest.NewTLSServer` to cover the full path. - New `docs/user-guide.md`: maglevd flags/signals, maglevc commands, shell completion, and command-tree parser walkthrough. - New `docs/healthchecks.md`: state machine, rise/fall model, probe intervals, all transition events with log examples. - Updated `docs/config-guide.md`: pools design, removed global weight from backends, updated all examples. - Updated `README.md`: packaging table, build paths, corrected binary locations (`/usr/sbin/maglevd`), config filename (`.yaml`). - `debian/` directory contains `control.in`, `maglevd.service`, `default.maglev`, `maglev.yaml` (example config), `conffiles`, `postinst`, `prerm`. - `debian/build-deb.sh` stages a package tree and calls `dpkg-deb`; emits `build/vpp-maglev_<version>~<commit>_<arch>.deb`. - Cross-compiles for amd64 and arm64 in one `make pkg-deb` invocation. - `maglevd` installed to `/usr/sbin/`, `maglevc` to `/usr/bin/`. - Service reads `MAGLEV_CONFIG` from `/etc/default/maglev` (default: `/etc/maglev/maglev.yaml`). - Man pages `maglevd(8)` and `maglevc(1)` live in `docs/` and are gzip'd into the package. - All build output goes to `build/<arch>/`; `build/` is gitignored.	2026-04-11 12:18:17 +02:00
Pim van Pelt	46e78ec36f	First stab at maglevc	2026-04-11 02:48:00 +02:00
Pim van Pelt	040d6f5853	Revision: Rename to 'maglevd'; Refactor config structure	2026-04-10 22:15:20 +02:00
Pim van Pelt	b84b3274b1	Initial revisin of healthchecker, inspired by HAProxy	2026-04-10 17:30:44 +02:00

20 Commits