Commit Graph

20 Commits

Author SHA1 Message Date
Pim van Pelt
9a3c5c5dc0 checker: fix ResumeBackend leaking goroutine on non-paused backend; v1.0.2
Calling ResumeBackend on a backend that wasn't actually paused (state
!= StatePaused) would overwrite w.cancel and spawn a fresh probe
goroutine without cancelling the old one, leaving two probe loops
running for the same backend until process exit. The guard now mirrors
EnableBackend's early-return on a non-target state.
2026-04-19 20:39:07 +02:00
Pim van Pelt
a8f02b913d maglevt: show FQDN in header instead of config paths; v1.0.1
Replace the cfgPath field in the TUI header with the system's
fully-qualified hostname via gethostname + CNAME lookup, matching
what `hostname -f` produces.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-04-17 22:20:35 +02:00
Pim van Pelt
6a48c12449 Add multi-arch Docker build and docker-compose stack
Introduce a multi-stage Alpine Dockerfile that cross-compiles via
buildx ($BUILDPLATFORM -> $TARGETARCH) so a single invocation produces
both linux/amd64 and linux/arm64 images without a qemu-emulated
builder. `make docker` loads the native-arch image locally for smoke
tests; `make docker-push` publishes a multi-arch manifest. Ship a
docker-compose.yaml with opt-in profiles for maglevd/frontend and a
.env.example template so operators can mirror /etc/default/vpp-maglev
muscle memory into containers.
2026-04-15 18:09:35 +02:00
Pim van Pelt
bc6ccaa844 v1.0.0 — first release
Bump VERSION to 1.0.0 and cut the first tagged release of vpp-maglev.

Also in this commit:

- maglevc: MAGLEV_SERVER env var as an alternative to the --server
  flag, matching the MAGLEV_CONFIG / MAGLEV_GRPC_ADDR convention on
  the other binaries. The flag takes precedence when both are set.
- Rename cmd/maglevd -> cmd/server and cmd/maglevc -> cmd/client so
  the source directory names are decoupled from binary names (the
  frontend and tester commands already followed this convention).
  Build outputs and the Debian packages are unchanged.
2026-04-15 15:29:31 +02:00
Pim van Pelt
177d81cca1 Split Debian package into vpp-maglevd + vpp-maglev; add maglevt.1 manpage
vpp-maglevd ships maglevd, maglevd-frontend, both systemd units, and
the config conffiles. vpp-maglev ships maglevc and maglevt as pure
client tools so jump hosts and workstations can install them without
pulling in the daemon. pkg-deb now emits four .debs per release
(2 packages x 2 archs); build-deb.sh takes a package-name argument
and dispatches accordingly.
2026-04-15 15:29:23 +02:00
Pim van Pelt
6d78921edd Restart-neutral VPP LB sync; deterministic AS ordering; maglevt cadence; v0.9.5
Three reliability fixes bundled with docs updates.

Restart-neutral VPP LB sync via a startup warmup window
(internal/vpp/warmup.go). Before this, a maglevd restart would
immediately issue SyncLBStateAll with every backend still in
StateUnknown — mapped through BackendEffectiveWeight to weight
0 — and VPP would black-hole all new flows until the checker's
rise counters caught up, several seconds later. The new warmup
tracker owns a process-wide state machine gated by two config
knobs: vpp.lb.startup-min-delay (default 5s) is an absolute
hands-off window during which neither the periodic sync loop
nor the per-transition reconciler touches VPP; vpp.lb.
startup-max-delay (default 30s) is the watchdog for a per-VIP
release phase that runs between the two, releasing each frontend
as soon as every backend it references reaches a non-Unknown
state. At max-delay a final SyncLBStateAll runs for any stragglers
still in Unknown. Config reload does not reset the clock. Both
delays can be set to 0 to disable the warmup entirely. The
reconciler's suppressed-during-warmup events log at DEBUG so
operators can still see them with --log-level debug. Unit tests
cover the tracker state machine, allBackendsKnown precondition,
and the zero-delay escape hatch.

Deterministic AS iteration in VPP LB sync. reconcileVIP and
recreateVIP now issue their lb_as_add_del / lb_as_set_weight
calls in numeric IP order (IPv4 before IPv6, ascending within
each family) via a new sortedIPKeys helper, instead of Go map
iteration order. VPP's LB plugin breaks per-bucket ties in the
Maglev lookup table by insertion position in its internal AS
vec, so without a stable call order two maglevd instances on
the same config could push identical AS sets into VPP in
different orders and produce divergent new-flow tables. Numeric
sort is used in preference to lexicographic so the sync log
stays human-readable: string order would place 10.0.0.10 before
10.0.0.2, and the same problem in v6. Unit tests cover empty,
single, v4/v6 numeric vs lexicographic, v4-before-v6 grouping,
a 1000-iteration stability loop against Go's randomised map
iteration, insertion-order invariance, and the desiredAS
call-site type.

maglevt interval fix. runProbeLoop used to sleep the full
jittered interval after every probe, so a 100ms --interval
with a 30ms probe actually produced a 130ms period. The sleep
now subtracts result.Duration so cadence matches the flag.
Probes that overrun clamp sleep to zero and fire the next
probe immediately without trying to catch up on missed cycles
— a slow backend doesn't get flooded with back-to-back probes
at the moment it's already struggling.

Docs. config-guide now documents flush-on-down and the new
startup-min-delay / startup-max-delay knobs; user-guide's
maglevd section explains the restart-neutrality property, the
three warmup phases, and the relevant slog lines operators
should watch for during a bounce.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-04-15 12:53:42 +02:00
Pim van Pelt
6b2b04b2d1 Frontend flush-on-down policy; v0.9.3
Adds a per-frontend flush-on-down flag (default true) that causes
maglevd to set is_flush=true on lb_as_set_weight when a backend
transitions to StateDown, tearing down existing flows pinned to
the dead AS instead of just draining them. rise/fall debouncing
in the health checker already absorbs single-probe flaps, so a
fall-counted down is almost always a real outage — and during a
real outage the client-visible "connection refused" oscillation
window (where VPP keeps steering existing flows at a dead AS
until retry) is a reliability regression worth closing by default.
Operators who want the pre-flag drain-only behaviour can set
flush-on-down: false per frontend.

BackendEffectiveWeight's truth table grows one axis: StateDown
now returns (0, flushOnDown); StateDisabled still unconditionally
flushes; StateUnknown / StatePaused still never flush. The unit
test pins all four combinations.

The flag surfaces in the gRPC FrontendInfo message and in
`maglevc show frontend <name>` right next to src-ip-sticky.
2026-04-15 01:43:04 +02:00
Pim van Pelt
6293521157 New maglevt TUI component: out-of-band VIP health monitor
A small bubbletea TUI that reads maglev.yaml (repeatable --config),
enumerates every VIP, and probes each from outside the load balancer
on a tight cadence (default 100ms, ±10% jitter). HTTP/HTTPS VIPs get
a GET against a configurable URI (default /.well-known/ipng/healthz)
with per-VIP rolling latency (p50/p95/p99/max), lifetime N/FAIL
counters, LAST status, and a response-header tally. Non-HTTP VIPs
get a TCP connect probe. A bounded error panel classifies anomalies
as timeout / http-err / net-err / spike and auto-sizes to fill the
screen.

Utility: during a failover drill (backend flap, AS drain, config
push) the tally panel shows which backend each VIP is actually
steering to, with two-colour activity highlighting over a 5s
window — white = receiving traffic, grey = drained. Paired with
the rolling OK%/latency columns it gives an at-a-glance answer to
"is the VIP healthy from the outside right now, and which backend
is it hitting", without relying on maglevd's own view of the
world.

Also bumps Makefile/go.mod to build the new binary.
2026-04-15 01:23:52 +02:00
Pim van Pelt
744b1cb3d2 install-deps Makefile target; docs refresh; golangci-lint v2 clean
Makefile:
- New install-deps umbrella target split into three sub-targets:
  install-deps-apt        — Debian/Trixie-packaged build deps
                            (nodejs, npm, protobuf-compiler, git, make,
                            dpkg-dev, ca-certificates, curl, tar). Uses
                            sudo when not already root.
  install-deps-go         — ensures a Go toolchain >= GO_VERSION (go.mod
                            floor, default 1.25.0). Short-circuits when
                            the system Go is already recent enough;
                            otherwise downloads the upstream tarball
                            from go.dev/dl/ into /usr/local/go. Trixie
                            only ships 1.24 so this step is load-bearing.
  install-deps-go-tools   — go install protoc-gen-go, protoc-gen-go-grpc,
                            and golangci-lint/v2/cmd/golangci-lint. Then
                            asserts the installed golangci-lint version
                            parses as >= GOLANGCI_LINT_VERSION (default
                            1.64.0, the floor that supports Go 1.25
                            syntax) to catch stale binaries in $GOPATH
                            /bin before they silently run against Go
                            1.25 code.
- Parser bug fixed: golangci-lint v1.x prints "has version v1.64.8" but
  v2.x dropped the 'v' prefix and prints "has version 2.11.4". The
  original sed regex required the 'v' and returned an empty match on
  v2.x, making the assertion explode with "could not parse version
  output". Fixed by switching to extended regex (sed -En) with 'v?' so
  both forms parse cleanly.
- GO_VERSION and GOLANGCI_LINT_VERSION exposed as Makefile variables
  so operators can override on the command line, e.g.
    make install-deps GO_VERSION=1.25.5 GOLANGCI_LINT_VERSION=2.0.0
- .PHONY extended with the four new target names.

Docs:
- README.md: capability note rewritten to cover CAP_NET_RAW (ICMP) and
  the new CAP_SYS_ADMIN requirement when healthchecker.netns is set,
  plus a paragraph explaining that the Debian systemd unit grants both
  automatically. Docker example gained a second variant that shows the
  additional --cap-add SYS_ADMIN and /var/run/netns bind mount for
  netns-scoped deployments. Also notes that maglevd-frontend ignores
  SIGHUP so controlling-terminal disconnects don't kill it.
- docs/user-guide.md: Capabilities section rewritten as a bulleted
  list covering both caps, with the EPERM error string and three
  different ways to grant them (systemd unit, setcap, systemd-run);
  'show vpp lb counters' command description updated to explain that
  per-backend packet counts are no longer shown (LB plugin's
  forwarding node bypasses ip{4,6}_lookup_inline, so /net/route/to at
  the backend's FIB entry never ticks for LB-forwarded traffic); new
  ~75-line "What the SPA shows" subsection covering the scope
  selector + maglev_scope cookie, the per-maglevd frontend cards, the
  health-cascade icon table (ok / bug-buckets / primary-drained /
  degraded / unknown), the lb buckets column semantics, the
  maglev_zippy_open cookie, the admin-mode lifecycle dialogs with
  their plain-English consequence text, and the debug panel.
- docs/config-guide.md: healthchecker.netns field gains a capability-
  requirement note spelling out setns(CLONE_NEWNET), the EPERM
  symptom string, and the /var/run/netns/ readability requirement.
- docs/healthchecks.md: new "Jitter" subsection explaining the +/-10%
  scaling on every computed interval, and a "Probe timing while a
  probe is in flight" subsection that explains why fast-interval alone
  doesn't give fast fault detection against hanging backends (the
  probe loop is synchronous, so each iteration is timeout +
  fast-interval; the advice is to lower timeout, not fast-interval).
- docs/maglevd.8: description paragraph corrected (dropped the
  per-backend stats claim and added a short note pointing at the LB
  plugin forwarding-path bypass); new CAPABILITIES section between
  SIGNALS and FILES covering both CAP_NET_RAW and CAP_SYS_ADMIN with
  the drop-in-override hint.
- docs/maglevd-frontend.8: new SIGNALS section documenting the
  explicit SIGHUP ignore (so a controlling-terminal disconnect doesn't
  kill the daemon); description extended with paragraphs on the two
  persistence cookies (maglev_scope, maglev_zippy_open) and on the
  health-cascade icon + lb buckets column.
- docs/maglevc.1: left untouched — intentionally minimal and delegates
  to docs/user-guide.md.

Lint (26 issues across 12 files, all errcheck / ineffassign / S1021):
- cmd/frontend/handlers.go: _, _ = fmt.Fprintf(...) for the SSE retry
  hint and resync control-event writes.
- cmd/maglevc/commands.go: bulk-prefix every fmt.Fprintf(w, ...) with
  _, _ =; also merged 'var watchEventsOptSlot *Node; ... = &Node{...}'
  into a single := declaration (staticcheck S1021) — the self-
  referencing pattern still works because the Children back-ref is
  assigned on the next statement, not inside the struct literal.
- cmd/maglevc/complete.go: _, _ = fmt.Fprintf(ql.rl.Stderr(), ...)
  for the banner and help writes; removed the ineffectual
  'partial = ""' assignment (nothing downstream reads partial after
  that branch, so setting it was dead code flagged by ineffassign).
- cmd/maglevc/shell.go: defer func() { _ = rl.Close() }() for the
  readline instance; _, _ = fmt.Fprintf(rl.Stderr(), ...) for error
  display in the REPL loop.
- cmd/maglevc/main.go: defer func() { _ = conn.Close() }() for the
  gRPC client connection.
- internal/grpcapi/server_test.go: _ = conn.Close() in the test
  teardown closure.
- internal/prober/http.go: _ = c.Close() in the TLS-handshake-failed
  path; defer func() { _ = conn.Close() }() and defer func() { _ =
  resp.Body.Close() }() for the two deferred cleanups.
- internal/prober/http_test.go: defer func() { _ = resp.Body.Close()
  }() plus three _, _ = fmt.Fprint(w, ...) in the httptest.Server
  handlers and _, _ = fmt.Sscanf(...) when parsing the test listener's
  port.
- internal/prober/icmp.go: defer func() { _ = pc.Close() }() for the
  ICMP packet conn.
- internal/prober/netns.go: defer func() { _ = origNs.Close() }(),
  defer func() { _ = netns.Set(origNs) }(), defer func() { _ =
  targetNs.Close() }() — also dropped a stray //nolint:errcheck that
  was no longer needed once the closure wrapping handled the discard.
- internal/prober/tcp.go: _ = conn.Close() in the L4-only path,
  _ = tlsConn.Close() in the failed and succeeded handshake branches,
  _ = tlsConn.SetDeadline(...) (also dropped a //nolint:errcheck
  previously covering it).

Iterative 'make lint' runs were needed because golangci-lint v2.x
caps same-linter reports per pass, so the first pass reported 21,
then 4, then 3, then 1, then 0. Final pass: 0 issues. make test is
green across every package, and make build produces all three
binaries cleanly.
2026-04-14 17:37:53 +02:00
Pim van Pelt
4288e22b71 Compile/Test static binaries, for better portability 2026-04-13 14:38:43 +02:00
Pim van Pelt
35643fd774 Rename maglev-frontend → maglevd-frontend; v0.9.1; API RX/TX pulse
Rename the web dashboard binary to maglevd-frontend and move it to
/usr/sbin (it's a daemon and belongs with maglevd). The systemd unit
name stays vpp-maglev-frontend.service since that prefix is the
package name. Manpage, README, user-guide, and debian packaging all
updated in lockstep; bump to 0.9.1 for the first real release.

All frontend env vars are now prefixed MAGLEV_FRONTEND_ so a single
/etc/default/vpp-maglev can be shared with maglevd without collisions.
Every flag has an env equivalent for Docker use. MAGLEV_FRONTEND_USER
and MAGLEV_FRONTEND_PASSWORD still gate the /admin surface.

VPPInfoPanel now pulses "API: ↑↓" indicators in the zippy title
whenever a vpp-api-send / vpp-api-recv log event arrives on the SSE
stream for the scoped maglevd — 250ms blue flash, re-triggerable,
with the two arrows tightly kerned via negative letter-spacing.
2026-04-13 00:13:52 +02:00
Pim van Pelt
25e9d79aba Frontend: live clocks, admin mode, backend actions; packaging polish
Builds on the maglev-frontend component introduced in 284b4cc with
quality-of-life improvements, an authenticated /admin surface, a
live-action control plane, and Debian packaging cleanup.

 - Backend state now renders live: maglevd's FrontendEvent synthetic
   from==to replay hydrates FrontendSnapshot.State on WatchEvents
   subscribe, and live transitions update both the in-process cache
   and every connected browser via a new applyFrontendTransition
   reducer. Shown as a StatusBadge next to the frontend name.
 - VPP connection state surfaces in the VPP zippy title as a
   green/red badge. Driven by vpp-connect / vpp-disconnect and by
   the steady stream of vpp-api-send/recv debug heartbeats so a
   silent VPP drop is caught within one debug-log tick.
 - Probe heartbeat dot becomes ❤️ while a probe is in flight and
   reverts to · on probe-done. Fixed-size wrapper so the emoji swap
   doesn't jiggle the row; both states share the same font-size.
 - Flash component replaced its subtle background-only fade with a
   scale-pop + yellow halo box-shadow + longer duration so
   weight/effective/state changes are unmissable on tiny numeric
   cells. Initial mount still skipped via defer so no flash on load.
 - Last-transition age is now a live countdown driven by a global
   1-second ticker signal (one timer, many subscribers). Two most
   significant units: 10m30s / 1h12m / 1d16h. Sub-second ages
   render as "now" to absorb clock skew between maglevd and the
   browser.
 - Event stream is now chronological (oldest at top) with tail-
   style auto-scroll, pause/resume, and the toolbar moved below the
   list. Row separators removed. Also shown only in /admin (see
   below) so /view stays a focused read-only surface.
 - Table nowrap so backend names like nginx0-frggh0 and the
   "last transition" header don't wrap. Frontends render in the
   order returned by ListFrontends instead of Go map iteration
   order so reload doesn't shuffle VIP order.
 - IPng logo in the header, clickable, links to the git repo.
   Header padding reduced so the logo can fill the bar up to the
   separator. Version + commit + build date shown in the brand area
   (fetched once from /view/api/version).
 - "view" / "admin" mode tag moved to sit just left of the admin
   toggle button so it reads as a pair.
 - Prettier wired in as the web-side fixstyle via a new
   fixstyle-web Make target that also runs from `make fixstyle`.
   Added .prettierrc.json and .prettierignore; 8 existing files
   were normalized in place.

 - Fixed a "20555d ago" rendering bug: maglevd's synthetic
   backend-replay events (from==to, at_unix_ns=0) were corrupting
   the local cache's LastTransition via applyBackendTransition.
   Backend synthetic events are now skipped entirely (refreshAll
   covers initial hydration for backends), while frontend synthetic
   events are still applied because FrontendInfo doesn't carry
   state — the event is the only source.

 - New MAGLEV_FRONTEND_USER / MAGLEV_FRONTEND_PASSWORD env vars.
   When both are set and non-empty, /admin/ becomes a basic-auth-
   protected SPA shell backed by the same embedded index.html as
   /view/. The SPA detects its base path via a new stores/mode.ts
   isAdmin constant and conditionally renders admin-only sections
   (currently: the Event Stream / DebugPanel). When disabled,
   /admin/ returns 404 (not 501) so operators who didn't configure
   it see no teasing affordance, and the SPA's admin-toggle button
   is hidden entirely via the admin_enabled flag on
   /view/api/version.
 - basicAuth uses crypto/subtle.ConstantTimeCompare for both user
   and password so timing can't distinguish a wrong username from
   a wrong password.

 - New POST /admin/api/{maglevd}/backend/{name}/{pause|resume|
   enable|disable} endpoint, gated by the same basic-auth
   middleware as the SPA shell. maglevClient.BackendAction wraps
   the four matching gRPC RPCs and returns a fresh BackendSnapshot;
   the same transition lands via WatchEvents so every connected
   browser converges through the normal reducer path.
 - BackendActionsMenu Solid component: kebab (⋮) button in a new
   trailing column rendered only in /admin. Click-outside and
   Escape close the popover (document listeners installed only
   while open). Actions are state-aware: up/down/unknown → pause,
   disable; paused → resume, disable; disabled → enable;
   removed → menu suppressed entirely. Busy indicator per action;
   errors render inline under the item list.
 - Structured audit log: every mutation logs an
   admin-backend-action record with maglevd / backend / action /
   resulting state.

 - Renamed debian/vpp-maglevd.service → debian/vpp-maglev.service
   to align naming with the new vpp-maglev-frontend.service
   sibling. postinst handles upgrades by stopping + disabling any
   lingering vpp-maglevd.service before enabling the renamed unit;
   prerm stops both (the frontend unit is installed but not
   enabled by default — operators opt in with systemctl enable).
 - New debian/vpp-maglev-frontend.service (hardened:
   NoNewPrivileges, ProtectSystem=strict, ProtectHome, PrivateTmp,
   no capabilities). Reads the same /etc/default/vpp-maglev
   conffile and expands MAGLEV_FRONTEND_ARGS via
   `ExecStart=/usr/bin/maglev-frontend $MAGLEV_FRONTEND_ARGS` so
   word-splitting works.
 - docs/maglev-frontend.8 manpage documenting flags, endpoints,
   and SSE reverse-proxy requirements.
 - build-deb.sh: drops the commit hash from the .deb filename
   (now vpp-maglev_<version>_<arch>.deb) and no longer takes the
   commit as a CLI arg. Binaries continue to carry the commit via
   -ldflags so `maglevd --version` et al are the authoritative
   "which build is running" answer.
2026-04-12 20:04:53 +02:00
Pim van Pelt
284b4cc9a4 New maglev-frontend component; promote LB sync events to INFO
Introduces maglev-frontend, a responsive, real-time web dashboard for one
or more running maglevd instances. Source lives at cmd/frontend/; the
built binary is maglev-frontend. It is a single Go process with the
SolidJS SPA embedded via //go:embed — no runtime file dependencies.

Architecture
 - One persistent gRPC connection per configured maglevd (-server A,B,C).
   Each connection runs three background loops: a WatchEvents stream
   subscribed at log_level=debug for live events, a 30s refresh loop as
   a safety net for drift, and a 5s health loop that surfaces connection
   drops quickly.
 - In-process pub/sub broker with a 30s / 2000-event replay ring using
   <epoch>-<seq> monotonic IDs. Short browser reconnects (nginx idle,
   wifi flap, laptop wake) silently replay buffered events via the
   EventSource Last-Event-ID header; longer outages or frontend restarts
   fall through to a "resync" event that triggers a full state refetch.
 - HTTP surface: /view/ (SPA), /view/api/state, /view/api/state/{name},
   /view/api/maglevds, /view/api/version, /view/api/events (SSE),
   /healthz, and an /admin/* placeholder returning 501 for a future
   basic-auth mutation surface.
 - SSE handler follows the full operational checklist: retry hint, 15s
   : ping heartbeat, Flush after every write, r.Context().Done() teardown,
   X-Accel-Buffering: no, and no gzip.

SolidJS SPA (cmd/frontend/web/, Vite + TypeScript)
 - solid-js/store for a reactive per-maglevd state tree; reducers apply
   backend transitions, maglevd-status flips, and resync refetches.
 - Scope selector tabs for multi-maglevd support, per-maglevd frontend
   cards with pool tables showing state, configured weight, effective
   weight, and last-transition age.
 - ProbeHeartbeat component turns a middle-dot into ❤️ on probe-start and
   back on probe-done, driven by real log events; fixed-size wrapper so
   the emoji swap doesn't jiggle the row.
 - Flash wrapper animates any primitive on change (1s yellow fade via
   Web Animations API, skipped on first mount). Wired into the state
   badge, configured weight, and effective weight columns.
 - DebugPanel: chronological rolling event tail with tail-style auto-
   scroll, pause/resume, and scope/firehose filter. Syntactic highlight
   for vpp-lb-sync-* events with fixed-order attribute formatting.
 - Live effective_weight updates: vpp-lb-sync-as-added/removed/weight-
   updated log events are routed through a reducer that walks the
   snapshot's pool rows and sets effective_weight on every match
   without waiting for the 30s refresh.
 - Header shows build version + commit with build date in a tooltip,
   fetched once from /view/api/version on mount.
 - Prettier wired in as the web-side fixstyle; make fixstyle now tidies
   both Go and web in one shot via a new fixstyle-web target.

Per-mutation VPP LB sync logging
 - Promotes the addVIP/delVIP/addAS/delAS/setASWeight helpers from
   slog.Debug to slog.Info and renames them from vpp-lbsync-* to
   vpp-lb-sync-{vip-added,vip-removed,as-added,as-removed,as-weight-
   updated}. Matching rename for vpp-lb-sync-start / -done / -error /
   -vip-recreate. The Prometheus metric name (maglev_vpp_lbsync_total)
   is left alone to preserve dashboards.
 - setASWeight now takes the prior weight so the event can emit
   from=X to=Y and the UI can show the delta.
 - The vip field in every event is the bare address (no /32 or /128
   mask), matching the CLI output style.
 - Any listener on the gRPC WatchEvents stream — CLI watch events or
   maglev-frontend — now sees every VIP/AS dataplane change in real
   time without needing to raise the log level.

Build and tooling
 - Makefile: maglev-frontend added to BINARIES; build / build-amd64 /
   build-arm64 emit the binary alongside maglevd and maglevc. A new
   maglev-frontend-web target rebuilds the SolidJS bundle via npm.
 - web/dist/ is tracked so a bare `go build` keeps working for Go-only
   contributors and CI.
 - .gitignore skips cmd/frontend/web/node_modules/.

Stability fixes
 - maglevd's WatchEvents synthetic replay events (from==to, at_unix_ns=0)
   were corrupting the frontend's LastTransition cache with at=0,
   rendering as "20555d ago" in the browser. Client now skips synthetic
   events: the cache comes from refreshAll and doesn't need them.
 - Frontends, Backends, and HealthChecks are now served in the order
   returned by the corresponding List* RPC instead of Go map iteration
   order, so reloads and refreshes keep the SPA stable.
2026-04-12 17:48:31 +02:00
Pim van Pelt
d3c5c86037 VPP load-balancer dataplane integration: state, sync, and global conf
This commit wires maglevd through to VPP's LB plugin end-to-end, using
locally-generated GoVPP bindings for the newer v2 API messages.

VPP binapi (vendored)
- New package internal/vpp/binapi/ containing lb, lb_types, ip_types, and
  interface_types, generated from a local VPP build (~/src/vpp) via a new
  'make vpp-binapi' target. GoVPP v0.12.0 upstream lacks the v2 messages we
  need (lb_conf_get, lb_add_del_vip_v2, lb_add_del_as_v2, lb_as_v2_dump,
  lb_as_set_weight), so we commit the generated output in-tree.
- All generated files go through our loggedChannel wrapper; every VPP API
  send/receive is recorded at DEBUG via slog (vpp-api-send / vpp-api-recv /
  vpp-api-send-multi / vpp-api-recv-multi) so the full wire-level trail is
  auditable. NewAPIChannel is unexported — callers must use c.apiChannel().

Read path: GetLBState{All,VIP}
- GetLBStateAll returns a full snapshot (global conf + every VIP with its
  attached application servers).
- GetLBStateVIP looks up a single VIP by (prefix, protocol, port) and
  returns (nil, nil) when the VIP doesn't exist in VPP. This is the
  efficient path for targeted updates on a busy LB.
- Helpers factored out: getLBConf, dumpAllVIPs, dumpASesForVIP, lookupVIP,
  vipFromDetails.

Write path: SyncLBState{All,VIP}
- SyncLBStateAll reconciles every configured frontend with VPP: creates
  missing VIPs, removes stale ones (with AS flush), and reconciles AS
  membership and weights within VIPs that exist on both sides.
- SyncLBStateVIP targets a single frontend by name. Never removes VIPs.
  Returns ErrFrontendNotFound (wrapped with the name) when the frontend
  isn't in config, so callers can use errors.Is.
- Shared reconcileVIP helper does the per-VIP AS diff; removeVIP is used
  only by the full-sync pass.
- LbAddDelVipV2 requests always set NewFlowsTableLength=1024. The .api
  default=1024 annotation is only applied by VAT/CLI parsers, not wire-
  level marshalling — sending 0 caused VPP to vec_validate with mask
  0xFFFFFFFF and OOM-panic.
- Pool semantics: backends in the primary (first) pool of a frontend get
  their configured weight; backends in secondary pools get weight 0. All
  backends are installed so higher layers can flip weights on failover
  without add/remove churn.
- Every individual change emits a DEBUG slog (vpp-lbsync-vip-add/del,
  vpp-lbsync-as-add/del, vpp-lbsync-as-weight). Start/done INFO logs
  carry a scope=all|vip label plus aggregate counts.

Global conf push: SetLBConf
- New SetLBConf(cfg) sends lb_conf with ipv4-src, ipv6-src, sticky-buckets,
  and flow-timeout. Called automatically on VPP (re)connect and after
  every config reload (via doReloadConfig). Results are cached on the
  Client so redundant pushes are silently skipped — only actual changes
  produce a vpp-lb-conf-set INFO log line.

Periodic drift reconciliation
- vpp.Client.lbSyncLoop runs in a goroutine tied to each VPP connection's
  lifetime. Its first tick is immediate (startup and post-reconnect
  sync quickly); subsequent ticks fire every vpp.lb.sync-interval from
  config (default 30s). Purpose: catch drift if something/someone
  modifies VPP state by hand. The loop uses a ConfigSource interface
  (satisfied by checker.Checker via its new Config() accessor) to avoid
  an import cycle with the checker package.

Config schema additions (maglev.vpp.lb)
- sync-interval: positive Go duration, default 30s.
- ipv4-src-address: REQUIRED. Used as the outer source for GRE4 encap
  to application servers. Missing this is a hard semantic error —
  maglevd --check exits 2 and the daemon refuses to start. VPP GRE
  needs a source address and every VIP we program uses GRE, so there
  is no meaningful config without it.
- ipv6-src-address: REQUIRED. Same treatment as ipv4-src-address.
- sticky-buckets-per-core: default 65536, must be a power of 2.
- flow-timeout: default 40s, must be a whole number of seconds in [1s, 120s].
- VPP validation runs at the end of convert() so structural errors in
  healthchecks/backends/frontends surface first — operators fix those,
  then get the VPP-specific requirements.

gRPC API
- New GetVPPLBState RPC returning VPPLBState: global conf + VIPs with
  ASes. Mirrors the read-path but strips fields irrelevant to our
  GRE-only deployment (srv_type, dscp, target_port).
- New SyncVPPLBState RPC with optional frontend_name. Unset → full sync
  (may remove stale VIPs). Set → single-VIP sync (never removes).
  Returns codes.NotFound for unknown frontends, codes.Unavailable when
  VPP integration is disabled or disconnected.

maglevc (CLI)
- New 'show vpp lbstate' command displaying the LB plugin state. VPP-only
  fields the dataplane irrelevant to GRE are suppressed. Per-AS lines use
  a key-value format ("address X  weight Y  flow-table-buckets Z")
  instead of a tabwriter column, which avoids the ANSI-color alignment
  issue we hit with mixed label/data rows.
- New 'sync vpp lbstate [<name>]' command. Without a name, triggers a
  full reconciliation; with a name, targets one frontend.
- Previous 'show vpp lb' renamed to 'show vpp lbstate' for consistency
  with the new sync command.

Test fixtures
- validConfig and all ad-hoc config_test.go fixtures that reach the end
  of convert() now include the two required vpp.lb src addresses.
- tests/01-maglevd/maglevd-lab/maglev.yaml gains a vpp.lb section so the
  robot integration tests can still load the config.
- cmd/maglevc/tree_test.go gains expected paths for the new commands.

Docs
- config-guide.md: new 'vpp' section in the basic structure, detailed
  vpp.lb field reference, noting ipv4/ipv6 src addresses as REQUIRED
  (hard error) with no defaults; example config updated.
- user-guide.md: documented 'show vpp info', 'show vpp lbstate',
  'sync vpp lbstate [<name>]', new --vpp-api-addr and --vpp-stats-addr
  flags, the vpp-lb-conf-set log line, and corrected the pause/resume
  description to reflect that pause cancels the probe goroutine.
- debian/maglev.yaml: example config gains a vpp.lb block with src
  addresses and commented optional overrides.
2026-04-12 10:58:44 +02:00
Pim van Pelt
1815675fb6 Distinguish disabled from removed backend state; add make fixstyle
Add StateDisabled for operator-initiated disable, keeping StateRemoved
for backends that disappear during a config reload. Previously both
used StateRemoved, which was confusing: "removed" implies the backend
no longer exists in config, but a disabled backend is still present
and can be re-enabled on the fly.

- health: add StateDisabled with String() "disabled", Disable() method
  with probe code "disabled". Record() rejects probes in all three
  inactive states (paused, disabled, removed).
- checker: DisableBackend calls backend.Disable() instead of Remove().
- docs: healthchecks.md rewritten for pause (goroutine cancelled, not
  just results discarded), and separate disabled/removed state rows.
  user-guide.md updated to match.
- Makefile: add fixstyle target (gofmt -w .).
2026-04-11 21:04:24 +02:00
Pim van Pelt
8bde00eb61 Fix pause to cancel probe goroutine; add Robot Framework integration tests
Pause semantics
- PauseBackend now cancels the probe goroutine so no HTTP/TCP/ICMP
  traffic is sent while the backend is paused. Previously the goroutine
  kept running and results were silently discarded.
- ResumeBackend launches a fresh probe goroutine on the existing worker,
  preserving transition history. The backend re-enters unknown state.

Integration tests (tests/01-maglevd/)
- Containerlab topology with 3 nginx:alpine backends on a dedicated
  management network (172.20.30.0/24) with static IPs.
- maglevd config with 200ms HTTP health-check interval for fast test
  convergence (rise=2, fall=2).
- 8 test cases: deploy lab, start maglevd, all backends reach up,
  nginx logs confirm probes arriving, pause stops probes (probe count
  stable), resume restarts probes, disable stops probes, enable
  restarts probes.

VPP dataplane test (tests/02-vpp-lb/)
- Rewrite 01-e2e-lab.robot to match the actual single-VPP topology:
  test client-to-server ping through VPP bridge domains and verify
  nginx is serving on all app servers. The previous version referenced
  a non-existent topology file and tested OSPF/BFD between two VPP
  nodes that don't exist in this lab.

Build infrastructure
- Add 'make robot-test' target with TEST= for suite selection.
- Add tests/.venv target for Robot Framework virtualenv.
- Make IMAGE optional in rf-run.sh.
- Add .gitignore entries for test output, venv, logs, and clab state.
2026-04-11 20:19:36 +02:00
Pim van Pelt
d612086a5f Pools, CLI, versioning, Debian packaging, HTTPS fix
- Replaced flat `backends: [...]` list on frontends with an ordered `pools:`
  list; each pool has a name and a map of backends with per-pool weights (0–100,
  default 100). Pools express priority: first pool with a healthy backend wins.
- Removed global backend weight (was on the backend, now lives in the pool).
- Config validation enforces non-empty pools, non-empty pool names, weight
  range, and consistent address families across all pools of a frontend.

- Added `PoolBackendInfo { name, weight }` and changed `PoolInfo.backends` from
  `repeated string` to `repeated PoolBackendInfo` so weights are visible over
  the API.

- Full interactive shell with readline, tab completion, and `?` inline help.
- Command tree parser (Walk) handles fixed keywords and dynamic slot nodes;
  prefix matching with exact-match priority.
- Commands: `show version/frontends/frontend/backends/backend/healthchecks/
  healthcheck`, `set backend <name> pause|resume`, `quit`/`exit`.
- `show frontend` output is hierarchical (pools → backends) with per-backend
  weights and `[disabled]` notation; pool section uses fixed-width formatting
  so ANSI color codes don't corrupt tabwriter alignment.
- `-color` flag (default true) wraps static field labels in dark-blue ANSI;
  works correctly with tabwriter because all labels carry identical-length
  escape sequences.

- `cmd/version.go` package holds `version`, `commit`, `date` vars set at build
  time via `-ldflags -X`.
- `make build` / `make build-amd64` / `make build-arm64` all inject
  `VERSION=0.1.1`, `COMMIT_HASH` (from `git rev-parse --short HEAD`), and
  `DATE` (UTC ISO-8601).
- `maglevc` prints version on interactive startup and exposes `show version`.
- `maglevd` logs version/commit/date at startup; `-version` flag prints and exits.

- `doHTTPProbe` was building a `https://` target URL even though TLS was already
  applied to the connection inside `inNetns`. `http.Transport` then wrapped the
  connection in a second TLS layer, producing "http: server gave HTTP response
  to HTTPS client". Fixed by always using `http://` in the target URL.
- Added `TestHTTPSProbe` using `httptest.NewTLSServer` to cover the full path.

- New `docs/user-guide.md`: maglevd flags/signals, maglevc commands, shell
  completion, and command-tree parser walkthrough.
- New `docs/healthchecks.md`: state machine, rise/fall model, probe intervals,
  all transition events with log examples.
- Updated `docs/config-guide.md`: pools design, removed global weight from
  backends, updated all examples.
- Updated `README.md`: packaging table, build paths, corrected binary locations
  (`/usr/sbin/maglevd`), config filename (`.yaml`).

- `debian/` directory contains `control.in`, `maglevd.service`, `default.maglev`,
  `maglev.yaml` (example config), `conffiles`, `postinst`, `prerm`.
- `debian/build-deb.sh` stages a package tree and calls `dpkg-deb`; emits
  `build/vpp-maglev_<version>~<commit>_<arch>.deb`.
- Cross-compiles for amd64 and arm64 in one `make pkg-deb` invocation.
- `maglevd` installed to `/usr/sbin/`, `maglevc` to `/usr/bin/`.
- Service reads `MAGLEV_CONFIG` from `/etc/default/maglev`
  (default: `/etc/maglev/maglev.yaml`).
- Man pages `maglevd(8)` and `maglevc(1)` live in `docs/` and are gzip'd into
  the package.
- All build output goes to `build/<arch>/`; `build/` is gitignored.
2026-04-11 12:18:17 +02:00
Pim van Pelt
46e78ec36f First stab at maglevc 2026-04-11 02:48:00 +02:00
Pim van Pelt
040d6f5853 Revision: Rename to 'maglevd'; Refactor config structure 2026-04-10 22:15:20 +02:00
Pim van Pelt
b84b3274b1 Initial revisin of healthchecker, inspired by HAProxy 2026-04-10 17:30:44 +02:00