Commit Graph

12 Commits

Author SHA1 Message Date
Pim van Pelt
701674399a Release v1.0.0 under Apache-2.0 license 2026-04-15 18:14:28 +02:00
Pim van Pelt
744b1cb3d2 install-deps Makefile target; docs refresh; golangci-lint v2 clean
Makefile:
- New install-deps umbrella target split into three sub-targets:
  install-deps-apt        — Debian/Trixie-packaged build deps
                            (nodejs, npm, protobuf-compiler, git, make,
                            dpkg-dev, ca-certificates, curl, tar). Uses
                            sudo when not already root.
  install-deps-go         — ensures a Go toolchain >= GO_VERSION (go.mod
                            floor, default 1.25.0). Short-circuits when
                            the system Go is already recent enough;
                            otherwise downloads the upstream tarball
                            from go.dev/dl/ into /usr/local/go. Trixie
                            only ships 1.24 so this step is load-bearing.
  install-deps-go-tools   — go install protoc-gen-go, protoc-gen-go-grpc,
                            and golangci-lint/v2/cmd/golangci-lint. Then
                            asserts the installed golangci-lint version
                            parses as >= GOLANGCI_LINT_VERSION (default
                            1.64.0, the floor that supports Go 1.25
                            syntax) to catch stale binaries in $GOPATH
                            /bin before they silently run against Go
                            1.25 code.
- Parser bug fixed: golangci-lint v1.x prints "has version v1.64.8" but
  v2.x dropped the 'v' prefix and prints "has version 2.11.4". The
  original sed regex required the 'v' and returned an empty match on
  v2.x, making the assertion explode with "could not parse version
  output". Fixed by switching to extended regex (sed -En) with 'v?' so
  both forms parse cleanly.
- GO_VERSION and GOLANGCI_LINT_VERSION exposed as Makefile variables
  so operators can override on the command line, e.g.
    make install-deps GO_VERSION=1.25.5 GOLANGCI_LINT_VERSION=2.0.0
- .PHONY extended with the four new target names.

Docs:
- README.md: capability note rewritten to cover CAP_NET_RAW (ICMP) and
  the new CAP_SYS_ADMIN requirement when healthchecker.netns is set,
  plus a paragraph explaining that the Debian systemd unit grants both
  automatically. Docker example gained a second variant that shows the
  additional --cap-add SYS_ADMIN and /var/run/netns bind mount for
  netns-scoped deployments. Also notes that maglevd-frontend ignores
  SIGHUP so controlling-terminal disconnects don't kill it.
- docs/user-guide.md: Capabilities section rewritten as a bulleted
  list covering both caps, with the EPERM error string and three
  different ways to grant them (systemd unit, setcap, systemd-run);
  'show vpp lb counters' command description updated to explain that
  per-backend packet counts are no longer shown (LB plugin's
  forwarding node bypasses ip{4,6}_lookup_inline, so /net/route/to at
  the backend's FIB entry never ticks for LB-forwarded traffic); new
  ~75-line "What the SPA shows" subsection covering the scope
  selector + maglev_scope cookie, the per-maglevd frontend cards, the
  health-cascade icon table (ok / bug-buckets / primary-drained /
  degraded / unknown), the lb buckets column semantics, the
  maglev_zippy_open cookie, the admin-mode lifecycle dialogs with
  their plain-English consequence text, and the debug panel.
- docs/config-guide.md: healthchecker.netns field gains a capability-
  requirement note spelling out setns(CLONE_NEWNET), the EPERM
  symptom string, and the /var/run/netns/ readability requirement.
- docs/healthchecks.md: new "Jitter" subsection explaining the +/-10%
  scaling on every computed interval, and a "Probe timing while a
  probe is in flight" subsection that explains why fast-interval alone
  doesn't give fast fault detection against hanging backends (the
  probe loop is synchronous, so each iteration is timeout +
  fast-interval; the advice is to lower timeout, not fast-interval).
- docs/maglevd.8: description paragraph corrected (dropped the
  per-backend stats claim and added a short note pointing at the LB
  plugin forwarding-path bypass); new CAPABILITIES section between
  SIGNALS and FILES covering both CAP_NET_RAW and CAP_SYS_ADMIN with
  the drop-in-override hint.
- docs/maglevd-frontend.8: new SIGNALS section documenting the
  explicit SIGHUP ignore (so a controlling-terminal disconnect doesn't
  kill the daemon); description extended with paragraphs on the two
  persistence cookies (maglev_scope, maglev_zippy_open) and on the
  health-cascade icon + lb buckets column.
- docs/maglevc.1: left untouched — intentionally minimal and delegates
  to docs/user-guide.md.

Lint (26 issues across 12 files, all errcheck / ineffassign / S1021):
- cmd/frontend/handlers.go: _, _ = fmt.Fprintf(...) for the SSE retry
  hint and resync control-event writes.
- cmd/maglevc/commands.go: bulk-prefix every fmt.Fprintf(w, ...) with
  _, _ =; also merged 'var watchEventsOptSlot *Node; ... = &Node{...}'
  into a single := declaration (staticcheck S1021) — the self-
  referencing pattern still works because the Children back-ref is
  assigned on the next statement, not inside the struct literal.
- cmd/maglevc/complete.go: _, _ = fmt.Fprintf(ql.rl.Stderr(), ...)
  for the banner and help writes; removed the ineffectual
  'partial = ""' assignment (nothing downstream reads partial after
  that branch, so setting it was dead code flagged by ineffassign).
- cmd/maglevc/shell.go: defer func() { _ = rl.Close() }() for the
  readline instance; _, _ = fmt.Fprintf(rl.Stderr(), ...) for error
  display in the REPL loop.
- cmd/maglevc/main.go: defer func() { _ = conn.Close() }() for the
  gRPC client connection.
- internal/grpcapi/server_test.go: _ = conn.Close() in the test
  teardown closure.
- internal/prober/http.go: _ = c.Close() in the TLS-handshake-failed
  path; defer func() { _ = conn.Close() }() and defer func() { _ =
  resp.Body.Close() }() for the two deferred cleanups.
- internal/prober/http_test.go: defer func() { _ = resp.Body.Close()
  }() plus three _, _ = fmt.Fprint(w, ...) in the httptest.Server
  handlers and _, _ = fmt.Sscanf(...) when parsing the test listener's
  port.
- internal/prober/icmp.go: defer func() { _ = pc.Close() }() for the
  ICMP packet conn.
- internal/prober/netns.go: defer func() { _ = origNs.Close() }(),
  defer func() { _ = netns.Set(origNs) }(), defer func() { _ =
  targetNs.Close() }() — also dropped a stray //nolint:errcheck that
  was no longer needed once the closure wrapping handled the discard.
- internal/prober/tcp.go: _ = conn.Close() in the L4-only path,
  _ = tlsConn.Close() in the failed and succeeded handshake branches,
  _ = tlsConn.SetDeadline(...) (also dropped a //nolint:errcheck
  previously covering it).

Iterative 'make lint' runs were needed because golangci-lint v2.x
caps same-linter reports per pass, so the first pass reported 21,
then 4, then 3, then 1, then 0. Final pass: 0 issues. make test is
green across every package, and make build produces all three
binaries cleanly.
2026-04-14 17:37:53 +02:00
Pim van Pelt
fb62532fd5 VPP LB counters, src-ip-sticky, and frontend state aggregation
New feature: per-VIP / per-backend runtime counters
  * New GetVPPLBCounters RPC serving an in-process snapshot refreshed
    by a 5s scrape loop (internal/vpp/lbstats.go). Each cycle pulls
    the LB plugin's four SimpleCounters (next, first, untracked,
    no-server) plus the FIB /net/route/to CombinedCounter for every
    VIP and every backend host prefix via a single DumpStats call.
  * FIB stats-index discovery via ip_route_lookup (internal/vpp/
    fibstats.go); per-worker reduction happens in the collector.
  * Prometheus collector exports vip_packets_total (kind label),
    vip_route_{packets,bytes}_total, and backend_route_{packets,
    bytes}_total. Metrics source interface extended with VIPStats /
    BackendRouteStats; vpp.Client publishes snapshots via
    atomic.Pointer and clears them on disconnect.
  * New 'show vpp lb counters' CLI command. The 'show vpp lbstate'
    and 'sync vpp lbstate' commands are restructured under 'show
    vpp lb {state,counters}' / 'sync vpp lb state' to make room
    for the new verb.

New feature: src-ip-sticky frontends
  * New frontend YAML key 'src-ip-sticky' (bool). Plumbed through
    config.Frontend, desiredVIP, and the lb_add_del_vip_v2 call.
  * Reflected in gRPC FrontendInfo.src_ip_sticky and VPPLBVIP.
    src_ip_sticky, and shown in 'show vpp lb state' output.
  * Scraped back from VPP by parsing 'show lb vips verbose' through
    cli_inband — lb_vip_details does not expose the flag. The same
    scrape also recovers the LB pool index for each VIP, which the
    stats-segment counters are keyed on. This is a documented
    temporary workaround until VPP ships an lb_vip_v2_dump.
  * src_ip_sticky cannot be mutated on a live VIP, so a flipped flag
    triggers a tear-down-and-recreate in reconcileVIP (ASes deleted
    with flush, VIP deleted, then re-added). Flip is logged.

New feature: frontend state aggregation and events
  * New health.FrontendState (unknown/up/down) and FrontendTransition
    types. A frontend is 'up' iff at least one backend has a nonzero
    effective weight, 'unknown' iff no backend has real state yet,
    and 'down' otherwise.
  * Checker tracks per-frontend aggregate state, recomputing after
    each backend transition and emitting a frontend-transition Event
    on change. Reload drops entries for removed frontends.
  * checker.Event gains an optional FrontendTransition pointer;
    backend- vs. frontend-transition events are demultiplexed on
    that field.
  * WatchEvents now sends an initial snapshot of frontend state on
    connect (mirroring the existing backend snapshot), subscribes
    once to the checker stream, and fans out to backend/frontend
    handlers based on the client's filter flags. The proto
    FrontendEvent message grows name + transition fields.
  * New Checker.FrontendState accessor.

Refactor: pure health helpers
  * Moved the priority-failover selector and the (pool idx, active
    pool, state, cfg weight) → (vpp weight, flush) mapping out of
    internal/vpp/lbsync.go into a new internal/health/weights.go so
    the checker can reuse them for frontend-state computation
    without importing internal/vpp.
  * New functions: health.ActivePoolIndex, BackendEffectiveWeight,
    EffectiveWeights, ComputeFrontendState. lbsync.go now calls
    these directly; vpp.EffectiveWeights is a thin wrapper over
    health.EffectiveWeights retained for the gRPC observability
    path. Fully unit-tested in internal/health/weights_test.go.

maglevc polish
  * --color default is now mode-aware: on in the interactive shell,
    off in one-shot mode so piped output is script-safe. Explicit
    --color=true/false still overrides.
  * New stripHostMask helper drops /32 and /128 from VIP display;
    non-host prefixes pass through unchanged.
  * Counter table column order fixed (first before next) and
    packets/bytes columns renamed to fib-packets/fib-bytes to
    clarify they come from the FIB, not the LB plugin.

Docs
  * config-guide: document src-ip-sticky, including the VIP
    recreate-on-change caveat.
  * user-guide, maglevc.1, maglevd.8: updated command tree, new
    counters command, color defaults, and the src-ip-sticky field.
2026-04-12 16:07:39 +02:00
Pim van Pelt
3227263d68 Add GoVPP integration and GetVPPInfo gRPC call
VPP client (internal/vpp/)
- New package managing connections to both VPP API and stats sockets,
  treated as a unit: if either drops, both are torn down and
  re-established together.
- Run() loop: connect, fetch version via vpe.ShowVersion, read
  /sys/boottime from the stats segment, log vpp-connect, then monitor
  with control_ping every 10s. On failure, disconnect both, retry
  after 5s.
- Registers as client name "vpp-maglev" (visible in VPP's
  "show api clients").
- Flags: --vpp-api-addr (default /run/vpp/api.sock) and
  --vpp-stats-addr (default /run/vpp/stats.sock). Empty api addr
  disables VPP integration entirely.

gRPC / proto
- Add GetVPPInfo RPC returning VPPInfo: version, build_date,
  build_directory, pid, boottime_ns, connecttime_ns. Both times are
  unix timestamps in nanoseconds — the client computes durations
  locally for display.
- Returns codes.Unavailable if VPP is disabled or not connected.

maglevc
- Add 'show vpp info' command displaying version, build-date,
  build-dir, vpp-pid, vpp-boottime (with duration), and connected
  time (with duration).
2026-04-11 22:03:28 +02:00
Pim van Pelt
1815675fb6 Distinguish disabled from removed backend state; add make fixstyle
Add StateDisabled for operator-initiated disable, keeping StateRemoved
for backends that disappear during a config reload. Previously both
used StateRemoved, which was confusing: "removed" implies the backend
no longer exists in config, but a disabled backend is still present
and can be re-enabled on the fly.

- health: add StateDisabled with String() "disabled", Disable() method
  with probe code "disabled". Record() rejects probes in all three
  inactive states (paused, disabled, removed).
- checker: DisableBackend calls backend.Disable() instead of Remove().
- docs: healthchecks.md rewritten for pause (goroutine cancelled, not
  just results discarded), and separate disabled/removed state rows.
  user-guide.md updated to match.
- Makefile: add fixstyle target (gofmt -w .).
2026-04-11 21:04:24 +02:00
Pim van Pelt
58391f5463 Add WatchEvents, enable/disable/weight RPCs, and config check
gRPC / proto
- Rename WatchBackendEvents → WatchEvents; return a stream of Event
  oneof (LogEvent, BackendEvent, FrontendEvent) with optional filter
  flags (log, log_level, backend, frontend)
- Add EnableBackend, DisableBackend, SetFrontendPoolBackendWeight RPCs
- Rename PauseResumeRequest → BackendRequest
- Add CheckConfig RPC returning ok/parse_error/semantic_error

maglevd
- Route slog through a LogBroadcaster (slog.Handler) so WatchEvents
  subscribers can receive structured log records independently of the
  daemon's own --log-level
- Add --reflection flag (default true) to toggle gRPC server reflection
- Add --check flag: validates config file and exits 0/1/2
- SIGHUP: use config.Check before applying reload; log parse vs semantic
  error separately; refuse reload on any error
- Rename default config path /etc/maglev → /etc/vpp-maglev

maglevc
- Add 'watch events [num <n>] [log [level <level>]] [backend] [frontend]'
  command; prints compact protojson, stops on any keypress or Ctrl-C;
  uses cbreak mode (not raw) so output post-processing is preserved
- Add 'set backend <name> enable|disable'
- Add 'set frontend <name> pool <pool> backend <name> weight <0-100>'
- Add 'config check' command

Debian packaging
- Rename service unit to vpp-maglevd.service
- Rename conffiles to /etc/default/vpp-maglev and /etc/vpp-maglev/
- Create maglevd system user/group in postinst; add to vpp group if present
- Add postrm; add adduser to Depends
2026-04-11 16:42:11 +02:00
Pim van Pelt
d612086a5f Pools, CLI, versioning, Debian packaging, HTTPS fix
- Replaced flat `backends: [...]` list on frontends with an ordered `pools:`
  list; each pool has a name and a map of backends with per-pool weights (0–100,
  default 100). Pools express priority: first pool with a healthy backend wins.
- Removed global backend weight (was on the backend, now lives in the pool).
- Config validation enforces non-empty pools, non-empty pool names, weight
  range, and consistent address families across all pools of a frontend.

- Added `PoolBackendInfo { name, weight }` and changed `PoolInfo.backends` from
  `repeated string` to `repeated PoolBackendInfo` so weights are visible over
  the API.

- Full interactive shell with readline, tab completion, and `?` inline help.
- Command tree parser (Walk) handles fixed keywords and dynamic slot nodes;
  prefix matching with exact-match priority.
- Commands: `show version/frontends/frontend/backends/backend/healthchecks/
  healthcheck`, `set backend <name> pause|resume`, `quit`/`exit`.
- `show frontend` output is hierarchical (pools → backends) with per-backend
  weights and `[disabled]` notation; pool section uses fixed-width formatting
  so ANSI color codes don't corrupt tabwriter alignment.
- `-color` flag (default true) wraps static field labels in dark-blue ANSI;
  works correctly with tabwriter because all labels carry identical-length
  escape sequences.

- `cmd/version.go` package holds `version`, `commit`, `date` vars set at build
  time via `-ldflags -X`.
- `make build` / `make build-amd64` / `make build-arm64` all inject
  `VERSION=0.1.1`, `COMMIT_HASH` (from `git rev-parse --short HEAD`), and
  `DATE` (UTC ISO-8601).
- `maglevc` prints version on interactive startup and exposes `show version`.
- `maglevd` logs version/commit/date at startup; `-version` flag prints and exits.

- `doHTTPProbe` was building a `https://` target URL even though TLS was already
  applied to the connection inside `inNetns`. `http.Transport` then wrapped the
  connection in a second TLS layer, producing "http: server gave HTTP response
  to HTTPS client". Fixed by always using `http://` in the target URL.
- Added `TestHTTPSProbe` using `httptest.NewTLSServer` to cover the full path.

- New `docs/user-guide.md`: maglevd flags/signals, maglevc commands, shell
  completion, and command-tree parser walkthrough.
- New `docs/healthchecks.md`: state machine, rise/fall model, probe intervals,
  all transition events with log examples.
- Updated `docs/config-guide.md`: pools design, removed global weight from
  backends, updated all examples.
- Updated `README.md`: packaging table, build paths, corrected binary locations
  (`/usr/sbin/maglevd`), config filename (`.yaml`).

- `debian/` directory contains `control.in`, `maglevd.service`, `default.maglev`,
  `maglev.yaml` (example config), `conffiles`, `postinst`, `prerm`.
- `debian/build-deb.sh` stages a package tree and calls `dpkg-deb`; emits
  `build/vpp-maglev_<version>~<commit>_<arch>.deb`.
- Cross-compiles for amd64 and arm64 in one `make pkg-deb` invocation.
- `maglevd` installed to `/usr/sbin/`, `maglevc` to `/usr/bin/`.
- Service reads `MAGLEV_CONFIG` from `/etc/default/maglev`
  (default: `/etc/maglev/maglev.yaml`).
- Man pages `maglevd(8)` and `maglevc(1)` live in `docs/` and are gzip'd into
  the package.
- All build output goes to `build/<arch>/`; `build/` is gitignored.
2026-04-11 12:18:17 +02:00
Pim van Pelt
ad7d7e20fc go fmt 2026-04-11 03:05:07 +02:00
Pim van Pelt
d8ad89d115 When the server exits (^C or because docker/systemd exits it), streaming gRPC clients must be
closed. Currently, the server does not exit until the gRPC client disconnects.
2026-04-11 02:18:44 +02:00
Pim van Pelt
530d85740e Add LICENSE and README + config-guide 2026-04-10 22:22:56 +02:00
Pim van Pelt
040d6f5853 Revision: Rename to 'maglevd'; Refactor config structure 2026-04-10 22:15:20 +02:00
Pim van Pelt
b84b3274b1 Initial revisin of healthchecker, inspired by HAProxy 2026-04-10 17:30:44 +02:00