d3c5c860379a0d32830346a98ced1ae7e11d7671
5 Commits
| Author | SHA1 | Message | Date | |
|---|---|---|---|---|
|
|
d3c5c86037 |
VPP load-balancer dataplane integration: state, sync, and global conf
This commit wires maglevd through to VPP's LB plugin end-to-end, using
locally-generated GoVPP bindings for the newer v2 API messages.
VPP binapi (vendored)
- New package internal/vpp/binapi/ containing lb, lb_types, ip_types, and
interface_types, generated from a local VPP build (~/src/vpp) via a new
'make vpp-binapi' target. GoVPP v0.12.0 upstream lacks the v2 messages we
need (lb_conf_get, lb_add_del_vip_v2, lb_add_del_as_v2, lb_as_v2_dump,
lb_as_set_weight), so we commit the generated output in-tree.
- All generated files go through our loggedChannel wrapper; every VPP API
send/receive is recorded at DEBUG via slog (vpp-api-send / vpp-api-recv /
vpp-api-send-multi / vpp-api-recv-multi) so the full wire-level trail is
auditable. NewAPIChannel is unexported — callers must use c.apiChannel().
Read path: GetLBState{All,VIP}
- GetLBStateAll returns a full snapshot (global conf + every VIP with its
attached application servers).
- GetLBStateVIP looks up a single VIP by (prefix, protocol, port) and
returns (nil, nil) when the VIP doesn't exist in VPP. This is the
efficient path for targeted updates on a busy LB.
- Helpers factored out: getLBConf, dumpAllVIPs, dumpASesForVIP, lookupVIP,
vipFromDetails.
Write path: SyncLBState{All,VIP}
- SyncLBStateAll reconciles every configured frontend with VPP: creates
missing VIPs, removes stale ones (with AS flush), and reconciles AS
membership and weights within VIPs that exist on both sides.
- SyncLBStateVIP targets a single frontend by name. Never removes VIPs.
Returns ErrFrontendNotFound (wrapped with the name) when the frontend
isn't in config, so callers can use errors.Is.
- Shared reconcileVIP helper does the per-VIP AS diff; removeVIP is used
only by the full-sync pass.
- LbAddDelVipV2 requests always set NewFlowsTableLength=1024. The .api
default=1024 annotation is only applied by VAT/CLI parsers, not wire-
level marshalling — sending 0 caused VPP to vec_validate with mask
0xFFFFFFFF and OOM-panic.
- Pool semantics: backends in the primary (first) pool of a frontend get
their configured weight; backends in secondary pools get weight 0. All
backends are installed so higher layers can flip weights on failover
without add/remove churn.
- Every individual change emits a DEBUG slog (vpp-lbsync-vip-add/del,
vpp-lbsync-as-add/del, vpp-lbsync-as-weight). Start/done INFO logs
carry a scope=all|vip label plus aggregate counts.
Global conf push: SetLBConf
- New SetLBConf(cfg) sends lb_conf with ipv4-src, ipv6-src, sticky-buckets,
and flow-timeout. Called automatically on VPP (re)connect and after
every config reload (via doReloadConfig). Results are cached on the
Client so redundant pushes are silently skipped — only actual changes
produce a vpp-lb-conf-set INFO log line.
Periodic drift reconciliation
- vpp.Client.lbSyncLoop runs in a goroutine tied to each VPP connection's
lifetime. Its first tick is immediate (startup and post-reconnect
sync quickly); subsequent ticks fire every vpp.lb.sync-interval from
config (default 30s). Purpose: catch drift if something/someone
modifies VPP state by hand. The loop uses a ConfigSource interface
(satisfied by checker.Checker via its new Config() accessor) to avoid
an import cycle with the checker package.
Config schema additions (maglev.vpp.lb)
- sync-interval: positive Go duration, default 30s.
- ipv4-src-address: REQUIRED. Used as the outer source for GRE4 encap
to application servers. Missing this is a hard semantic error —
maglevd --check exits 2 and the daemon refuses to start. VPP GRE
needs a source address and every VIP we program uses GRE, so there
is no meaningful config without it.
- ipv6-src-address: REQUIRED. Same treatment as ipv4-src-address.
- sticky-buckets-per-core: default 65536, must be a power of 2.
- flow-timeout: default 40s, must be a whole number of seconds in [1s, 120s].
- VPP validation runs at the end of convert() so structural errors in
healthchecks/backends/frontends surface first — operators fix those,
then get the VPP-specific requirements.
gRPC API
- New GetVPPLBState RPC returning VPPLBState: global conf + VIPs with
ASes. Mirrors the read-path but strips fields irrelevant to our
GRE-only deployment (srv_type, dscp, target_port).
- New SyncVPPLBState RPC with optional frontend_name. Unset → full sync
(may remove stale VIPs). Set → single-VIP sync (never removes).
Returns codes.NotFound for unknown frontends, codes.Unavailable when
VPP integration is disabled or disconnected.
maglevc (CLI)
- New 'show vpp lbstate' command displaying the LB plugin state. VPP-only
fields the dataplane irrelevant to GRE are suppressed. Per-AS lines use
a key-value format ("address X weight Y flow-table-buckets Z")
instead of a tabwriter column, which avoids the ANSI-color alignment
issue we hit with mixed label/data rows.
- New 'sync vpp lbstate [<name>]' command. Without a name, triggers a
full reconciliation; with a name, targets one frontend.
- Previous 'show vpp lb' renamed to 'show vpp lbstate' for consistency
with the new sync command.
Test fixtures
- validConfig and all ad-hoc config_test.go fixtures that reach the end
of convert() now include the two required vpp.lb src addresses.
- tests/01-maglevd/maglevd-lab/maglev.yaml gains a vpp.lb section so the
robot integration tests can still load the config.
- cmd/maglevc/tree_test.go gains expected paths for the new commands.
Docs
- config-guide.md: new 'vpp' section in the basic structure, detailed
vpp.lb field reference, noting ipv4/ipv6 src addresses as REQUIRED
(hard error) with no defaults; example config updated.
- user-guide.md: documented 'show vpp info', 'show vpp lbstate',
'sync vpp lbstate [<name>]', new --vpp-api-addr and --vpp-stats-addr
flags, the vpp-lb-conf-set log line, and corrected the pause/resume
description to reflect that pause cancels the probe goroutine.
- debian/maglev.yaml: example config gains a vpp.lb block with src
addresses and commented optional overrides.
|
||
|
|
3227263d68 |
Add GoVPP integration and GetVPPInfo gRPC call
VPP client (internal/vpp/) - New package managing connections to both VPP API and stats sockets, treated as a unit: if either drops, both are torn down and re-established together. - Run() loop: connect, fetch version via vpe.ShowVersion, read /sys/boottime from the stats segment, log vpp-connect, then monitor with control_ping every 10s. On failure, disconnect both, retry after 5s. - Registers as client name "vpp-maglev" (visible in VPP's "show api clients"). - Flags: --vpp-api-addr (default /run/vpp/api.sock) and --vpp-stats-addr (default /run/vpp/stats.sock). Empty api addr disables VPP integration entirely. gRPC / proto - Add GetVPPInfo RPC returning VPPInfo: version, build_date, build_directory, pid, boottime_ns, connecttime_ns. Both times are unix timestamps in nanoseconds — the client computes durations locally for display. - Returns codes.Unavailable if VPP is disabled or not connected. maglevc - Add 'show vpp info' command displaying version, build-date, build-dir, vpp-pid, vpp-boottime (with duration), and connected time (with duration). |
||
|
|
1815675fb6 |
Distinguish disabled from removed backend state; add make fixstyle
Add StateDisabled for operator-initiated disable, keeping StateRemoved for backends that disappear during a config reload. Previously both used StateRemoved, which was confusing: "removed" implies the backend no longer exists in config, but a disabled backend is still present and can be re-enabled on the fly. - health: add StateDisabled with String() "disabled", Disable() method with probe code "disabled". Record() rejects probes in all three inactive states (paused, disabled, removed). - checker: DisableBackend calls backend.Disable() instead of Remove(). - docs: healthchecks.md rewritten for pause (goroutine cancelled, not just results discarded), and separate disabled/removed state rows. user-guide.md updated to match. - Makefile: add fixstyle target (gofmt -w .). |
||
|
|
4ab3096c8b |
Add Prometheus metrics endpoint; containerize integration tests
Prometheus metrics (internal/metrics/, cmd/maglevd/) - New --metrics-addr flag (default :9091, env MAGLEV_METRICS_ADDR) serving /metrics via promhttp. - Gauge metrics scraped on demand via a custom prometheus.Collector: maglev_backend_state, maglev_backend_health, maglev_backend_enabled, maglev_frontend_pool_backend_weight. - Inline counter/histogram metrics updated per probe: maglev_probe_total (by backend, type, result, code), maglev_probe_duration_seconds (by backend, type), maglev_backend_transitions_total (by backend, from, to). - StateSource interface in metrics package breaks the import cycle with checker; checker.Checker satisfies it via GetBackendInfo. Integration tests - Run maglevd inside a containerlab node (debian:trixie-slim with build/ bind-mounted) instead of on the host. Eliminates port collisions with any host maglevd. - maglevc commands run via docker exec into the maglevd container. - Add 6 Prometheus test cases: endpoint reachable, all backends report state=up, probe counters non-zero, duration histogram populated, pool weights correct, transition counters present. |
||
|
|
8bde00eb61 |
Fix pause to cancel probe goroutine; add Robot Framework integration tests
Pause semantics - PauseBackend now cancels the probe goroutine so no HTTP/TCP/ICMP traffic is sent while the backend is paused. Previously the goroutine kept running and results were silently discarded. - ResumeBackend launches a fresh probe goroutine on the existing worker, preserving transition history. The backend re-enters unknown state. Integration tests (tests/01-maglevd/) - Containerlab topology with 3 nginx:alpine backends on a dedicated management network (172.20.30.0/24) with static IPs. - maglevd config with 200ms HTTP health-check interval for fast test convergence (rise=2, fall=2). - 8 test cases: deploy lab, start maglevd, all backends reach up, nginx logs confirm probes arriving, pause stops probes (probe count stable), resume restarts probes, disable stops probes, enable restarts probes. VPP dataplane test (tests/02-vpp-lb/) - Rewrite 01-e2e-lab.robot to match the actual single-VPP topology: test client-to-server ping through VPP bridge domains and verify nginx is serving on all app servers. The previous version referenced a non-existent topology file and tested OSPF/BFD between two VPP nodes that don't exist in this lab. Build infrastructure - Add 'make robot-test' target with TEST= for suite selection. - Add tests/.venv target for Robot Framework virtualenv. - Make IMAGE optional in rf-run.sh. - Add .gitignore entries for test output, venv, logs, and clab state. |