fb62532fd53f4bc42da7bbe5608fead916727c6e
4 Commits
| Author | SHA1 | Message | Date | |
|---|---|---|---|---|
|
|
fb62532fd5 |
VPP LB counters, src-ip-sticky, and frontend state aggregation
New feature: per-VIP / per-backend runtime counters
* New GetVPPLBCounters RPC serving an in-process snapshot refreshed
by a 5s scrape loop (internal/vpp/lbstats.go). Each cycle pulls
the LB plugin's four SimpleCounters (next, first, untracked,
no-server) plus the FIB /net/route/to CombinedCounter for every
VIP and every backend host prefix via a single DumpStats call.
* FIB stats-index discovery via ip_route_lookup (internal/vpp/
fibstats.go); per-worker reduction happens in the collector.
* Prometheus collector exports vip_packets_total (kind label),
vip_route_{packets,bytes}_total, and backend_route_{packets,
bytes}_total. Metrics source interface extended with VIPStats /
BackendRouteStats; vpp.Client publishes snapshots via
atomic.Pointer and clears them on disconnect.
* New 'show vpp lb counters' CLI command. The 'show vpp lbstate'
and 'sync vpp lbstate' commands are restructured under 'show
vpp lb {state,counters}' / 'sync vpp lb state' to make room
for the new verb.
New feature: src-ip-sticky frontends
* New frontend YAML key 'src-ip-sticky' (bool). Plumbed through
config.Frontend, desiredVIP, and the lb_add_del_vip_v2 call.
* Reflected in gRPC FrontendInfo.src_ip_sticky and VPPLBVIP.
src_ip_sticky, and shown in 'show vpp lb state' output.
* Scraped back from VPP by parsing 'show lb vips verbose' through
cli_inband — lb_vip_details does not expose the flag. The same
scrape also recovers the LB pool index for each VIP, which the
stats-segment counters are keyed on. This is a documented
temporary workaround until VPP ships an lb_vip_v2_dump.
* src_ip_sticky cannot be mutated on a live VIP, so a flipped flag
triggers a tear-down-and-recreate in reconcileVIP (ASes deleted
with flush, VIP deleted, then re-added). Flip is logged.
New feature: frontend state aggregation and events
* New health.FrontendState (unknown/up/down) and FrontendTransition
types. A frontend is 'up' iff at least one backend has a nonzero
effective weight, 'unknown' iff no backend has real state yet,
and 'down' otherwise.
* Checker tracks per-frontend aggregate state, recomputing after
each backend transition and emitting a frontend-transition Event
on change. Reload drops entries for removed frontends.
* checker.Event gains an optional FrontendTransition pointer;
backend- vs. frontend-transition events are demultiplexed on
that field.
* WatchEvents now sends an initial snapshot of frontend state on
connect (mirroring the existing backend snapshot), subscribes
once to the checker stream, and fans out to backend/frontend
handlers based on the client's filter flags. The proto
FrontendEvent message grows name + transition fields.
* New Checker.FrontendState accessor.
Refactor: pure health helpers
* Moved the priority-failover selector and the (pool idx, active
pool, state, cfg weight) → (vpp weight, flush) mapping out of
internal/vpp/lbsync.go into a new internal/health/weights.go so
the checker can reuse them for frontend-state computation
without importing internal/vpp.
* New functions: health.ActivePoolIndex, BackendEffectiveWeight,
EffectiveWeights, ComputeFrontendState. lbsync.go now calls
these directly; vpp.EffectiveWeights is a thin wrapper over
health.EffectiveWeights retained for the gRPC observability
path. Fully unit-tested in internal/health/weights_test.go.
maglevc polish
* --color default is now mode-aware: on in the interactive shell,
off in one-shot mode so piped output is script-safe. Explicit
--color=true/false still overrides.
* New stripHostMask helper drops /32 and /128 from VIP display;
non-host prefixes pass through unchanged.
* Counter table column order fixed (first before next) and
packets/bytes columns renamed to fib-packets/fib-bytes to
clarify they come from the FIB, not the LB plugin.
Docs
* config-guide: document src-ip-sticky, including the VIP
recreate-on-change caveat.
* user-guide, maglevc.1, maglevd.8: updated command tree, new
counters command, color defaults, and the src-ip-sticky field.
|
||
|
|
d5fbf5c640 |
Prometheus: add VPP, LB sync, and gRPC metrics; expand docs
New metrics plus the corresponding documentation for everything that's
accumulated since the last Prometheus pass.
internal/metrics/metrics.go
- New VPPSource interface (IsConnected, VPPInfo) plus a metrics-local
VPPInfo struct that mirrors vpp.Info. Decoupling via interface +
struct-mirror keeps the dependency direction one-way (vpp → metrics),
so vpp can import metrics to update inline counters without a cycle.
- New Collector gauges scraped on demand: maglev_vpp_connected,
maglev_vpp_uptime_seconds (from /sys/boottime), maglev_vpp_connected_seconds
(time since maglevd connected), and maglev_vpp_info (static 1-gauge
carrying version, build_date, and pid as labels).
- New inline counters:
- maglev_vpp_api_total{msg, direction, result} — bumped from the
loggedChannel wrapper on every VPP binary-API send/recv. Gives full
visibility into what maglevd is doing with VPP, broken down by
message name, direction (send/recv), and result (success/failure).
- maglev_vpp_lbsync_total{scope, kind} — bumped from the reconciler
at the end of each SyncLBStateAll/SyncLBStateVIP run. kind ∈
{vip_added, vip_removed, as_added, as_removed, as_weight_updated};
scope ∈ {all, vip}. Zero-valued kinds are not emitted so noise
stays low.
- Register() signature now takes a VPPSource (may be nil) alongside
the existing StateSource.
internal/vpp/client.go
- New VPPInfo() (metrics.VPPInfo, bool) shim method on *Client that
satisfies metrics.VPPSource. Returns (_, false) when disconnected so
the collector skips the vpp_* gauges cleanly.
internal/vpp/apilog.go
- The loggedChannel's SendRequest / SendMultiRequest / ReceiveReply
paths now call metrics.VPPAPITotal.WithLabelValues(...).Inc() in
addition to slog.Debug. Since every VPP API call in the codebase
must go through loggedChannel (NewAPIChannel is unexported), this
one instrumentation point catches everything.
internal/vpp/lbsync.go
- New recordSyncStats(scope, st) helper called once at the end of
SyncLBStateAll and SyncLBStateVIP to bump maglev_vpp_lbsync_total.
Zero-valued stats are skipped.
cmd/maglevd/main.go
- Added github.com/grpc-ecosystem/go-grpc-middleware/providers/prometheus
for the standard gRPC server metrics (grpc_server_started_total,
grpc_server_handled_total, grpc_server_handling_seconds, etc.,
labelled by service/method/type/code).
- Constructs grpcprom.NewServerMetrics(WithServerHandlingTimeHistogram())
before creating the grpc.Server, installs it as UnaryInterceptor +
StreamInterceptor, then calls InitializeMetrics(srv) after service
registration so every method appears at 0 on the first scrape
instead of materialising lazily on first RPC.
- Passes the vppClient (or nil) as a metrics.VPPSource to
metrics.Register so the vpp_* gauges are emitted when integration
is enabled and silently omitted otherwise.
docs/user-guide.md
- New 'Prometheus metrics' section in the maglevd chapter,
tabulating every metric family: backend state gauges, probe
counters/histogram, transition counters, the new VPP gauges and
counters, and the standard gRPC server metrics.
- 'show frontends <name>' description updated to mention the two
weight columns ('weight' = configured from YAML, 'effective' =
state-aware after pool-failover logic).
- Pause / disable descriptions clarified: transition history is
preserved across these operator actions.
docs/healthchecks.md
- New 'Static (no-healthcheck) backends' section explaining that
backends without a healthcheck use rise/fall=1, fire a synthetic
passing probe immediately on startup (no 30s wait), and idle at
30s between iterations thereafter.
- New 'Pool failover' section documenting the priority-tier model,
the active-pool definition, when promotion happens, cascading to
further tiers, and graceful drain on demotion. Points readers at
'maglevc show frontends <name>' as the inspection interface.
docs/config-guide.md
- healthcheck field doc now describes static-backend behavior and
cross-references healthchecks.md.
- pools field doc now explains failover semantics at a high level
and cross-references the detailed healthchecks.md section.
|
||
|
|
1815675fb6 |
Distinguish disabled from removed backend state; add make fixstyle
Add StateDisabled for operator-initiated disable, keeping StateRemoved for backends that disappear during a config reload. Previously both used StateRemoved, which was confusing: "removed" implies the backend no longer exists in config, but a disabled backend is still present and can be re-enabled on the fly. - health: add StateDisabled with String() "disabled", Disable() method with probe code "disabled". Record() rejects probes in all three inactive states (paused, disabled, removed). - checker: DisableBackend calls backend.Disable() instead of Remove(). - docs: healthchecks.md rewritten for pause (goroutine cancelled, not just results discarded), and separate disabled/removed state rows. user-guide.md updated to match. - Makefile: add fixstyle target (gofmt -w .). |
||
|
|
4ab3096c8b |
Add Prometheus metrics endpoint; containerize integration tests
Prometheus metrics (internal/metrics/, cmd/maglevd/) - New --metrics-addr flag (default :9091, env MAGLEV_METRICS_ADDR) serving /metrics via promhttp. - Gauge metrics scraped on demand via a custom prometheus.Collector: maglev_backend_state, maglev_backend_health, maglev_backend_enabled, maglev_frontend_pool_backend_weight. - Inline counter/histogram metrics updated per probe: maglev_probe_total (by backend, type, result, code), maglev_probe_duration_seconds (by backend, type), maglev_backend_transitions_total (by backend, from, to). - StateSource interface in metrics package breaks the import cycle with checker; checker.Checker satisfies it via GetBackendInfo. Integration tests - Run maglevd inside a containerlab node (debian:trixie-slim with build/ bind-mounted) instead of on the host. Eliminates port collisions with any host maglevd. - maglevc commands run via docker exec into the maglevd container. - Add 6 Prometheus test cases: endpoint reachable, all backends report state=up, probe counters non-zero, duration histogram populated, pool weights correct, transition counters present. |