Prometheus: add VPP, LB sync, and gRPC metrics; expand docs
New metrics plus the corresponding documentation for everything that's
accumulated since the last Prometheus pass.
internal/metrics/metrics.go
- New VPPSource interface (IsConnected, VPPInfo) plus a metrics-local
VPPInfo struct that mirrors vpp.Info. Decoupling via interface +
struct-mirror keeps the dependency direction one-way (vpp → metrics),
so vpp can import metrics to update inline counters without a cycle.
- New Collector gauges scraped on demand: maglev_vpp_connected,
maglev_vpp_uptime_seconds (from /sys/boottime), maglev_vpp_connected_seconds
(time since maglevd connected), and maglev_vpp_info (static 1-gauge
carrying version, build_date, and pid as labels).
- New inline counters:
- maglev_vpp_api_total{msg, direction, result} — bumped from the
loggedChannel wrapper on every VPP binary-API send/recv. Gives full
visibility into what maglevd is doing with VPP, broken down by
message name, direction (send/recv), and result (success/failure).
- maglev_vpp_lbsync_total{scope, kind} — bumped from the reconciler
at the end of each SyncLBStateAll/SyncLBStateVIP run. kind ∈
{vip_added, vip_removed, as_added, as_removed, as_weight_updated};
scope ∈ {all, vip}. Zero-valued kinds are not emitted so noise
stays low.
- Register() signature now takes a VPPSource (may be nil) alongside
the existing StateSource.
internal/vpp/client.go
- New VPPInfo() (metrics.VPPInfo, bool) shim method on *Client that
satisfies metrics.VPPSource. Returns (_, false) when disconnected so
the collector skips the vpp_* gauges cleanly.
internal/vpp/apilog.go
- The loggedChannel's SendRequest / SendMultiRequest / ReceiveReply
paths now call metrics.VPPAPITotal.WithLabelValues(...).Inc() in
addition to slog.Debug. Since every VPP API call in the codebase
must go through loggedChannel (NewAPIChannel is unexported), this
one instrumentation point catches everything.
internal/vpp/lbsync.go
- New recordSyncStats(scope, st) helper called once at the end of
SyncLBStateAll and SyncLBStateVIP to bump maglev_vpp_lbsync_total.
Zero-valued stats are skipped.
cmd/maglevd/main.go
- Added github.com/grpc-ecosystem/go-grpc-middleware/providers/prometheus
for the standard gRPC server metrics (grpc_server_started_total,
grpc_server_handled_total, grpc_server_handling_seconds, etc.,
labelled by service/method/type/code).
- Constructs grpcprom.NewServerMetrics(WithServerHandlingTimeHistogram())
before creating the grpc.Server, installs it as UnaryInterceptor +
StreamInterceptor, then calls InitializeMetrics(srv) after service
registration so every method appears at 0 on the first scrape
instead of materialising lazily on first RPC.
- Passes the vppClient (or nil) as a metrics.VPPSource to
metrics.Register so the vpp_* gauges are emitted when integration
is enabled and silently omitted otherwise.
docs/user-guide.md
- New 'Prometheus metrics' section in the maglevd chapter,
tabulating every metric family: backend state gauges, probe
counters/histogram, transition counters, the new VPP gauges and
counters, and the standard gRPC server metrics.
- 'show frontends <name>' description updated to mention the two
weight columns ('weight' = configured from YAML, 'effective' =
state-aware after pool-failover logic).
- Pause / disable descriptions clarified: transition history is
preserved across these operator actions.
docs/healthchecks.md
- New 'Static (no-healthcheck) backends' section explaining that
backends without a healthcheck use rise/fall=1, fire a synthetic
passing probe immediately on startup (no 30s wait), and idle at
30s between iterations thereafter.
- New 'Pool failover' section documenting the priority-tier model,
the active-pool definition, when promotion happens, cascading to
further tiers, and graceful drain on demotion. Points readers at
'maglevc show frontends <name>' as the inspection interface.
docs/config-guide.md
- healthcheck field doc now describes static-backend behavior and
cross-references healthchecks.md.
- pools field doc now explains failover semantics at a high level
and cross-references the detailed healthchecks.md section.
This commit is contained in:
@@ -187,6 +187,55 @@ goroutine is not restarted and no transition event is emitted.
|
||||
|
||||
---
|
||||
|
||||
## Static (no-healthcheck) backends
|
||||
|
||||
A backend with no `healthcheck` field in YAML skips the probe loop entirely.
|
||||
Instead of actually probing, `maglevd` synthesises a single passing result
|
||||
on startup. Specifically:
|
||||
|
||||
- The worker's rise/fall counters are forced to `1/1`, so a single synthetic
|
||||
pass is enough to reach `StateUp`.
|
||||
- The first "probe" fires immediately (zero sleep). Subsequent iterations
|
||||
idle at 30 seconds — there is nothing to do.
|
||||
- The backend reaches `up` within milliseconds of startup.
|
||||
|
||||
Static backends are useful for administrative VIPs where the caller knows the
|
||||
backend is always available, or for test configurations where deterministic
|
||||
state is more valuable than real health signals.
|
||||
|
||||
---
|
||||
|
||||
## Pool failover
|
||||
|
||||
Every frontend has one or more pools. The pools are priority tiers: pool[0]
|
||||
is the primary, pool[1] is the first fallback, pool[2] the next, and so on.
|
||||
At any moment, `maglevd` computes an **active pool** — the first pool that
|
||||
contains at least one backend in `StateUp`:
|
||||
|
||||
- As long as pool[0] has any up backend, it stays active. Its up backends
|
||||
receive traffic at their configured weights; backends in lower-priority
|
||||
pools stay on standby with effective weight 0.
|
||||
- When pool[0] has zero up backends (all down, paused, disabled, or still
|
||||
unknown), pool[1] is promoted: its up backends get their configured
|
||||
weights, and pool[0] backends stay at 0 until at least one recovers.
|
||||
- The same rule cascades to pool[2], pool[3], etc., for further fallback
|
||||
tiers.
|
||||
- When no pool has any up backend, every backend's effective weight is 0
|
||||
and the VIP serves nothing.
|
||||
|
||||
Failover is evaluated on every backend state transition and also on the
|
||||
periodic VPP drift reconciliation (every `maglev.vpp.lb.sync-interval`).
|
||||
The resulting effective weight for each backend can be inspected via
|
||||
`maglevc show frontends <name>` — each pool backend row shows both the
|
||||
configured weight and the effective weight after failover.
|
||||
|
||||
Demotion on recovery (e.g. pool[1] → standby when pool[0] comes back up)
|
||||
drains gracefully: the demoted backends have their weight set to 0 but
|
||||
existing flows in the VPP flow table are left to drain naturally. The only
|
||||
state that forces immediate flow-table flushing is operator `disable`.
|
||||
|
||||
---
|
||||
|
||||
## Log lines
|
||||
|
||||
All state changes produce a structured log line at `INFO` level:
|
||||
|
||||
Reference in New Issue
Block a user