Prometheus: add VPP, LB sync, and gRPC metrics; expand docs

New metrics plus the corresponding documentation for everything that's
accumulated since the last Prometheus pass.

internal/metrics/metrics.go
- New VPPSource interface (IsConnected, VPPInfo) plus a metrics-local
  VPPInfo struct that mirrors vpp.Info. Decoupling via interface +
  struct-mirror keeps the dependency direction one-way (vpp → metrics),
  so vpp can import metrics to update inline counters without a cycle.
- New Collector gauges scraped on demand: maglev_vpp_connected,
  maglev_vpp_uptime_seconds (from /sys/boottime), maglev_vpp_connected_seconds
  (time since maglevd connected), and maglev_vpp_info (static 1-gauge
  carrying version, build_date, and pid as labels).
- New inline counters:
  - maglev_vpp_api_total{msg, direction, result} — bumped from the
    loggedChannel wrapper on every VPP binary-API send/recv. Gives full
    visibility into what maglevd is doing with VPP, broken down by
    message name, direction (send/recv), and result (success/failure).
  - maglev_vpp_lbsync_total{scope, kind} — bumped from the reconciler
    at the end of each SyncLBStateAll/SyncLBStateVIP run. kind ∈
    {vip_added, vip_removed, as_added, as_removed, as_weight_updated};
    scope ∈ {all, vip}. Zero-valued kinds are not emitted so noise
    stays low.
- Register() signature now takes a VPPSource (may be nil) alongside
  the existing StateSource.

internal/vpp/client.go
- New VPPInfo() (metrics.VPPInfo, bool) shim method on *Client that
  satisfies metrics.VPPSource. Returns (_, false) when disconnected so
  the collector skips the vpp_* gauges cleanly.

internal/vpp/apilog.go
- The loggedChannel's SendRequest / SendMultiRequest / ReceiveReply
  paths now call metrics.VPPAPITotal.WithLabelValues(...).Inc() in
  addition to slog.Debug. Since every VPP API call in the codebase
  must go through loggedChannel (NewAPIChannel is unexported), this
  one instrumentation point catches everything.

internal/vpp/lbsync.go
- New recordSyncStats(scope, st) helper called once at the end of
  SyncLBStateAll and SyncLBStateVIP to bump maglev_vpp_lbsync_total.
  Zero-valued stats are skipped.

cmd/maglevd/main.go
- Added github.com/grpc-ecosystem/go-grpc-middleware/providers/prometheus
  for the standard gRPC server metrics (grpc_server_started_total,
  grpc_server_handled_total, grpc_server_handling_seconds, etc.,
  labelled by service/method/type/code).
- Constructs grpcprom.NewServerMetrics(WithServerHandlingTimeHistogram())
  before creating the grpc.Server, installs it as UnaryInterceptor +
  StreamInterceptor, then calls InitializeMetrics(srv) after service
  registration so every method appears at 0 on the first scrape
  instead of materialising lazily on first RPC.
- Passes the vppClient (or nil) as a metrics.VPPSource to
  metrics.Register so the vpp_* gauges are emitted when integration
  is enabled and silently omitted otherwise.

docs/user-guide.md
- New 'Prometheus metrics' section in the maglevd chapter,
  tabulating every metric family: backend state gauges, probe
  counters/histogram, transition counters, the new VPP gauges and
  counters, and the standard gRPC server metrics.
- 'show frontends <name>' description updated to mention the two
  weight columns ('weight' = configured from YAML, 'effective' =
  state-aware after pool-failover logic).
- Pause / disable descriptions clarified: transition history is
  preserved across these operator actions.

docs/healthchecks.md
- New 'Static (no-healthcheck) backends' section explaining that
  backends without a healthcheck use rise/fall=1, fire a synthetic
  passing probe immediately on startup (no 30s wait), and idle at
  30s between iterations thereafter.
- New 'Pool failover' section documenting the priority-tier model,
  the active-pool definition, when promotion happens, cascading to
  further tiers, and graceful drain on demotion. Points readers at
  'maglevc show frontends <name>' as the inspection interface.

docs/config-guide.md
- healthcheck field doc now describes static-backend behavior and
  cross-references healthchecks.md.
- pools field doc now explains failover semantics at a high level
  and cross-references the detailed healthchecks.md section.
This commit is contained in:
2026-04-12 13:00:29 +02:00
parent 0049c2ae73
commit d5fbf5c640
10 changed files with 322 additions and 28 deletions

View File

@@ -41,8 +41,47 @@ special capabilities.
All log output is written to stdout as JSON using Go's `log/slog`. The first
line logged after the logger is configured is a `starting` record that includes
`version`, `commit`, and `date`. Every state change emits a `backend-transition`
line at `INFO` level. Set `--log-level debug` to see individual probe attempts
and their outcomes.
line at `INFO` level. Set `--log-level debug` to see individual probe attempts,
every VPP binary-API call (`vpp-api-send` / `vpp-api-recv` with full payload),
and the per-VIP sync operations (`vpp-lbsync-vip-add`, `vpp-lbsync-as-weight`,
etc.) as they happen.
### Prometheus metrics
`maglevd` exposes Prometheus metrics on `--metrics-addr` (default `:9091`) at
the `/metrics` path. Metric families:
**Health-check and backend state (gauges, on-demand):**
| Metric | Labels | Description |
|---|---|---|
| `maglev_backend_state` | `backend`, `address`, `healthcheck`, `state` | 1 for the current state row per backend, 0 otherwise. |
| `maglev_backend_health` | `backend` | Current rise/fall counter value. |
| `maglev_backend_enabled` | `backend` | 1 if enabled, 0 if disabled. |
| `maglev_frontend_pool_backend_weight` | `frontend`, `pool`, `backend` | Configured weight from YAML. |
**Probe counters and latency (inline):**
| Metric | Labels | Description |
|---|---|---|
| `maglev_probe_total` | `backend`, `type`, `result`, `code` | Probes executed. `result` is `success` or `failure`. |
| `maglev_probe_duration_seconds` | `backend`, `type` | Histogram of probe wall time. |
| `maglev_backend_transitions_total` | `backend`, `from`, `to` | State machine transitions. |
**VPP integration (when enabled):**
| Metric | Labels | Description |
|---|---|---|
| `maglev_vpp_connected` | — | 1 if maglevd currently has a live VPP connection. |
| `maglev_vpp_uptime_seconds` | — | Seconds since VPP started (from `/sys/boottime`). |
| `maglev_vpp_connected_seconds` | — | Seconds since maglevd established the current VPP connection. |
| `maglev_vpp_info` | `version`, `build_date`, `pid` | Static VPP build metadata; always 1. |
| `maglev_vpp_api_total` | `msg`, `direction`, `result` | VPP binary-API calls. `direction` is `send` or `recv`; `result` is `success` or `failure`. |
| `maglev_vpp_lbsync_total` | `scope`, `kind` | Per-mutation sync counters. `scope` is `all` or `vip`; `kind` is one of `vip_added`, `vip_removed`, `as_added`, `as_removed`, `as_weight_updated`. |
**gRPC server (standard `go-grpc-middleware/prometheus` metrics):**
`grpc_server_started_total`, `grpc_server_handled_total`,
`grpc_server_msg_received_total`, `grpc_server_msg_sent_total`, and
`grpc_server_handling_seconds` — all labelled by `grpc_service`,
`grpc_method`, `grpc_type`, and `grpc_code`. Every method is
pre-registered at zero so time series exist on the first scrape.
---
@@ -74,8 +113,12 @@ show version Print build version, commit hash, and build dat
show frontends [<name>] Without name: list all frontend names.
With name: show address, protocol, port, description,
and pools. Each pool lists its backends with weights
(if != 100) and marks disabled backends with [disabled].
and pools. Each pool lists its backends with two
weight columns:
weight — configured weight from the YAML
effective — state-aware weight after pool failover
(what gets programmed into VPP)
Disabled backends are marked with [disabled].
show backends [<name>] Without name: list all backend names.
With name: show address, current state (with duration),
@@ -104,13 +147,16 @@ sync vpp lbstate [<name>] Reconcile the VPP load-balancer dataplane from
drift, and once on startup.
set backend <name> pause Stop health checking for a backend. Cancels the probe
goroutine so no further traffic is sent, and freezes
the state at whatever it was when paused.
goroutine so no further traffic is sent, and sets the
state to 'paused'. The backend's transition history is
preserved, so 'show backend <name>' still shows where
it came from.
set backend <name> resume Resume health checking. A fresh probe goroutine is
started and the backend re-enters unknown state.
set backend <name> disable Stop probing entirely and remove the backend from rotation.
The backend remains visible (state: disabled) and can be
re-enabled without reloading configuration.
set backend <name> disable Stop probing entirely and remove the backend from
rotation. The backend remains visible (state: disabled)
with its transition history intact and can be re-enabled
without reloading configuration.
set backend <name> enable Re-enable a disabled backend. A fresh probe goroutine is
started and the backend re-enters unknown state.