Files
vpp-maglev/internal/metrics/metrics.go
Pim van Pelt d5fbf5c640 Prometheus: add VPP, LB sync, and gRPC metrics; expand docs
New metrics plus the corresponding documentation for everything that's
accumulated since the last Prometheus pass.

internal/metrics/metrics.go
- New VPPSource interface (IsConnected, VPPInfo) plus a metrics-local
  VPPInfo struct that mirrors vpp.Info. Decoupling via interface +
  struct-mirror keeps the dependency direction one-way (vpp → metrics),
  so vpp can import metrics to update inline counters without a cycle.
- New Collector gauges scraped on demand: maglev_vpp_connected,
  maglev_vpp_uptime_seconds (from /sys/boottime), maglev_vpp_connected_seconds
  (time since maglevd connected), and maglev_vpp_info (static 1-gauge
  carrying version, build_date, and pid as labels).
- New inline counters:
  - maglev_vpp_api_total{msg, direction, result} — bumped from the
    loggedChannel wrapper on every VPP binary-API send/recv. Gives full
    visibility into what maglevd is doing with VPP, broken down by
    message name, direction (send/recv), and result (success/failure).
  - maglev_vpp_lbsync_total{scope, kind} — bumped from the reconciler
    at the end of each SyncLBStateAll/SyncLBStateVIP run. kind ∈
    {vip_added, vip_removed, as_added, as_removed, as_weight_updated};
    scope ∈ {all, vip}. Zero-valued kinds are not emitted so noise
    stays low.
- Register() signature now takes a VPPSource (may be nil) alongside
  the existing StateSource.

internal/vpp/client.go
- New VPPInfo() (metrics.VPPInfo, bool) shim method on *Client that
  satisfies metrics.VPPSource. Returns (_, false) when disconnected so
  the collector skips the vpp_* gauges cleanly.

internal/vpp/apilog.go
- The loggedChannel's SendRequest / SendMultiRequest / ReceiveReply
  paths now call metrics.VPPAPITotal.WithLabelValues(...).Inc() in
  addition to slog.Debug. Since every VPP API call in the codebase
  must go through loggedChannel (NewAPIChannel is unexported), this
  one instrumentation point catches everything.

internal/vpp/lbsync.go
- New recordSyncStats(scope, st) helper called once at the end of
  SyncLBStateAll and SyncLBStateVIP to bump maglev_vpp_lbsync_total.
  Zero-valued stats are skipped.

cmd/maglevd/main.go
- Added github.com/grpc-ecosystem/go-grpc-middleware/providers/prometheus
  for the standard gRPC server metrics (grpc_server_started_total,
  grpc_server_handled_total, grpc_server_handling_seconds, etc.,
  labelled by service/method/type/code).
- Constructs grpcprom.NewServerMetrics(WithServerHandlingTimeHistogram())
  before creating the grpc.Server, installs it as UnaryInterceptor +
  StreamInterceptor, then calls InitializeMetrics(srv) after service
  registration so every method appears at 0 on the first scrape
  instead of materialising lazily on first RPC.
- Passes the vppClient (or nil) as a metrics.VPPSource to
  metrics.Register so the vpp_* gauges are emitted when integration
  is enabled and silently omitted otherwise.

docs/user-guide.md
- New 'Prometheus metrics' section in the maglevd chapter,
  tabulating every metric family: backend state gauges, probe
  counters/histogram, transition counters, the new VPP gauges and
  counters, and the standard gRPC server metrics.
- 'show frontends <name>' description updated to mention the two
  weight columns ('weight' = configured from YAML, 'effective' =
  state-aware after pool-failover logic).
- Pause / disable descriptions clarified: transition history is
  preserved across these operator actions.

docs/healthchecks.md
- New 'Static (no-healthcheck) backends' section explaining that
  backends without a healthcheck use rise/fall=1, fire a synthetic
  passing probe immediately on startup (no 30s wait), and idle at
  30s between iterations thereafter.
- New 'Pool failover' section documenting the priority-tier model,
  the active-pool definition, when promotion happens, cascading to
  further tiers, and graceful drain on demotion. Points readers at
  'maglevc show frontends <name>' as the inspection interface.

docs/config-guide.md
- healthcheck field doc now describes static-backend behavior and
  cross-references healthchecks.md.
- pools field doc now explains failover semantics at a high level
  and cross-references the detailed healthchecks.md section.
2026-04-12 13:00:35 +02:00

288 lines
8.4 KiB
Go

// Copyright (c) 2026, Pim van Pelt <pim@ipng.ch>
// Package metrics exposes Prometheus metrics for maglevd.
//
// Gauge-type metrics (backend state, health counter, weights, VPP connection
// info) are collected on demand when Prometheus scrapes /metrics via the
// Collector. Counter and histogram metrics (probe totals, probe duration,
// transitions, VPP API calls, LB sync operations) are updated inline from
// the probe loop and VPP sync paths.
package metrics
import (
"fmt"
"time"
"git.ipng.ch/ipng/vpp-maglev/internal/config"
"git.ipng.ch/ipng/vpp-maglev/internal/health"
"github.com/prometheus/client_golang/prometheus"
)
// BackendInfo holds the health and config state needed by the collector.
type BackendInfo struct {
Health *health.Backend
Enabled bool
HCName string // healthcheck name from config
}
// StateSource provides read-only access to the running checker state.
type StateSource interface {
ListBackends() []string
GetBackendInfo(name string) (BackendInfo, bool)
ListFrontends() []string
GetFrontend(name string) (config.Frontend, bool)
}
// VPPInfo mirrors vpp.Info so the metrics package doesn't need to import
// internal/vpp (which would create an import cycle — vpp imports metrics
// to update counters inline).
type VPPInfo struct {
Version string
BuildDate string
PID uint32
BootTime time.Time
ConnectedSince time.Time
}
// VPPSource provides read-only access to the VPP client's state. vpp.Client
// is adapted to this interface via a small shim in the collector so the
// metrics package stays decoupled from the vpp package's concrete types.
type VPPSource interface {
IsConnected() bool
VPPInfo() (VPPInfo, bool)
}
// ---- inline metrics (updated per probe) ------------------------------------
var (
ProbeTotal = prometheus.NewCounterVec(prometheus.CounterOpts{
Namespace: "maglev",
Subsystem: "probe",
Name: "total",
Help: "Total number of health-check probes executed.",
}, []string{"backend", "type", "result", "code"})
ProbeDuration = prometheus.NewHistogramVec(prometheus.HistogramOpts{
Namespace: "maglev",
Subsystem: "probe",
Name: "duration_seconds",
Help: "Health-check probe duration in seconds.",
Buckets: []float64{.001, .0025, .005, .01, .025, .05, .1, .25, .5, 1, 2.5},
}, []string{"backend", "type"})
TransitionTotal = prometheus.NewCounterVec(prometheus.CounterOpts{
Namespace: "maglev",
Subsystem: "backend",
Name: "transitions_total",
Help: "Total number of backend state transitions.",
}, []string{"backend", "from", "to"})
// ---- VPP API counters ---------------------------------------------------
VPPAPITotal = prometheus.NewCounterVec(prometheus.CounterOpts{
Namespace: "maglev",
Subsystem: "vpp_api",
Name: "total",
Help: "Total number of VPP binary-API messages sent to or received from VPP.",
}, []string{"msg", "direction", "result"})
// ---- LB sync counters ---------------------------------------------------
// LBSyncTotal counts individual dataplane mutations performed by the
// sync path. kind ∈ {vip_added, vip_removed, as_added, as_removed,
// as_weight_updated}; scope ∈ {all, vip}.
LBSyncTotal = prometheus.NewCounterVec(prometheus.CounterOpts{
Namespace: "maglev",
Subsystem: "vpp_lbsync",
Name: "total",
Help: "Total number of VPP load-balancer sync operations applied to the dataplane.",
}, []string{"scope", "kind"})
)
// ---- collector (scraped on demand) -----------------------------------------
// Collector implements prometheus.Collector by querying the running checker
// on each scrape. This avoids stale label sets when backends are added or
// removed by a config reload.
type Collector struct {
src StateSource
vpp VPPSource // optional; nil when VPP integration is disabled
backendState *prometheus.Desc
backendHealth *prometheus.Desc
backendEnabled *prometheus.Desc
poolWeight *prometheus.Desc
vppConnected *prometheus.Desc
vppUptimeSeconds *prometheus.Desc
vppConnectedFor *prometheus.Desc
vppInfo *prometheus.Desc
}
// NewCollector creates a Collector backed by the given StateSource. vpp may
// be nil when VPP integration is disabled; in that case vpp_* metrics are
// simply not emitted.
func NewCollector(src StateSource, vpp VPPSource) *Collector {
return &Collector{
src: src,
vpp: vpp,
backendState: prometheus.NewDesc(
"maglev_backend_state",
"Current backend state (1 = active for the given state label).",
[]string{"backend", "address", "healthcheck", "state"}, nil,
),
backendHealth: prometheus.NewDesc(
"maglev_backend_health",
"Current health counter value.",
[]string{"backend"}, nil,
),
backendEnabled: prometheus.NewDesc(
"maglev_backend_enabled",
"Whether the backend is enabled (1) or disabled (0).",
[]string{"backend"}, nil,
),
poolWeight: prometheus.NewDesc(
"maglev_frontend_pool_backend_weight",
"Configured weight of a backend in a frontend pool (0-100).",
[]string{"frontend", "pool", "backend"}, nil,
),
vppConnected: prometheus.NewDesc(
"maglev_vpp_connected",
"Whether maglevd currently has an established connection to VPP (1) or not (0).",
nil, nil,
),
vppUptimeSeconds: prometheus.NewDesc(
"maglev_vpp_uptime_seconds",
"Seconds since VPP started (from the /sys/boottime stats counter).",
nil, nil,
),
vppConnectedFor: prometheus.NewDesc(
"maglev_vpp_connected_seconds",
"Seconds since maglevd established the current VPP connection.",
nil, nil,
),
vppInfo: prometheus.NewDesc(
"maglev_vpp_info",
"Static VPP build information. Always 1; metadata is conveyed via labels.",
[]string{"version", "build_date", "pid"}, nil,
),
}
}
// Describe implements prometheus.Collector.
func (c *Collector) Describe(ch chan<- *prometheus.Desc) {
ch <- c.backendState
ch <- c.backendHealth
ch <- c.backendEnabled
ch <- c.poolWeight
ch <- c.vppConnected
ch <- c.vppUptimeSeconds
ch <- c.vppConnectedFor
ch <- c.vppInfo
}
// Collect implements prometheus.Collector.
func (c *Collector) Collect(ch chan<- prometheus.Metric) {
states := []health.State{
health.StateUnknown,
health.StateUp,
health.StateDown,
health.StatePaused,
health.StateDisabled,
health.StateRemoved,
}
for _, name := range c.src.ListBackends() {
info, ok := c.src.GetBackendInfo(name)
if !ok {
continue
}
addr := info.Health.Address.String()
// One time-series per possible state; the current state is 1, rest 0.
for _, s := range states {
val := 0.0
if info.Health.State == s {
val = 1.0
}
ch <- prometheus.MustNewConstMetric(
c.backendState, prometheus.GaugeValue, val,
name, addr, info.HCName, s.String(),
)
}
ch <- prometheus.MustNewConstMetric(
c.backendHealth, prometheus.GaugeValue,
float64(info.Health.Counter.Health), name,
)
enabled := 0.0
if info.Enabled {
enabled = 1.0
}
ch <- prometheus.MustNewConstMetric(
c.backendEnabled, prometheus.GaugeValue, enabled, name,
)
}
for _, feName := range c.src.ListFrontends() {
fe, ok := c.src.GetFrontend(feName)
if !ok {
continue
}
for _, pool := range fe.Pools {
for beName, pb := range pool.Backends {
ch <- prometheus.MustNewConstMetric(
c.poolWeight, prometheus.GaugeValue,
float64(pb.Weight), feName, pool.Name, beName,
)
}
}
}
// ---- VPP gauges -------------------------------------------------------
if c.vpp == nil {
return
}
connected := 0.0
if c.vpp.IsConnected() {
connected = 1.0
}
ch <- prometheus.MustNewConstMetric(c.vppConnected, prometheus.GaugeValue, connected)
info, ok := c.vpp.VPPInfo()
if !ok {
return
}
if !info.BootTime.IsZero() {
ch <- prometheus.MustNewConstMetric(
c.vppUptimeSeconds, prometheus.GaugeValue,
time.Since(info.BootTime).Seconds(),
)
}
if !info.ConnectedSince.IsZero() {
ch <- prometheus.MustNewConstMetric(
c.vppConnectedFor, prometheus.GaugeValue,
time.Since(info.ConnectedSince).Seconds(),
)
}
ch <- prometheus.MustNewConstMetric(
c.vppInfo, prometheus.GaugeValue, 1.0,
info.Version, info.BuildDate, fmt.Sprintf("%d", info.PID),
)
}
// Register registers all metrics with the given registry. vpp may be nil
// to disable VPP-related metrics.
func Register(reg prometheus.Registerer, src StateSource, vpp VPPSource) *Collector {
coll := NewCollector(src, vpp)
reg.MustRegister(coll)
reg.MustRegister(ProbeTotal)
reg.MustRegister(ProbeDuration)
reg.MustRegister(TransitionTotal)
reg.MustRegister(VPPAPITotal)
reg.MustRegister(LBSyncTotal)
return coll
}