Pim van Pelt d5fbf5c640 Prometheus: add VPP, LB sync, and gRPC metrics; expand docs
New metrics plus the corresponding documentation for everything that's
accumulated since the last Prometheus pass.

internal/metrics/metrics.go
- New VPPSource interface (IsConnected, VPPInfo) plus a metrics-local
  VPPInfo struct that mirrors vpp.Info. Decoupling via interface +
  struct-mirror keeps the dependency direction one-way (vpp → metrics),
  so vpp can import metrics to update inline counters without a cycle.
- New Collector gauges scraped on demand: maglev_vpp_connected,
  maglev_vpp_uptime_seconds (from /sys/boottime), maglev_vpp_connected_seconds
  (time since maglevd connected), and maglev_vpp_info (static 1-gauge
  carrying version, build_date, and pid as labels).
- New inline counters:
  - maglev_vpp_api_total{msg, direction, result} — bumped from the
    loggedChannel wrapper on every VPP binary-API send/recv. Gives full
    visibility into what maglevd is doing with VPP, broken down by
    message name, direction (send/recv), and result (success/failure).
  - maglev_vpp_lbsync_total{scope, kind} — bumped from the reconciler
    at the end of each SyncLBStateAll/SyncLBStateVIP run. kind ∈
    {vip_added, vip_removed, as_added, as_removed, as_weight_updated};
    scope ∈ {all, vip}. Zero-valued kinds are not emitted so noise
    stays low.
- Register() signature now takes a VPPSource (may be nil) alongside
  the existing StateSource.

internal/vpp/client.go
- New VPPInfo() (metrics.VPPInfo, bool) shim method on *Client that
  satisfies metrics.VPPSource. Returns (_, false) when disconnected so
  the collector skips the vpp_* gauges cleanly.

internal/vpp/apilog.go
- The loggedChannel's SendRequest / SendMultiRequest / ReceiveReply
  paths now call metrics.VPPAPITotal.WithLabelValues(...).Inc() in
  addition to slog.Debug. Since every VPP API call in the codebase
  must go through loggedChannel (NewAPIChannel is unexported), this
  one instrumentation point catches everything.

internal/vpp/lbsync.go
- New recordSyncStats(scope, st) helper called once at the end of
  SyncLBStateAll and SyncLBStateVIP to bump maglev_vpp_lbsync_total.
  Zero-valued stats are skipped.

cmd/maglevd/main.go
- Added github.com/grpc-ecosystem/go-grpc-middleware/providers/prometheus
  for the standard gRPC server metrics (grpc_server_started_total,
  grpc_server_handled_total, grpc_server_handling_seconds, etc.,
  labelled by service/method/type/code).
- Constructs grpcprom.NewServerMetrics(WithServerHandlingTimeHistogram())
  before creating the grpc.Server, installs it as UnaryInterceptor +
  StreamInterceptor, then calls InitializeMetrics(srv) after service
  registration so every method appears at 0 on the first scrape
  instead of materialising lazily on first RPC.
- Passes the vppClient (or nil) as a metrics.VPPSource to
  metrics.Register so the vpp_* gauges are emitted when integration
  is enabled and silently omitted otherwise.

docs/user-guide.md
- New 'Prometheus metrics' section in the maglevd chapter,
  tabulating every metric family: backend state gauges, probe
  counters/histogram, transition counters, the new VPP gauges and
  counters, and the standard gRPC server metrics.
- 'show frontends <name>' description updated to mention the two
  weight columns ('weight' = configured from YAML, 'effective' =
  state-aware after pool-failover logic).
- Pause / disable descriptions clarified: transition history is
  preserved across these operator actions.

docs/healthchecks.md
- New 'Static (no-healthcheck) backends' section explaining that
  backends without a healthcheck use rise/fall=1, fire a synthetic
  passing probe immediately on startup (no 30s wait), and idle at
  30s between iterations thereafter.
- New 'Pool failover' section documenting the priority-tier model,
  the active-pool definition, when promotion happens, cascading to
  further tiers, and graceful drain on demotion. Points readers at
  'maglevc show frontends <name>' as the inspection interface.

docs/config-guide.md
- healthcheck field doc now describes static-backend behavior and
  cross-references healthchecks.md.
- pools field doc now explains failover semantics at a high level
  and cross-references the detailed healthchecks.md section.
2026-04-12 13:00:35 +02:00
2026-04-10 22:22:56 +02:00

maglevd

Health checker and gRPC control plane for VPP Maglev load balancing.

Build and Install

make          # builds build/<arch>/maglevd and build/<arch>/maglevc
make test     # runs all tests
make pkg-deb  # Creates a debian package for arm64 and amd64

Requires Go 1.25+ and (for make proto) protoc with protoc-gen-go and protoc-gen-go-grpc.

Produces vpp-maglev_<version>_amd64.deb and vpp-maglev_<version>_arm64.deb in the build/ directory by cross-compiling with GOOS=linux GOARCH=<arch>. Requires dpkg-deb (available on any Debian/Ubuntu host).

Running

After installing, the unit is enabled but not started automatically:

# edit /etc/vpp-maglev/maglev.yaml, then:
systemctl enable --now vpp-maglevd

Or run the server and client by hand:

maglevd --config /etc/vpp-maglev/maglev.yaml --grpc-addr :9090
maglevd --version                        # print version and exit

maglevc --server localhost:9090          # interactive shell
maglevc show frontends                   # one-shot
maglevc -color=false show backends       # one-shot, no ANSI color
maglevc set backend nginx0-ams pause

Send SIGHUP to maglevd to reload config without restarting. maglevd requires CAP_NET_RAW for ICMP health checks.

Check out a minimal configuration file in [debian/maglev.yaml]. See docs/user-guide.md for flags, signals, and maglevc usage. See docs/config-guide.md for the full configuration reference. See docs/healthchecks.md for health state machine details.

Docker

docker build -t maglevd .
docker run --cap-add NET_RAW -v /etc/vpp-maglev:/etc/vpp-maglev maglevd
Description
A health-checking maglev controlplane for VPP
Readme Apache-2.0 1.6 MiB
Languages
Go 79%
TypeScript 12.8%
CSS 2.9%
Makefile 2.2%
RobotFramework 2.2%
Other 0.8%