New metrics plus the corresponding documentation for everything that's
accumulated since the last Prometheus pass.
internal/metrics/metrics.go
- New VPPSource interface (IsConnected, VPPInfo) plus a metrics-local
VPPInfo struct that mirrors vpp.Info. Decoupling via interface +
struct-mirror keeps the dependency direction one-way (vpp → metrics),
so vpp can import metrics to update inline counters without a cycle.
- New Collector gauges scraped on demand: maglev_vpp_connected,
maglev_vpp_uptime_seconds (from /sys/boottime), maglev_vpp_connected_seconds
(time since maglevd connected), and maglev_vpp_info (static 1-gauge
carrying version, build_date, and pid as labels).
- New inline counters:
- maglev_vpp_api_total{msg, direction, result} — bumped from the
loggedChannel wrapper on every VPP binary-API send/recv. Gives full
visibility into what maglevd is doing with VPP, broken down by
message name, direction (send/recv), and result (success/failure).
- maglev_vpp_lbsync_total{scope, kind} — bumped from the reconciler
at the end of each SyncLBStateAll/SyncLBStateVIP run. kind ∈
{vip_added, vip_removed, as_added, as_removed, as_weight_updated};
scope ∈ {all, vip}. Zero-valued kinds are not emitted so noise
stays low.
- Register() signature now takes a VPPSource (may be nil) alongside
the existing StateSource.
internal/vpp/client.go
- New VPPInfo() (metrics.VPPInfo, bool) shim method on *Client that
satisfies metrics.VPPSource. Returns (_, false) when disconnected so
the collector skips the vpp_* gauges cleanly.
internal/vpp/apilog.go
- The loggedChannel's SendRequest / SendMultiRequest / ReceiveReply
paths now call metrics.VPPAPITotal.WithLabelValues(...).Inc() in
addition to slog.Debug. Since every VPP API call in the codebase
must go through loggedChannel (NewAPIChannel is unexported), this
one instrumentation point catches everything.
internal/vpp/lbsync.go
- New recordSyncStats(scope, st) helper called once at the end of
SyncLBStateAll and SyncLBStateVIP to bump maglev_vpp_lbsync_total.
Zero-valued stats are skipped.
cmd/maglevd/main.go
- Added github.com/grpc-ecosystem/go-grpc-middleware/providers/prometheus
for the standard gRPC server metrics (grpc_server_started_total,
grpc_server_handled_total, grpc_server_handling_seconds, etc.,
labelled by service/method/type/code).
- Constructs grpcprom.NewServerMetrics(WithServerHandlingTimeHistogram())
before creating the grpc.Server, installs it as UnaryInterceptor +
StreamInterceptor, then calls InitializeMetrics(srv) after service
registration so every method appears at 0 on the first scrape
instead of materialising lazily on first RPC.
- Passes the vppClient (or nil) as a metrics.VPPSource to
metrics.Register so the vpp_* gauges are emitted when integration
is enabled and silently omitted otherwise.
docs/user-guide.md
- New 'Prometheus metrics' section in the maglevd chapter,
tabulating every metric family: backend state gauges, probe
counters/histogram, transition counters, the new VPP gauges and
counters, and the standard gRPC server metrics.
- 'show frontends <name>' description updated to mention the two
weight columns ('weight' = configured from YAML, 'effective' =
state-aware after pool-failover logic).
- Pause / disable descriptions clarified: transition history is
preserved across these operator actions.
docs/healthchecks.md
- New 'Static (no-healthcheck) backends' section explaining that
backends without a healthcheck use rise/fall=1, fire a synthetic
passing probe immediately on startup (no 30s wait), and idle at
30s between iterations thereafter.
- New 'Pool failover' section documenting the priority-tier model,
the active-pool definition, when promotion happens, cascading to
further tiers, and graceful drain on demotion. Points readers at
'maglevc show frontends <name>' as the inspection interface.
docs/config-guide.md
- healthcheck field doc now describes static-backend behavior and
cross-references healthchecks.md.
- pools field doc now explains failover semantics at a high level
and cross-references the detailed healthchecks.md section.
252 lines
8.7 KiB
Markdown
252 lines
8.7 KiB
Markdown
# Health Checking
|
||
|
||
`maglevd` probes each backend independently of how many frontends reference it.
|
||
Every backend runs exactly one probe goroutine. State changes are broadcast as
|
||
gRPC events to all connected `WatchEvents` subscribers.
|
||
|
||
---
|
||
|
||
## States
|
||
|
||
| State | Meaning |
|
||
|---|---|
|
||
| `unknown` | Initial state; also entered after a resume or enable. |
|
||
| `up` | Backend is healthy and eligible to receive traffic. |
|
||
| `down` | Backend has failed enough consecutive probes to be considered offline. |
|
||
| `paused` | Health checking stopped by an operator. No probes are sent. |
|
||
| `disabled` | Backend disabled by an operator. No probes are sent. |
|
||
| `removed` | Backend removed from configuration by a reload. No probes are sent. |
|
||
|
||
---
|
||
|
||
## Rise / fall counter
|
||
|
||
The state machine is driven by HAProxy's single-integer health counter.
|
||
|
||
```
|
||
counter ∈ [0, rise + fall − 1] (called Max below)
|
||
|
||
backend is UP when counter ≥ rise
|
||
backend is DOWN when counter < rise
|
||
```
|
||
|
||
On each probe:
|
||
- **pass** — counter increments, ceiling at Max.
|
||
- **fail** — counter decrements, floor at 0.
|
||
|
||
This gives **hysteresis**: a backend that is barely up (counter = rise) needs
|
||
`fall` consecutive failures before it transitions to down. A backend that is
|
||
fully down (counter = 0) needs `rise` consecutive passes to come back up. A
|
||
backend that oscillates between passing and failing stays in the degraded range
|
||
without bouncing between up and down.
|
||
|
||
### Expedited unknown resolution
|
||
|
||
When a backend enters `unknown` state (new, restarted, resumed, or re-enabled)
|
||
its counter is pre-loaded to `rise − 1`. This means a single probe result is
|
||
enough to resolve the state:
|
||
|
||
- **1 pass** → `up`
|
||
- **1 fail** → `down` (also via the special unknown shortcut below)
|
||
|
||
In addition, any failure while state is `unknown` transitions immediately to
|
||
`down`, regardless of the counter value.
|
||
|
||
### Example: rise=2, fall=3 (Max=4)
|
||
|
||
```
|
||
counter: 0 1 2 3 4
|
||
state: DOWN DOWN UP UP UP
|
||
^
|
||
rise boundary
|
||
```
|
||
|
||
A backend starting from unknown has counter=1 (rise−1). One pass → counter=2
|
||
→ up. One fail while unknown → down immediately.
|
||
|
||
A backend that just became up sits at counter=2. It needs 3 failures to go down
|
||
(2→1→0, crossing the rise boundary at 2→1).
|
||
|
||
A backend that has been fully healthy for a while sits at counter=4. It needs 3
|
||
failures to go down (4→3→2→1, crossing the rise boundary at 2→1).
|
||
|
||
---
|
||
|
||
## Probe intervals
|
||
|
||
The interval used between probes depends on the backend's counter state:
|
||
|
||
| Condition | Interval used |
|
||
|---|---|
|
||
| State is `unknown` | `fast-interval` (falls back to `interval`) |
|
||
| Counter = Max (fully healthy) | `interval` |
|
||
| Counter = 0 (fully down) | `down-interval` (falls back to `interval`) |
|
||
| Counter between 0 and Max (degraded) | `fast-interval` (falls back to `interval`) |
|
||
|
||
Using `fast-interval` in degraded and unknown states means a flapping or
|
||
recovering backend is re-evaluated quickly without waiting a full `interval`.
|
||
Using `down-interval` for fully down backends reduces probe traffic to servers
|
||
that are known to be offline.
|
||
|
||
---
|
||
|
||
## Transition events
|
||
|
||
Every state change is logged as `backend-transition` and emitted as a gRPC
|
||
`BackendEvent` to all active `WatchEvents` streams.
|
||
|
||
### Backend added (config load or reload)
|
||
|
||
```
|
||
unknown → unknown (code: start)
|
||
```
|
||
|
||
The counter is pre-loaded to `rise − 1`. The first probe fires after
|
||
`fast-interval` (or `interval` if not configured). One pass produces `unknown →
|
||
up`; one fail produces `unknown → down`.
|
||
|
||
If multiple backends start together they are staggered across the first
|
||
`interval` to avoid probe bursts.
|
||
|
||
### Probe pass
|
||
|
||
- Counter increments.
|
||
- If counter reaches `rise` from below: `down → up` (or `unknown → up`).
|
||
- If already up: no transition. Next probe at `fast-interval` if degraded,
|
||
`interval` if fully healthy.
|
||
|
||
### Probe fail
|
||
|
||
- Counter decrements.
|
||
- If counter drops below `rise` from above: `up → down`.
|
||
- If state is `unknown`: transition immediately to `down` regardless of counter.
|
||
- Next probe at `down-interval` if fully down, `fast-interval` if degraded.
|
||
|
||
### Pause
|
||
|
||
```
|
||
<any> → paused (operator action)
|
||
```
|
||
|
||
The counter is reset to 0. The probe goroutine is cancelled — no further
|
||
probes are sent and no traffic reaches the backend while it is paused. The
|
||
backend stays `paused` until explicitly resumed.
|
||
|
||
### Resume
|
||
|
||
```
|
||
paused → unknown (operator action)
|
||
```
|
||
|
||
The counter is reset to `rise − 1`. A fresh probe goroutine is started,
|
||
which fires its first probe after `fast-interval` (or `interval` if not
|
||
configured). One pass produces `unknown → up`; one fail produces `unknown →
|
||
down`.
|
||
|
||
### Disable
|
||
|
||
```
|
||
<any> → disabled (operator action)
|
||
```
|
||
|
||
The probe goroutine is cancelled and the backend is marked `enabled: false`.
|
||
No further probes are sent. The backend remains visible via the gRPC API (state
|
||
`disabled`) and can be re-enabled without a config reload.
|
||
|
||
### Enable
|
||
|
||
```
|
||
disabled → unknown (operator action, via fresh goroutine)
|
||
```
|
||
|
||
A new probe goroutine is started and the backend re-enters `unknown` with the
|
||
counter pre-loaded to `rise − 1`. The `enabled` flag is set back to `true`.
|
||
The first probe fires after `fast-interval` and resolves state as described
|
||
under *Backend added*.
|
||
|
||
### Backend removed (config reload)
|
||
|
||
```
|
||
<any> → removed (code: removed)
|
||
```
|
||
|
||
The probe goroutine stops. No further state changes occur. The removed event is
|
||
emitted using the frontend map from before the reload so that consumers can
|
||
correlate it to the correct frontend.
|
||
|
||
### Backend healthcheck config changed (config reload)
|
||
|
||
The old probe goroutine is stopped (`<any> → removed`) and a new one started
|
||
(`unknown → unknown`, code: `start`). The new goroutine resolves state on the
|
||
first probe as described under *Backend added* above.
|
||
|
||
### Backend metadata changed without healthcheck change (config reload)
|
||
|
||
Weight, enabled flag, and similar fields are updated in place. The probe
|
||
goroutine is not restarted and no transition event is emitted.
|
||
|
||
---
|
||
|
||
## Static (no-healthcheck) backends
|
||
|
||
A backend with no `healthcheck` field in YAML skips the probe loop entirely.
|
||
Instead of actually probing, `maglevd` synthesises a single passing result
|
||
on startup. Specifically:
|
||
|
||
- The worker's rise/fall counters are forced to `1/1`, so a single synthetic
|
||
pass is enough to reach `StateUp`.
|
||
- The first "probe" fires immediately (zero sleep). Subsequent iterations
|
||
idle at 30 seconds — there is nothing to do.
|
||
- The backend reaches `up` within milliseconds of startup.
|
||
|
||
Static backends are useful for administrative VIPs where the caller knows the
|
||
backend is always available, or for test configurations where deterministic
|
||
state is more valuable than real health signals.
|
||
|
||
---
|
||
|
||
## Pool failover
|
||
|
||
Every frontend has one or more pools. The pools are priority tiers: pool[0]
|
||
is the primary, pool[1] is the first fallback, pool[2] the next, and so on.
|
||
At any moment, `maglevd` computes an **active pool** — the first pool that
|
||
contains at least one backend in `StateUp`:
|
||
|
||
- As long as pool[0] has any up backend, it stays active. Its up backends
|
||
receive traffic at their configured weights; backends in lower-priority
|
||
pools stay on standby with effective weight 0.
|
||
- When pool[0] has zero up backends (all down, paused, disabled, or still
|
||
unknown), pool[1] is promoted: its up backends get their configured
|
||
weights, and pool[0] backends stay at 0 until at least one recovers.
|
||
- The same rule cascades to pool[2], pool[3], etc., for further fallback
|
||
tiers.
|
||
- When no pool has any up backend, every backend's effective weight is 0
|
||
and the VIP serves nothing.
|
||
|
||
Failover is evaluated on every backend state transition and also on the
|
||
periodic VPP drift reconciliation (every `maglev.vpp.lb.sync-interval`).
|
||
The resulting effective weight for each backend can be inspected via
|
||
`maglevc show frontends <name>` — each pool backend row shows both the
|
||
configured weight and the effective weight after failover.
|
||
|
||
Demotion on recovery (e.g. pool[1] → standby when pool[0] comes back up)
|
||
drains gracefully: the demoted backends have their weight set to 0 but
|
||
existing flows in the VPP flow table are left to drain naturally. The only
|
||
state that forces immediate flow-table flushing is operator `disable`.
|
||
|
||
---
|
||
|
||
## Log lines
|
||
|
||
All state changes produce a structured log line at `INFO` level:
|
||
|
||
```json
|
||
{"level":"INFO","msg":"backend-transition","backend":"nginx0-ams","from":"up","to":"paused"}
|
||
{"level":"INFO","msg":"backend-transition","backend":"nginx0-ams","from":"paused","to":"unknown"}
|
||
{"level":"INFO","msg":"backend-transition","backend":"nginx0-ams","from":"unknown","to":"up","code":"L7OK","detail":""}
|
||
```
|
||
|
||
Probe-driven transitions also carry `code` and `detail` fields from the probe
|
||
result (e.g. `L4CON`, `L7STS`, `connection refused`). Operator-driven
|
||
transitions (pause, resume, disable, enable) carry empty code and detail.
|