New feature: per-VIP / per-backend runtime counters
* New GetVPPLBCounters RPC serving an in-process snapshot refreshed
by a 5s scrape loop (internal/vpp/lbstats.go). Each cycle pulls
the LB plugin's four SimpleCounters (next, first, untracked,
no-server) plus the FIB /net/route/to CombinedCounter for every
VIP and every backend host prefix via a single DumpStats call.
* FIB stats-index discovery via ip_route_lookup (internal/vpp/
fibstats.go); per-worker reduction happens in the collector.
* Prometheus collector exports vip_packets_total (kind label),
vip_route_{packets,bytes}_total, and backend_route_{packets,
bytes}_total. Metrics source interface extended with VIPStats /
BackendRouteStats; vpp.Client publishes snapshots via
atomic.Pointer and clears them on disconnect.
* New 'show vpp lb counters' CLI command. The 'show vpp lbstate'
and 'sync vpp lbstate' commands are restructured under 'show
vpp lb {state,counters}' / 'sync vpp lb state' to make room
for the new verb.
New feature: src-ip-sticky frontends
* New frontend YAML key 'src-ip-sticky' (bool). Plumbed through
config.Frontend, desiredVIP, and the lb_add_del_vip_v2 call.
* Reflected in gRPC FrontendInfo.src_ip_sticky and VPPLBVIP.
src_ip_sticky, and shown in 'show vpp lb state' output.
* Scraped back from VPP by parsing 'show lb vips verbose' through
cli_inband — lb_vip_details does not expose the flag. The same
scrape also recovers the LB pool index for each VIP, which the
stats-segment counters are keyed on. This is a documented
temporary workaround until VPP ships an lb_vip_v2_dump.
* src_ip_sticky cannot be mutated on a live VIP, so a flipped flag
triggers a tear-down-and-recreate in reconcileVIP (ASes deleted
with flush, VIP deleted, then re-added). Flip is logged.
New feature: frontend state aggregation and events
* New health.FrontendState (unknown/up/down) and FrontendTransition
types. A frontend is 'up' iff at least one backend has a nonzero
effective weight, 'unknown' iff no backend has real state yet,
and 'down' otherwise.
* Checker tracks per-frontend aggregate state, recomputing after
each backend transition and emitting a frontend-transition Event
on change. Reload drops entries for removed frontends.
* checker.Event gains an optional FrontendTransition pointer;
backend- vs. frontend-transition events are demultiplexed on
that field.
* WatchEvents now sends an initial snapshot of frontend state on
connect (mirroring the existing backend snapshot), subscribes
once to the checker stream, and fans out to backend/frontend
handlers based on the client's filter flags. The proto
FrontendEvent message grows name + transition fields.
* New Checker.FrontendState accessor.
Refactor: pure health helpers
* Moved the priority-failover selector and the (pool idx, active
pool, state, cfg weight) → (vpp weight, flush) mapping out of
internal/vpp/lbsync.go into a new internal/health/weights.go so
the checker can reuse them for frontend-state computation
without importing internal/vpp.
* New functions: health.ActivePoolIndex, BackendEffectiveWeight,
EffectiveWeights, ComputeFrontendState. lbsync.go now calls
these directly; vpp.EffectiveWeights is a thin wrapper over
health.EffectiveWeights retained for the gRPC observability
path. Fully unit-tested in internal/health/weights_test.go.
maglevc polish
* --color default is now mode-aware: on in the interactive shell,
off in one-shot mode so piped output is script-safe. Explicit
--color=true/false still overrides.
* New stripHostMask helper drops /32 and /128 from VIP display;
non-host prefixes pass through unchanged.
* Counter table column order fixed (first before next) and
packets/bytes columns renamed to fib-packets/fib-bytes to
clarify they come from the FIB, not the LB plugin.
Docs
* config-guide: document src-ip-sticky, including the VIP
recreate-on-change caveat.
* user-guide, maglevc.1, maglevd.8: updated command tree, new
counters command, color defaults, and the src-ip-sticky field.
271 lines
8.0 KiB
Go
271 lines
8.0 KiB
Go
// Copyright (c) 2026, Pim van Pelt <pim@ipng.ch>
|
|
|
|
package health
|
|
|
|
import (
|
|
"net"
|
|
"time"
|
|
)
|
|
|
|
// CheckLayer indicates at which network layer a probe stopped.
|
|
type CheckLayer int
|
|
|
|
const (
|
|
LayerUnknown CheckLayer = iota
|
|
LayerL4 // TCP connect
|
|
LayerL6 // TLS handshake
|
|
LayerL7 // Application (HTTP response, ICMP reply)
|
|
)
|
|
|
|
// ProbeResult is the outcome of a single probe execution.
|
|
type ProbeResult struct {
|
|
OK bool
|
|
Layer CheckLayer
|
|
Code string // "L4OK", "L4TOUT", "L4CON", "L7OK", "L7TOUT", "L7RSP", "L7STS"
|
|
Detail string // human-readable, e.g. "HTTP 503", "connection refused"
|
|
}
|
|
|
|
// State represents the health state of a backend.
|
|
type State int
|
|
|
|
const (
|
|
StateUnknown State = iota // initial state before first probe
|
|
StateUp // backend is healthy
|
|
StateDown // backend has failed enough probes
|
|
StatePaused // operator paused health checking
|
|
StateDisabled // operator disabled the backend
|
|
StateRemoved // backend removed from configuration by reload
|
|
)
|
|
|
|
func (s State) String() string {
|
|
switch s {
|
|
case StateUnknown:
|
|
return "unknown"
|
|
case StateUp:
|
|
return "up"
|
|
case StateDown:
|
|
return "down"
|
|
case StatePaused:
|
|
return "paused"
|
|
case StateDisabled:
|
|
return "disabled"
|
|
case StateRemoved:
|
|
return "removed"
|
|
default:
|
|
return "unknown"
|
|
}
|
|
}
|
|
|
|
// Transition records a single state change event.
|
|
type Transition struct {
|
|
From State
|
|
To State
|
|
At time.Time
|
|
Result ProbeResult
|
|
}
|
|
|
|
// FrontendState is the aggregated state of a frontend derived from the
|
|
// effective weights of its member backends. Frontends do not have their
|
|
// own rise/fall counters: they're purely a reduction over backend state.
|
|
//
|
|
// - unknown: no backends, or every referenced backend is in StateUnknown
|
|
// (the checker has no probe data yet).
|
|
// - up: at least one backend has effective weight > 0 — the VIP has
|
|
// something to serve.
|
|
// - down: backends exist with real state, but none have effective
|
|
// weight > 0 — the VIP has nothing to serve.
|
|
type FrontendState int
|
|
|
|
const (
|
|
FrontendStateUnknown FrontendState = iota
|
|
FrontendStateUp
|
|
FrontendStateDown
|
|
)
|
|
|
|
func (s FrontendState) String() string {
|
|
switch s {
|
|
case FrontendStateUnknown:
|
|
return "unknown"
|
|
case FrontendStateUp:
|
|
return "up"
|
|
case FrontendStateDown:
|
|
return "down"
|
|
}
|
|
return "unknown"
|
|
}
|
|
|
|
// FrontendTransition records a frontend state change event.
|
|
type FrontendTransition struct {
|
|
From FrontendState
|
|
To FrontendState
|
|
At time.Time
|
|
}
|
|
|
|
// HealthCounter is HAProxy's single-integer rise/fall model.
|
|
//
|
|
// Health ∈ [0, Rise+Fall-1]. Server is UP when Health >= Rise, DOWN when
|
|
// Health < Rise. On success Health increments (ceiling Rise+Fall-1); on
|
|
// failure Health decrements (floor 0). This gives hysteresis: a flapping
|
|
// backend stays in the degraded range without bouncing between UP and DOWN.
|
|
type HealthCounter struct {
|
|
Health int
|
|
Rise int
|
|
Fall int
|
|
}
|
|
|
|
func (h *HealthCounter) Max() int { return h.Rise + h.Fall - 1 }
|
|
func (h *HealthCounter) IsUp() bool { return h.Health >= h.Rise }
|
|
func (h *HealthCounter) IsDegraded() bool { return h.Health > 0 && h.Health < h.Max() }
|
|
|
|
// RecordPass increments the counter. Returns true if the server just became UP.
|
|
func (h *HealthCounter) RecordPass() bool {
|
|
wasUp := h.IsUp()
|
|
if h.Health < h.Max() {
|
|
h.Health++
|
|
}
|
|
return !wasUp && h.IsUp()
|
|
}
|
|
|
|
// RecordFail decrements the counter. Returns true if the server just went DOWN.
|
|
func (h *HealthCounter) RecordFail() bool {
|
|
wasDown := !h.IsUp()
|
|
if h.Health > 0 {
|
|
h.Health--
|
|
}
|
|
return !wasDown && !h.IsUp()
|
|
}
|
|
|
|
// Backend tracks the health state of a named backend.
|
|
type Backend struct {
|
|
Name string
|
|
Address net.IP
|
|
State State
|
|
Counter HealthCounter
|
|
Transitions []Transition // newest first, capped at maxHistory
|
|
}
|
|
|
|
// New creates a Backend in StateUnknown with the health counter pre-loaded to
|
|
// Rise-1, so the very first probe resolves the state: one pass → Up, any
|
|
// fail → Down (via the StateUnknown shortcut in Record).
|
|
func New(name string, addr net.IP, rise, fall int) *Backend {
|
|
return &Backend{
|
|
Name: name,
|
|
Address: addr,
|
|
State: StateUnknown,
|
|
Counter: HealthCounter{Rise: rise, Fall: fall, Health: rise - 1},
|
|
}
|
|
}
|
|
|
|
// Record applies a probe result to the health counter and transitions state if
|
|
// needed. Returns true if the state changed.
|
|
//
|
|
// StateUnknown transitions to StateDown on the first failure (any evidence of
|
|
// failure means the backend is not yet confirmed reachable), and to StateUp
|
|
// once the counter reaches Rise consecutive passes.
|
|
func (b *Backend) Record(r ProbeResult, maxHistory int) bool {
|
|
if b.State == StatePaused || b.State == StateDisabled || b.State == StateRemoved {
|
|
return false
|
|
}
|
|
if r.OK {
|
|
if b.Counter.RecordPass() {
|
|
b.transition(StateUp, r, maxHistory)
|
|
return true
|
|
}
|
|
} else {
|
|
if b.Counter.RecordFail() || b.State == StateUnknown {
|
|
b.transition(StateDown, r, maxHistory)
|
|
return true
|
|
}
|
|
}
|
|
return false
|
|
}
|
|
|
|
// Pause transitions the backend to StatePaused. Returns true if the state changed.
|
|
func (b *Backend) Pause(maxHistory int) bool {
|
|
if b.State == StatePaused {
|
|
return false
|
|
}
|
|
b.transition(StatePaused, ProbeResult{}, maxHistory)
|
|
b.Counter.Health = 0
|
|
return true
|
|
}
|
|
|
|
// Resume transitions a paused backend back to StateUnknown, resetting the
|
|
// counter. Returns true if the state changed.
|
|
func (b *Backend) Resume(maxHistory int) bool {
|
|
if b.State != StatePaused {
|
|
return false
|
|
}
|
|
b.transition(StateUnknown, ProbeResult{}, maxHistory)
|
|
b.Counter.Health = b.Counter.Rise - 1
|
|
return true
|
|
}
|
|
|
|
// NextInterval returns the appropriate probe interval based on state and counter:
|
|
// - Unknown (initial / post-resume): fastInterval (falls back to interval) — probe quickly to establish state
|
|
// - Fully healthy (counter at max): interval
|
|
// - Fully down (counter at 0): downInterval (falls back to interval)
|
|
// - Degraded (anywhere in between): fastInterval (falls back to interval)
|
|
func (b *Backend) NextInterval(interval, fastInterval, downInterval time.Duration) time.Duration {
|
|
if b.State == StateUnknown {
|
|
if fastInterval > 0 {
|
|
return fastInterval
|
|
}
|
|
return interval
|
|
}
|
|
if b.Counter.Health == b.Counter.Max() {
|
|
return interval
|
|
}
|
|
if b.Counter.Health == 0 {
|
|
if downInterval > 0 {
|
|
return downInterval
|
|
}
|
|
return interval
|
|
}
|
|
if fastInterval > 0 {
|
|
return fastInterval
|
|
}
|
|
return interval
|
|
}
|
|
|
|
// Start records the initial StateUnknown transition when a backend is first
|
|
// created or restarted. It exists solely to populate the transition history
|
|
// and fire a reload event; the state does not change.
|
|
func (b *Backend) Start(maxHistory int) Transition {
|
|
b.transition(StateUnknown, ProbeResult{Code: "start"}, maxHistory)
|
|
return b.Transitions[0]
|
|
}
|
|
|
|
// Disable transitions the backend to StateDisabled. Returns the transition.
|
|
// After this call no further probe results are accepted.
|
|
func (b *Backend) Disable(maxHistory int) Transition {
|
|
b.transition(StateDisabled, ProbeResult{Code: "disabled"}, maxHistory)
|
|
return b.Transitions[0]
|
|
}
|
|
|
|
// Enable transitions a disabled backend back to StateUnknown, resetting the
|
|
// counter so the first probe result resolves state (rise-1 preload gives
|
|
// 1-pass → Up, 1-fail → Down). Returns the transition.
|
|
func (b *Backend) Enable(maxHistory int) Transition {
|
|
b.transition(StateUnknown, ProbeResult{Code: "enabled"}, maxHistory)
|
|
b.Counter.Health = b.Counter.Rise - 1
|
|
return b.Transitions[0]
|
|
}
|
|
|
|
// Remove transitions the backend to StateRemoved. Returns the transition.
|
|
// After this call no further probe results are accepted.
|
|
func (b *Backend) Remove(maxHistory int) Transition {
|
|
b.transition(StateRemoved, ProbeResult{Code: "removed"}, maxHistory)
|
|
return b.Transitions[0]
|
|
}
|
|
|
|
// transition appends a new Transition and updates State.
|
|
func (b *Backend) transition(to State, r ProbeResult, maxHistory int) {
|
|
t := Transition{From: b.State, To: to, At: time.Now(), Result: r}
|
|
b.Transitions = append([]Transition{t}, b.Transitions...)
|
|
if len(b.Transitions) > maxHistory {
|
|
b.Transitions = b.Transitions[:maxHistory]
|
|
}
|
|
b.State = to
|
|
}
|