Makefile:
- New install-deps umbrella target split into three sub-targets:
install-deps-apt — Debian/Trixie-packaged build deps
(nodejs, npm, protobuf-compiler, git, make,
dpkg-dev, ca-certificates, curl, tar). Uses
sudo when not already root.
install-deps-go — ensures a Go toolchain >= GO_VERSION (go.mod
floor, default 1.25.0). Short-circuits when
the system Go is already recent enough;
otherwise downloads the upstream tarball
from go.dev/dl/ into /usr/local/go. Trixie
only ships 1.24 so this step is load-bearing.
install-deps-go-tools — go install protoc-gen-go, protoc-gen-go-grpc,
and golangci-lint/v2/cmd/golangci-lint. Then
asserts the installed golangci-lint version
parses as >= GOLANGCI_LINT_VERSION (default
1.64.0, the floor that supports Go 1.25
syntax) to catch stale binaries in $GOPATH
/bin before they silently run against Go
1.25 code.
- Parser bug fixed: golangci-lint v1.x prints "has version v1.64.8" but
v2.x dropped the 'v' prefix and prints "has version 2.11.4". The
original sed regex required the 'v' and returned an empty match on
v2.x, making the assertion explode with "could not parse version
output". Fixed by switching to extended regex (sed -En) with 'v?' so
both forms parse cleanly.
- GO_VERSION and GOLANGCI_LINT_VERSION exposed as Makefile variables
so operators can override on the command line, e.g.
make install-deps GO_VERSION=1.25.5 GOLANGCI_LINT_VERSION=2.0.0
- .PHONY extended with the four new target names.
Docs:
- README.md: capability note rewritten to cover CAP_NET_RAW (ICMP) and
the new CAP_SYS_ADMIN requirement when healthchecker.netns is set,
plus a paragraph explaining that the Debian systemd unit grants both
automatically. Docker example gained a second variant that shows the
additional --cap-add SYS_ADMIN and /var/run/netns bind mount for
netns-scoped deployments. Also notes that maglevd-frontend ignores
SIGHUP so controlling-terminal disconnects don't kill it.
- docs/user-guide.md: Capabilities section rewritten as a bulleted
list covering both caps, with the EPERM error string and three
different ways to grant them (systemd unit, setcap, systemd-run);
'show vpp lb counters' command description updated to explain that
per-backend packet counts are no longer shown (LB plugin's
forwarding node bypasses ip{4,6}_lookup_inline, so /net/route/to at
the backend's FIB entry never ticks for LB-forwarded traffic); new
~75-line "What the SPA shows" subsection covering the scope
selector + maglev_scope cookie, the per-maglevd frontend cards, the
health-cascade icon table (ok / bug-buckets / primary-drained /
degraded / unknown), the lb buckets column semantics, the
maglev_zippy_open cookie, the admin-mode lifecycle dialogs with
their plain-English consequence text, and the debug panel.
- docs/config-guide.md: healthchecker.netns field gains a capability-
requirement note spelling out setns(CLONE_NEWNET), the EPERM
symptom string, and the /var/run/netns/ readability requirement.
- docs/healthchecks.md: new "Jitter" subsection explaining the +/-10%
scaling on every computed interval, and a "Probe timing while a
probe is in flight" subsection that explains why fast-interval alone
doesn't give fast fault detection against hanging backends (the
probe loop is synchronous, so each iteration is timeout +
fast-interval; the advice is to lower timeout, not fast-interval).
- docs/maglevd.8: description paragraph corrected (dropped the
per-backend stats claim and added a short note pointing at the LB
plugin forwarding-path bypass); new CAPABILITIES section between
SIGNALS and FILES covering both CAP_NET_RAW and CAP_SYS_ADMIN with
the drop-in-override hint.
- docs/maglevd-frontend.8: new SIGNALS section documenting the
explicit SIGHUP ignore (so a controlling-terminal disconnect doesn't
kill the daemon); description extended with paragraphs on the two
persistence cookies (maglev_scope, maglev_zippy_open) and on the
health-cascade icon + lb buckets column.
- docs/maglevc.1: left untouched — intentionally minimal and delegates
to docs/user-guide.md.
Lint (26 issues across 12 files, all errcheck / ineffassign / S1021):
- cmd/frontend/handlers.go: _, _ = fmt.Fprintf(...) for the SSE retry
hint and resync control-event writes.
- cmd/maglevc/commands.go: bulk-prefix every fmt.Fprintf(w, ...) with
_, _ =; also merged 'var watchEventsOptSlot *Node; ... = &Node{...}'
into a single := declaration (staticcheck S1021) — the self-
referencing pattern still works because the Children back-ref is
assigned on the next statement, not inside the struct literal.
- cmd/maglevc/complete.go: _, _ = fmt.Fprintf(ql.rl.Stderr(), ...)
for the banner and help writes; removed the ineffectual
'partial = ""' assignment (nothing downstream reads partial after
that branch, so setting it was dead code flagged by ineffassign).
- cmd/maglevc/shell.go: defer func() { _ = rl.Close() }() for the
readline instance; _, _ = fmt.Fprintf(rl.Stderr(), ...) for error
display in the REPL loop.
- cmd/maglevc/main.go: defer func() { _ = conn.Close() }() for the
gRPC client connection.
- internal/grpcapi/server_test.go: _ = conn.Close() in the test
teardown closure.
- internal/prober/http.go: _ = c.Close() in the TLS-handshake-failed
path; defer func() { _ = conn.Close() }() and defer func() { _ =
resp.Body.Close() }() for the two deferred cleanups.
- internal/prober/http_test.go: defer func() { _ = resp.Body.Close()
}() plus three _, _ = fmt.Fprint(w, ...) in the httptest.Server
handlers and _, _ = fmt.Sscanf(...) when parsing the test listener's
port.
- internal/prober/icmp.go: defer func() { _ = pc.Close() }() for the
ICMP packet conn.
- internal/prober/netns.go: defer func() { _ = origNs.Close() }(),
defer func() { _ = netns.Set(origNs) }(), defer func() { _ =
targetNs.Close() }() — also dropped a stray //nolint:errcheck that
was no longer needed once the closure wrapping handled the discard.
- internal/prober/tcp.go: _ = conn.Close() in the L4-only path,
_ = tlsConn.Close() in the failed and succeeded handshake branches,
_ = tlsConn.SetDeadline(...) (also dropped a //nolint:errcheck
previously covering it).
Iterative 'make lint' runs were needed because golangci-lint v2.x
caps same-linter reports per pass, so the first pass reported 21,
then 4, then 3, then 1, then 0. Final pass: 0 issues. make test is
green across every package, and make build produces all three
binaries cleanly.
9.9 KiB
Health Checking
maglevd probes each backend independently of how many frontends reference it.
Every backend runs exactly one probe goroutine. State changes are broadcast as
gRPC events to all connected WatchEvents subscribers.
States
| State | Meaning |
|---|---|
unknown |
Initial state; also entered after a resume or enable. |
up |
Backend is healthy and eligible to receive traffic. |
down |
Backend has failed enough consecutive probes to be considered offline. |
paused |
Health checking stopped by an operator. No probes are sent. |
disabled |
Backend disabled by an operator. No probes are sent. |
removed |
Backend removed from configuration by a reload. No probes are sent. |
Rise / fall counter
The state machine is driven by HAProxy's single-integer health counter.
counter ∈ [0, rise + fall − 1] (called Max below)
backend is UP when counter ≥ rise
backend is DOWN when counter < rise
On each probe:
- pass — counter increments, ceiling at Max.
- fail — counter decrements, floor at 0.
This gives hysteresis: a backend that is barely up (counter = rise) needs
fall consecutive failures before it transitions to down. A backend that is
fully down (counter = 0) needs rise consecutive passes to come back up. A
backend that oscillates between passing and failing stays in the degraded range
without bouncing between up and down.
Expedited unknown resolution
When a backend enters unknown state (new, restarted, resumed, or re-enabled)
its counter is pre-loaded to rise − 1. This means a single probe result is
enough to resolve the state:
- 1 pass →
up - 1 fail →
down(also via the special unknown shortcut below)
In addition, any failure while state is unknown transitions immediately to
down, regardless of the counter value.
Example: rise=2, fall=3 (Max=4)
counter: 0 1 2 3 4
state: DOWN DOWN UP UP UP
^
rise boundary
A backend starting from unknown has counter=1 (rise−1). One pass → counter=2 → up. One fail while unknown → down immediately.
A backend that just became up sits at counter=2. It needs 3 failures to go down (2→1→0, crossing the rise boundary at 2→1).
A backend that has been fully healthy for a while sits at counter=4. It needs 3 failures to go down (4→3→2→1, crossing the rise boundary at 2→1).
Probe intervals
The interval used between probes depends on the backend's counter state:
| Condition | Interval used |
|---|---|
State is unknown |
fast-interval (falls back to interval) |
| Counter = Max (fully healthy) | interval |
| Counter = 0 (fully down) | down-interval (falls back to interval) |
| Counter between 0 and Max (degraded) | fast-interval (falls back to interval) |
Using fast-interval in degraded and unknown states means a flapping or
recovering backend is re-evaluated quickly without waiting a full interval.
Using down-interval for fully down backends reduces probe traffic to servers
that are known to be offline.
Jitter
Every computed interval is then scaled by a uniformly-distributed random
factor in [0.9, 1.1) before the probe worker sleeps. The ±10% jitter
prevents all probes from aligning on the same tick after a restart or a
config reload — a deployment with dozens of backends would otherwise send a
bursty, phase-locked flight of probes every interval. The jitter is
applied once per probe iteration, not averaged across iterations, so the
long-run cadence is still the configured interval.
Probe timing while a probe is in flight
The probe worker loop is synchronous: each iteration blocks on the probe's
completion (or its timeout) before computing the next sleepFor. That
means a fully-timing-out probe effectively runs at
timeout + fast-interval cadence, not fast-interval cadence. If you
want fast fault detection against backends that hang rather than refuse
the connection (e.g. a dead TCP stack, or an unreachable backend via a
blackhole route), lower timeout rather than fast-interval. Setting
fast-interval below timeout doesn't make probes fire more frequently —
it just changes the idle gap between a completed probe and the next one.
Transition events
Every state change is logged as backend-transition and emitted as a gRPC
BackendEvent to all active WatchEvents streams.
Backend added (config load or reload)
unknown → unknown (code: start)
The counter is pre-loaded to rise − 1. The first probe fires after
fast-interval (or interval if not configured). One pass produces unknown → up; one fail produces unknown → down.
If multiple backends start together they are staggered across the first
interval to avoid probe bursts.
Probe pass
- Counter increments.
- If counter reaches
risefrom below:down → up(orunknown → up). - If already up: no transition. Next probe at
fast-intervalif degraded,intervalif fully healthy.
Probe fail
- Counter decrements.
- If counter drops below
risefrom above:up → down. - If state is
unknown: transition immediately todownregardless of counter. - Next probe at
down-intervalif fully down,fast-intervalif degraded.
Pause
<any> → paused (operator action)
The counter is reset to 0. The probe goroutine is cancelled — no further
probes are sent and no traffic reaches the backend while it is paused. The
backend stays paused until explicitly resumed.
Resume
paused → unknown (operator action)
The counter is reset to rise − 1. A fresh probe goroutine is started,
which fires its first probe after fast-interval (or interval if not
configured). One pass produces unknown → up; one fail produces unknown → down.
Disable
<any> → disabled (operator action)
The probe goroutine is cancelled and the backend is marked enabled: false.
No further probes are sent. The backend remains visible via the gRPC API (state
disabled) and can be re-enabled without a config reload.
Enable
disabled → unknown (operator action, via fresh goroutine)
A new probe goroutine is started and the backend re-enters unknown with the
counter pre-loaded to rise − 1. The enabled flag is set back to true.
The first probe fires after fast-interval and resolves state as described
under Backend added.
Backend removed (config reload)
<any> → removed (code: removed)
The probe goroutine stops. No further state changes occur. The removed event is emitted using the frontend map from before the reload so that consumers can correlate it to the correct frontend.
Backend healthcheck config changed (config reload)
The old probe goroutine is stopped (<any> → removed) and a new one started
(unknown → unknown, code: start). The new goroutine resolves state on the
first probe as described under Backend added above.
Backend metadata changed without healthcheck change (config reload)
Weight, enabled flag, and similar fields are updated in place. The probe goroutine is not restarted and no transition event is emitted.
Static (no-healthcheck) backends
A backend with no healthcheck field in YAML skips the probe loop entirely.
Instead of actually probing, maglevd synthesises a single passing result
on startup. Specifically:
- The worker's rise/fall counters are forced to
1/1, so a single synthetic pass is enough to reachStateUp. - The first "probe" fires immediately (zero sleep). Subsequent iterations idle at 30 seconds — there is nothing to do.
- The backend reaches
upwithin milliseconds of startup.
Static backends are useful for administrative VIPs where the caller knows the backend is always available, or for test configurations where deterministic state is more valuable than real health signals.
Pool failover
Every frontend has one or more pools. The pools are priority tiers: pool[0]
is the primary, pool[1] is the first fallback, pool[2] the next, and so on.
At any moment, maglevd computes an active pool — the first pool that
contains at least one backend in StateUp:
- As long as pool[0] has any up backend, it stays active. Its up backends receive traffic at their configured weights; backends in lower-priority pools stay on standby with effective weight 0.
- When pool[0] has zero up backends (all down, paused, disabled, or still unknown), pool[1] is promoted: its up backends get their configured weights, and pool[0] backends stay at 0 until at least one recovers.
- The same rule cascades to pool[2], pool[3], etc., for further fallback tiers.
- When no pool has any up backend, every backend's effective weight is 0 and the VIP serves nothing.
Failover is evaluated on every backend state transition and also on the
periodic VPP drift reconciliation (every maglev.vpp.lb.sync-interval).
The resulting effective weight for each backend can be inspected via
maglevc show frontends <name> — each pool backend row shows both the
configured weight and the effective weight after failover.
Demotion on recovery (e.g. pool[1] → standby when pool[0] comes back up)
drains gracefully: the demoted backends have their weight set to 0 but
existing flows in the VPP flow table are left to drain naturally. The only
state that forces immediate flow-table flushing is operator disable.
Log lines
All state changes produce a structured log line at INFO level:
{"level":"INFO","msg":"backend-transition","backend":"nginx0-ams","from":"up","to":"paused"}
{"level":"INFO","msg":"backend-transition","backend":"nginx0-ams","from":"paused","to":"unknown"}
{"level":"INFO","msg":"backend-transition","backend":"nginx0-ams","from":"unknown","to":"up","code":"L7OK","detail":""}
Probe-driven transitions also carry code and detail fields from the probe
result (e.g. L4CON, L7STS, connection refused). Operator-driven
transitions (pause, resume, disable, enable) carry empty code and detail.