Add StateDisabled for operator-initiated disable, keeping StateRemoved for backends that disappear during a config reload. Previously both used StateRemoved, which was confusing: "removed" implies the backend no longer exists in config, but a disabled backend is still present and can be re-enabled on the fly. - health: add StateDisabled with String() "disabled", Disable() method with probe code "disabled". Record() rejects probes in all three inactive states (paused, disabled, removed). - checker: DisableBackend calls backend.Disable() instead of Remove(). - docs: healthchecks.md rewritten for pause (goroutine cancelled, not just results discarded), and separate disabled/removed state rows. user-guide.md updated to match. - Makefile: add fixstyle target (gofmt -w .).
6.5 KiB
Health Checking
maglevd probes each backend independently of how many frontends reference it.
Every backend runs exactly one probe goroutine. State changes are broadcast as
gRPC events to all connected WatchEvents subscribers.
States
| State | Meaning |
|---|---|
unknown |
Initial state; also entered after a resume or enable. |
up |
Backend is healthy and eligible to receive traffic. |
down |
Backend has failed enough consecutive probes to be considered offline. |
paused |
Health checking stopped by an operator. No probes are sent. |
disabled |
Backend disabled by an operator. No probes are sent. |
removed |
Backend removed from configuration by a reload. No probes are sent. |
Rise / fall counter
The state machine is driven by HAProxy's single-integer health counter.
counter ∈ [0, rise + fall − 1] (called Max below)
backend is UP when counter ≥ rise
backend is DOWN when counter < rise
On each probe:
- pass — counter increments, ceiling at Max.
- fail — counter decrements, floor at 0.
This gives hysteresis: a backend that is barely up (counter = rise) needs
fall consecutive failures before it transitions to down. A backend that is
fully down (counter = 0) needs rise consecutive passes to come back up. A
backend that oscillates between passing and failing stays in the degraded range
without bouncing between up and down.
Expedited unknown resolution
When a backend enters unknown state (new, restarted, resumed, or re-enabled)
its counter is pre-loaded to rise − 1. This means a single probe result is
enough to resolve the state:
- 1 pass →
up - 1 fail →
down(also via the special unknown shortcut below)
In addition, any failure while state is unknown transitions immediately to
down, regardless of the counter value.
Example: rise=2, fall=3 (Max=4)
counter: 0 1 2 3 4
state: DOWN DOWN UP UP UP
^
rise boundary
A backend starting from unknown has counter=1 (rise−1). One pass → counter=2 → up. One fail while unknown → down immediately.
A backend that just became up sits at counter=2. It needs 3 failures to go down (2→1→0, crossing the rise boundary at 2→1).
A backend that has been fully healthy for a while sits at counter=4. It needs 3 failures to go down (4→3→2→1, crossing the rise boundary at 2→1).
Probe intervals
The interval used between probes depends on the backend's counter state:
| Condition | Interval used |
|---|---|
State is unknown |
fast-interval (falls back to interval) |
| Counter = Max (fully healthy) | interval |
| Counter = 0 (fully down) | down-interval (falls back to interval) |
| Counter between 0 and Max (degraded) | fast-interval (falls back to interval) |
Using fast-interval in degraded and unknown states means a flapping or
recovering backend is re-evaluated quickly without waiting a full interval.
Using down-interval for fully down backends reduces probe traffic to servers
that are known to be offline.
Transition events
Every state change is logged as backend-transition and emitted as a gRPC
BackendEvent to all active WatchEvents streams.
Backend added (config load or reload)
unknown → unknown (code: start)
The counter is pre-loaded to rise − 1. The first probe fires after
fast-interval (or interval if not configured). One pass produces unknown → up; one fail produces unknown → down.
If multiple backends start together they are staggered across the first
interval to avoid probe bursts.
Probe pass
- Counter increments.
- If counter reaches
risefrom below:down → up(orunknown → up). - If already up: no transition. Next probe at
fast-intervalif degraded,intervalif fully healthy.
Probe fail
- Counter decrements.
- If counter drops below
risefrom above:up → down. - If state is
unknown: transition immediately todownregardless of counter. - Next probe at
down-intervalif fully down,fast-intervalif degraded.
Pause
<any> → paused (operator action)
The counter is reset to 0. The probe goroutine is cancelled — no further
probes are sent and no traffic reaches the backend while it is paused. The
backend stays paused until explicitly resumed.
Resume
paused → unknown (operator action)
The counter is reset to rise − 1. A fresh probe goroutine is started,
which fires its first probe after fast-interval (or interval if not
configured). One pass produces unknown → up; one fail produces unknown → down.
Disable
<any> → disabled (operator action)
The probe goroutine is cancelled and the backend is marked enabled: false.
No further probes are sent. The backend remains visible via the gRPC API (state
disabled) and can be re-enabled without a config reload.
Enable
disabled → unknown (operator action, via fresh goroutine)
A new probe goroutine is started and the backend re-enters unknown with the
counter pre-loaded to rise − 1. The enabled flag is set back to true.
The first probe fires after fast-interval and resolves state as described
under Backend added.
Backend removed (config reload)
<any> → removed (code: removed)
The probe goroutine stops. No further state changes occur. The removed event is emitted using the frontend map from before the reload so that consumers can correlate it to the correct frontend.
Backend healthcheck config changed (config reload)
The old probe goroutine is stopped (<any> → removed) and a new one started
(unknown → unknown, code: start). The new goroutine resolves state on the
first probe as described under Backend added above.
Backend metadata changed without healthcheck change (config reload)
Weight, enabled flag, and similar fields are updated in place. The probe goroutine is not restarted and no transition event is emitted.
Log lines
All state changes produce a structured log line at INFO level:
{"level":"INFO","msg":"backend-transition","backend":"nginx0-ams","from":"up","to":"paused"}
{"level":"INFO","msg":"backend-transition","backend":"nginx0-ams","from":"paused","to":"unknown"}
{"level":"INFO","msg":"backend-transition","backend":"nginx0-ams","from":"unknown","to":"up","code":"L7OK","detail":""}
Probe-driven transitions also carry code and detail fields from the probe
result (e.g. L4CON, L7STS, connection refused). Operator-driven
transitions (pause, resume, disable, enable) carry empty code and detail.