Distinguish disabled from removed backend state; add make fixstyle

Add StateDisabled for operator-initiated disable, keeping StateRemoved
for backends that disappear during a config reload. Previously both
used StateRemoved, which was confusing: "removed" implies the backend
no longer exists in config, but a disabled backend is still present
and can be re-enabled on the fly.

- health: add StateDisabled with String() "disabled", Disable() method
  with probe code "disabled". Record() rejects probes in all three
  inactive states (paused, disabled, removed).
- checker: DisableBackend calls backend.Disable() instead of Remove().
- docs: healthchecks.md rewritten for pause (goroutine cancelled, not
  just results discarded), and separate disabled/removed state rows.
  user-guide.md updated to match.
- Makefile: add fixstyle target (gofmt -w .).
This commit is contained in:
2026-04-11 21:04:17 +02:00
parent 4ab3096c8b
commit 1815675fb6
10 changed files with 68 additions and 30 deletions

View File

@@ -14,7 +14,7 @@ LDFLAGS := -X '$(MODULE)/cmd.version=$(VERSION)' \
TEST ?= tests/
.PHONY: all build build-amd64 build-arm64 test proto lint pkg-deb robot-test clean
.PHONY: all build build-amd64 build-arm64 test proto lint fixstyle pkg-deb robot-test clean
all: build
@@ -48,6 +48,9 @@ $(GEN_FILES): $(PROTO_FILE)
--go-grpc_out=. --go-grpc_opt=module=$(MODULE) \
$(PROTO_FILE)
fixstyle:
gofmt -w .
lint:
golangci-lint run ./...

View File

@@ -2,7 +2,7 @@
`maglevd` probes each backend independently of how many frontends reference it.
Every backend runs exactly one probe goroutine. State changes are broadcast as
gRPC events to all connected `WatchBackendEvents` subscribers.
gRPC events to all connected `WatchEvents` subscribers.
---
@@ -10,11 +10,12 @@ gRPC events to all connected `WatchBackendEvents` subscribers.
| State | Meaning |
|---|---|
| `unknown` | Initial state; also entered after a resume or backend restart. |
| `unknown` | Initial state; also entered after a resume or enable. |
| `up` | Backend is healthy and eligible to receive traffic. |
| `down` | Backend has failed enough consecutive probes to be considered offline. |
| `paused` | Health checking suspended by an operator. Probes fire but results are discarded. |
| `removed` | Backend was removed from configuration. No further probes are accepted. |
| `paused` | Health checking stopped by an operator. No probes are sent. |
| `disabled` | Backend disabled by an operator. No probes are sent. |
| `removed` | Backend removed from configuration by a reload. No probes are sent. |
---
@@ -41,9 +42,9 @@ without bouncing between up and down.
### Expedited unknown resolution
When a backend enters `unknown` state (new, restarted, or resumed) its counter
is pre-loaded to `rise 1`. This means a single probe result is enough to
resolve the state:
When a backend enters `unknown` state (new, restarted, resumed, or re-enabled)
its counter is pre-loaded to `rise 1`. This means a single probe result is
enough to resolve the state:
- **1 pass** → `up`
- **1 fail** → `down` (also via the special unknown shortcut below)
@@ -92,7 +93,7 @@ that are known to be offline.
## Transition events
Every state change is logged as `backend-transition` and emitted as a gRPC
`BackendEvent` to all active `WatchBackendEvents` streams.
`BackendEvent` to all active `WatchEvents` streams.
### Backend added (config load or reload)
@@ -100,7 +101,7 @@ Every state change is logged as `backend-transition` and emitted as a gRPC
unknown → unknown (code: start)
```
The counter is pre-loaded to `rise 1`. The first probe fires immediately at
The counter is pre-loaded to `rise 1`. The first probe fires after
`fast-interval` (or `interval` if not configured). One pass produces `unknown →
up`; one fail produces `unknown → down`.
@@ -127,8 +128,9 @@ If multiple backends start together they are staggered across the first
<any> → paused (operator action)
```
The counter is reset to 0. Probes continue to fire on their normal schedule but
all results are discarded. The backend stays `paused` until explicitly resumed.
The counter is reset to 0. The probe goroutine is cancelled — no further
probes are sent and no traffic reaches the backend while it is paused. The
backend stays `paused` until explicitly resumed.
### Resume
@@ -136,9 +138,31 @@ all results are discarded. The backend stays `paused` until explicitly resumed.
paused → unknown (operator action)
```
The counter is reset to `rise 1`. The probe goroutine is woken immediately
(no wait for the next scheduled probe). One subsequent pass produces `unknown →
up`; one fail produces `unknown → down`.
The counter is reset to `rise 1`. A fresh probe goroutine is started,
which fires its first probe after `fast-interval` (or `interval` if not
configured). One pass produces `unknown → up`; one fail produces `unknown →
down`.
### Disable
```
<any> → disabled (operator action)
```
The probe goroutine is cancelled and the backend is marked `enabled: false`.
No further probes are sent. The backend remains visible via the gRPC API (state
`disabled`) and can be re-enabled without a config reload.
### Enable
```
disabled → unknown (operator action, via fresh goroutine)
```
A new probe goroutine is started and the backend re-enters `unknown` with the
counter pre-loaded to `rise 1`. The `enabled` flag is set back to `true`.
The first probe fires after `fast-interval` and resolves state as described
under *Backend added*.
### Backend removed (config reload)
@@ -175,4 +199,4 @@ All state changes produce a structured log line at `INFO` level:
Probe-driven transitions also carry `code` and `detail` fields from the probe
result (e.g. `L4CON`, `L7STS`, `connection refused`). Operator-driven
transitions (pause, resume) carry empty code and detail.
transitions (pause, resume, disable, enable) carry empty code and detail.

View File

@@ -87,7 +87,7 @@ set backend <name> pause Suspend health checking for a backend, freezing
set backend <name> resume Resume health checking; backend re-enters unknown state
and is probed immediately.
set backend <name> disable Stop probing entirely and remove the backend from rotation.
The backend remains visible (state: removed) and can be
The backend remains visible (state: disabled) and can be
re-enabled without reloading configuration.
set backend <name> enable Re-enable a disabled backend. A fresh probe goroutine is
started and the backend re-enters unknown state.

View File

@@ -350,7 +350,7 @@ func (c *Checker) DisableBackend(name string) (BackendSnapshot, bool) {
return BackendSnapshot{Health: w.backend, Config: w.entry}, true
}
maxHistory := c.cfg.HealthChecker.TransitionHistory
t := w.backend.Remove(maxHistory)
t := w.backend.Disable(maxHistory)
slog.Info("backend-disable", "backend", name)
c.emitForBackend(name, w.backend.Address, t, c.cfg.Frontends)
w.cancel()

View File

@@ -310,8 +310,8 @@ func TestEnableDisable(t *testing.T) {
if !ok {
t.Fatal("DisableBackend: not found")
}
if b.Health.State != health.StateRemoved {
t.Errorf("after disable: state=%s, want removed", b.Health.State)
if b.Health.State != health.StateDisabled {
t.Errorf("after disable: state=%s, want disabled", b.Health.State)
}
if b.Config.Enabled {
t.Error("after disable: Enabled should be false")

View File

@@ -268,8 +268,8 @@ func TestEnableDisableBackend(t *testing.T) {
if err != nil {
t.Fatalf("DisableBackend: %v", err)
}
if info.State != "removed" {
t.Errorf("after disable: got %q, want removed", info.State)
if info.State != "disabled" {
t.Errorf("after disable: got %q, want disabled", info.State)
}
if info.Enabled {
t.Error("after disable: Enabled should be false")

View File

@@ -30,10 +30,11 @@ type State int
const (
StateUnknown State = iota // initial state before first probe
StateUp
StateDown
StatePaused
StateRemoved // backend was removed from configuration
StateUp // backend is healthy
StateDown // backend has failed enough probes
StatePaused // operator paused health checking
StateDisabled // operator disabled the backend
StateRemoved // backend removed from configuration by reload
)
func (s State) String() string {
@@ -46,6 +47,8 @@ func (s State) String() string {
return "down"
case StatePaused:
return "paused"
case StateDisabled:
return "disabled"
case StateRemoved:
return "removed"
default:
@@ -123,7 +126,7 @@ func New(name string, addr net.IP, rise, fall int) *Backend {
// failure means the backend is not yet confirmed reachable), and to StateUp
// once the counter reaches Rise consecutive passes.
func (b *Backend) Record(r ProbeResult, maxHistory int) bool {
if b.State == StatePaused || b.State == StateRemoved {
if b.State == StatePaused || b.State == StateDisabled || b.State == StateRemoved {
return false
}
if r.OK {
@@ -196,6 +199,13 @@ func (b *Backend) Start(maxHistory int) Transition {
return b.Transitions[0]
}
// Disable transitions the backend to StateDisabled. Returns the transition.
// After this call no further probe results are accepted.
func (b *Backend) Disable(maxHistory int) Transition {
b.transition(StateDisabled, ProbeResult{Code: "disabled"}, maxHistory)
return b.Transitions[0]
}
// Remove transitions the backend to StateRemoved. Returns the transition.
// After this call no further probe results are accepted.
func (b *Backend) Remove(maxHistory int) Transition {

View File

@@ -333,6 +333,7 @@ func TestStateString(t *testing.T) {
{StateUp, "up"},
{StateDown, "down"},
{StatePaused, "paused"},
{StateDisabled, "disabled"},
{StateRemoved, "removed"},
}
for _, c := range cases {

View File

@@ -112,6 +112,7 @@ func (c *Collector) Collect(ch chan<- prometheus.Metric) {
health.StateUp,
health.StateDown,
health.StatePaused,
health.StateDisabled,
health.StateRemoved,
}
@@ -173,4 +174,3 @@ func Register(reg prometheus.Registerer, src StateSource) *Collector {
reg.MustRegister(TransitionTotal)
return coll
}

View File

@@ -59,7 +59,7 @@ Resume backend restarts probing
Disable backend stops probing
Maglevc set backend nginx2 disable
Backend Should Have State nginx2 removed
Backend Should Have State nginx2 disabled
Backend Should Be Disabled nginx2
Sleep 1s
${before} = Get Probe Count nginx2