Frontend aggregate state: SPA-side derive + checker fixes

The web UI showed the wrong up/down state for frontends whose pool
composition had been touched by a mix of runtime disable/enable and
weight changes: a frontend with every backend at effective_weight=0
would still display "up", while a sibling frontend with a serving
fallback backend would display "down". Two independent bugs, each
fixed on its own layer.

On the fast path (healthCheckEqual returns true), Reload did
`w.entry = b`, blindly replacing the runtime worker entry with the
fresh YAML record. YAML's default for Enabled is true, so any
backend the operator had runtime-disabled would have its Enabled
flag silently reset while the worker's backend.State stayed at
StateDisabled. Subsequent EnableBackend calls then early-returned
on `if w.entry.Enabled` and never transitioned the state machine
— the CLI reported "enabled, state is 'disabled'" and the backend
was permanently stuck.

Fix: preserve w.entry.Enabled across the fast-path replacement.

    runtimeEnabled := w.entry.Enabled
    w.entry = b
    w.entry.Enabled = runtimeEnabled

Runtime operator state now outlives config reloads. On the worker-
restart path (different health check) the new worker is
structurally fresh and the YAML's Enabled is still authoritative.

Both methods used `w.entry.Enabled` as their idempotency check,
which meant a stuck `Enabled=true, State=disabled` combo couldn't
be repaired even after the Reload fix (existing bad state had to
survive the upgrade). Switched both methods to key on
`w.backend.State`:

 - DisableBackend: if state == StateDisabled, sync the flag but
   don't emit a redundant transition; otherwise do the full
   state transition + flag flip + worker cancel.
 - EnableBackend: if state != StateDisabled, sync the flag but
   don't emit a redundant transition; otherwise do the full
   transition + flag flip + probe-goroutine restart.

Either method will now unstick any inconsistency between the
flag and the state machine — future drift from a panic, a new
code path we haven't thought of, or existing already-stuck
backends from before this commit are all repaired on the next
enable/disable call.

Changing a backend's weight can flip a frontend between up and
down (e.g. zeroing the last non-zero-weighted backend in the
active pool), but SetFrontendPoolBackendWeight never called
updateFrontendState, so the checker's cached frontend state
would drift from reality until the next genuine backend
transition happened to trigger a recompute. The symptom was
"show frontends nginx-ip4-http" reporting up even with every
effective_weight=0.

Fix: call c.updateFrontendState(frontendName, fe) after the
weight mutation, under the same lock. The recompute emits a
FrontendEvent transition if the aggregate flipped, so any
WatchEvents consumer picks up the change live.

stores/state.ts recomputeEffectiveWeights is renamed and
extended to recomputeDerivedState, which now also writes
fe.state using the same rule as health.ComputeFrontendState:
unknown if no backends or all unknown, up if any effective
weight > 0, down otherwise. Called from every mutation path
(replaceAll, replaceSnapshot, applyBackendTransition,
applyConfiguredWeight) so the SPA is authoritative for *display*
state and doesn't inherit any staleness the server's cached
frontendStates map might have.

applyFrontendTransition is now a no-op for the state field —
the server's `to` value is no longer trusted because
recomputeDerivedState walks the local backends array on every
update and produces a fresh, correct answer. The reducer is kept
as a named function so sse.ts's dispatch table still has a
landing spot for "frontend" events (they still feed the
DebugPanel via pushEvent); the empty body is deliberate, not a
bug — a comment at the top spells it out.
This commit is contained in:
2026-04-12 23:50:22 +02:00
parent 4347bb9b05
commit 1191b3d994
5 changed files with 125 additions and 34 deletions

View File

@@ -138,8 +138,22 @@ func (c *Checker) Reload(ctx context.Context, cfg *config.Config) error {
hc := cfg.HealthChecks[b.HealthCheck]
if w, ok := c.workers[name]; ok {
if healthCheckEqual(w.hc, hc) {
// Update entry metadata (weight, etc.) in place without restart.
// Update entry metadata (address, healthcheck name)
// in place without restart. Preserve the runtime
// Enabled flag — the operator's
// PauseBackend/DisableBackend/EnableBackend state
// must outlive config reloads so an operator who
// disabled a backend and then reloaded config
// (e.g. to adjust weights on an unrelated
// frontend) doesn't find their disabled backend
// silently re-enabled while its worker state
// remains stuck at StateDisabled. The YAML's
// Enabled field is still authoritative on the
// worker-restart path below (where the backend
// is structurally new to this worker instance).
runtimeEnabled := w.entry.Enabled
w.entry = b
w.entry.Enabled = runtimeEnabled
continue
}
slog.Info("backend-restart", "backend", name)
@@ -237,6 +251,13 @@ func (c *Checker) GetFrontend(name string) (config.Frontend, bool) {
// SetFrontendPoolBackendWeight updates the weight of a backend within a named
// pool of a frontend. Returns the updated FrontendInfo and a descriptive error
// if the frontend, pool, or backend is not found or the weight is out of range.
//
// After mutating the weight, updateFrontendState is re-run for the affected
// frontend so the aggregate state reflects the new effective weights. A
// weight change can flip a frontend between UP and DOWN (e.g. zeroing the
// last non-zero-weighted backend in the active pool), and without this
// call the checker's cached frontend state would drift from reality until
// the next genuine backend transition happens to trigger a recompute.
func (c *Checker) SetFrontendPoolBackendWeight(frontendName, poolName, backendName string, weight int) (config.Frontend, error) {
if weight < 0 || weight > 100 {
return config.Frontend{}, fmt.Errorf("weight %d out of range [0, 100]", weight)
@@ -259,6 +280,7 @@ func (c *Checker) SetFrontendPoolBackendWeight(frontendName, poolName, backendNa
fe.Pools[i].Backends[backendName] = pb
c.cfg.Frontends[frontendName] = fe
slog.Info("frontend-pool-weight", "frontend", frontendName, "pool", poolName, "backend", backendName, "weight", weight)
c.updateFrontendState(frontendName, fe)
return fe, nil
}
return config.Frontend{}, fmt.Errorf("pool %q not found in frontend %q", poolName, frontendName)
@@ -410,6 +432,13 @@ func (c *Checker) ResumeBackend(name string) (BackendSnapshot, error) {
// DisableBackend stops health checking for a backend and removes it from active
// rotation. The worker entry is kept in the map so the backend remains visible
// via GetBackend and can be re-enabled with EnableBackend.
//
// Preconditions are keyed on w.backend.State rather than w.entry.Enabled so
// that any drift between the two fields (e.g. a past Reload that reset the
// flag without transitioning state) is self-healing: if the state is not
// already disabled we always do the full transition and bring the flag in
// line, and if the state is already disabled we fix up the flag without a
// no-op transition.
func (c *Checker) DisableBackend(name string) (BackendSnapshot, bool) {
c.mu.Lock()
defer c.mu.Unlock()
@@ -417,7 +446,14 @@ func (c *Checker) DisableBackend(name string) (BackendSnapshot, bool) {
if !ok {
return BackendSnapshot{}, false
}
if !w.entry.Enabled {
if w.backend.State == health.StateDisabled {
// Already disabled at the state level; make sure the flag
// reflects reality without emitting a redundant transition.
w.entry.Enabled = false
if b, ok := c.cfg.Backends[name]; ok {
b.Enabled = false
c.cfg.Backends[name] = b
}
return BackendSnapshot{Health: w.backend, Config: w.entry}, true
}
maxHistory := c.cfg.HealthChecker.TransitionHistory
@@ -439,6 +475,12 @@ func (c *Checker) DisableBackend(name string) (BackendSnapshot, bool) {
// EnableBackend re-enables a previously disabled backend. The existing
// Backend struct is reused — its transition history is preserved — and a
// fresh probe goroutine is launched. The backend re-enters StateUnknown.
//
// Preconditions are keyed on w.backend.State rather than w.entry.Enabled, so
// drift between the two (most commonly caused by a Reload that reset the
// flag while the worker state was still disabled) doesn't wedge the backend
// — we always do the full transition when the state is disabled, and skip
// it (while syncing the flag) when it's not.
func (c *Checker) EnableBackend(name string) (BackendSnapshot, bool) {
c.mu.Lock()
defer c.mu.Unlock()
@@ -446,7 +488,13 @@ func (c *Checker) EnableBackend(name string) (BackendSnapshot, bool) {
if !ok {
return BackendSnapshot{}, false
}
if w.entry.Enabled {
if w.backend.State != health.StateDisabled {
// Not in the disabled state — just make the flag match.
w.entry.Enabled = true
if b, ok := c.cfg.Backends[name]; ok {
b.Enabled = true
c.cfg.Backends[name] = b
}
return BackendSnapshot{Health: w.backend, Config: w.entry}, true
}
w.entry.Enabled = true