Frontend aggregate state: SPA-side derive + checker fixes

The web UI showed the wrong up/down state for frontends whose pool composition had been touched by a mix of runtime disable/enable and weight changes: a frontend with every backend at effective_weight=0 would still display "up", while a sibling frontend with a serving fallback backend would display "down". Two independent bugs, each fixed on its own layer. On the fast path (healthCheckEqual returns true), Reload did `w.entry = b`, blindly replacing the runtime worker entry with the fresh YAML record. YAML's default for Enabled is true, so any backend the operator had runtime-disabled would have its Enabled flag silently reset while the worker's backend.State stayed at StateDisabled. Subsequent EnableBackend calls then early-returned on `if w.entry.Enabled` and never transitioned the state machine — the CLI reported "enabled, state is 'disabled'" and the backend was permanently stuck. Fix: preserve w.entry.Enabled across the fast-path replacement. runtimeEnabled := w.entry.Enabled w.entry = b w.entry.Enabled = runtimeEnabled Runtime operator state now outlives config reloads. On the worker- restart path (different health check) the new worker is structurally fresh and the YAML's Enabled is still authoritative. Both methods used `w.entry.Enabled` as their idempotency check, which meant a stuck `Enabled=true, State=disabled` combo couldn't be repaired even after the Reload fix (existing bad state had to survive the upgrade). Switched both methods to key on `w.backend.State`: - DisableBackend: if state == StateDisabled, sync the flag but don't emit a redundant transition; otherwise do the full state transition + flag flip + worker cancel. - EnableBackend: if state != StateDisabled, sync the flag but don't emit a redundant transition; otherwise do the full transition + flag flip + probe-goroutine restart. Either method will now unstick any inconsistency between the flag and the state machine — future drift from a panic, a new code path we haven't thought of, or existing already-stuck backends from before this commit are all repaired on the next enable/disable call. Changing a backend's weight can flip a frontend between up and down (e.g. zeroing the last non-zero-weighted backend in the active pool), but SetFrontendPoolBackendWeight never called updateFrontendState, so the checker's cached frontend state would drift from reality until the next genuine backend transition happened to trigger a recompute. The symptom was "show frontends nginx-ip4-http" reporting up even with every effective_weight=0. Fix: call c.updateFrontendState(frontendName, fe) after the weight mutation, under the same lock. The recompute emits a FrontendEvent transition if the aggregate flipped, so any WatchEvents consumer picks up the change live. stores/state.ts recomputeEffectiveWeights is renamed and extended to recomputeDerivedState, which now also writes fe.state using the same rule as health.ComputeFrontendState: unknown if no backends or all unknown, up if any effective weight > 0, down otherwise. Called from every mutation path (replaceAll, replaceSnapshot, applyBackendTransition, applyConfiguredWeight) so the SPA is authoritative for *display* state and doesn't inherit any staleness the server's cached frontendStates map might have. applyFrontendTransition is now a no-op for the state field — the server's `to` value is no longer trusted because recomputeDerivedState walks the local backends array on every update and produces a fresh, correct answer. The reducer is kept as a named function so sse.ts's dispatch table still has a landing spot for "frontend" events (they still feed the DebugPanel via pushEvent); the empty body is deliberate, not a bug — a comment at the top spells it out.
2026-04-12 23:50:22 +02:00
parent 4347bb9b05
commit 1191b3d994
5 changed files with 125 additions and 34 deletions
--- a/internal/checker/checker.go
+++ b/internal/checker/checker.go
@@ -138,8 +138,22 @@ func (c *Checker) Reload(ctx context.Context, cfg *config.Config) error {
 		hc := cfg.HealthChecks[b.HealthCheck]
 		if w, ok := c.workers[name]; ok {
 			if healthCheckEqual(w.hc, hc) {
-				// Update entry metadata (weight, etc.) in place without restart.
+				// Update entry metadata (address, healthcheck name)
+				// in place without restart. Preserve the runtime
+				// Enabled flag — the operator's
+				// PauseBackend/DisableBackend/EnableBackend state
+				// must outlive config reloads so an operator who
+				// disabled a backend and then reloaded config
+				// (e.g. to adjust weights on an unrelated
+				// frontend) doesn't find their disabled backend
+				// silently re-enabled while its worker state
+				// remains stuck at StateDisabled. The YAML's
+				// Enabled field is still authoritative on the
+				// worker-restart path below (where the backend
+				// is structurally new to this worker instance).
+				runtimeEnabled := w.entry.Enabled
 				w.entry = b
+				w.entry.Enabled = runtimeEnabled
 				continue
 			}
 			slog.Info("backend-restart", "backend", name)
@@ -237,6 +251,13 @@ func (c *Checker) GetFrontend(name string) (config.Frontend, bool) {
 // SetFrontendPoolBackendWeight updates the weight of a backend within a named
 // pool of a frontend. Returns the updated FrontendInfo and a descriptive error
 // if the frontend, pool, or backend is not found or the weight is out of range.
+//
+// After mutating the weight, updateFrontendState is re-run for the affected
+// frontend so the aggregate state reflects the new effective weights. A
+// weight change can flip a frontend between UP and DOWN (e.g. zeroing the
+// last non-zero-weighted backend in the active pool), and without this
+// call the checker's cached frontend state would drift from reality until
+// the next genuine backend transition happens to trigger a recompute.
 func (c *Checker) SetFrontendPoolBackendWeight(frontendName, poolName, backendName string, weight int) (config.Frontend, error) {
 	if weight < 0 || weight > 100 {
 		return config.Frontend{}, fmt.Errorf("weight %d out of range [0, 100]", weight)
@@ -259,6 +280,7 @@ func (c *Checker) SetFrontendPoolBackendWeight(frontendName, poolName, backendNa
 		fe.Pools[i].Backends[backendName] = pb
 		c.cfg.Frontends[frontendName] = fe
 		slog.Info("frontend-pool-weight", "frontend", frontendName, "pool", poolName, "backend", backendName, "weight", weight)
+		c.updateFrontendState(frontendName, fe)
 		return fe, nil
 	}
 	return config.Frontend{}, fmt.Errorf("pool %q not found in frontend %q", poolName, frontendName)
@@ -410,6 +432,13 @@ func (c *Checker) ResumeBackend(name string) (BackendSnapshot, error) {
 // DisableBackend stops health checking for a backend and removes it from active
 // rotation. The worker entry is kept in the map so the backend remains visible
 // via GetBackend and can be re-enabled with EnableBackend.
+//
+// Preconditions are keyed on w.backend.State rather than w.entry.Enabled so
+// that any drift between the two fields (e.g. a past Reload that reset the
+// flag without transitioning state) is self-healing: if the state is not
+// already disabled we always do the full transition and bring the flag in
+// line, and if the state is already disabled we fix up the flag without a
+// no-op transition.
 func (c *Checker) DisableBackend(name string) (BackendSnapshot, bool) {
 	c.mu.Lock()
 	defer c.mu.Unlock()
@@ -417,7 +446,14 @@ func (c *Checker) DisableBackend(name string) (BackendSnapshot, bool) {
 	if !ok {
 		return BackendSnapshot{}, false
 	}
-	if !w.entry.Enabled {
+	if w.backend.State == health.StateDisabled {
+		// Already disabled at the state level; make sure the flag
+		// reflects reality without emitting a redundant transition.
+		w.entry.Enabled = false
+		if b, ok := c.cfg.Backends[name]; ok {
+			b.Enabled = false
+			c.cfg.Backends[name] = b
+		}
 		return BackendSnapshot{Health: w.backend, Config: w.entry}, true
 	}
 	maxHistory := c.cfg.HealthChecker.TransitionHistory
@@ -439,6 +475,12 @@ func (c *Checker) DisableBackend(name string) (BackendSnapshot, bool) {
 // EnableBackend re-enables a previously disabled backend. The existing
 // Backend struct is reused — its transition history is preserved — and a
 // fresh probe goroutine is launched. The backend re-enters StateUnknown.
+//
+// Preconditions are keyed on w.backend.State rather than w.entry.Enabled, so
+// drift between the two (most commonly caused by a Reload that reset the
+// flag while the worker state was still disabled) doesn't wedge the backend
+// — we always do the full transition when the state is disabled, and skip
+// it (while syncing the flag) when it's not.
 func (c *Checker) EnableBackend(name string) (BackendSnapshot, bool) {
 	c.mu.Lock()
 	defer c.mu.Unlock()
@@ -446,7 +488,13 @@ func (c *Checker) EnableBackend(name string) (BackendSnapshot, bool) {
 	if !ok {
 		return BackendSnapshot{}, false
 	}
-	if w.entry.Enabled {
+	if w.backend.State != health.StateDisabled {
+		// Not in the disabled state — just make the flag match.
+		w.entry.Enabled = true
+		if b, ok := c.cfg.Backends[name]; ok {
+			b.Enabled = true
+			c.cfg.Backends[name] = b
+		}
 		return BackendSnapshot{Health: w.backend, Config: w.entry}, true
 	}
 	w.entry.Enabled = true