Frontend aggregate state: SPA-side derive + checker fixes

The web UI showed the wrong up/down state for frontends whose pool composition had been touched by a mix of runtime disable/enable and weight changes: a frontend with every backend at effective_weight=0 would still display "up", while a sibling frontend with a serving fallback backend would display "down". Two independent bugs, each fixed on its own layer. On the fast path (healthCheckEqual returns true), Reload did `w.entry = b`, blindly replacing the runtime worker entry with the fresh YAML record. YAML's default for Enabled is true, so any backend the operator had runtime-disabled would have its Enabled flag silently reset while the worker's backend.State stayed at StateDisabled. Subsequent EnableBackend calls then early-returned on `if w.entry.Enabled` and never transitioned the state machine — the CLI reported "enabled, state is 'disabled'" and the backend was permanently stuck. Fix: preserve w.entry.Enabled across the fast-path replacement. runtimeEnabled := w.entry.Enabled w.entry = b w.entry.Enabled = runtimeEnabled Runtime operator state now outlives config reloads. On the worker- restart path (different health check) the new worker is structurally fresh and the YAML's Enabled is still authoritative. Both methods used `w.entry.Enabled` as their idempotency check, which meant a stuck `Enabled=true, State=disabled` combo couldn't be repaired even after the Reload fix (existing bad state had to survive the upgrade). Switched both methods to key on `w.backend.State`: - DisableBackend: if state == StateDisabled, sync the flag but don't emit a redundant transition; otherwise do the full state transition + flag flip + worker cancel. - EnableBackend: if state != StateDisabled, sync the flag but don't emit a redundant transition; otherwise do the full transition + flag flip + probe-goroutine restart. Either method will now unstick any inconsistency between the flag and the state machine — future drift from a panic, a new code path we haven't thought of, or existing already-stuck backends from before this commit are all repaired on the next enable/disable call. Changing a backend's weight can flip a frontend between up and down (e.g. zeroing the last non-zero-weighted backend in the active pool), but SetFrontendPoolBackendWeight never called updateFrontendState, so the checker's cached frontend state would drift from reality until the next genuine backend transition happened to trigger a recompute. The symptom was "show frontends nginx-ip4-http" reporting up even with every effective_weight=0. Fix: call c.updateFrontendState(frontendName, fe) after the weight mutation, under the same lock. The recompute emits a FrontendEvent transition if the aggregate flipped, so any WatchEvents consumer picks up the change live. stores/state.ts recomputeEffectiveWeights is renamed and extended to recomputeDerivedState, which now also writes fe.state using the same rule as health.ComputeFrontendState: unknown if no backends or all unknown, up if any effective weight > 0, down otherwise. Called from every mutation path (replaceAll, replaceSnapshot, applyBackendTransition, applyConfiguredWeight) so the SPA is authoritative for *display* state and doesn't inherit any staleness the server's cached frontendStates map might have. applyFrontendTransition is now a no-op for the state field — the server's `to` value is no longer trusted because recomputeDerivedState walks the local backends array on every update and produces a fresh, correct answer. The reducer is kept as a named function so sse.ts's dispatch table still has a landing spot for "frontend" events (they still feed the DebugPanel via pushEvent); the empty body is deliberate, not a bug — a comment at the top spells it out.
2026-04-12 23:50:22 +02:00
parent 4347bb9b05
commit 1191b3d994
5 changed files with 125 additions and 34 deletions
--- a/cmd/frontend/web/dist/assets/index-BBNMNdtq.js
+++ b/cmd/frontend/web/dist/assets/index-BBNMNdtq.js
--- a/cmd/frontend/web/dist/assets/index-C-XMkBf5.js
+++ b/cmd/frontend/web/dist/assets/index-C-XMkBf5.js
--- a/cmd/frontend/web/dist/index.html
+++ b/cmd/frontend/web/dist/index.html
@@ -4,7 +4,7 @@
    <meta charset="UTF-8" />
    <meta name="viewport" content="width=device-width, initial-scale=1.0" />
    <title>maglev</title>
-    <script type="module" crossorigin src="/view/assets/index-BBNMNdtq.js"></script>
+    <script type="module" crossorigin src="/view/assets/index-C-XMkBf5.js"></script>
    <link rel="stylesheet" crossorigin href="/view/assets/index-CxDuAfMR.css">
  </head>
  <body>
--- a/cmd/frontend/web/src/stores/state.ts
+++ b/cmd/frontend/web/src/stores/state.ts
@@ -8,22 +8,36 @@ import type {
 } from "../types";
 import { tick } from "./tick";

-// recomputeEffectiveWeights mirrors the server-side
-// health.EffectiveWeights / ActivePoolIndex logic so the SPA can keep
-// pool.effective_weight correct the moment a backend transitions,
-// without waiting for the 30s refresh. Walking every frontend is cheap
-// — O(frontends × pools × backends-per-pool) with tiny constants —
-// and it's strictly a function of the backend state map, so there's no
-// risk of drift vs. the server as long as the rule stays the same.
+// recomputeDerivedState mirrors the server-side
+// health.EffectiveWeights / ActivePoolIndex / ComputeFrontendState
+// logic so the SPA can keep pool.effective_weight AND the
+// per-frontend aggregate state correct the moment any backend
+// transitions or any configured weight changes, without waiting for
+// the 30s refresh. Walking every frontend is cheap — O(frontends ×
+// pools × backends-per-pool) with tiny constants — and it's
+// strictly a function of the backend state map + configured
+// weights, so there's no risk of drift vs. the server as long as
+// the rules stay identical. The SPA is the authoritative source of
+// truth for *display* state: the server's cached frontendStates
+// field can be stale (e.g. after a SetFrontendPoolBackendWeight
+// call that doesn't re-run updateFrontendState, or after a long-
+// lived WatchEvents stream where a past transition corrupted the
+// client's cache) and the SPA recomputes from its own live
+// backends array to avoid inheriting any staleness.
 //
-// Rule: a backend gets its configured pool weight iff it is up AND
-// belongs to the currently-active pool; everything else is 0. The
-// active pool is the first pool containing a backend that is both
-// up AND has a non-zero configured weight — a pool whose up backends
-// are all weight=0 contributes no serving capacity and gets skipped
-// over in priority failover. Kept in lock-step with
-// internal/health/weights.go.
-function recomputeEffectiveWeights(snap: StateSnapshot) {
+// Effective weight rule: a backend gets its configured pool weight
+// iff it is up AND belongs to the currently-active pool; everything
+// else is 0. The active pool is the first pool containing a backend
+// that is both up AND has a non-zero configured weight — a pool
+// whose up backends are all weight=0 contributes no serving
+// capacity and gets skipped over in priority failover. Kept in
+// lock-step with internal/health/weights.go ActivePoolIndex.
+//
+// Frontend state rule: unknown if no backends or every referenced
+// backend is still in StateUnknown; up if any backend in any pool
+// has effective_weight > 0; otherwise down. Kept in lock-step with
+// internal/health/weights.go ComputeFrontendState.
+function recomputeDerivedState(snap: StateSnapshot) {
  const stateOf: Record<string, string> = {};
  for (const b of snap.backends) stateOf[b.name] = b.state;
  for (const fe of snap.frontends) {
@@ -41,13 +55,30 @@ function recomputeEffectiveWeights(snap: StateSnapshot) {
        break;
      }
    }
+    let anyEffective = false;
+    let seenAny = false;
+    let allUnknown = true;
+    const seen = new Set<string>();
    for (let i = 0; i < fe.pools.length; i++) {
      for (const pb of fe.pools[i].backends) {
        const st = stateOf[pb.name];
        pb.effective_weight = st === "up" && i === activePool ? pb.weight : 0;
+        if (pb.effective_weight > 0) anyEffective = true;
+        if (!seen.has(pb.name)) {
+          seen.add(pb.name);
+          seenAny = true;
+          if (st !== "unknown") allUnknown = false;
        }
      }
    }
+    if (!seenAny || allUnknown) {
+      fe.state = "unknown";
+    } else if (anyEffective) {
+      fe.state = "up";
+    } else {
+      fe.state = "down";
+    }
+  }
 }

 // FrontendState keys snapshots by maglevd name. A single store drives the
@@ -61,6 +92,14 @@ const [state, setState] = createStore<FrontendState>({ byName: {} });
 export { state };

 export function replaceSnapshot(snap: StateSnapshot) {
+  // Recompute effective weights + aggregate frontend state locally
+  // from the snapshot's backends array, rather than trusting the
+  // server's state field verbatim. The server can be stale (the
+  // checker's frontendStates map is only updated on backend
+  // transitions, not on weight changes), so deriving from our own
+  // backend data is the only way to guarantee the display stays
+  // consistent with reality.
+  recomputeDerivedState(snap);
  setState(
    produce((s) => {
      s.byName[snap.maglevd.name] = snap;
@@ -70,7 +109,10 @@ export function replaceSnapshot(snap: StateSnapshot) {

 export function replaceAll(snaps: StateSnapshot[]) {
  const byName: Record<string, StateSnapshot> = {};
-  for (const s of snaps) byName[s.maglevd.name] = s;
+  for (const s of snaps) {
+    recomputeDerivedState(s);
+    byName[s.maglevd.name] = s;
+  }
  setState({ byName });
 }

@@ -96,25 +138,26 @@ export function applyBackendTransition(maglevd: string, p: BackendEventPayload)
      }
      // A backend state change can shift which pool is active and
      // therefore which pool-memberships get non-zero effective
-      // weights. Recompute for every frontend — not just the one
+      // weights, and in turn can flip the frontend's aggregate
+      // state. Recompute for every frontend — not just the one
      // pointed at by this backend — because pool-failover is a
      // per-frontend computation and the same backend can appear in
      // multiple frontends with different pool placements.
-      recomputeEffectiveWeights(snap);
+      recomputeDerivedState(snap);
    }),
  );
 }

-export function applyFrontendTransition(maglevd: string, p: FrontendEventPayload) {
-  setState(
-    produce((s) => {
-      const snap = s.byName[maglevd];
-      if (!snap) return;
-      const fe = snap.frontends.find((x) => x.name === p.frontend);
-      if (!fe) return;
-      fe.state = p.transition.to;
-    }),
-  );
+// Frontend-transition events arrive from the server's checker, but
+// the SPA no longer trusts their `to` field — recomputeDerivedState
+// walks the local backends array on every backend event and every
+// hydration to produce an up-to-date frontend state that the server
+// can't make stale. Kept as a named reducer so sse.ts's dispatch
+// table still has a landing spot for "frontend" events (they also
+// flow into the DebugPanel via pushEvent); the body is deliberately
+// empty — not a bug.
+export function applyFrontendTransition(_maglevd: string, _p: FrontendEventPayload) {
+  // no-op — state is derived client-side, see recomputeDerivedState
 }

 export function applyVPPStatus(maglevd: string, state: string) {
@@ -165,7 +208,7 @@ export function applyConfiguredWeight(
      const pb = p.backends.find((x) => x.name === backend);
      if (!pb) return;
      pb.weight = weight;
-      recomputeEffectiveWeights(snap);
+      recomputeDerivedState(snap);
    }),
  );
 }
--- a/internal/checker/checker.go
+++ b/internal/checker/checker.go
@@ -138,8 +138,22 @@ func (c *Checker) Reload(ctx context.Context, cfg *config.Config) error {
 		hc := cfg.HealthChecks[b.HealthCheck]
 		if w, ok := c.workers[name]; ok {
 			if healthCheckEqual(w.hc, hc) {
-				// Update entry metadata (weight, etc.) in place without restart.
+				// Update entry metadata (address, healthcheck name)
+				// in place without restart. Preserve the runtime
+				// Enabled flag — the operator's
+				// PauseBackend/DisableBackend/EnableBackend state
+				// must outlive config reloads so an operator who
+				// disabled a backend and then reloaded config
+				// (e.g. to adjust weights on an unrelated
+				// frontend) doesn't find their disabled backend
+				// silently re-enabled while its worker state
+				// remains stuck at StateDisabled. The YAML's
+				// Enabled field is still authoritative on the
+				// worker-restart path below (where the backend
+				// is structurally new to this worker instance).
+				runtimeEnabled := w.entry.Enabled
 				w.entry = b
+				w.entry.Enabled = runtimeEnabled
 				continue
 			}
 			slog.Info("backend-restart", "backend", name)
@@ -237,6 +251,13 @@ func (c *Checker) GetFrontend(name string) (config.Frontend, bool) {
 // SetFrontendPoolBackendWeight updates the weight of a backend within a named
 // pool of a frontend. Returns the updated FrontendInfo and a descriptive error
 // if the frontend, pool, or backend is not found or the weight is out of range.
+//
+// After mutating the weight, updateFrontendState is re-run for the affected
+// frontend so the aggregate state reflects the new effective weights. A
+// weight change can flip a frontend between UP and DOWN (e.g. zeroing the
+// last non-zero-weighted backend in the active pool), and without this
+// call the checker's cached frontend state would drift from reality until
+// the next genuine backend transition happens to trigger a recompute.
 func (c *Checker) SetFrontendPoolBackendWeight(frontendName, poolName, backendName string, weight int) (config.Frontend, error) {
 	if weight < 0 || weight > 100 {
 		return config.Frontend{}, fmt.Errorf("weight %d out of range [0, 100]", weight)
@@ -259,6 +280,7 @@ func (c *Checker) SetFrontendPoolBackendWeight(frontendName, poolName, backendNa
 		fe.Pools[i].Backends[backendName] = pb
 		c.cfg.Frontends[frontendName] = fe
 		slog.Info("frontend-pool-weight", "frontend", frontendName, "pool", poolName, "backend", backendName, "weight", weight)
+		c.updateFrontendState(frontendName, fe)
 		return fe, nil
 	}
 	return config.Frontend{}, fmt.Errorf("pool %q not found in frontend %q", poolName, frontendName)
@@ -410,6 +432,13 @@ func (c *Checker) ResumeBackend(name string) (BackendSnapshot, error) {
 // DisableBackend stops health checking for a backend and removes it from active
 // rotation. The worker entry is kept in the map so the backend remains visible
 // via GetBackend and can be re-enabled with EnableBackend.
+//
+// Preconditions are keyed on w.backend.State rather than w.entry.Enabled so
+// that any drift between the two fields (e.g. a past Reload that reset the
+// flag without transitioning state) is self-healing: if the state is not
+// already disabled we always do the full transition and bring the flag in
+// line, and if the state is already disabled we fix up the flag without a
+// no-op transition.
 func (c *Checker) DisableBackend(name string) (BackendSnapshot, bool) {
 	c.mu.Lock()
 	defer c.mu.Unlock()
@@ -417,7 +446,14 @@ func (c *Checker) DisableBackend(name string) (BackendSnapshot, bool) {
 	if !ok {
 		return BackendSnapshot{}, false
 	}
-	if !w.entry.Enabled {
+	if w.backend.State == health.StateDisabled {
+		// Already disabled at the state level; make sure the flag
+		// reflects reality without emitting a redundant transition.
+		w.entry.Enabled = false
+		if b, ok := c.cfg.Backends[name]; ok {
+			b.Enabled = false
+			c.cfg.Backends[name] = b
+		}
 		return BackendSnapshot{Health: w.backend, Config: w.entry}, true
 	}
 	maxHistory := c.cfg.HealthChecker.TransitionHistory
@@ -439,6 +475,12 @@ func (c *Checker) DisableBackend(name string) (BackendSnapshot, bool) {
 // EnableBackend re-enables a previously disabled backend. The existing
 // Backend struct is reused — its transition history is preserved — and a
 // fresh probe goroutine is launched. The backend re-enters StateUnknown.
+//
+// Preconditions are keyed on w.backend.State rather than w.entry.Enabled, so
+// drift between the two (most commonly caused by a Reload that reset the
+// flag while the worker state was still disabled) doesn't wedge the backend
+// — we always do the full transition when the state is disabled, and skip
+// it (while syncing the flag) when it's not.
 func (c *Checker) EnableBackend(name string) (BackendSnapshot, bool) {
 	c.mu.Lock()
 	defer c.mu.Unlock()
@@ -446,7 +488,13 @@ func (c *Checker) EnableBackend(name string) (BackendSnapshot, bool) {
 	if !ok {
 		return BackendSnapshot{}, false
 	}
-	if w.entry.Enabled {
+	if w.backend.State != health.StateDisabled {
+		// Not in the disabled state — just make the flag match.
+		w.entry.Enabled = true
+		if b, ok := c.cfg.Backends[name]; ok {
+			b.Enabled = true
+			c.cfg.Backends[name] = b
+		}
 		return BackendSnapshot{Health: w.backend, Config: w.entry}, true
 	}
 	w.entry.Enabled = true