Frontend aggregate state: SPA-side derive + checker fixes

The web UI showed the wrong up/down state for frontends whose pool
composition had been touched by a mix of runtime disable/enable and
weight changes: a frontend with every backend at effective_weight=0
would still display "up", while a sibling frontend with a serving
fallback backend would display "down". Two independent bugs, each
fixed on its own layer.

On the fast path (healthCheckEqual returns true), Reload did
`w.entry = b`, blindly replacing the runtime worker entry with the
fresh YAML record. YAML's default for Enabled is true, so any
backend the operator had runtime-disabled would have its Enabled
flag silently reset while the worker's backend.State stayed at
StateDisabled. Subsequent EnableBackend calls then early-returned
on `if w.entry.Enabled` and never transitioned the state machine
— the CLI reported "enabled, state is 'disabled'" and the backend
was permanently stuck.

Fix: preserve w.entry.Enabled across the fast-path replacement.

    runtimeEnabled := w.entry.Enabled
    w.entry = b
    w.entry.Enabled = runtimeEnabled

Runtime operator state now outlives config reloads. On the worker-
restart path (different health check) the new worker is
structurally fresh and the YAML's Enabled is still authoritative.

Both methods used `w.entry.Enabled` as their idempotency check,
which meant a stuck `Enabled=true, State=disabled` combo couldn't
be repaired even after the Reload fix (existing bad state had to
survive the upgrade). Switched both methods to key on
`w.backend.State`:

 - DisableBackend: if state == StateDisabled, sync the flag but
   don't emit a redundant transition; otherwise do the full
   state transition + flag flip + worker cancel.
 - EnableBackend: if state != StateDisabled, sync the flag but
   don't emit a redundant transition; otherwise do the full
   transition + flag flip + probe-goroutine restart.

Either method will now unstick any inconsistency between the
flag and the state machine — future drift from a panic, a new
code path we haven't thought of, or existing already-stuck
backends from before this commit are all repaired on the next
enable/disable call.

Changing a backend's weight can flip a frontend between up and
down (e.g. zeroing the last non-zero-weighted backend in the
active pool), but SetFrontendPoolBackendWeight never called
updateFrontendState, so the checker's cached frontend state
would drift from reality until the next genuine backend
transition happened to trigger a recompute. The symptom was
"show frontends nginx-ip4-http" reporting up even with every
effective_weight=0.

Fix: call c.updateFrontendState(frontendName, fe) after the
weight mutation, under the same lock. The recompute emits a
FrontendEvent transition if the aggregate flipped, so any
WatchEvents consumer picks up the change live.

stores/state.ts recomputeEffectiveWeights is renamed and
extended to recomputeDerivedState, which now also writes
fe.state using the same rule as health.ComputeFrontendState:
unknown if no backends or all unknown, up if any effective
weight > 0, down otherwise. Called from every mutation path
(replaceAll, replaceSnapshot, applyBackendTransition,
applyConfiguredWeight) so the SPA is authoritative for *display*
state and doesn't inherit any staleness the server's cached
frontendStates map might have.

applyFrontendTransition is now a no-op for the state field —
the server's `to` value is no longer trusted because
recomputeDerivedState walks the local backends array on every
update and produces a fresh, correct answer. The reducer is kept
as a named function so sse.ts's dispatch table still has a
landing spot for "frontend" events (they still feed the
DebugPanel via pushEvent); the empty body is deliberate, not a
bug — a comment at the top spells it out.
This commit is contained in:
2026-04-12 23:50:22 +02:00
parent 4347bb9b05
commit 1191b3d994
5 changed files with 125 additions and 34 deletions

File diff suppressed because one or more lines are too long

File diff suppressed because one or more lines are too long

View File

@@ -4,7 +4,7 @@
<meta charset="UTF-8" /> <meta charset="UTF-8" />
<meta name="viewport" content="width=device-width, initial-scale=1.0" /> <meta name="viewport" content="width=device-width, initial-scale=1.0" />
<title>maglev</title> <title>maglev</title>
<script type="module" crossorigin src="/view/assets/index-BBNMNdtq.js"></script> <script type="module" crossorigin src="/view/assets/index-C-XMkBf5.js"></script>
<link rel="stylesheet" crossorigin href="/view/assets/index-CxDuAfMR.css"> <link rel="stylesheet" crossorigin href="/view/assets/index-CxDuAfMR.css">
</head> </head>
<body> <body>

View File

@@ -8,22 +8,36 @@ import type {
} from "../types"; } from "../types";
import { tick } from "./tick"; import { tick } from "./tick";
// recomputeEffectiveWeights mirrors the server-side // recomputeDerivedState mirrors the server-side
// health.EffectiveWeights / ActivePoolIndex logic so the SPA can keep // health.EffectiveWeights / ActivePoolIndex / ComputeFrontendState
// pool.effective_weight correct the moment a backend transitions, // logic so the SPA can keep pool.effective_weight AND the
// without waiting for the 30s refresh. Walking every frontend is cheap // per-frontend aggregate state correct the moment any backend
// — O(frontends × pools × backends-per-pool) with tiny constants — // transitions or any configured weight changes, without waiting for
// and it's strictly a function of the backend state map, so there's no // the 30s refresh. Walking every frontend is cheap — O(frontends ×
// risk of drift vs. the server as long as the rule stays the same. // pools × backends-per-pool) with tiny constants — and it's
// strictly a function of the backend state map + configured
// weights, so there's no risk of drift vs. the server as long as
// the rules stay identical. The SPA is the authoritative source of
// truth for *display* state: the server's cached frontendStates
// field can be stale (e.g. after a SetFrontendPoolBackendWeight
// call that doesn't re-run updateFrontendState, or after a long-
// lived WatchEvents stream where a past transition corrupted the
// client's cache) and the SPA recomputes from its own live
// backends array to avoid inheriting any staleness.
// //
// Rule: a backend gets its configured pool weight iff it is up AND // Effective weight rule: a backend gets its configured pool weight
// belongs to the currently-active pool; everything else is 0. The // iff it is up AND belongs to the currently-active pool; everything
// active pool is the first pool containing a backend that is both // else is 0. The active pool is the first pool containing a backend
// up AND has a non-zero configured weight — a pool whose up backends // that is both up AND has a non-zero configured weight — a pool
// are all weight=0 contributes no serving capacity and gets skipped // whose up backends are all weight=0 contributes no serving
// over in priority failover. Kept in lock-step with // capacity and gets skipped over in priority failover. Kept in
// internal/health/weights.go. // lock-step with internal/health/weights.go ActivePoolIndex.
function recomputeEffectiveWeights(snap: StateSnapshot) { //
// Frontend state rule: unknown if no backends or every referenced
// backend is still in StateUnknown; up if any backend in any pool
// has effective_weight > 0; otherwise down. Kept in lock-step with
// internal/health/weights.go ComputeFrontendState.
function recomputeDerivedState(snap: StateSnapshot) {
const stateOf: Record<string, string> = {}; const stateOf: Record<string, string> = {};
for (const b of snap.backends) stateOf[b.name] = b.state; for (const b of snap.backends) stateOf[b.name] = b.state;
for (const fe of snap.frontends) { for (const fe of snap.frontends) {
@@ -41,13 +55,30 @@ function recomputeEffectiveWeights(snap: StateSnapshot) {
break; break;
} }
} }
let anyEffective = false;
let seenAny = false;
let allUnknown = true;
const seen = new Set<string>();
for (let i = 0; i < fe.pools.length; i++) { for (let i = 0; i < fe.pools.length; i++) {
for (const pb of fe.pools[i].backends) { for (const pb of fe.pools[i].backends) {
const st = stateOf[pb.name]; const st = stateOf[pb.name];
pb.effective_weight = st === "up" && i === activePool ? pb.weight : 0; pb.effective_weight = st === "up" && i === activePool ? pb.weight : 0;
if (pb.effective_weight > 0) anyEffective = true;
if (!seen.has(pb.name)) {
seen.add(pb.name);
seenAny = true;
if (st !== "unknown") allUnknown = false;
} }
} }
} }
if (!seenAny || allUnknown) {
fe.state = "unknown";
} else if (anyEffective) {
fe.state = "up";
} else {
fe.state = "down";
}
}
} }
// FrontendState keys snapshots by maglevd name. A single store drives the // FrontendState keys snapshots by maglevd name. A single store drives the
@@ -61,6 +92,14 @@ const [state, setState] = createStore<FrontendState>({ byName: {} });
export { state }; export { state };
export function replaceSnapshot(snap: StateSnapshot) { export function replaceSnapshot(snap: StateSnapshot) {
// Recompute effective weights + aggregate frontend state locally
// from the snapshot's backends array, rather than trusting the
// server's state field verbatim. The server can be stale (the
// checker's frontendStates map is only updated on backend
// transitions, not on weight changes), so deriving from our own
// backend data is the only way to guarantee the display stays
// consistent with reality.
recomputeDerivedState(snap);
setState( setState(
produce((s) => { produce((s) => {
s.byName[snap.maglevd.name] = snap; s.byName[snap.maglevd.name] = snap;
@@ -70,7 +109,10 @@ export function replaceSnapshot(snap: StateSnapshot) {
export function replaceAll(snaps: StateSnapshot[]) { export function replaceAll(snaps: StateSnapshot[]) {
const byName: Record<string, StateSnapshot> = {}; const byName: Record<string, StateSnapshot> = {};
for (const s of snaps) byName[s.maglevd.name] = s; for (const s of snaps) {
recomputeDerivedState(s);
byName[s.maglevd.name] = s;
}
setState({ byName }); setState({ byName });
} }
@@ -96,25 +138,26 @@ export function applyBackendTransition(maglevd: string, p: BackendEventPayload)
} }
// A backend state change can shift which pool is active and // A backend state change can shift which pool is active and
// therefore which pool-memberships get non-zero effective // therefore which pool-memberships get non-zero effective
// weights. Recompute for every frontend — not just the one // weights, and in turn can flip the frontend's aggregate
// state. Recompute for every frontend — not just the one
// pointed at by this backend — because pool-failover is a // pointed at by this backend — because pool-failover is a
// per-frontend computation and the same backend can appear in // per-frontend computation and the same backend can appear in
// multiple frontends with different pool placements. // multiple frontends with different pool placements.
recomputeEffectiveWeights(snap); recomputeDerivedState(snap);
}), }),
); );
} }
export function applyFrontendTransition(maglevd: string, p: FrontendEventPayload) { // Frontend-transition events arrive from the server's checker, but
setState( // the SPA no longer trusts their `to` field — recomputeDerivedState
produce((s) => { // walks the local backends array on every backend event and every
const snap = s.byName[maglevd]; // hydration to produce an up-to-date frontend state that the server
if (!snap) return; // can't make stale. Kept as a named reducer so sse.ts's dispatch
const fe = snap.frontends.find((x) => x.name === p.frontend); // table still has a landing spot for "frontend" events (they also
if (!fe) return; // flow into the DebugPanel via pushEvent); the body is deliberately
fe.state = p.transition.to; // empty — not a bug.
}), export function applyFrontendTransition(_maglevd: string, _p: FrontendEventPayload) {
); // no-op — state is derived client-side, see recomputeDerivedState
} }
export function applyVPPStatus(maglevd: string, state: string) { export function applyVPPStatus(maglevd: string, state: string) {
@@ -165,7 +208,7 @@ export function applyConfiguredWeight(
const pb = p.backends.find((x) => x.name === backend); const pb = p.backends.find((x) => x.name === backend);
if (!pb) return; if (!pb) return;
pb.weight = weight; pb.weight = weight;
recomputeEffectiveWeights(snap); recomputeDerivedState(snap);
}), }),
); );
} }

View File

@@ -138,8 +138,22 @@ func (c *Checker) Reload(ctx context.Context, cfg *config.Config) error {
hc := cfg.HealthChecks[b.HealthCheck] hc := cfg.HealthChecks[b.HealthCheck]
if w, ok := c.workers[name]; ok { if w, ok := c.workers[name]; ok {
if healthCheckEqual(w.hc, hc) { if healthCheckEqual(w.hc, hc) {
// Update entry metadata (weight, etc.) in place without restart. // Update entry metadata (address, healthcheck name)
// in place without restart. Preserve the runtime
// Enabled flag — the operator's
// PauseBackend/DisableBackend/EnableBackend state
// must outlive config reloads so an operator who
// disabled a backend and then reloaded config
// (e.g. to adjust weights on an unrelated
// frontend) doesn't find their disabled backend
// silently re-enabled while its worker state
// remains stuck at StateDisabled. The YAML's
// Enabled field is still authoritative on the
// worker-restart path below (where the backend
// is structurally new to this worker instance).
runtimeEnabled := w.entry.Enabled
w.entry = b w.entry = b
w.entry.Enabled = runtimeEnabled
continue continue
} }
slog.Info("backend-restart", "backend", name) slog.Info("backend-restart", "backend", name)
@@ -237,6 +251,13 @@ func (c *Checker) GetFrontend(name string) (config.Frontend, bool) {
// SetFrontendPoolBackendWeight updates the weight of a backend within a named // SetFrontendPoolBackendWeight updates the weight of a backend within a named
// pool of a frontend. Returns the updated FrontendInfo and a descriptive error // pool of a frontend. Returns the updated FrontendInfo and a descriptive error
// if the frontend, pool, or backend is not found or the weight is out of range. // if the frontend, pool, or backend is not found or the weight is out of range.
//
// After mutating the weight, updateFrontendState is re-run for the affected
// frontend so the aggregate state reflects the new effective weights. A
// weight change can flip a frontend between UP and DOWN (e.g. zeroing the
// last non-zero-weighted backend in the active pool), and without this
// call the checker's cached frontend state would drift from reality until
// the next genuine backend transition happens to trigger a recompute.
func (c *Checker) SetFrontendPoolBackendWeight(frontendName, poolName, backendName string, weight int) (config.Frontend, error) { func (c *Checker) SetFrontendPoolBackendWeight(frontendName, poolName, backendName string, weight int) (config.Frontend, error) {
if weight < 0 || weight > 100 { if weight < 0 || weight > 100 {
return config.Frontend{}, fmt.Errorf("weight %d out of range [0, 100]", weight) return config.Frontend{}, fmt.Errorf("weight %d out of range [0, 100]", weight)
@@ -259,6 +280,7 @@ func (c *Checker) SetFrontendPoolBackendWeight(frontendName, poolName, backendNa
fe.Pools[i].Backends[backendName] = pb fe.Pools[i].Backends[backendName] = pb
c.cfg.Frontends[frontendName] = fe c.cfg.Frontends[frontendName] = fe
slog.Info("frontend-pool-weight", "frontend", frontendName, "pool", poolName, "backend", backendName, "weight", weight) slog.Info("frontend-pool-weight", "frontend", frontendName, "pool", poolName, "backend", backendName, "weight", weight)
c.updateFrontendState(frontendName, fe)
return fe, nil return fe, nil
} }
return config.Frontend{}, fmt.Errorf("pool %q not found in frontend %q", poolName, frontendName) return config.Frontend{}, fmt.Errorf("pool %q not found in frontend %q", poolName, frontendName)
@@ -410,6 +432,13 @@ func (c *Checker) ResumeBackend(name string) (BackendSnapshot, error) {
// DisableBackend stops health checking for a backend and removes it from active // DisableBackend stops health checking for a backend and removes it from active
// rotation. The worker entry is kept in the map so the backend remains visible // rotation. The worker entry is kept in the map so the backend remains visible
// via GetBackend and can be re-enabled with EnableBackend. // via GetBackend and can be re-enabled with EnableBackend.
//
// Preconditions are keyed on w.backend.State rather than w.entry.Enabled so
// that any drift between the two fields (e.g. a past Reload that reset the
// flag without transitioning state) is self-healing: if the state is not
// already disabled we always do the full transition and bring the flag in
// line, and if the state is already disabled we fix up the flag without a
// no-op transition.
func (c *Checker) DisableBackend(name string) (BackendSnapshot, bool) { func (c *Checker) DisableBackend(name string) (BackendSnapshot, bool) {
c.mu.Lock() c.mu.Lock()
defer c.mu.Unlock() defer c.mu.Unlock()
@@ -417,7 +446,14 @@ func (c *Checker) DisableBackend(name string) (BackendSnapshot, bool) {
if !ok { if !ok {
return BackendSnapshot{}, false return BackendSnapshot{}, false
} }
if !w.entry.Enabled { if w.backend.State == health.StateDisabled {
// Already disabled at the state level; make sure the flag
// reflects reality without emitting a redundant transition.
w.entry.Enabled = false
if b, ok := c.cfg.Backends[name]; ok {
b.Enabled = false
c.cfg.Backends[name] = b
}
return BackendSnapshot{Health: w.backend, Config: w.entry}, true return BackendSnapshot{Health: w.backend, Config: w.entry}, true
} }
maxHistory := c.cfg.HealthChecker.TransitionHistory maxHistory := c.cfg.HealthChecker.TransitionHistory
@@ -439,6 +475,12 @@ func (c *Checker) DisableBackend(name string) (BackendSnapshot, bool) {
// EnableBackend re-enables a previously disabled backend. The existing // EnableBackend re-enables a previously disabled backend. The existing
// Backend struct is reused — its transition history is preserved — and a // Backend struct is reused — its transition history is preserved — and a
// fresh probe goroutine is launched. The backend re-enters StateUnknown. // fresh probe goroutine is launched. The backend re-enters StateUnknown.
//
// Preconditions are keyed on w.backend.State rather than w.entry.Enabled, so
// drift between the two (most commonly caused by a Reload that reset the
// flag while the worker state was still disabled) doesn't wedge the backend
// — we always do the full transition when the state is disabled, and skip
// it (while syncing the flag) when it's not.
func (c *Checker) EnableBackend(name string) (BackendSnapshot, bool) { func (c *Checker) EnableBackend(name string) (BackendSnapshot, bool) {
c.mu.Lock() c.mu.Lock()
defer c.mu.Unlock() defer c.mu.Unlock()
@@ -446,7 +488,13 @@ func (c *Checker) EnableBackend(name string) (BackendSnapshot, bool) {
if !ok { if !ok {
return BackendSnapshot{}, false return BackendSnapshot{}, false
} }
if w.entry.Enabled { if w.backend.State != health.StateDisabled {
// Not in the disabled state — just make the flag match.
w.entry.Enabled = true
if b, ok := c.cfg.Backends[name]; ok {
b.Enabled = true
c.cfg.Backends[name] = b
}
return BackendSnapshot{Health: w.backend, Config: w.entry}, true return BackendSnapshot{Health: w.backend, Config: w.entry}, true
} }
w.entry.Enabled = true w.entry.Enabled = true