Restart-neutral VPP LB sync; deterministic AS ordering; maglevt cadence; v0.9.5

Three reliability fixes bundled with docs updates.

Restart-neutral VPP LB sync via a startup warmup window
(internal/vpp/warmup.go). Before this, a maglevd restart would
immediately issue SyncLBStateAll with every backend still in
StateUnknown — mapped through BackendEffectiveWeight to weight
0 — and VPP would black-hole all new flows until the checker's
rise counters caught up, several seconds later. The new warmup
tracker owns a process-wide state machine gated by two config
knobs: vpp.lb.startup-min-delay (default 5s) is an absolute
hands-off window during which neither the periodic sync loop
nor the per-transition reconciler touches VPP; vpp.lb.
startup-max-delay (default 30s) is the watchdog for a per-VIP
release phase that runs between the two, releasing each frontend
as soon as every backend it references reaches a non-Unknown
state. At max-delay a final SyncLBStateAll runs for any stragglers
still in Unknown. Config reload does not reset the clock. Both
delays can be set to 0 to disable the warmup entirely. The
reconciler's suppressed-during-warmup events log at DEBUG so
operators can still see them with --log-level debug. Unit tests
cover the tracker state machine, allBackendsKnown precondition,
and the zero-delay escape hatch.

Deterministic AS iteration in VPP LB sync. reconcileVIP and
recreateVIP now issue their lb_as_add_del / lb_as_set_weight
calls in numeric IP order (IPv4 before IPv6, ascending within
each family) via a new sortedIPKeys helper, instead of Go map
iteration order. VPP's LB plugin breaks per-bucket ties in the
Maglev lookup table by insertion position in its internal AS
vec, so without a stable call order two maglevd instances on
the same config could push identical AS sets into VPP in
different orders and produce divergent new-flow tables. Numeric
sort is used in preference to lexicographic so the sync log
stays human-readable: string order would place 10.0.0.10 before
10.0.0.2, and the same problem in v6. Unit tests cover empty,
single, v4/v6 numeric vs lexicographic, v4-before-v6 grouping,
a 1000-iteration stability loop against Go's randomised map
iteration, insertion-order invariance, and the desiredAS
call-site type.

maglevt interval fix. runProbeLoop used to sleep the full
jittered interval after every probe, so a 100ms --interval
with a 30ms probe actually produced a 130ms period. The sleep
now subtracts result.Duration so cadence matches the flag.
Probes that overrun clamp sleep to zero and fire the next
probe immediately without trying to catch up on missed cycles
— a slow backend doesn't get flooded with back-to-back probes
at the moment it's already struggling.

Docs. config-guide now documents flush-on-down and the new
startup-min-delay / startup-max-delay knobs; user-guide's
maglevd section explains the restart-neutrality property, the
three warmup phases, and the relevant slog lines operators
should watch for during a bounce.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
This commit is contained in:
2026-04-15 11:25:53 +02:00
parent 695ebc4bd1
commit 6d78921edd
10 changed files with 1257 additions and 23 deletions

View File

@@ -108,10 +108,36 @@ VPP's GRE encap needs a source address and every VIP `maglevd` programs uses GRE
* ***lb.flow-timeout***: Idle time after which an established flow is removed
from the table. Must be a whole number of seconds between `1s` and `120s`
inclusive. Defaults to `40s`.
* ***lb.startup-min-delay***: Absolute hands-off window at the start of the
`maglevd` process. For the first `startup-min-delay` seconds of the
process's life, no VPP LB sync of any kind is issued — neither the
periodic `SyncLBStateAll` loop nor the per-transition `SyncLBStateVIP`
path from the reconciler touches VPP. This makes a `maglevd` restart
dataplane-neutral: without this gate, the first sync would fire before
any probes had completed, every backend would still be in `StateUnknown`,
and every AS would be reprogrammed to weight 0 until the rise counters
caught up — producing a visible black-hole window of several seconds
on every restart. A non-negative Go duration. Must be `<= startup-max-delay`.
Defaults to `5s`. Set to `0s` (together with `startup-max-delay: 0s`) to
disable the warmup entirely and sync VPP immediately on startup.
* ***lb.startup-max-delay***: Watchdog for the per-VIP release phase that
follows `startup-min-delay`. Between `min-delay` and `max-delay`, each
frontend is released individually as soon as every backend it references
has reached a non-`Unknown` state, and a single `SyncLBStateVIP` runs
against the newly-released frontend. At `max-delay` the warmup driver
unconditionally runs `SyncLBStateAll`, marking the warmup phase complete
regardless of whether any backends are still `StateUnknown` — stragglers
get programmed at their current effective weight at that point, which
for a `StateUnknown` backend means weight 0. A non-negative Go duration.
Must be `>= startup-min-delay`. Defaults to `30s`. Set to `0s` together
with `startup-min-delay: 0s` to disable the warmup entirely.
These four values are pushed to VPP via `lb_conf` when `maglevd` connects to
These values are pushed to VPP via `lb_conf` when `maglevd` connects to
VPP and again after every config reload (whenever they change). A log line
`vpp-lb-conf-set` records the effective values.
`vpp-lb-conf-set` records the effective values. The `startup-*` settings
are latched at the first successful VPP connect and are not re-read on
subsequent config reloads — a reload that changes them only takes effect
on the next process start.
Example:
```yaml
@@ -123,6 +149,8 @@ maglev:
ipv6-src-address: 2001:db8::1
sticky-buckets-per-core: 65536
flow-timeout: 40s
startup-min-delay: 5s
startup-max-delay: 30s
```
---
@@ -313,6 +341,22 @@ ordered list of backend pools. The gRPC API exposes frontends by name.
application servers are deleted with flush, then the VIP itself is deleted) and
recreate it with the new value; VPP has no API to mutate `src_ip_sticky` on an
existing VIP, and existing flow state cannot be preserved across the flip.
* ***flush-on-down***: Boolean, default `true`. Controls what happens to existing
flows pinned to a backend that transitions to `down`. With it `true`, maglevd
issues `lb_as_set_weight(is_flush=true)` on the down transition, clearing VPP's
flow-table entries for that backend so existing connections are torn down
immediately and reshuffle onto healthy backends. With it `false`, the weight drops
to 0 (drain only): new flows skip the dead backend via the Maglev lookup table,
but existing sticky flows keep being steered at the dead IP until the client
retries, producing visible "connection refused" oscillations during an outage.
The default is `true` because the healthcheck's `rise` / `fall` counters already
absorb single-probe flaps, so a fall-counted `down` is almost always a real
outage where immediate session teardown is the safer behaviour. Set it to `false`
per frontend only when existing sessions are expensive enough that you'd rather
keep them pinned to a potentially-dead backend than reset them (e.g. long-lived
WebSocket connections with expensive reconnect logic). The `disabled` state
always flushes regardless of this flag — `disabled` is an explicit operator
signal that the backend is going away.
Each pool has: