Restart-neutral VPP LB sync; deterministic AS ordering; maglevt cadence; v0.9.5

Three reliability fixes bundled with docs updates. Restart-neutral VPP LB sync via a startup warmup window (internal/vpp/warmup.go). Before this, a maglevd restart would immediately issue SyncLBStateAll with every backend still in StateUnknown — mapped through BackendEffectiveWeight to weight 0 — and VPP would black-hole all new flows until the checker's rise counters caught up, several seconds later. The new warmup tracker owns a process-wide state machine gated by two config knobs: vpp.lb.startup-min-delay (default 5s) is an absolute hands-off window during which neither the periodic sync loop nor the per-transition reconciler touches VPP; vpp.lb. startup-max-delay (default 30s) is the watchdog for a per-VIP release phase that runs between the two, releasing each frontend as soon as every backend it references reaches a non-Unknown state. At max-delay a final SyncLBStateAll runs for any stragglers still in Unknown. Config reload does not reset the clock. Both delays can be set to 0 to disable the warmup entirely. The reconciler's suppressed-during-warmup events log at DEBUG so operators can still see them with --log-level debug. Unit tests cover the tracker state machine, allBackendsKnown precondition, and the zero-delay escape hatch. Deterministic AS iteration in VPP LB sync. reconcileVIP and recreateVIP now issue their lb_as_add_del / lb_as_set_weight calls in numeric IP order (IPv4 before IPv6, ascending within each family) via a new sortedIPKeys helper, instead of Go map iteration order. VPP's LB plugin breaks per-bucket ties in the Maglev lookup table by insertion position in its internal AS vec, so without a stable call order two maglevd instances on the same config could push identical AS sets into VPP in different orders and produce divergent new-flow tables. Numeric sort is used in preference to lexicographic so the sync log stays human-readable: string order would place 10.0.0.10 before 10.0.0.2, and the same problem in v6. Unit tests cover empty, single, v4/v6 numeric vs lexicographic, v4-before-v6 grouping, a 1000-iteration stability loop against Go's randomised map iteration, insertion-order invariance, and the desiredAS call-site type. maglevt interval fix. runProbeLoop used to sleep the full jittered interval after every probe, so a 100ms --interval with a 30ms probe actually produced a 130ms period. The sleep now subtracts result.Duration so cadence matches the flag. Probes that overrun clamp sleep to zero and fire the next probe immediately without trying to catch up on missed cycles — a slow backend doesn't get flooded with back-to-back probes at the moment it's already struggling. Docs. config-guide now documents flush-on-down and the new startup-min-delay / startup-max-delay knobs; user-guide's maglevd section explains the restart-neutrality property, the three warmup phases, and the relevant slog lines operators should watch for during a bounce. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-04-15 11:25:53 +02:00
parent 695ebc4bd1
commit 6d78921edd
10 changed files with 1257 additions and 23 deletions
--- a/docs/config-guide.md
+++ b/docs/config-guide.md
@@ -108,10 +108,36 @@ VPP's GRE encap needs a source address and every VIP `maglevd` programs uses GRE
 *   ***lb.flow-timeout***: Idle time after which an established flow is removed
    from the table. Must be a whole number of seconds between `1s` and `120s`
    inclusive. Defaults to `40s`.
+*   ***lb.startup-min-delay***: Absolute hands-off window at the start of the
+    `maglevd` process. For the first `startup-min-delay` seconds of the
+    process's life, no VPP LB sync of any kind is issued — neither the
+    periodic `SyncLBStateAll` loop nor the per-transition `SyncLBStateVIP`
+    path from the reconciler touches VPP. This makes a `maglevd` restart
+    dataplane-neutral: without this gate, the first sync would fire before
+    any probes had completed, every backend would still be in `StateUnknown`,
+    and every AS would be reprogrammed to weight 0 until the rise counters
+    caught up — producing a visible black-hole window of several seconds
+    on every restart. A non-negative Go duration. Must be `<= startup-max-delay`.
+    Defaults to `5s`. Set to `0s` (together with `startup-max-delay: 0s`) to
+    disable the warmup entirely and sync VPP immediately on startup.
+*   ***lb.startup-max-delay***: Watchdog for the per-VIP release phase that
+    follows `startup-min-delay`. Between `min-delay` and `max-delay`, each
+    frontend is released individually as soon as every backend it references
+    has reached a non-`Unknown` state, and a single `SyncLBStateVIP` runs
+    against the newly-released frontend. At `max-delay` the warmup driver
+    unconditionally runs `SyncLBStateAll`, marking the warmup phase complete
+    regardless of whether any backends are still `StateUnknown` — stragglers
+    get programmed at their current effective weight at that point, which
+    for a `StateUnknown` backend means weight 0. A non-negative Go duration.
+    Must be `>= startup-min-delay`. Defaults to `30s`. Set to `0s` together
+    with `startup-min-delay: 0s` to disable the warmup entirely.

-These four values are pushed to VPP via `lb_conf` when `maglevd` connects to
+These values are pushed to VPP via `lb_conf` when `maglevd` connects to
 VPP and again after every config reload (whenever they change). A log line
-`vpp-lb-conf-set` records the effective values.
+`vpp-lb-conf-set` records the effective values. The `startup-*` settings
+are latched at the first successful VPP connect and are not re-read on
+subsequent config reloads — a reload that changes them only takes effect
+on the next process start.

 Example:
 ```yaml
@@ -123,6 +149,8 @@ maglev:
      ipv6-src-address: 2001:db8::1
      sticky-buckets-per-core: 65536
      flow-timeout: 40s
+      startup-min-delay: 5s
+      startup-max-delay: 30s
 ```

 ---
@@ -313,6 +341,22 @@ ordered list of backend pools. The gRPC API exposes frontends by name.
    application servers are deleted with flush, then the VIP itself is deleted) and
    recreate it with the new value; VPP has no API to mutate `src_ip_sticky` on an
    existing VIP, and existing flow state cannot be preserved across the flip.
+*   ***flush-on-down***: Boolean, default `true`. Controls what happens to existing
+    flows pinned to a backend that transitions to `down`. With it `true`, maglevd
+    issues `lb_as_set_weight(is_flush=true)` on the down transition, clearing VPP's
+    flow-table entries for that backend so existing connections are torn down
+    immediately and reshuffle onto healthy backends. With it `false`, the weight drops
+    to 0 (drain only): new flows skip the dead backend via the Maglev lookup table,
+    but existing sticky flows keep being steered at the dead IP until the client
+    retries, producing visible "connection refused" oscillations during an outage.
+    The default is `true` because the healthcheck's `rise` / `fall` counters already
+    absorb single-probe flaps, so a fall-counted `down` is almost always a real
+    outage where immediate session teardown is the safer behaviour. Set it to `false`
+    per frontend only when existing sessions are expensive enough that you'd rather
+    keep them pinned to a potentially-dead backend than reset them (e.g. long-lived
+    WebSocket connections with expensive reconnect logic). The `disabled` state
+    always flushes regardless of this flag — `disabled` is an explicit operator
+    signal that the backend is going away.

 Each pool has: