Three reliability fixes bundled with docs updates. Restart-neutral VPP LB sync via a startup warmup window (internal/vpp/warmup.go). Before this, a maglevd restart would immediately issue SyncLBStateAll with every backend still in StateUnknown — mapped through BackendEffectiveWeight to weight 0 — and VPP would black-hole all new flows until the checker's rise counters caught up, several seconds later. The new warmup tracker owns a process-wide state machine gated by two config knobs: vpp.lb.startup-min-delay (default 5s) is an absolute hands-off window during which neither the periodic sync loop nor the per-transition reconciler touches VPP; vpp.lb. startup-max-delay (default 30s) is the watchdog for a per-VIP release phase that runs between the two, releasing each frontend as soon as every backend it references reaches a non-Unknown state. At max-delay a final SyncLBStateAll runs for any stragglers still in Unknown. Config reload does not reset the clock. Both delays can be set to 0 to disable the warmup entirely. The reconciler's suppressed-during-warmup events log at DEBUG so operators can still see them with --log-level debug. Unit tests cover the tracker state machine, allBackendsKnown precondition, and the zero-delay escape hatch. Deterministic AS iteration in VPP LB sync. reconcileVIP and recreateVIP now issue their lb_as_add_del / lb_as_set_weight calls in numeric IP order (IPv4 before IPv6, ascending within each family) via a new sortedIPKeys helper, instead of Go map iteration order. VPP's LB plugin breaks per-bucket ties in the Maglev lookup table by insertion position in its internal AS vec, so without a stable call order two maglevd instances on the same config could push identical AS sets into VPP in different orders and produce divergent new-flow tables. Numeric sort is used in preference to lexicographic so the sync log stays human-readable: string order would place 10.0.0.10 before 10.0.0.2, and the same problem in v6. Unit tests cover empty, single, v4/v6 numeric vs lexicographic, v4-before-v6 grouping, a 1000-iteration stability loop against Go's randomised map iteration, insertion-order invariance, and the desiredAS call-site type. maglevt interval fix. runProbeLoop used to sleep the full jittered interval after every probe, so a 100ms --interval with a 30ms probe actually produced a 130ms period. The sleep now subtracts result.Duration so cadence matches the flag. Probes that overrun clamp sleep to zero and fire the next probe immediately without trying to catch up on missed cycles — a slow backend doesn't get flooded with back-to-back probes at the moment it's already struggling. Docs. config-guide now documents flush-on-down and the new startup-min-delay / startup-max-delay knobs; user-guide's maglevd section explains the restart-neutrality property, the three warmup phases, and the relevant slog lines operators should watch for during a bounce. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
17 KiB
maglevd Configuration Guide
Overview
maglevd consumes a YAML configuration file of a specific format. Validation is performed
in two stages:
- Structural parsing: the YAML is unmarshalled into typed Go structs. Unknown fields and type mismatches are rejected immediately.
- Semantic validation: cross-field and cross-object rules are enforced, for example ensuring that every backend referenced by a frontend exists, that address families are consistent within a frontend, and that IP source addresses are the correct family.
If you want to get started quickly, take a look at the example config.
Basic structure
The YAML configuration file has the following top-level structure:
maglev:
healthchecker:
[ Global health checker settings ]
vpp:
lb:
[ VPP load-balancer integration settings ]
healthchecks:
my-check:
[ Health check definition ]
backends:
my-backend:
[ Backend definition ]
frontends:
my-frontend:
[ Frontend (VIP) definition ]
All five sections live under the top-level maglev: key. The healthchecks, backends,
and frontends sections are maps keyed by an arbitrary name of your choosing. Names must be
unique within their section and are case-sensitive. The vpp section is required when
maglevd has a working VPP connection — its lb.ipv4-src-address and lb.ipv6-src-address
fields are mandatory and maglevd will refuse to start without them.
healthchecker
Global settings for the health checker engine.
-
transition-history: An integer >= 1 that controls how many state transitions are retained per backend for display via the gRPC API. Defaults to
5. -
netns: The name of a Linux network namespace in which probes are executed. When empty or omitted, probes run in the current (default) network namespace. Useful when backends are reachable only through a dedicated dataplane namespace.
Capability requirement: setting this field makes
maglevdcallsetns(CLONE_NEWNET)on the probe thread before each probe, which the kernel only permits to processes holdingCAP_SYS_ADMINin the target namespace's user namespace (setns(2)). The Debian systemd unit (vpp-maglev.service) already grants this capability; if you runmaglevdby hand under a non-root user make sure the binary hasCAP_SYS_ADMINviasetcap cap_net_raw,cap_sys_admin=eip /usr/sbin/maglevdor equivalent, otherwise every probe fails withenter netns "<name>": operation not permittedand all backends transition todownon their first probe.Also make sure the named namespace is mounted under
/var/run/netns/(which is whereip netns addputs it) and that it is readable by the usermaglevdruns as — the default mode fromip netns addis0644, which is fine for any user.
Example:
maglev:
healthchecker:
transition-history: 10
netns: dataplane
vpp
Settings controlling the integration with a locally running VPP instance. The
vpp section is a map with a single sub-section, lb. Both lb.ipv4-src-address
and lb.ipv6-src-address are required — maglevd --check exits with a
semantic error and the daemon refuses to start when either is missing, because
VPP's GRE encap needs a source address and every VIP maglevd programs uses GRE.
- lb.ipv4-src-address: Required. The IPv4 source address VPP uses when encapsulating IPv4 traffic into GRE4 tunnels to application servers. Must be a valid IPv4 address. No default.
- lb.ipv6-src-address: Required. The IPv6 source address VPP uses when encapsulating IPv6 traffic into GRE6 tunnels. Must be a valid IPv6 address. No default.
- lb.sync-interval: A positive Go duration (e.g.
30s,1m) controlling how oftenmaglevdreconciles the VPP load-balancer dataplane against its running configuration. On startup, an immediate full sync runs; subsequent syncs fire at this interval as long as the VPP connection is up. Defaults to30s. The purpose is to catch drift — for example, a VIP added to VPP by hand — and bring VPP back in line with the maglev config. - lb.sticky-buckets-per-core: The number of buckets per worker thread in
the established-flow table. Must be a power of 2. Defaults to
65536(64k). - lb.flow-timeout: Idle time after which an established flow is removed
from the table. Must be a whole number of seconds between
1sand120sinclusive. Defaults to40s. - lb.startup-min-delay: Absolute hands-off window at the start of the
maglevdprocess. For the firststartup-min-delayseconds of the process's life, no VPP LB sync of any kind is issued — neither the periodicSyncLBStateAllloop nor the per-transitionSyncLBStateVIPpath from the reconciler touches VPP. This makes amaglevdrestart dataplane-neutral: without this gate, the first sync would fire before any probes had completed, every backend would still be inStateUnknown, and every AS would be reprogrammed to weight 0 until the rise counters caught up — producing a visible black-hole window of several seconds on every restart. A non-negative Go duration. Must be<= startup-max-delay. Defaults to5s. Set to0s(together withstartup-max-delay: 0s) to disable the warmup entirely and sync VPP immediately on startup. - lb.startup-max-delay: Watchdog for the per-VIP release phase that
follows
startup-min-delay. Betweenmin-delayandmax-delay, each frontend is released individually as soon as every backend it references has reached a non-Unknownstate, and a singleSyncLBStateVIPruns against the newly-released frontend. Atmax-delaythe warmup driver unconditionally runsSyncLBStateAll, marking the warmup phase complete regardless of whether any backends are stillStateUnknown— stragglers get programmed at their current effective weight at that point, which for aStateUnknownbackend means weight 0. A non-negative Go duration. Must be>= startup-min-delay. Defaults to30s. Set to0stogether withstartup-min-delay: 0sto disable the warmup entirely.
These values are pushed to VPP via lb_conf when maglevd connects to
VPP and again after every config reload (whenever they change). A log line
vpp-lb-conf-set records the effective values. The startup-* settings
are latched at the first successful VPP connect and are not re-read on
subsequent config reloads — a reload that changes them only takes effect
on the next process start.
Example:
maglev:
vpp:
lb:
sync-interval: 60s
ipv4-src-address: 10.0.0.1
ipv6-src-address: 2001:db8::1
sticky-buckets-per-core: 65536
flow-timeout: 40s
startup-min-delay: 5s
startup-max-delay: 30s
healthchecks
A named map of health check definitions. Each health check describes how to probe a backend. Backends reference health checks by name. The same health check can be reused across any number of backends; each backend is probed exactly once regardless of how many frontends reference it.
Common fields (all types):
- type: Required. One of
icmp,tcp,http, orhttps. - port: The destination port to probe. Required for
tcp,http, andhttps. Must be omitted foricmp. - probe-ipv4-src: An optional IPv4 source address used when probing IPv4 backends. Must be an IPv4 address. When omitted, the OS chooses the source address.
- probe-ipv6-src: An optional IPv6 source address used when probing IPv6 backends. Must be an IPv6 address. When omitted, the OS chooses the source address.
- interval: Required. A positive Go duration string (e.g.
2s,500ms) controlling how often a probe is sent when the backend is fully healthy (counter at maximum). - fast-interval: Optional. A positive duration used instead of
intervalwhile the backend's health counter is degraded (between down and up) or inunknownstate. When omitted,intervalis used. - down-interval: Optional. A positive duration used instead of
intervalwhile the backend is fully down (counter at zero). When omitted,intervalis used. Setting this to a longer value reduces probe traffic to backends that are known to be offline. - timeout: Required. A positive duration after which an in-flight probe is abandoned and counted as a failure.
- rise: The number of consecutive successes required to transition from down to up.
Defaults to
2. Must be >= 1. - fall: The number of consecutive failures required to transition from up to down.
Defaults to
3. Must be >= 1.
type: icmp
Sends an ICMP echo request (ping) to the backend address. Requires CAP_NET_RAW. No port
may be specified. No params block is used.
healthchecks:
ping:
type: icmp
probe-ipv4-src: 10.0.0.1
probe-ipv6-src: 2001:db8::1
interval: 2s
timeout: 1s
rise: 2
fall: 3
type: tcp
Opens a TCP connection to the backend and immediately closes it upon success. Use params to
optionally wrap the connection in TLS.
- params.ssl: A boolean. When
true, a TLS handshake is performed after the TCP connection is established. Defaults tofalse. - params.server-name: The TLS SNI hostname sent during the handshake. When omitted, the backend IP address is used.
- params.insecure-skip-verify: A boolean. When
true, the TLS certificate presented by the server is not verified. Defaults tofalse.
healthchecks:
imaps-check:
type: tcp
port: 993
params:
ssl: true
server-name: imaps.example.com
interval: 5s
timeout: 3s
rise: 2
fall: 3
type: http / https
Opens a TCP (or TLS for https) connection, sends an HTTP request, and evaluates the response
code. An optional regexp can additionally match against the response body.
- params.path: Required. The HTTP request path, e.g.
/healthz. - params.host: The
Hostheader value sent in the request. When omitted, the backend IP address is used. - params.response-code: The expected HTTP response code. Can be a single value (
"200") or an inclusive range ("200-299"). Defaults to"200". - params.response-regexp: An optional Go regular expression matched against the response body. If specified, the body must match for the probe to succeed.
- params.server-name: The TLS SNI hostname (
httpsonly). Defaults to the value ofparams.hostif not set. - params.insecure-skip-verify: A boolean. Skip TLS certificate verification (
httpsonly). Defaults tofalse.
healthchecks:
nginx-http:
type: http
port: 80
params:
path: /healthz
host: nginx.example.com
response-code: "200-204"
interval: 2s
fast-interval: 500ms
down-interval: 30s
timeout: 3s
rise: 2
fall: 3
nginx-https:
type: https
port: 443
params:
path: /healthz
host: nginx.example.com
server-name: nginx.example.com
insecure-skip-verify: false
interval: 5s
timeout: 3s
backends
A named map of individual backend servers. Each backend has a single IP address and optionally references a health check by name. Backends are probed exactly once, even if they appear in multiple frontends.
- address: Required. The IPv4 or IPv6 address of this backend server.
- healthcheck: The name of a health check defined in the
healthcheckssection. When empty or omitted, the backend is static: no probing is performed and the backend entersStateUpimmediately on startup (via a synthetic pass, rise/fall forced to 1/1). This is useful for backends that are always available or managed by other means. See healthchecks.md for details on the static-backend behavior. - enabled: A boolean controlling whether this backend participates in any frontend.
When
false, the backend is excluded entirely and no probe goroutine is started. Defaults totrue.
Examples:
backends:
nginx0-ams:
address: 198.51.100.10
healthcheck: nginx-http
nginx0-lon:
address: 198.51.100.11
healthcheck: nginx-http
nginx0-draining:
address: 198.51.100.12
healthcheck: nginx-http
enabled: false
static-backend:
address: 198.51.100.20
# no healthcheck: assumed always healthy
frontends
A named map of virtual IPs (VIPs). Each frontend ties together a listener address with an ordered list of backend pools. The gRPC API exposes frontends by name.
- description: An optional free-text string for documentation purposes.
- address: Required. The IPv4 or IPv6 address of the VIP.
- protocol: The IP protocol, either
tcporudp. When omitted, the frontend matches all traffic to the VIP address regardless of protocol. Ifportis specified,protocolmust also be set. - port: The destination port of the VIP, an integer between 1 and 65535. Requires
protocolto be set. When omitted, the frontend matches all ports. Note that the frontend port is independent of the healthcheck port: a frontend on port 443 may use a healthcheck that probes port 80. - pools: Required. A non-empty ordered list of pool objects. Pools express priority:
the first pool is preferred; subsequent pools act as fallbacks. When every backend in
pool[0] leaves
StateUp(down, paused, disabled, or not yet probed), pool[1] is automatically promoted — its up backends take over serving traffic. The promotion cascades across further tiers. See healthchecks.md for the full failover semantics. All backends across all pools in a frontend must have addresses of the same address family (all IPv4 or all IPv6). - src-ip-sticky: Boolean, default
false. Whentrue, the VPP load-balancer programs this VIP with source-IP-based stickiness — all flows from the same client source IP hash to the same backend (subject to the Maglev consistent-hash bucket assignment). Use this for protocols that require session affinity at the L3 level, or when clients open many short flows that should land on one backend. Changing this field in a running config and reloading causes maglevd to tear down the VIP (all application servers are deleted with flush, then the VIP itself is deleted) and recreate it with the new value; VPP has no API to mutatesrc_ip_stickyon an existing VIP, and existing flow state cannot be preserved across the flip. - flush-on-down: Boolean, default
true. Controls what happens to existing flows pinned to a backend that transitions todown. With ittrue, maglevd issueslb_as_set_weight(is_flush=true)on the down transition, clearing VPP's flow-table entries for that backend so existing connections are torn down immediately and reshuffle onto healthy backends. With itfalse, the weight drops to 0 (drain only): new flows skip the dead backend via the Maglev lookup table, but existing sticky flows keep being steered at the dead IP until the client retries, producing visible "connection refused" oscillations during an outage. The default istruebecause the healthcheck'srise/fallcounters already absorb single-probe flaps, so a fall-counteddownis almost always a real outage where immediate session teardown is the safer behaviour. Set it tofalseper frontend only when existing sessions are expensive enough that you'd rather keep them pinned to a potentially-dead backend than reset them (e.g. long-lived WebSocket connections with expensive reconnect logic). Thedisabledstate always flushes regardless of this flag —disabledis an explicit operator signal that the backend is going away.
Each pool has:
- name: Required. A non-empty string identifying the pool (e.g.
primary,fallback). - backends: A map of backend names to per-pool backend options. Every name must refer
to an existing entry in the
backendssection.
Per-pool backend options:
- weight: An integer between 0 and 100 (inclusive) expressing the relative weight of
this backend within the pool.
0keeps the backend in the pool but assigns it no traffic. Defaults to100. Weight is per-pool, not global — the same backend can appear with different weights in different frontends.
Examples:
frontends:
nginx-v4-http:
description: "IPv4 HTTP VIP with fallback"
address: 198.51.100.1
protocol: tcp
port: 80
pools:
- name: primary
backends:
nginx0-ams: { weight: 10 }
nginx0-lon: {}
- name: fallback
backends:
nginx0-fra: {}
maildrop-imaps:
description: "IMAPS VIP"
address: 2001:db8::1
protocol: tcp
port: 993
src-ip-sticky: true
pools:
- name: primary
backends:
maildrop0-ams: {}
maildrop0-lon: {}
For a detailed description of the health state machine, probe intervals, and all transition events, see healthchecks.md. For a user guide on how to use the maglev daemon and client, see the user-guide.md.