vpp-maglev/docs/config-guide.md

# maglevd Configuration Guide

## Overview

`maglevd` consumes a YAML configuration file of a specific format. Validation is performed
in two stages:

1.  **Structural parsing**: the YAML is unmarshalled into typed Go structs. Unknown fields and
    type mismatches are rejected immediately.
1.  **Semantic validation**: cross-field and cross-object rules are enforced, for example
    ensuring that every backend referenced by a frontend exists, that address families are
    consistent within a frontend, and that IP source addresses are the correct family.

If you want to get started quickly, take a look at the [example config](../debian/maglev.yaml).

## Basic structure

The YAML configuration file has the following top-level structure:

```yaml
maglev:
  healthchecker:
    [ Global health checker settings ]

  vpp:
    lb:
      [ VPP load-balancer integration settings ]

  healthchecks:
    my-check:
      [ Health check definition ]

  backends:
    my-backend:
      [ Backend definition ]

  frontends:
    my-frontend:
      [ Frontend (VIP) definition ]
```

All five sections live under the top-level `maglev:` key. The `healthchecks`, `backends`,
and `frontends` sections are maps keyed by an arbitrary name of your choosing. Names must be
unique within their section and are case-sensitive. The `vpp` section is required when
`maglevd` has a working VPP connection — its `lb.ipv4-src-address` and `lb.ipv6-src-address`
fields are mandatory and `maglevd` will refuse to start without them.

---

## healthchecker

Global settings for the health checker engine.

*   ***transition-history***: An integer >= 1 that controls how many state transitions are
    retained per backend for display via the gRPC API. Defaults to `5`.
*   ***netns***: The name of a Linux network namespace in which probes are executed. When
    empty or omitted, probes run in the current (default) network namespace. Useful when
    backends are reachable only through a dedicated dataplane namespace.

    **Capability requirement**: setting this field makes `maglevd` call
    `setns(CLONE_NEWNET)` on the probe thread before each probe, which the
    kernel only permits to processes holding `CAP_SYS_ADMIN` in the target
    namespace's user namespace (`setns(2)`). The Debian systemd unit
    (`vpp-maglev.service`) already grants this capability; if you run
    `maglevd` by hand under a non-root user make sure the binary has
    `CAP_SYS_ADMIN` via `setcap cap_net_raw,cap_sys_admin=eip
    /usr/sbin/maglevd` or equivalent, otherwise every probe fails with
    `enter netns "<name>": operation not permitted` and all backends
    transition to `down` on their first probe.

    Also make sure the named namespace is mounted under `/var/run/netns/`
    (which is where `ip netns add` puts it) and that it is readable by
    the user `maglevd` runs as — the default mode from `ip netns add` is
    `0644`, which is fine for any user.

Example:
```yaml
maglev:
  healthchecker:
    transition-history: 10
    netns: dataplane
```

---

## vpp

Settings controlling the integration with a locally running VPP instance. The
`vpp` section is a map with a single sub-section, `lb`. Both `lb.ipv4-src-address`
and `lb.ipv6-src-address` are **required** — `maglevd --check` exits with a
semantic error and the daemon refuses to start when either is missing, because
VPP's GRE encap needs a source address and every VIP `maglevd` programs uses GRE.

*   ***lb.ipv4-src-address***: Required. The IPv4 source address VPP uses when
    encapsulating IPv4 traffic into GRE4 tunnels to application servers. Must
    be a valid IPv4 address. No default.
*   ***lb.ipv6-src-address***: Required. The IPv6 source address VPP uses when
    encapsulating IPv6 traffic into GRE6 tunnels. Must be a valid IPv6 address.
    No default.
*   ***lb.sync-interval***: A positive Go duration (e.g. `30s`, `1m`) controlling
    how often `maglevd` reconciles the VPP load-balancer dataplane against its
    running configuration. On startup, an immediate full sync runs; subsequent
    syncs fire at this interval as long as the VPP connection is up. Defaults
    to `30s`. The purpose is to catch drift — for example, a VIP added to VPP
    by hand — and bring VPP back in line with the maglev config.
*   ***lb.sticky-buckets-per-core***: The number of buckets per worker thread in
    the established-flow table. Must be a power of 2. Defaults to `65536` (64k).
*   ***lb.flow-timeout***: Idle time after which an established flow is removed
    from the table. Must be a whole number of seconds between `1s` and `120s`
    inclusive. Defaults to `40s`.
*   ***lb.startup-min-delay***: Absolute hands-off window at the start of the
    `maglevd` process. For the first `startup-min-delay` seconds of the
    process's life, no VPP LB sync of any kind is issued — neither the
    periodic `SyncLBStateAll` loop nor the per-transition `SyncLBStateVIP`
    path from the reconciler touches VPP. This makes a `maglevd` restart
    dataplane-neutral: without this gate, the first sync would fire before
    any probes had completed, every backend would still be in `StateUnknown`,
    and every AS would be reprogrammed to weight 0 until the rise counters
    caught up — producing a visible black-hole window of several seconds
    on every restart. A non-negative Go duration. Must be `<= startup-max-delay`.
    Defaults to `5s`. Set to `0s` (together with `startup-max-delay: 0s`) to
    disable the warmup entirely and sync VPP immediately on startup.
*   ***lb.startup-max-delay***: Watchdog for the per-VIP release phase that
    follows `startup-min-delay`. Between `min-delay` and `max-delay`, each
    frontend is released individually as soon as every backend it references
    has reached a non-`Unknown` state, and a single `SyncLBStateVIP` runs
    against the newly-released frontend. At `max-delay` the warmup driver
    unconditionally runs `SyncLBStateAll`, marking the warmup phase complete
    regardless of whether any backends are still `StateUnknown` — stragglers
    get programmed at their current effective weight at that point, which
    for a `StateUnknown` backend means weight 0. A non-negative Go duration.
    Must be `>= startup-min-delay`. Defaults to `30s`. Set to `0s` together
    with `startup-min-delay: 0s` to disable the warmup entirely.

These values are pushed to VPP via `lb_conf` when `maglevd` connects to
VPP and again after every config reload (whenever they change). A log line
`vpp-lb-conf-set` records the effective values. The `startup-*` settings
are latched at the first successful VPP connect and are not re-read on
subsequent config reloads — a reload that changes them only takes effect
on the next process start.

Example:
```yaml
maglev:
  vpp:
    lb:
      sync-interval: 60s
      ipv4-src-address: 10.0.0.1
      ipv6-src-address: 2001:db8::1
      sticky-buckets-per-core: 65536
      flow-timeout: 40s
      startup-min-delay: 5s
      startup-max-delay: 30s
```

---

## healthchecks

A named map of health check definitions. Each health check describes *how* to probe a backend.
Backends reference health checks by name. The same health check can be reused across any number
of backends; each backend is probed exactly once regardless of how many frontends reference it.

Common fields (all types):

*   ***type***: Required. One of `icmp`, `tcp`, `http`, or `https`.
*   ***port***: The destination port to probe. Required for `tcp`, `http`, and `https`.
    Must be omitted for `icmp`.
*   ***probe-ipv4-src***: An optional IPv4 source address used when probing IPv4 backends.
    Must be an IPv4 address. When omitted, the OS chooses the source address.
*   ***probe-ipv6-src***: An optional IPv6 source address used when probing IPv6 backends.
    Must be an IPv6 address. When omitted, the OS chooses the source address.
*   ***interval***: Required. A positive Go duration string (e.g. `2s`, `500ms`) controlling
    how often a probe is sent when the backend is fully healthy (counter at maximum).
*   ***fast-interval***: Optional. A positive duration used instead of `interval` while the
    backend's health counter is degraded (between down and up) or in `unknown` state. When
    omitted, `interval` is used.
*   ***down-interval***: Optional. A positive duration used instead of `interval` while the
    backend is fully down (counter at zero). When omitted, `interval` is used. Setting this to
    a longer value reduces probe traffic to backends that are known to be offline.
*   ***timeout***: Required. A positive duration after which an in-flight probe is abandoned
    and counted as a failure.
*   ***rise***: The number of consecutive successes required to transition from down to up.
    Defaults to `2`. Must be >= 1.
*   ***fall***: The number of consecutive failures required to transition from up to down.
    Defaults to `3`. Must be >= 1.

### type: icmp

Sends an ICMP echo request (ping) to the backend address. Requires `CAP_NET_RAW`. No `port`
may be specified. No `params` block is used.

```yaml
healthchecks:
  ping:
    type: icmp
    probe-ipv4-src: 10.0.0.1
    probe-ipv6-src: 2001:db8::1
    interval: 2s
    timeout: 1s
    rise: 2
    fall: 3
```

### type: tcp

Opens a TCP connection to the backend and immediately closes it upon success. Use `params` to
optionally wrap the connection in TLS.

*   ***params.ssl***: A boolean. When `true`, a TLS handshake is performed after the TCP
    connection is established. Defaults to `false`.
*   ***params.server-name***: The TLS SNI hostname sent during the handshake. When omitted,
    the backend IP address is used.
*   ***params.insecure-skip-verify***: A boolean. When `true`, the TLS certificate presented
    by the server is not verified. Defaults to `false`.

```yaml
healthchecks:
  imaps-check:
    type: tcp
    port: 993
    params:
      ssl: true
      server-name: imaps.example.com
    interval: 5s
    timeout: 3s
    rise: 2
    fall: 3
```

### type: http / https

Opens a TCP (or TLS for `https`) connection, sends an HTTP request, and evaluates the response
code. An optional regexp can additionally match against the response body.

*   ***params.path***: Required. The HTTP request path, e.g. `/healthz`.
*   ***params.host***: The `Host` header value sent in the request. When omitted, the backend
    IP address is used.
*   ***params.response-code***: The expected HTTP response code. Can be a single value (`"200"`)
    or an inclusive range (`"200-299"`). Defaults to `"200"`.
*   ***params.response-regexp***: An optional Go regular expression matched against the response
    body. If specified, the body must match for the probe to succeed.
*   ***params.server-name***: The TLS SNI hostname (`https` only). Defaults to the value of
    `params.host` if not set.
*   ***params.insecure-skip-verify***: A boolean. Skip TLS certificate verification (`https`
    only). Defaults to `false`.

```yaml
healthchecks:
  nginx-http:
    type: http
    port: 80
    params:
      path: /healthz
      host: nginx.example.com
      response-code: "200-204"
    interval: 2s
    fast-interval: 500ms
    down-interval: 30s
    timeout: 3s
    rise: 2
    fall: 3

  nginx-https:
    type: https
    port: 443
    params:
      path: /healthz
      host: nginx.example.com
      server-name: nginx.example.com
      insecure-skip-verify: false
    interval: 5s
    timeout: 3s
```

---

## backends

A named map of individual backend servers. Each backend has a single IP address and optionally
references a health check by name. Backends are probed exactly once, even if they appear in
multiple frontends.

*   ***address***: Required. The IPv4 or IPv6 address of this backend server.
*   ***healthcheck***: The name of a health check defined in the `healthchecks` section.
    When empty or omitted, the backend is static: no probing is performed and the backend
    enters `StateUp` immediately on startup (via a synthetic pass, rise/fall forced to 1/1).
    This is useful for backends that are always available or managed by other means. See
    [healthchecks.md](healthchecks.md) for details on the static-backend behavior.
*   ***enabled***: A boolean controlling whether this backend participates in any frontend.
    When `false`, the backend is excluded entirely and no probe goroutine is started.
    Defaults to `true`.

Examples:
```yaml
backends:
  nginx0-ams:
    address: 198.51.100.10
    healthcheck: nginx-http
  nginx0-lon:
    address: 198.51.100.11
    healthcheck: nginx-http
  nginx0-draining:
    address: 198.51.100.12
    healthcheck: nginx-http
    enabled: false
  static-backend:
    address: 198.51.100.20
    # no healthcheck: assumed always healthy
```

---

## frontends

A named map of virtual IPs (VIPs). Each frontend ties together a listener address with an
ordered list of backend pools. The gRPC API exposes frontends by name.

*   ***description***: An optional free-text string for documentation purposes.
*   ***address***: Required. The IPv4 or IPv6 address of the VIP.
*   ***protocol***: The IP protocol, either `tcp` or `udp`. When omitted, the frontend matches
    all traffic to the VIP address regardless of protocol. If `port` is specified, `protocol`
    must also be set.
*   ***port***: The destination port of the VIP, an integer between 1 and 65535. Requires
    `protocol` to be set. When omitted, the frontend matches all ports. Note that the
    frontend port is independent of the healthcheck port: a frontend on port 443 may use
    a healthcheck that probes port 80.
*   ***pools***: Required. A non-empty ordered list of pool objects. Pools express priority:
    the first pool is preferred; subsequent pools act as fallbacks. When every backend in
    pool[0] leaves `StateUp` (down, paused, disabled, or not yet probed), pool[1] is
    automatically promoted — its up backends take over serving traffic. The promotion
    cascades across further tiers. See [healthchecks.md](healthchecks.md#pool-failover)
    for the full failover semantics. All backends across all pools in a frontend must
    have addresses of the same address family (all IPv4 or all IPv6).
*   ***src-ip-sticky***: Boolean, default `false`. When `true`, the VPP load-balancer
    programs this VIP with source-IP-based stickiness — all flows from the same client
    source IP hash to the same backend (subject to the Maglev consistent-hash bucket
    assignment). Use this for protocols that require session affinity at the L3 level,
    or when clients open many short flows that should land on one backend. Changing this
    field in a running config and reloading causes maglevd to tear down the VIP (all
    application servers are deleted with flush, then the VIP itself is deleted) and
    recreate it with the new value; VPP has no API to mutate `src_ip_sticky` on an
    existing VIP, and existing flow state cannot be preserved across the flip.
*   ***flush-on-down***: Boolean, default `true`. Controls what happens to existing
    flows pinned to a backend that transitions to `down`. With it `true`, maglevd
    issues `lb_as_set_weight(is_flush=true)` on the down transition, clearing VPP's
    flow-table entries for that backend so existing connections are torn down
    immediately and reshuffle onto healthy backends. With it `false`, the weight drops
    to 0 (drain only): new flows skip the dead backend via the Maglev lookup table,
    but existing sticky flows keep being steered at the dead IP until the client
    retries, producing visible "connection refused" oscillations during an outage.
    The default is `true` because the healthcheck's `rise` / `fall` counters already
    absorb single-probe flaps, so a fall-counted `down` is almost always a real
    outage where immediate session teardown is the safer behaviour. Set it to `false`
    per frontend only when existing sessions are expensive enough that you'd rather
    keep them pinned to a potentially-dead backend than reset them (e.g. long-lived
    WebSocket connections with expensive reconnect logic). The `disabled` state
    always flushes regardless of this flag — `disabled` is an explicit operator
    signal that the backend is going away.

Each pool has:

*   ***name***: Required. A non-empty string identifying the pool (e.g. `primary`, `fallback`).
*   ***backends***: A map of backend names to per-pool backend options. Every name must refer
    to an existing entry in the `backends` section.

Per-pool backend options:

*   ***weight***: An integer between 0 and 100 (inclusive) expressing the relative weight of
    this backend within the pool. `0` keeps the backend in the pool but assigns it no traffic.
    Defaults to `100`. Weight is per-pool, not global — the same backend can appear with
    different weights in different frontends.

Examples:
```yaml
frontends:
  nginx-v4-http:
    description: "IPv4 HTTP VIP with fallback"
    address: 198.51.100.1
    protocol: tcp
    port: 80
    pools:
      - name: primary
        backends:
          nginx0-ams: { weight: 10 }
          nginx0-lon: {}
      - name: fallback
        backends:
          nginx0-fra: {}

  maildrop-imaps:
    description: "IMAPS VIP"
    address: 2001:db8::1
    protocol: tcp
    port: 993
    src-ip-sticky: true
    pools:
      - name: primary
        backends:
          maildrop0-ams: {}
          maildrop0-lon: {}
```

---

For a detailed description of the health state machine, probe intervals, and all transition events,
see [healthchecks.md](healthchecks.md). For a user guide on how to use the maglev daemon and client,
see the [user-guide.md](user-guide.md).