vpp-maglev/docs/healthchecks.md

# Health Checking

`maglevd` probes each backend independently of how many frontends reference it.
Every backend runs exactly one probe goroutine. State changes are broadcast as
gRPC events to all connected `WatchBackendEvents` subscribers.

---

## States

| State | Meaning |
|---|---|
| `unknown` | Initial state; also entered after a resume or backend restart. |
| `up` | Backend is healthy and eligible to receive traffic. |
| `down` | Backend has failed enough consecutive probes to be considered offline. |
| `paused` | Health checking suspended by an operator. Probes fire but results are discarded. |
| `removed` | Backend was removed from configuration. No further probes are accepted. |

---

## Rise / fall counter

The state machine is driven by HAProxy's single-integer health counter.

```
counter ∈ [0, rise + fall − 1]   (called Max below)

backend is UP   when counter ≥ rise
backend is DOWN when counter < rise
```

On each probe:
- **pass** — counter increments, ceiling at Max.
- **fail** — counter decrements, floor at 0.

This gives **hysteresis**: a backend that is barely up (counter = rise) needs
`fall` consecutive failures before it transitions to down. A backend that is
fully down (counter = 0) needs `rise` consecutive passes to come back up. A
backend that oscillates between passing and failing stays in the degraded range
without bouncing between up and down.

### Expedited unknown resolution

When a backend enters `unknown` state (new, restarted, or resumed) its counter
is pre-loaded to `rise − 1`. This means a single probe result is enough to
resolve the state:

- **1 pass** → `up`
- **1 fail** → `down` (also via the special unknown shortcut below)

In addition, any failure while state is `unknown` transitions immediately to
`down`, regardless of the counter value.

### Example: rise=2, fall=3 (Max=4)

```
counter:  0   1   2   3   4
state:   DOWN DOWN  UP  UP  UP
                ^
                rise boundary
```

A backend starting from unknown has counter=1 (rise−1). One pass → counter=2
→ up. One fail while unknown → down immediately.

A backend that just became up sits at counter=2. It needs 3 failures to go down
(2→1→0, crossing the rise boundary at 2→1).

A backend that has been fully healthy for a while sits at counter=4. It needs 3
failures to go down (4→3→2→1, crossing the rise boundary at 2→1).

---

## Probe intervals

The interval used between probes depends on the backend's counter state:

| Condition | Interval used |
|---|---|
| State is `unknown` | `fast-interval` (falls back to `interval`) |
| Counter = Max (fully healthy) | `interval` |
| Counter = 0 (fully down) | `down-interval` (falls back to `interval`) |
| Counter between 0 and Max (degraded) | `fast-interval` (falls back to `interval`) |

Using `fast-interval` in degraded and unknown states means a flapping or
recovering backend is re-evaluated quickly without waiting a full `interval`.
Using `down-interval` for fully down backends reduces probe traffic to servers
that are known to be offline.

---

## Transition events

Every state change is logged as `backend-transition` and emitted as a gRPC
`BackendEvent` to all active `WatchBackendEvents` streams.

### Backend added (config load or reload)

```
unknown → unknown  (code: start)
```

The counter is pre-loaded to `rise − 1`. The first probe fires immediately at
`fast-interval` (or `interval` if not configured). One pass produces `unknown →
up`; one fail produces `unknown → down`.

If multiple backends start together they are staggered across the first
`interval` to avoid probe bursts.

### Probe pass

- Counter increments.
- If counter reaches `rise` from below: `down → up` (or `unknown → up`).
- If already up: no transition. Next probe at `fast-interval` if degraded,
  `interval` if fully healthy.

### Probe fail

- Counter decrements.
- If counter drops below `rise` from above: `up → down`.
- If state is `unknown`: transition immediately to `down` regardless of counter.
- Next probe at `down-interval` if fully down, `fast-interval` if degraded.

### Pause

```
<any> → paused  (operator action)
```

The counter is reset to 0. Probes continue to fire on their normal schedule but
all results are discarded. The backend stays `paused` until explicitly resumed.

### Resume

```
paused → unknown  (operator action)
```

The counter is reset to `rise − 1`. The probe goroutine is woken immediately
(no wait for the next scheduled probe). One subsequent pass produces `unknown →
up`; one fail produces `unknown → down`.

### Backend removed (config reload)

```
<any> → removed  (code: removed)
```

The probe goroutine stops. No further state changes occur. The removed event is
emitted using the frontend map from before the reload so that consumers can
correlate it to the correct frontend.

### Backend healthcheck config changed (config reload)

The old probe goroutine is stopped (`<any> → removed`) and a new one started
(`unknown → unknown`, code: `start`). The new goroutine resolves state on the
first probe as described under *Backend added* above.

### Backend metadata changed without healthcheck change (config reload)

Weight, enabled flag, and similar fields are updated in place. The probe
goroutine is not restarted and no transition event is emitted.

---

## Log lines

All state changes produce a structured log line at `INFO` level:

```json
{"level":"INFO","msg":"backend-transition","backend":"nginx0-ams","from":"up","to":"paused"}
{"level":"INFO","msg":"backend-transition","backend":"nginx0-ams","from":"paused","to":"unknown"}
{"level":"INFO","msg":"backend-transition","backend":"nginx0-ams","from":"unknown","to":"up","code":"L7OK","detail":""}
```

Probe-driven transitions also carry `code` and `detail` fields from the probe
result (e.g. `L4CON`, `L7STS`, `connection refused`). Operator-driven
transitions (pause, resume) carry empty code and detail.