New maglevt TUI component: out-of-band VIP health monitor

A small bubbletea TUI that reads maglev.yaml (repeatable --config), enumerates every VIP, and probes each from outside the load balancer on a tight cadence (default 100ms, ±10% jitter). HTTP/HTTPS VIPs get a GET against a configurable URI (default /.well-known/ipng/healthz) with per-VIP rolling latency (p50/p95/p99/max), lifetime N/FAIL counters, LAST status, and a response-header tally. Non-HTTP VIPs get a TCP connect probe. A bounded error panel classifies anomalies as timeout / http-err / net-err / spike and auto-sizes to fill the screen. Utility: during a failover drill (backend flap, AS drain, config push) the tally panel shows which backend each VIP is actually steering to, with two-colour activity highlighting over a 5s window — white = receiving traffic, grey = drained. Paired with the rolling OK%/latency columns it gives an at-a-glance answer to "is the VIP healthy from the outside right now, and which backend is it hitting", without relying on maglevd's own view of the world. Also bumps Makefile/go.mod to build the new binary.
2026-04-15 01:23:34 +02:00
parent 744b1cb3d2
commit 6293521157
8 changed files with 1890 additions and 1 deletions
--- a/cmd/tester/model.go
+++ b/cmd/tester/model.go
@@ -0,0 +1,349 @@
+// Copyright (c) 2026, Pim van Pelt <pim@ipng.ch>
+
+package main
+
+import (
+	"fmt"
+	"time"
+
+	tea "github.com/charmbracelet/bubbletea"
+)
+
+// tallyWindow is the sliding-window length used to classify tally
+// entries as "actively receiving traffic" versus "drained". A probe
+// snapshot of each VIP's tally is rotated into vipState.tallyOld once
+// a second, so on steady state the delta (tally - tallyOld) reflects
+// somewhere between 1 and 2 seconds of activity — long enough to be
+// noise-resistant, short enough that a flush or graceful drain shows
+// up in the next UI redraw.
+const tallyWindow = 5 * time.Second
+
+// errWindowSize is the hard storage cap for Model.events. It isn't
+// the number of rows rendered to screen — that's computed per-frame
+// in View() from the terminal height so the events panel fills
+// whatever space is left after the header, table, tally, and
+// footer. This cap only exists to stop the ring from growing
+// unbounded on a very long-running session that's seeing constant
+// anomalies: 500 × ~100 bytes per event is ~50 KB, negligible.
+const errWindowSize = 500
+
+// errKind classifies why an event ended up in the error panel. The
+// four kinds map one-to-one to the four situations Update flags
+// from a probeResultMsg: a probe hit its timeout deadline, a probe
+// came back with an HTTP 4xx/5xx, a probe failed at the network
+// layer (connection refused, reset, unreachable, TLS handshake
+// error), or a probe completed successfully but with a latency
+// more than 25% above the rolling-window max.
+type errKind int
+
+const (
+	kindTimeout errKind = iota
+	kindHTTPErr
+	kindNetErr
+	kindSpike
+)
+
+func (k errKind) String() string {
+	switch k {
+	case kindTimeout:
+		return "timeout"
+	case kindHTTPErr:
+		return "http-err"
+	case kindNetErr:
+		return "net-err"
+	case kindSpike:
+		return "spike"
+	}
+	return "unknown"
+}
+
+// errEvent is one entry in the bounded error-panel ring. VIPIdx
+// points back into Model.vips so the view can look up the scheme
+// and address for the row label at render time (we don't store a
+// formatted label here to keep the event struct cheap and to let
+// the view decide how to style it).
+type errEvent struct {
+	At     time.Time
+	VIPIdx int
+	Kind   errKind
+	Detail string
+}
+
+// Model is the bubbletea Model for maglevt. Held by value throughout
+// so bubbletea's copy-on-Update semantics work naturally; mutable
+// per-VIP state lives behind *vipState pointers in the vips slice so
+// probeResultMsg handlers can update rolling/tally without copying
+// the whole model.
+type Model struct {
+	cfgPath string
+	vips    []*vipState
+	opts    probeOpts
+	startAt time.Time
+	width   int
+	height  int
+	help    bool       // whether the help overlay is currently shown
+	events  []errEvent // bounded ring of recent anomalies (size errWindowSize)
+	// showDNS toggles between hostname and IP-literal display in
+	// the ADDR column and the tally/events labels. On by default:
+	// operators usually know VIPs by name, and the 'd' key flips
+	// to the raw literal when they need to see which address
+	// family or which specific IP the row represents. Hostnames
+	// come in asynchronously via hostnameMsg, so vipState.hostname
+	// may still be empty for a VIP even when showDNS is true —
+	// the display falls back to the IP literal in that case.
+	showDNS bool
+}
+
+// vipState is the mutable per-VIP record threaded through the tea
+// dispatch loop. vipState.info is the immutable descriptor built at
+// startup (see probe.go::vipInfo), while everything else on this
+// struct is rewritten as probe results arrive.
+type vipState struct {
+	info *vipInfo
+
+	// hostname is the PTR-record lookup result for info.ip, filled
+	// in asynchronously by runDNSLookup via hostnameMsg. Empty
+	// until the lookup returns (or forever, if it fails or times
+	// out). The UI consults Model.showDNS to decide whether to
+	// use it.
+	hostname string
+
+	// Rolling stats populated from every probeResultMsg. Separate
+	// from tally so reset semantics match the user's mental model:
+	// pressing 'r' blows away both, but a future pause-clear-resume
+	// cadence could reset just one.
+	rolling *rolling
+	tally   map[string]int
+
+	// tallyOld / tallyNew are the two-slot rotating snapshots used
+	// by the tally panel to classify each backend as green (still
+	// receiving traffic), orange (receiving less than the leader),
+	// or grey (drained). tallyNew is captured every tallyWindow;
+	// on the next rotation it shifts into tallyOld, so the delta
+	// (tally - tallyOld) always spans between 1 and 2 tallyWindow
+	// units of activity. tallyAt is the wall-clock time tallyNew
+	// was captured and drives the rotation decision in tickMsg.
+	tallyOld map[string]int
+	tallyNew map[string]int
+	tallyAt  time.Time
+
+	// Lifetime counters. Unlike the rolling window these never
+	// forget until the operator hits 'r'. The N column in the
+	// probe table renders totalProbes; FAIL renders totalFails
+	// tinted red when non-zero so a failure that rolled off the
+	// 100-sample rolling window still leaves a visible mark on
+	// the cumulative count.
+	totalProbes int64
+	totalFails  int64
+
+	// Last-seen values for the rightmost LAST column. These are
+	// the "what happened on the most recent probe" snapshot the
+	// UI shows as green/yellow/red.
+	lastAt     time.Time
+	lastOK     bool
+	lastCode   int
+	lastErr    string
+	lastHeader string
+	lastDur    time.Duration
+}
+
+// tickMsg drives the periodic redraw even when no probe results are
+// arriving (so the uptime counter in the header ticks along even on
+// a completely idle VIP set). 250ms is fast enough to look live
+// without burning CPU on layout work.
+type tickMsg time.Time
+
+func tickCmd() tea.Cmd {
+	return tea.Tick(250*time.Millisecond, func(t time.Time) tea.Msg {
+		return tickMsg(t)
+	})
+}
+
+// Init kicks off the periodic redraw. The alt-screen entry and window
+// title are set at NewProgram time in main.go.
+func (m Model) Init() tea.Cmd {
+	return tickCmd()
+}
+
+// Update handles every tea.Msg delivered to the program. Four
+// message classes:
+//
+//   - tea.WindowSizeMsg — resize; cache width/height for the View.
+//   - tea.KeyMsg — keybindings (quit, pause, reset, help).
+//   - probeResultMsg — probe goroutine delivered a new sample;
+//     update rolling/tally/last* and the sparkline.
+//   - tickMsg — periodic redraw; re-arm the timer.
+func (m Model) Update(msg tea.Msg) (tea.Model, tea.Cmd) {
+	switch msg := msg.(type) {
+	case tea.WindowSizeMsg:
+		m.width = msg.Width
+		m.height = msg.Height
+		return m, nil
+
+	case tea.KeyMsg:
+		switch msg.String() {
+		case "q", "ctrl+c":
+			return m, tea.Quit
+		case " ":
+			paused.Store(!paused.Load())
+			return m, nil
+		case "r":
+			// Unified reset: wipe per-VIP rolling windows, per-VIP
+			// tallies, per-VIP sparklines, the global event ring,
+			// and the global uptime origin so the header clock
+			// starts fresh. 'r' is the one-key way to start a
+			// clean capture window, which matches the "I'm about
+			// to do a failover, watch this" flow.
+			for _, v := range m.vips {
+				v.rolling.reset()
+				v.tally = map[string]int{}
+				v.tallyOld = map[string]int{}
+				v.tallyNew = map[string]int{}
+				v.tallyAt = time.Time{}
+				v.totalProbes = 0
+				v.totalFails = 0
+				v.lastAt = time.Time{}
+				v.lastOK = false
+				v.lastCode = 0
+				v.lastErr = ""
+				v.lastHeader = ""
+				v.lastDur = 0
+			}
+			m.events = nil
+			m.startAt = time.Now()
+			return m, nil
+		case "h", "?":
+			m.help = !m.help
+			return m, nil
+		case "d":
+			m.showDNS = !m.showDNS
+			return m, nil
+		}
+
+	case hostnameMsg:
+		if msg.VIPIdx >= 0 && msg.VIPIdx < len(m.vips) {
+			m.vips[msg.VIPIdx].hostname = msg.Hostname
+		}
+		return m, nil
+
+	case probeResultMsg:
+		if msg.VIPIdx < 0 || msg.VIPIdx >= len(m.vips) {
+			return m, nil
+		}
+		v := m.vips[msg.VIPIdx]
+		ns := uint64(msg.Duration.Nanoseconds())
+		prevMax := v.rolling.maxNS
+		spike := v.rolling.isSpike(ns)
+
+		v.lastAt = msg.At
+		v.lastOK = msg.OK
+		v.lastCode = msg.Code
+		v.lastErr = msg.Err
+		v.lastHeader = msg.Header
+		v.lastDur = msg.Duration
+		v.totalProbes++
+		if !msg.OK {
+			v.totalFails++
+		}
+		v.rolling.record(ns, msg.OK)
+		if msg.Header != "" {
+			v.tally[msg.Header]++
+		}
+
+		// Classify the sample for the error panel. Order matters:
+		// a network error is always more interesting than a latency
+		// observation (the latency is noise from the failure
+		// itself), an HTTP error is more interesting than a spike
+		// (a 503 dominates a 10ms vs 12ms latency blip), and a
+		// spike is only flagged on otherwise-successful samples.
+		if ev, ok := classifyEvent(msg, v, spike, prevMax); ok {
+			m.events = append(m.events, ev)
+			if len(m.events) > errWindowSize {
+				m.events = m.events[len(m.events)-errWindowSize:]
+			}
+		}
+		return m, nil
+
+	case tickMsg:
+		// Rotate each VIP's tally snapshot once the window has
+		// elapsed. Skipping while paused keeps the tally colours
+		// frozen at their pre-pause state instead of decaying
+		// everything to grey as deltas naturally fall to zero.
+		if !paused.Load() {
+			now := time.Time(msg)
+			for _, v := range m.vips {
+				if v.tallyAt.IsZero() || now.Sub(v.tallyAt) >= tallyWindow {
+					v.tallyOld = v.tallyNew
+					v.tallyNew = cloneTally(v.tally)
+					v.tallyAt = now
+				}
+			}
+		}
+		return m, tickCmd()
+	}
+	return m, nil
+}
+
+// cloneTally returns a shallow copy of src suitable for the two-slot
+// rotation in vipState. The snapshot must be independent of the live
+// tally because subsequent probes keep mutating the original map;
+// without the copy, tallyNew and tally would alias and the delta
+// would always be zero.
+func cloneTally(src map[string]int) map[string]int {
+	out := make(map[string]int, len(src))
+	for k, v := range src {
+		out[k] = v
+	}
+	return out
+}
+
+// classifyEvent inspects a probeResultMsg and returns the matching
+// errEvent (if any) for the error panel. Returns (_, false) when
+// the sample is uninteresting — a boring 2xx/3xx HTTP response or
+// a successful TCP connect. The four classes are checked in
+// priority order: network/timeout errors trump HTTP status trump
+// latency spikes, because a failed probe's latency is noise
+// inherited from the failure.
+//
+// prevMax is the rolling-window max seen *before* this sample was
+// recorded. It's included in the spike Detail so the operator can
+// see the baseline the current probe blew past ("482ms (was
+// 98ms)") rather than just an absolute number with no context.
+func classifyEvent(msg probeResultMsg, v *vipState, spike bool, prevMax uint64) (errEvent, bool) {
+	if msg.Err != "" {
+		kind := kindNetErr
+		// shortError already collapses "i/o timeout" and
+		// "context deadline exceeded" to the literal "timeout"
+		// token, so an equality check is enough to distinguish
+		// hit-the-deadline failures from refused / reset /
+		// unreachable errors.
+		if msg.Err == "timeout" {
+			kind = kindTimeout
+		}
+		return errEvent{
+			At:     msg.At,
+			VIPIdx: msg.VIPIdx,
+			Kind:   kind,
+			Detail: msg.Err,
+		}, true
+	}
+	if v.info.scheme != "tcp" && msg.Code >= 400 {
+		return errEvent{
+			At:     msg.At,
+			VIPIdx: msg.VIPIdx,
+			Kind:   kindHTTPErr,
+			Detail: fmt.Sprintf("HTTP %d", msg.Code),
+		}, true
+	}
+	if spike {
+		return errEvent{
+			At:     msg.At,
+			VIPIdx: msg.VIPIdx,
+			Kind:   kindSpike,
+			Detail: fmt.Sprintf("%s (prev max %s)",
+				formatDur(msg.Duration),
+				formatDur(time.Duration(prevMax))),
+		}, true
+	}
+	return errEvent{}, false
+}