fix: UDP listener parses batched datagrams

nginx-ipng-stats-plugin's ipng_stats_logtail directive buffers many log
lines into a single UDP datagram (default buffer=64k flush=1s). The
listener was treating each datagram as exactly one log line, so any
datagram with N>1 lines failed the v1 field-count check and dropped
silently. In production this showed up as logtail_udp_packets_received_total
roughly 4x logtail_udp_loglines_success_total — matching typical
burst-coalesced 4-lines-per-batch ratios.

Fix: strip trailing CRLF, split the payload on '\n', parse each
non-empty line independently. Counter semantics now match the names:

  packets_received  — datagrams off the socket (one per recvfrom)
  loglines_success  — log lines parsed OK (may be many per datagram)
  loglines_consumed — log lines forwarded to the store (not dropped)

After the fix, loglines_success ≈ packets_received × avg_lines_per_batch.

Regression test TestUDPListenerBatchedDatagram sends one datagram with
three '\n'-separated v1 lines and asserts all three LogRecords arrive,
plus loglines_success >= 3 * packets_received.

Docs (user-guide.md, design.md) now explain the datagram-vs-line unit
distinction so operators don't misread the ratio.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
This commit is contained in:
2026-04-17 11:59:43 +02:00
parent a554cfc2ee
commit e1f8bc5eb4
4 changed files with 110 additions and 23 deletions

View File

@@ -322,11 +322,17 @@ The collector exposes a Prometheus-compatible `/metrics` endpoint on `--prom-lis
**UDP ingest counters** — lets operators distinguish parse failures from back-pressure drops:
- `logtail_udp_packets_received_total` — datagrams read off the socket.
- `logtail_udp_loglines_success_total` — parsed OK.
- `logtail_udp_loglines_consumed_total` — forwarded to the store (not dropped).
- `logtail_udp_loglines_success_total` log lines parsed OK.
- `logtail_udp_loglines_consumed_total` log lines forwarded to the store (not dropped).
`received - success` is the parse-failure rate; `success - consumed` is the back-pressure
drop rate. Alert on either being non-zero.
Note the unit mismatch: `packets_*` counts datagrams, `loglines_*` counts log lines.
The nginx plugin batches many log lines into a single UDP datagram (default `buffer=64k
flush=1s`), so `loglines_success ≫ packets_received` is normal — operators should see
roughly `loglines_success / packets_received ≈ avg lines per batch`.
`loglines_success - loglines_consumed` is the back-pressure drop rate (channel full).
A large gap between `packets_received * expected_lines_per_packet` and `loglines_success`
indicates parse failures.
**Prometheus scrape config:**