Files
nginx-logtail/PLAN_COLLECTOR.md
2026-03-14 20:07:32 +01:00

6.9 KiB
Raw Blame History

Collector v0 — Implementation Plan ✓ COMPLETE

Module path: git.ipng.ch/ipng/nginx-logtail

Scope: A working collector that tails files, aggregates into memory, and serves TopN, Trend, and StreamSnapshots over gRPC. Full vertical slice, no optimisation passes yet.


Step 1 — Repo scaffolding

  • go mod init git.ipng.ch/ipng/nginx-logtail
  • .gitignore
  • Install deps: google.golang.org/grpc, google.golang.org/protobuf, github.com/fsnotify/fsnotify

Step 2 — Proto (proto/logtail.proto)

Write the full proto file as specified in README.md DESIGN § Protobuf API. Generate Go stubs with protoc. Commit generated files. This defines the contract everything else builds on.

Step 3 — Parser (cmd/collector/parser.go)

  • LogRecord struct: Website, ClientPrefix, URI, Status string
  • ParseLine(line string) (LogRecord, bool)SplitN on tab, discard query string at ?, return false for lines with fewer than 8 fields
  • TruncateIP(addr string, v4bits, v6bits int) string — handle IPv4 and IPv6
  • Unit-tested with table-driven tests: normal line, short line, IPv6, query string stripping, /24 and /48 truncation

Step 4 — Store (cmd/collector/store.go)

Implement in order, each piece testable independently:

  1. Tuple4 and live mapmap[Tuple4]int64, cap enforcement at 100K, Ingest(r LogRecord)
  2. Fine ring buffer[60]Snapshot circular array, rotate() heap-selects top-50K from live map, appends to ring, resets live map
  3. Coarse ring buffer[288]Snapshot, populated every 5 fine rotations by merging the last 5 fine snapshots into a top-5K snapshot
  4. QueryTopN(filter, groupBy, n, window) — RLock, sum bucket range, group by dimension, apply filter, heap-select top N
  5. QueryTrend(filter, window) — per-bucket count sum, returns one point per bucket
  6. Store.Run(ch <-chan LogRecord) — single goroutine: read channel → Ingest, minute ticker → rotate()
  7. Snapshot broadcast — per-subscriber buffered channel fan-out; Subscribe() <-chan Snapshot / Unsubscribe(ch)

Step 5 — Tailer (cmd/collector/tailer.go)

  • Tailer struct: path, fsnotify watcher, output channel
  • On start: open file, seek to EOF, register fsnotify watch
  • On fsnotify.Write: bufio.Scanner reads all new lines, sends LogRecord to channel
  • On fsnotify.Rename / Remove: drain to EOF, close fd, retry open with 100 ms backoff (up to 5 s), resume from position 0 — no lines lost between drain and reopen
  • Tailer.Run(ctx context.Context) — blocks until context cancelled

Step 6 — gRPC server (cmd/collector/server.go)

  • Server wraps *Store, implements LogtailServiceServer
  • TopN: store.QueryTopN → marshal to proto response
  • Trend: store.QueryTrend → marshal to proto response
  • StreamSnapshots: store.Subscribe(), loop sending snapshots until client disconnects or context done, then store.Unsubscribe(ch)

Step 7 — Main (cmd/collector/main.go)

Flags:

  • --listen default :9090
  • --logs comma-separated log file paths
  • --source name for this collector instance (default: hostname)
  • --v4prefix default 24
  • --v6prefix default 48

Wire-up: create channel → start store.Run goroutine → start one Tailer goroutine per log path → start gRPC server → signal.NotifyContext for clean shutdown on SIGINT/SIGTERM.

Step 8 — Smoke test

  • Generate fake log lines at 10K/s (small Go script or shell one-liner)
  • Run collector against them
  • Use grpcurl to call TopN and verify results
  • Check runtime.MemStats to confirm memory stays well under 1 GB

Deferred (not in v0)

  • cmd/cli, cmd/aggregator, cmd/frontend
  • ClickHouse export
  • TLS / auth
  • Prometheus metrics endpoint

Implementation notes

Deviation from plan: MultiTailer

Step 5 planned one Tailer struct per file. During implementation this was changed to a single MultiTailer with one shared fsnotify.Watcher. Reason: one watcher per file creates one inotify instance per file; the kernel default limit is 128 instances per user, which would be hit with 100s of log files. The MultiTailer uses a single instance and routes events by path via a map[string]*fileState.

Deviation from plan: IPv6 /48 semantics

The design doc said "truncate to /48". /48 keeps the first three full 16-bit groups intact (e.g. 2001:db8:cafe::12001:db8:cafe::/48). An early test expected 2001:db8:ca00::/48 (truncating mid-group), which was wrong. The code is correct; the test was fixed.


Test results

Run with: go test ./cmd/collector/ -v -count=1 -timeout 120s

Test What it covers
TestParseLine (7 cases) Tab parsing, query string stripping, bad lines
TestTruncateIP IPv4 /24 and IPv6 /48 masking
TestIngestAndRotate Live map → fine ring rotation
TestLiveMapCap Hard cap at 100 K entries, no panic beyond cap
TestQueryTopN Ranked results from ring buffer
TestQueryTopNWithFilter Filter by HTTP status code
TestQueryTrend Per-bucket counts, oldest-first ordering
TestCoarseRingPopulated 5 fine ticks → 1 coarse bucket, count aggregation
TestSubscribeBroadcast Fan-out channel delivery after rotation
TestTopKOrdering Heap select returns correct top-K descending
TestMultiTailerReadsLines Live file write → LogRecord received on channel
TestMultiTailerMultipleFiles 5 files, one watcher, all lines received
TestMultiTailerLogRotation RENAME → drain → retry → new file tailed correctly
TestExpandGlobs Glob pattern expands to matching files only
TestExpandGlobsDeduplication Same file via path + glob deduplicated to one
TestMemoryBudget Full ring fill stays within 1 GB heap
TestGRPCEndToEnd Real gRPC server: TopN, filtered TopN, Trend, StreamSnapshots

Total: 17 tests, all passing.


Benchmark results

Run with: go test ./cmd/collector/ -bench=. -benchtime=3s

Hardware: 12th Gen Intel Core i7-12700T

Benchmark ns/op throughput headroom vs 10K/s
BenchmarkParseLine 418 ~2.4M lines/s 240×
BenchmarkIngest 152 ~6.5M records/s 650×

Both the parser and the store ingestion goroutine have several hundred times more capacity than the 10 000 lines/second peak requirement. The bottleneck at scale will be fsnotify event delivery and kernel I/O, not the Go code.