Files

Pim van Pelt 6ca296b2e8 Collector implementation

2026-03-14 20:07:32 +01:00

6.9 KiB

Raw Blame History

Collector v0 — Implementation Plan ✓ COMPLETE

Module path: git.ipng.ch/ipng/nginx-logtail

Scope: A working collector that tails files, aggregates into memory, and serves TopN, Trend, and StreamSnapshots over gRPC. Full vertical slice, no optimisation passes yet.

Step 1 — Repo scaffolding

go mod init git.ipng.ch/ipng/nginx-logtail
.gitignore
Install deps: google.golang.org/grpc, google.golang.org/protobuf, github.com/fsnotify/fsnotify

Step 2 — Proto (`proto/logtail.proto`)

Write the full proto file as specified in README.md DESIGN § Protobuf API. Generate Go stubs with protoc. Commit generated files. This defines the contract everything else builds on.

Step 3 — Parser (`cmd/collector/parser.go`)

LogRecord struct: Website, ClientPrefix, URI, Status string
ParseLine(line string) (LogRecord, bool) — SplitN on tab, discard query string at ?, return false for lines with fewer than 8 fields
TruncateIP(addr string, v4bits, v6bits int) string — handle IPv4 and IPv6
Unit-tested with table-driven tests: normal line, short line, IPv6, query string stripping, /24 and /48 truncation

Step 4 — Store (`cmd/collector/store.go`)

Implement in order, each piece testable independently:

Tuple4 and live map — map[Tuple4]int64, cap enforcement at 100K, Ingest(r LogRecord)
Fine ring buffer — [60]Snapshot circular array, rotate() heap-selects top-50K from live map, appends to ring, resets live map
Coarse ring buffer — [288]Snapshot, populated every 5 fine rotations by merging the last 5 fine snapshots into a top-5K snapshot
QueryTopN(filter, groupBy, n, window) — RLock, sum bucket range, group by dimension, apply filter, heap-select top N
QueryTrend(filter, window) — per-bucket count sum, returns one point per bucket
Store.Run(ch <-chan LogRecord) — single goroutine: read channel → Ingest, minute ticker → rotate()
Snapshot broadcast — per-subscriber buffered channel fan-out; Subscribe() <-chan Snapshot / Unsubscribe(ch)

Step 5 — Tailer (`cmd/collector/tailer.go`)

Tailer struct: path, fsnotify watcher, output channel
On start: open file, seek to EOF, register fsnotify watch
On fsnotify.Write: bufio.Scanner reads all new lines, sends LogRecord to channel
On fsnotify.Rename / Remove: drain to EOF, close fd, retry open with 100 ms backoff (up to 5 s), resume from position 0 — no lines lost between drain and reopen
Tailer.Run(ctx context.Context) — blocks until context cancelled

Step 6 — gRPC server (`cmd/collector/server.go`)

Server wraps *Store, implements LogtailServiceServer
TopN: store.QueryTopN → marshal to proto response
Trend: store.QueryTrend → marshal to proto response
StreamSnapshots: store.Subscribe(), loop sending snapshots until client disconnects or context done, then store.Unsubscribe(ch)

Step 7 — Main (`cmd/collector/main.go`)

Flags:

--listen default :9090
--logs comma-separated log file paths
--source name for this collector instance (default: hostname)
--v4prefix default 24
--v6prefix default 48

Wire-up: create channel → start store.Run goroutine → start one Tailer goroutine per log path → start gRPC server → signal.NotifyContext for clean shutdown on SIGINT/SIGTERM.

Step 8 — Smoke test

Generate fake log lines at 10K/s (small Go script or shell one-liner)
Run collector against them
Use grpcurl to call TopN and verify results
Check runtime.MemStats to confirm memory stays well under 1 GB

Deferred (not in v0)

cmd/cli, cmd/aggregator, cmd/frontend
ClickHouse export
TLS / auth
Prometheus metrics endpoint

Implementation notes

Deviation from plan: MultiTailer

Step 5 planned one Tailer struct per file. During implementation this was changed to a single MultiTailer with one shared fsnotify.Watcher. Reason: one watcher per file creates one inotify instance per file; the kernel default limit is 128 instances per user, which would be hit with 100s of log files. The MultiTailer uses a single instance and routes events by path via a map[string]*fileState.

Deviation from plan: IPv6 /48 semantics

The design doc said "truncate to /48". /48 keeps the first three full 16-bit groups intact (e.g. 2001:db8:cafe::1 → 2001:db8:cafe::/48). An early test expected 2001:db8:ca00::/48 (truncating mid-group), which was wrong. The code is correct; the test was fixed.

Test results

Run with: go test ./cmd/collector/ -v -count=1 -timeout 120s

Test	What it covers
`TestParseLine` (7 cases)	Tab parsing, query string stripping, bad lines
`TestTruncateIP`	IPv4 /24 and IPv6 /48 masking
`TestIngestAndRotate`	Live map → fine ring rotation
`TestLiveMapCap`	Hard cap at 100 K entries, no panic beyond cap
`TestQueryTopN`	Ranked results from ring buffer
`TestQueryTopNWithFilter`	Filter by HTTP status code
`TestQueryTrend`	Per-bucket counts, oldest-first ordering
`TestCoarseRingPopulated`	5 fine ticks → 1 coarse bucket, count aggregation
`TestSubscribeBroadcast`	Fan-out channel delivery after rotation
`TestTopKOrdering`	Heap select returns correct top-K descending
`TestMultiTailerReadsLines`	Live file write → LogRecord received on channel
`TestMultiTailerMultipleFiles`	5 files, one watcher, all lines received
`TestMultiTailerLogRotation`	RENAME → drain → retry → new file tailed correctly
`TestExpandGlobs`	Glob pattern expands to matching files only
`TestExpandGlobsDeduplication`	Same file via path + glob deduplicated to one
`TestMemoryBudget`	Full ring fill stays within 1 GB heap
`TestGRPCEndToEnd`	Real gRPC server: TopN, filtered TopN, Trend, StreamSnapshots

Total: 17 tests, all passing.

Benchmark results

Run with: go test ./cmd/collector/ -bench=. -benchtime=3s

Hardware: 12th Gen Intel Core i7-12700T

Benchmark	ns/op	throughput	headroom vs 10K/s
`BenchmarkParseLine`	418	~2.4M lines/s	240×
`BenchmarkIngest`	152	~6.5M records/s	650×

Both the parser and the store ingestion goroutine have several hundred times more capacity than the 10 000 lines/second peak requirement. The bottleneck at scale will be fsnotify event delivery and kernel I/O, not the Go code.

6.9 KiB Raw Blame History Unescape Escape