Files
nginx-logtail/PLAN_COLLECTOR.md
2026-03-14 20:07:32 +01:00

145 lines
6.9 KiB
Markdown
Raw Blame History

This file contains ambiguous Unicode characters
This file contains Unicode characters that might be confused with other characters. If you think that this is intentional, you can safely ignore this warning. Use the Escape button to reveal them.
# Collector v0 — Implementation Plan ✓ COMPLETE
Module path: `git.ipng.ch/ipng/nginx-logtail`
**Scope:** A working collector that tails files, aggregates into memory, and serves `TopN`,
`Trend`, and `StreamSnapshots` over gRPC. Full vertical slice, no optimisation passes yet.
---
## Step 1 — Repo scaffolding
- `go mod init git.ipng.ch/ipng/nginx-logtail`
- `.gitignore`
- Install deps: `google.golang.org/grpc`, `google.golang.org/protobuf`, `github.com/fsnotify/fsnotify`
## Step 2 — Proto (`proto/logtail.proto`)
Write the full proto file as specified in README.md DESIGN § Protobuf API. Generate Go stubs with
`protoc`. Commit generated files. This defines the contract everything else builds on.
## Step 3 — Parser (`cmd/collector/parser.go`)
- `LogRecord` struct: `Website`, `ClientPrefix`, `URI`, `Status string`
- `ParseLine(line string) (LogRecord, bool)``SplitN` on tab, discard query string at `?`,
return `false` for lines with fewer than 8 fields
- `TruncateIP(addr string, v4bits, v6bits int) string` — handle IPv4 and IPv6
- Unit-tested with table-driven tests: normal line, short line, IPv6, query string stripping,
/24 and /48 truncation
## Step 4 — Store (`cmd/collector/store.go`)
Implement in order, each piece testable independently:
1. **`Tuple4` and live map** — `map[Tuple4]int64`, cap enforcement at 100K, `Ingest(r LogRecord)`
2. **Fine ring buffer**`[60]Snapshot` circular array, `rotate()` heap-selects top-50K from
live map, appends to ring, resets live map
3. **Coarse ring buffer**`[288]Snapshot`, populated every 5 fine rotations by merging
the last 5 fine snapshots into a top-5K snapshot
4. **`QueryTopN(filter, groupBy, n, window)`** — RLock, sum bucket range, group by dimension,
apply filter, heap-select top N
5. **`QueryTrend(filter, window)`** — per-bucket count sum, returns one point per bucket
6. **`Store.Run(ch <-chan LogRecord)`** — single goroutine: read channel → `Ingest`, minute
ticker → `rotate()`
7. **Snapshot broadcast** — per-subscriber buffered channel fan-out;
`Subscribe() <-chan Snapshot` / `Unsubscribe(ch)`
## Step 5 — Tailer (`cmd/collector/tailer.go`)
- `Tailer` struct: path, fsnotify watcher, output channel
- On start: open file, seek to EOF, register fsnotify watch
- On `fsnotify.Write`: `bufio.Scanner` reads all new lines, sends `LogRecord` to channel
- On `fsnotify.Rename` / `Remove`: drain to EOF, close fd, retry open with 100 ms backoff
(up to 5 s), resume from position 0 — no lines lost between drain and reopen
- `Tailer.Run(ctx context.Context)` — blocks until context cancelled
## Step 6 — gRPC server (`cmd/collector/server.go`)
- `Server` wraps `*Store`, implements `LogtailServiceServer`
- `TopN`: `store.QueryTopN` → marshal to proto response
- `Trend`: `store.QueryTrend` → marshal to proto response
- `StreamSnapshots`: `store.Subscribe()`, loop sending snapshots until client disconnects
or context done, then `store.Unsubscribe(ch)`
## Step 7 — Main (`cmd/collector/main.go`)
Flags:
- `--listen` default `:9090`
- `--logs` comma-separated log file paths
- `--source` name for this collector instance (default: hostname)
- `--v4prefix` default `24`
- `--v6prefix` default `48`
Wire-up: create channel → start `store.Run` goroutine → start one `Tailer` goroutine per log
path → start gRPC server → `signal.NotifyContext` for clean shutdown on SIGINT/SIGTERM.
## Step 8 — Smoke test
- Generate fake log lines at 10K/s (small Go script or shell one-liner)
- Run collector against them
- Use `grpcurl` to call `TopN` and verify results
- Check `runtime.MemStats` to confirm memory stays well under 1 GB
---
## Deferred (not in v0)
- `cmd/cli`, `cmd/aggregator`, `cmd/frontend`
- ClickHouse export
- TLS / auth
- Prometheus metrics endpoint
---
## Implementation notes
### Deviation from plan: MultiTailer
Step 5 planned one `Tailer` struct per file. During implementation this was changed to a single
`MultiTailer` with one shared `fsnotify.Watcher`. Reason: one watcher per file creates one inotify
instance per file; the kernel default limit is 128 instances per user, which would be hit with
100s of log files. The `MultiTailer` uses a single instance and routes events by path via a
`map[string]*fileState`.
### Deviation from plan: IPv6 /48 semantics
The design doc said "truncate to /48". `/48` keeps the first three full 16-bit groups intact
(e.g. `2001:db8:cafe::1``2001:db8:cafe::/48`). An early test expected `2001:db8:ca00::/48`
(truncating mid-group), which was wrong. The code is correct; the test was fixed.
---
## Test results
Run with: `go test ./cmd/collector/ -v -count=1 -timeout 120s`
| Test | What it covers |
|-----------------------------|----------------------------------------------------|
| `TestParseLine` (7 cases) | Tab parsing, query string stripping, bad lines |
| `TestTruncateIP` | IPv4 /24 and IPv6 /48 masking |
| `TestIngestAndRotate` | Live map → fine ring rotation |
| `TestLiveMapCap` | Hard cap at 100 K entries, no panic beyond cap |
| `TestQueryTopN` | Ranked results from ring buffer |
| `TestQueryTopNWithFilter` | Filter by HTTP status code |
| `TestQueryTrend` | Per-bucket counts, oldest-first ordering |
| `TestCoarseRingPopulated` | 5 fine ticks → 1 coarse bucket, count aggregation |
| `TestSubscribeBroadcast` | Fan-out channel delivery after rotation |
| `TestTopKOrdering` | Heap select returns correct top-K descending |
| `TestMultiTailerReadsLines` | Live file write → LogRecord received on channel |
| `TestMultiTailerMultipleFiles` | 5 files, one watcher, all lines received |
| `TestMultiTailerLogRotation`| RENAME → drain → retry → new file tailed correctly |
| `TestExpandGlobs` | Glob pattern expands to matching files only |
| `TestExpandGlobsDeduplication` | Same file via path + glob deduplicated to one |
| `TestMemoryBudget` | Full ring fill stays within 1 GB heap |
| `TestGRPCEndToEnd` | Real gRPC server: TopN, filtered TopN, Trend, StreamSnapshots |
**Total: 17 tests, all passing.**
---
## Benchmark results
Run with: `go test ./cmd/collector/ -bench=. -benchtime=3s`
Hardware: 12th Gen Intel Core i7-12700T
| Benchmark | ns/op | throughput | headroom vs 10K/s |
|--------------------|-------|----------------|-------------------|
| `BenchmarkParseLine` | 418 | ~2.4M lines/s | 240× |
| `BenchmarkIngest` | 152 | ~6.5M records/s| 650× |
Both the parser and the store ingestion goroutine have several hundred times more capacity than
the 10 000 lines/second peak requirement. The bottleneck at scale will be fsnotify event delivery
and kernel I/O, not the Go code.