Collector implementation

This commit is contained in:
2026-03-14 20:07:22 +01:00
parent 4393ae2726
commit 6ca296b2e8
16 changed files with 3052 additions and 0 deletions

144
PLAN_COLLECTOR.md Normal file
View File

@@ -0,0 +1,144 @@
# Collector v0 — Implementation Plan ✓ COMPLETE
Module path: `git.ipng.ch/ipng/nginx-logtail`
**Scope:** A working collector that tails files, aggregates into memory, and serves `TopN`,
`Trend`, and `StreamSnapshots` over gRPC. Full vertical slice, no optimisation passes yet.
---
## Step 1 — Repo scaffolding
- `go mod init git.ipng.ch/ipng/nginx-logtail`
- `.gitignore`
- Install deps: `google.golang.org/grpc`, `google.golang.org/protobuf`, `github.com/fsnotify/fsnotify`
## Step 2 — Proto (`proto/logtail.proto`)
Write the full proto file as specified in README.md DESIGN § Protobuf API. Generate Go stubs with
`protoc`. Commit generated files. This defines the contract everything else builds on.
## Step 3 — Parser (`cmd/collector/parser.go`)
- `LogRecord` struct: `Website`, `ClientPrefix`, `URI`, `Status string`
- `ParseLine(line string) (LogRecord, bool)``SplitN` on tab, discard query string at `?`,
return `false` for lines with fewer than 8 fields
- `TruncateIP(addr string, v4bits, v6bits int) string` — handle IPv4 and IPv6
- Unit-tested with table-driven tests: normal line, short line, IPv6, query string stripping,
/24 and /48 truncation
## Step 4 — Store (`cmd/collector/store.go`)
Implement in order, each piece testable independently:
1. **`Tuple4` and live map** — `map[Tuple4]int64`, cap enforcement at 100K, `Ingest(r LogRecord)`
2. **Fine ring buffer**`[60]Snapshot` circular array, `rotate()` heap-selects top-50K from
live map, appends to ring, resets live map
3. **Coarse ring buffer**`[288]Snapshot`, populated every 5 fine rotations by merging
the last 5 fine snapshots into a top-5K snapshot
4. **`QueryTopN(filter, groupBy, n, window)`** — RLock, sum bucket range, group by dimension,
apply filter, heap-select top N
5. **`QueryTrend(filter, window)`** — per-bucket count sum, returns one point per bucket
6. **`Store.Run(ch <-chan LogRecord)`** — single goroutine: read channel → `Ingest`, minute
ticker → `rotate()`
7. **Snapshot broadcast** — per-subscriber buffered channel fan-out;
`Subscribe() <-chan Snapshot` / `Unsubscribe(ch)`
## Step 5 — Tailer (`cmd/collector/tailer.go`)
- `Tailer` struct: path, fsnotify watcher, output channel
- On start: open file, seek to EOF, register fsnotify watch
- On `fsnotify.Write`: `bufio.Scanner` reads all new lines, sends `LogRecord` to channel
- On `fsnotify.Rename` / `Remove`: drain to EOF, close fd, retry open with 100 ms backoff
(up to 5 s), resume from position 0 — no lines lost between drain and reopen
- `Tailer.Run(ctx context.Context)` — blocks until context cancelled
## Step 6 — gRPC server (`cmd/collector/server.go`)
- `Server` wraps `*Store`, implements `LogtailServiceServer`
- `TopN`: `store.QueryTopN` → marshal to proto response
- `Trend`: `store.QueryTrend` → marshal to proto response
- `StreamSnapshots`: `store.Subscribe()`, loop sending snapshots until client disconnects
or context done, then `store.Unsubscribe(ch)`
## Step 7 — Main (`cmd/collector/main.go`)
Flags:
- `--listen` default `:9090`
- `--logs` comma-separated log file paths
- `--source` name for this collector instance (default: hostname)
- `--v4prefix` default `24`
- `--v6prefix` default `48`
Wire-up: create channel → start `store.Run` goroutine → start one `Tailer` goroutine per log
path → start gRPC server → `signal.NotifyContext` for clean shutdown on SIGINT/SIGTERM.
## Step 8 — Smoke test
- Generate fake log lines at 10K/s (small Go script or shell one-liner)
- Run collector against them
- Use `grpcurl` to call `TopN` and verify results
- Check `runtime.MemStats` to confirm memory stays well under 1 GB
---
## Deferred (not in v0)
- `cmd/cli`, `cmd/aggregator`, `cmd/frontend`
- ClickHouse export
- TLS / auth
- Prometheus metrics endpoint
---
## Implementation notes
### Deviation from plan: MultiTailer
Step 5 planned one `Tailer` struct per file. During implementation this was changed to a single
`MultiTailer` with one shared `fsnotify.Watcher`. Reason: one watcher per file creates one inotify
instance per file; the kernel default limit is 128 instances per user, which would be hit with
100s of log files. The `MultiTailer` uses a single instance and routes events by path via a
`map[string]*fileState`.
### Deviation from plan: IPv6 /48 semantics
The design doc said "truncate to /48". `/48` keeps the first three full 16-bit groups intact
(e.g. `2001:db8:cafe::1``2001:db8:cafe::/48`). An early test expected `2001:db8:ca00::/48`
(truncating mid-group), which was wrong. The code is correct; the test was fixed.
---
## Test results
Run with: `go test ./cmd/collector/ -v -count=1 -timeout 120s`
| Test | What it covers |
|-----------------------------|----------------------------------------------------|
| `TestParseLine` (7 cases) | Tab parsing, query string stripping, bad lines |
| `TestTruncateIP` | IPv4 /24 and IPv6 /48 masking |
| `TestIngestAndRotate` | Live map → fine ring rotation |
| `TestLiveMapCap` | Hard cap at 100 K entries, no panic beyond cap |
| `TestQueryTopN` | Ranked results from ring buffer |
| `TestQueryTopNWithFilter` | Filter by HTTP status code |
| `TestQueryTrend` | Per-bucket counts, oldest-first ordering |
| `TestCoarseRingPopulated` | 5 fine ticks → 1 coarse bucket, count aggregation |
| `TestSubscribeBroadcast` | Fan-out channel delivery after rotation |
| `TestTopKOrdering` | Heap select returns correct top-K descending |
| `TestMultiTailerReadsLines` | Live file write → LogRecord received on channel |
| `TestMultiTailerMultipleFiles` | 5 files, one watcher, all lines received |
| `TestMultiTailerLogRotation`| RENAME → drain → retry → new file tailed correctly |
| `TestExpandGlobs` | Glob pattern expands to matching files only |
| `TestExpandGlobsDeduplication` | Same file via path + glob deduplicated to one |
| `TestMemoryBudget` | Full ring fill stays within 1 GB heap |
| `TestGRPCEndToEnd` | Real gRPC server: TopN, filtered TopN, Trend, StreamSnapshots |
**Total: 17 tests, all passing.**
---
## Benchmark results
Run with: `go test ./cmd/collector/ -bench=. -benchtime=3s`
Hardware: 12th Gen Intel Core i7-12700T
| Benchmark | ns/op | throughput | headroom vs 10K/s |
|--------------------|-------|----------------|-------------------|
| `BenchmarkParseLine` | 418 | ~2.4M lines/s | 240× |
| `BenchmarkIngest` | 152 | ~6.5M records/s| 650× |
Both the parser and the store ingestion goroutine have several hundred times more capacity than
the 10 000 lines/second peak requirement. The bottleneck at scale will be fsnotify event delivery
and kernel I/O, not the Go code.