Add aggregator backfill, pulling fine+coarse buckets from collectors

This commit is contained in:
2026-03-25 07:03:46 +01:00
parent d2dcd88c4b
commit eddb04ced4
11 changed files with 419 additions and 1384 deletions

View File

@@ -264,6 +264,7 @@ message Snapshot {
string source = 1;
int64 timestamp = 2;
repeated TopNEntry entries = 3; // full top-50K for this bucket
bool is_coarse = 4; // true for 5-min coarse buckets (DumpSnapshots only)
}
// Target discovery: list the collectors behind the queried endpoint
@@ -274,15 +275,22 @@ message TargetInfo {
}
message ListTargetsResponse { repeated TargetInfo targets = 1; }
// Backfill: dump full ring buffer contents for aggregator restart recovery
message DumpSnapshotsRequest {}
// Response reuses Snapshot; is_coarse distinguishes fine (1-min) from coarse (5-min) buckets.
// Stream closes after all historical data is sent (unlike StreamSnapshots which stays open).
service LogtailService {
rpc TopN(TopNRequest) returns (TopNResponse);
rpc Trend(TrendRequest) returns (TrendResponse);
rpc StreamSnapshots(SnapshotRequest) returns (stream Snapshot);
rpc ListTargets(ListTargetsRequest) returns (ListTargetsResponse);
rpc DumpSnapshots(DumpSnapshotsRequest) returns (stream Snapshot);
}
// Both collector and aggregator implement LogtailService.
// The aggregator's StreamSnapshots re-streams the merged view.
// ListTargets: aggregator returns all configured collectors; collector returns itself.
// DumpSnapshots: collector only; aggregator calls this on startup to backfill its ring.
```
## Program 1 — Collector
@@ -334,11 +342,16 @@ service LogtailService {
- **TopN query**: RLock ring, sum bucket range, apply filter, group by dimension, heap-select top N.
- **Trend query**: per-bucket filtered sum, returns one `TrendPoint` per bucket.
- **Subscriber fan-out**: per-subscriber buffered channel; `Subscribe`/`Unsubscribe` for streaming.
- **`DumpRings()`**: acquires `RLock`, copies both ring arrays and their head/filled pointers
(just slice headers — microseconds), releases lock, then returns chronologically-ordered fine
and coarse snapshot slices. The lock is never held during serialisation or network I/O.
### server.go
- gRPC server on configurable port (default `:9090`).
- `TopN` and `Trend`: unary, answered from the ring buffer under RLock.
- `StreamSnapshots`: registers a subscriber channel; loops `Recv` on it; 30 s keepalive ticker.
- `DumpSnapshots`: calls `DumpRings()`, streams all fine buckets (`is_coarse=false`) then all
coarse buckets (`is_coarse=true`), then closes the stream. No lock held during streaming.
## Program 2 — Aggregator
@@ -362,6 +375,23 @@ service LogtailService {
to the same 1-minute cadence as collectors regardless of how many collectors are connected.
- Same tiered ring structure as the collector store; populated from `merger.TopK()` each tick.
- `QueryTopN`, `QueryTrend`, `Subscribe`/`Unsubscribe` — identical interface to collector store.
- **`LoadHistorical(fine, coarse []Snapshot)`**: writes pre-merged backfill snapshots directly into
the ring arrays under `mu.Lock()`, sets head and filled counters, then returns. Safe to call
concurrently with queries. The live ticker continues from the updated head after this returns.
### backfill.go
- **`Backfill(ctx, collectorAddrs, cache)`**: called once at aggregator startup (in a goroutine,
after the gRPC server is already listening so the frontend is never blocked).
- Dials all collectors concurrently and calls `DumpSnapshots` on each.
- Accumulates entries per timestamp in `map[unix-second]map[label]count`; multiple collectors'
contributions for the same bucket are summed — the same delta-merge semantics as the live path.
- Sorts timestamps chronologically, runs `TopKFromMap` per bucket, caps to ring size.
- Calls `cache.LoadHistorical` once with the merged results.
- **Graceful degradation**: if a collector returns `Unimplemented` (old binary without
`DumpSnapshots`), logs an informational message and skips it — live streaming still starts
normally. Any other error is logged with timing and also skipped. Partial backfill (some
collectors succeed, some fail) is supported.
- Logs per-collector stats: bucket counts, total entry counts, and wall-clock duration.
### registry.go
- **`TargetRegistry`**: `sync.RWMutex`-protected `map[addr → name]`. Initialised with the
@@ -489,3 +519,6 @@ with a non-zero code on gRPC error.
| Regex filters compiled once per query (`CompiledFilter`) | Up to 288 × 5 000 per-entry calls — compiling per-entry would dominate query latency |
| Filter expression box (`q=`) redirects to canonical URL | Filter state stays in individual `f_*` params; URLs remain shareable and bookmarkable |
| `ListTargets` + frontend source picker | "Which nginx is busiest?" answered by switching `target=` to a collector; no data model changes, no extra memory |
| Backfill via `DumpSnapshots` on restart | Aggregator recovers full 24h ring from collectors on restart; gRPC server starts first so frontend is never blocked during backfill |
| `DumpRings()` copies under lock, streams without lock | Lock held for microseconds (slice-header copy only); network I/O happens outside the lock so minute rotation is never delayed |
| Backfill merges per-timestamp across collectors | Correct cross-collector sums per bucket, same semantics as live delta-merge; collectors that don't support `DumpSnapshots` are skipped gracefully |