Add is_tor plumbing from collector->aggregator->frontend/cli

This commit is contained in:
2026-03-23 22:17:39 +01:00
parent b89caa594c
commit cd7f15afaf
20 changed files with 1815 additions and 212 deletions

View File

@@ -27,7 +27,7 @@ Add the `logtail` log format to your `nginx.conf` and apply it to each `server`
```nginx
http {
log_format logtail '$host\t$remote_addr\t$msec\t$request_method\t$request_uri\t$status\t$body_bytes_sent\t$request_time';
log_format logtail '$host\t$remote_addr\t$msec\t$request_method\t$request_uri\t$status\t$body_bytes_sent\t$request_time\t$is_tor';
server {
access_log /var/log/nginx/access.log logtail;
@@ -38,7 +38,10 @@ http {
```
The format is tab-separated with fixed field positions. Query strings are stripped from the URI
by the collector at ingest time — only the path is tracked.
by the collector at ingest time — only the path is tracked. `$is_tor` must be set to `1` when
the client IP is a TOR exit node and `0` otherwise (this is typically populated by a custom nginx
variable or a Lua script that checks the IP against a TOR exit list). The field is optional for
backward compatibility — log lines without it are accepted and treated as `is_tor=0`.
---
@@ -64,14 +67,15 @@ windows, and exposes a gRPC interface for the aggregator (and directly for the C
### Flags
| Flag | Default | Description |
|----------------|--------------|-----------------------------------------------------------|
| `--listen` | `:9090` | gRPC listen address |
| `--logs` | — | Comma-separated log file paths or glob patterns |
| `--logs-file` | — | File containing one log path/glob per line |
| `--source` | hostname | Name for this collector in query responses |
| `--v4prefix` | `24` | IPv4 prefix length for client bucketing (e.g. /24 → /23) |
| `--v6prefix` | `48` | IPv6 prefix length for client bucketing |
| Flag | Default | Description |
|-------------------|--------------|-----------------------------------------------------------|
| `--listen` | `:9090` | gRPC listen address |
| `--logs` | — | Comma-separated log file paths or glob patterns |
| `--logs-file` | — | File containing one log path/glob per line |
| `--source` | hostname | Name for this collector in query responses |
| `--v4prefix` | `24` | IPv4 prefix length for client bucketing (e.g. /24 → /23) |
| `--v6prefix` | `48` | IPv6 prefix length for client bucketing |
| `--scan-interval` | `10s` | How often to rescan glob patterns for new/removed files |
At least one of `--logs` or `--logs-file` is required.
@@ -124,7 +128,7 @@ The collector is designed to stay well under 1 GB:
| Coarse ring (288 × 5-min) | 288 × 5 000 | ~268 MB |
| **Total** | | **~845 MB** |
When the live map reaches 100 000 distinct 4-tuples, new keys are dropped for the rest of that
When the live map reaches 100 000 distinct 5-tuples, new keys are dropped for the rest of that
minute. Existing keys continue to accumulate counts. The cap resets at each minute rotation.
### Time windows
@@ -284,6 +288,10 @@ Supported fields and operators:
| `website` | `=` `~=` | `website~=gouda.*` |
| `uri` | `=` `~=` | `uri~=^/api/` |
| `prefix` | `=` | `prefix=1.2.3.0/24` |
| `is_tor` | `=` `!=` | `is_tor=1`, `is_tor!=0` |
`is_tor=1` and `is_tor!=0` are equivalent (TOR traffic only). `is_tor=0` and `is_tor!=1` are
equivalent (non-TOR traffic only).
`~=` means RE2 regex match. Values with spaces or quotes may be wrapped in double or single
quotes: `uri~="^/search\?q="`.
@@ -303,8 +311,8 @@ accept RE2 regular expressions. The breadcrumb strip shows them as `website~=gou
`uri~=^/api/` with the usual `×` remove link.
**URL sharing** — all filter state is in the URL query string (`w`, `by`, `f_website`,
`f_prefix`, `f_uri`, `f_status`, `f_website_re`, `f_uri_re`, `n`). Copy the URL to share an
exact view with another operator, or bookmark a recurring query.
`f_prefix`, `f_uri`, `f_status`, `f_website_re`, `f_uri_re`, `f_is_tor`, `n`). Copy the URL to
share an exact view with another operator, or bookmark a recurring query.
**JSON output** — append `&raw=1` to any URL to receive the TopN result as JSON instead of
HTML. Useful for scripting without the CLI binary:
@@ -359,6 +367,7 @@ logtail-cli targets [flags] list targets known to the queried endpoint
| `--status` | — | Filter: HTTP status expression (`200`, `!=200`, `>=400`, `<500`, …) |
| `--website-re`| — | Filter: RE2 regex against website |
| `--uri-re` | — | Filter: RE2 regex against request URI |
| `--is-tor` | — | Filter: `1` or `!=0` = TOR only; `0` or `!=1` = non-TOR only |
### `topn` flags
@@ -455,6 +464,12 @@ logtail-cli topn --target agg:9091 --window 5m --website-re 'gouda.*'
# Filter by URI regex: all /api/ paths
logtail-cli topn --target agg:9091 --window 5m --group-by uri --uri-re '^/api/'
# Show only TOR traffic — which websites are TOR clients hitting?
logtail-cli topn --target agg:9091 --window 5m --is-tor 1
# Show non-TOR traffic only — exclude exit nodes from the view
logtail-cli topn --target agg:9091 --window 5m --is-tor 0
# Compare two collectors side by side in one command
logtail-cli topn --target nginx1:9090,nginx2:9090 --window 5m