Add designdoc and AP2.0 license for this nginx module
This commit is contained in:
201
LICENSE
Normal file
201
LICENSE
Normal file
@@ -0,0 +1,201 @@
|
||||
Apache License
|
||||
Version 2.0, January 2004
|
||||
http://www.apache.org/licenses/
|
||||
|
||||
TERMS AND CONDITIONS FOR USE, REPRODUCTION, AND DISTRIBUTION
|
||||
|
||||
1. Definitions.
|
||||
|
||||
"License" shall mean the terms and conditions for use, reproduction,
|
||||
and distribution as defined by Sections 1 through 9 of this document.
|
||||
|
||||
"Licensor" shall mean the copyright owner or entity authorized by
|
||||
the copyright owner that is granting the License.
|
||||
|
||||
"Legal Entity" shall mean the union of the acting entity and all
|
||||
other entities that control, are controlled by, or are under common
|
||||
control with that entity. For the purposes of this definition,
|
||||
"control" means (i) the power, direct or indirect, to cause the
|
||||
direction or management of such entity, whether by contract or
|
||||
otherwise, or (ii) ownership of fifty percent (50%) or more of the
|
||||
outstanding shares, or (iii) beneficial ownership of such entity.
|
||||
|
||||
"You" (or "Your") shall mean an individual or Legal Entity
|
||||
exercising permissions granted by this License.
|
||||
|
||||
"Source" form shall mean the preferred form for making modifications,
|
||||
including but not limited to software source code, documentation
|
||||
source, and configuration files.
|
||||
|
||||
"Object" form shall mean any form resulting from mechanical
|
||||
transformation or translation of a Source form, including but
|
||||
not limited to compiled object code, generated documentation,
|
||||
and conversions to other media types.
|
||||
|
||||
"Work" shall mean the work of authorship, whether in Source or
|
||||
Object form, made available under the License, as indicated by a
|
||||
copyright notice that is included in or attached to the work
|
||||
(an example is provided in the Appendix below).
|
||||
|
||||
"Derivative Works" shall mean any work, whether in Source or Object
|
||||
form, that is based on (or derived from) the Work and for which the
|
||||
editorial revisions, annotations, elaborations, or other modifications
|
||||
represent, as a whole, an original work of authorship. For the purposes
|
||||
of this License, Derivative Works shall not include works that remain
|
||||
separable from, or merely link (or bind by name) to the interfaces of,
|
||||
the Work and Derivative Works thereof.
|
||||
|
||||
"Contribution" shall mean any work of authorship, including
|
||||
the original version of the Work and any modifications or additions
|
||||
to that Work or Derivative Works thereof, that is intentionally
|
||||
submitted to Licensor for inclusion in the Work by the copyright owner
|
||||
or by an individual or Legal Entity authorized to submit on behalf of
|
||||
the copyright owner. For the purposes of this definition, "submitted"
|
||||
means any form of electronic, verbal, or written communication sent
|
||||
to the Licensor or its representatives, including but not limited to
|
||||
communication on electronic mailing lists, source code control systems,
|
||||
and issue tracking systems that are managed by, or on behalf of, the
|
||||
Licensor for the purpose of discussing and improving the Work, but
|
||||
excluding communication that is conspicuously marked or otherwise
|
||||
designated in writing by the copyright owner as "Not a Contribution."
|
||||
|
||||
"Contributor" shall mean Licensor and any individual or Legal Entity
|
||||
on behalf of whom a Contribution has been received by Licensor and
|
||||
subsequently incorporated within the Work.
|
||||
|
||||
2. Grant of Copyright License. Subject to the terms and conditions of
|
||||
this License, each Contributor hereby grants to You a perpetual,
|
||||
worldwide, non-exclusive, no-charge, royalty-free, irrevocable
|
||||
copyright license to reproduce, prepare Derivative Works of,
|
||||
publicly display, publicly perform, sublicense, and distribute the
|
||||
Work and such Derivative Works in Source or Object form.
|
||||
|
||||
3. Grant of Patent License. Subject to the terms and conditions of
|
||||
this License, each Contributor hereby grants to You a perpetual,
|
||||
worldwide, non-exclusive, no-charge, royalty-free, irrevocable
|
||||
(except as stated in this section) patent license to make, have made,
|
||||
use, offer to sell, sell, import, and otherwise transfer the Work,
|
||||
where such license applies only to those patent claims licensable
|
||||
by such Contributor that are necessarily infringed by their
|
||||
Contribution(s) alone or by combination of their Contribution(s)
|
||||
with the Work to which such Contribution(s) was submitted. If You
|
||||
institute patent litigation against any entity (including a
|
||||
cross-claim or counterclaim in a lawsuit) alleging that the Work
|
||||
or a Contribution incorporated within the Work constitutes direct
|
||||
or contributory patent infringement, then any patent licenses
|
||||
granted to You under this License for that Work shall terminate
|
||||
as of the date such litigation is filed.
|
||||
|
||||
4. Redistribution. You may reproduce and distribute copies of the
|
||||
Work or Derivative Works thereof in any medium, with or without
|
||||
modifications, and in Source or Object form, provided that You
|
||||
meet the following conditions:
|
||||
|
||||
(a) You must give any other recipients of the Work or
|
||||
Derivative Works a copy of this License; and
|
||||
|
||||
(b) You must cause any modified files to carry prominent notices
|
||||
stating that You changed the files; and
|
||||
|
||||
(c) You must retain, in the Source form of any Derivative Works
|
||||
that You distribute, all copyright, patent, trademark, and
|
||||
attribution notices from the Source form of the Work,
|
||||
excluding those notices that do not pertain to any part of
|
||||
the Derivative Works; and
|
||||
|
||||
(d) If the Work includes a "NOTICE" text file as part of its
|
||||
distribution, then any Derivative Works that You distribute must
|
||||
include a readable copy of the attribution notices contained
|
||||
within such NOTICE file, excluding those notices that do not
|
||||
pertain to any part of the Derivative Works, in at least one
|
||||
of the following places: within a NOTICE text file distributed
|
||||
as part of the Derivative Works; within the Source form or
|
||||
documentation, if provided along with the Derivative Works; or,
|
||||
within a display generated by the Derivative Works, if and
|
||||
wherever such third-party notices normally appear. The contents
|
||||
of the NOTICE file are for informational purposes only and
|
||||
do not modify the License. You may add Your own attribution
|
||||
notices within Derivative Works that You distribute, alongside
|
||||
or as an addendum to the NOTICE text from the Work, provided
|
||||
that such additional attribution notices cannot be construed
|
||||
as modifying the License.
|
||||
|
||||
You may add Your own copyright statement to Your modifications and
|
||||
may provide additional or different license terms and conditions
|
||||
for use, reproduction, or distribution of Your modifications, or
|
||||
for any such Derivative Works as a whole, provided Your use,
|
||||
reproduction, and distribution of the Work otherwise complies with
|
||||
the conditions stated in this License.
|
||||
|
||||
5. Submission of Contributions. Unless You explicitly state otherwise,
|
||||
any Contribution intentionally submitted for inclusion in the Work
|
||||
by You to the Licensor shall be under the terms and conditions of
|
||||
this License, without any additional terms or conditions.
|
||||
Notwithstanding the above, nothing herein shall supersede or modify
|
||||
the terms of any separate license agreement you may have executed
|
||||
with Licensor regarding such Contributions.
|
||||
|
||||
6. Trademarks. This License does not grant permission to use the trade
|
||||
names, trademarks, service marks, or product names of the Licensor,
|
||||
except as required for describing the origin of the Work and
|
||||
reproducing the content of the NOTICE file.
|
||||
|
||||
7. Disclaimer of Warranty. Unless required by applicable law or
|
||||
agreed to in writing, Licensor provides the Work (and each
|
||||
Contributor provides its Contributions) on an "AS IS" BASIS,
|
||||
WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or
|
||||
implied, including, without limitation, any warranties or conditions
|
||||
of TITLE, NON-INFRINGEMENT, MERCHANTABILITY, or FITNESS FOR A
|
||||
PARTICULAR PURPOSE. You are solely responsible for determining the
|
||||
appropriateness of using or redistributing the Work and assume any
|
||||
risks associated with Your exercise of permissions under this License.
|
||||
|
||||
8. Limitation of Liability. In no event and under no legal theory,
|
||||
whether in tort (including negligence), contract, or otherwise,
|
||||
unless required by applicable law (such as deliberate and grossly
|
||||
negligent acts) or agreed to in writing, shall any Contributor be
|
||||
liable to You for damages, including any direct, indirect, special,
|
||||
incidental, or consequential damages of any character arising as a
|
||||
result of this License or out of the use or inability to use the
|
||||
Work (including but not limited to damages for loss of goodwill,
|
||||
work stoppage, computer failure or malfunction, or any and all
|
||||
other commercial damages or losses), even if such Contributor
|
||||
has been advised of the possibility of such damages.
|
||||
|
||||
9. Accepting Warranty or Additional Liability. While redistributing
|
||||
the Work or Derivative Works thereof, You may choose to offer,
|
||||
and charge a fee for, acceptance of support, warranty, indemnity,
|
||||
or other liability obligations and/or rights consistent with this
|
||||
License. However, in accepting such obligations, You may act only
|
||||
on Your own behalf and on Your sole responsibility, not on behalf
|
||||
of any other Contributor, and only if You agree to indemnify,
|
||||
defend, and hold each Contributor harmless for any liability
|
||||
incurred by, or claims asserted against, such Contributor by reason
|
||||
of your accepting any such warranty or additional liability.
|
||||
|
||||
END OF TERMS AND CONDITIONS
|
||||
|
||||
APPENDIX: How to apply the Apache License to your work.
|
||||
|
||||
To apply the Apache License to your work, attach the following
|
||||
boilerplate notice, with the fields enclosed by brackets "[]"
|
||||
replaced with your own identifying information. (Don't include
|
||||
the brackets!) The text should be enclosed in the appropriate
|
||||
comment syntax for the file format. We also recommend that a
|
||||
file or class name and description of purpose be included on the
|
||||
same "printed page" as the copyright notice for easier
|
||||
identification within third-party archives.
|
||||
|
||||
Copyright 2026 Pim van Pelt <pim@ipng.ch>
|
||||
|
||||
Licensed under the Apache License, Version 2.0 (the "License");
|
||||
you may not use this file except in compliance with the License.
|
||||
You may obtain a copy of the License at
|
||||
|
||||
http://www.apache.org/licenses/LICENSE-2.0
|
||||
|
||||
Unless required by applicable law or agreed to in writing, software
|
||||
distributed under the License is distributed on an "AS IS" BASIS,
|
||||
WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
|
||||
See the License for the specific language governing permissions and
|
||||
limitations under the License.
|
||||
613
docs/design.md
Normal file
613
docs/design.md
Normal file
@@ -0,0 +1,613 @@
|
||||
# nginx-vpp-maglev-plugin Design Document
|
||||
|
||||
## Metadata
|
||||
|
||||
| | |
|
||||
| --- | --- |
|
||||
| **Status** | Draft — describes intended behavior for `v0.1.0` |
|
||||
| **Author** | Pim van Pelt `<pim@ipng.ch>` |
|
||||
| **Last updated** | 2026-04-16 |
|
||||
| **Audience** | Operators and contributors building the nginx-side observability half of `vpp-maglev` |
|
||||
|
||||
The key words **MUST**, **MUST NOT**, **SHOULD**, **SHOULD NOT**, and **MAY** are used as described in
|
||||
[RFC 2119](https://datatracker.ietf.org/doc/html/rfc2119), and are reserved in this document for requirements that are intended to be
|
||||
enforced in code or by an external dependency. Plain-language descriptions of what the system or an operator can do are written in
|
||||
lowercase — "can", "will", "does" — and should not be read as normative.
|
||||
|
||||
## Summary
|
||||
|
||||
`nginx-vpp-maglev-plugin` is a dynamic nginx module and its surrounding Debian packaging. Loaded into stock upstream nginx, the module records
|
||||
per-VIP traffic counters — requests, status codes, bytes, latency — and attributes them to the specific `vpp-maglev` instance whose GRE
|
||||
tunnel delivered each connection. A small HTTP scrape endpoint exposes the counters as both Prometheus text and JSON so that
|
||||
`maglevd-frontend`, Prometheus, and ad-hoc `curl` sessions can all read the same data. The module is the nginx-side answer to the open
|
||||
question in [`vpp-maglev/docs/design.md`](../../vpp-maglev/docs/design.md) about per-backend traffic counters: VPP's `lb` plugin bypasses
|
||||
the FIB and cannot produce them, so the backends report what they see.
|
||||
|
||||
## Background
|
||||
|
||||
`vpp-maglev` programs VPP's `lb` plugin so that traffic hashed to a VIP lands on a pool of healthy Application Servers (ASes). For the
|
||||
deployment this module targets, every AS is an nginx instance receiving GRE-encapsulated traffic from one or more `maglevd` daemons,
|
||||
decapsulating it, and terminating or proxying HTTP and HTTPS as it would for any other inbound client.
|
||||
|
||||
The design document for `vpp-maglev` identifies **per-AS traffic counters** as an explicit open question: VPP's `lb` fast path bypasses
|
||||
the FIB, so VPP exposes per-VIP counters in the stats segment but not per-backend ones. An operator looking at the `maglevd-frontend`
|
||||
status page for a frontend with four backends can see the frontend's aggregate packet rate but not which backend is carrying how much of
|
||||
it, which errors are concentrated on which backend, or whether one backend's p95 latency is drifting.
|
||||
|
||||
This project closes that gap from the opposite end. The nginx instances that serve the traffic already observe everything an operator
|
||||
wants to see — they are the authoritative source for request rate, response code mix, bytes moved, and latency distributions. A small
|
||||
in-process module emits those numbers on an HTTP endpoint, and `maglevd-frontend` fans out to the backends of each frontend and aggregates
|
||||
the result into the existing status page.
|
||||
|
||||
## Goals and Non-Goals
|
||||
|
||||
### Product Goals
|
||||
|
||||
1. **Per-VIP, per-maglev traffic visibility.** For each VIP, the module records request count, status-code distribution, bytes in and out,
|
||||
and request-duration histograms, split by which `maglevd` instance delivered the traffic.
|
||||
2. **Negligible hot-path cost.** At steady state, a request traversing an nginx worker with the module loaded pays at most a handful of
|
||||
non-atomic integer increments and a histogram bucket update. No locks, no allocations, no system calls.
|
||||
3. **Two readers, one endpoint.** A single HTTP location serves both Prometheus text and JSON, so a site running Prometheus and a site
|
||||
using only the `maglevd-frontend` UI can both consume the module without extra configuration.
|
||||
4. **Packaging as a dynamic module.** The module builds with nginx's `--with-compat` ABI and ships as a Debian package that loads into
|
||||
stock upstream nginx without recompiling nginx itself.
|
||||
5. **Composable with normal nginx use.** A host running the module as a maglev backend **and** serving unrelated direct web traffic on the
|
||||
same ports MUST remain a correct nginx deployment. The module MUST NOT change the semantics of any existing directive; it only adds new
|
||||
parameters and directives that are no-ops when unused.
|
||||
6. **Graceful reload.** An `nginx -s reload` MUST NOT reset counters, lose history, or drop in-flight connections from the module's point
|
||||
of view.
|
||||
|
||||
### Non-Goals
|
||||
|
||||
- The module is **not** a generic nginx metrics exporter. It does not aim to replace `nginx-module-vts`, `ngx_http_stub_status`, or
|
||||
`nginx-lua-prometheus`. Its metric set is deliberately narrow and shaped by the `maglevd-frontend` status page.
|
||||
- The module does **not** terminate TLS, rewrite headers, or alter the request in any way. It is observation-only.
|
||||
- The module does **not** talk to `maglevd` directly. It does not initiate gRPC, it does not read maglev configuration, and it does not
|
||||
know which maglev instance owns which VIP. The attribution tag it emits is a string supplied by the operator in the `listen` directive;
|
||||
nothing more.
|
||||
- The module does **not** provide per-client-IP, per-path, or per-User-Agent counters. Those dimensions explode cardinality and belong in
|
||||
access logs and existing log-analysis tools.
|
||||
- The module does **not** provide persistent storage. Counters live in shared memory for the lifetime of the nginx master process; on
|
||||
restart they start at zero. Consumers who need historical retention SHOULD scrape it from Prometheus.
|
||||
- The module does **not** own the GRE tunnels, the VIP addresses, or the `SO_BINDTODEVICE` privilege. Tunnel creation, VIP binding, and
|
||||
nginx master privileges are the operator's responsibility.
|
||||
|
||||
## Requirements
|
||||
|
||||
Each requirement carries a unique identifier (`FR-X.Y` or `NFR-X.Y`) so that later sections can cite it.
|
||||
|
||||
### Functional Requirements
|
||||
|
||||
**FR-1 Attribution**
|
||||
|
||||
- **FR-1.1** The module MUST support a new parameter on the nginx `listen` directive, `device=<ifname>`, which causes the resulting
|
||||
listening socket to be created with `SO_BINDTODEVICE` set to the named interface. A listen directive without `device=` MUST create a
|
||||
plain listening socket as stock nginx does.
|
||||
- **FR-1.2** The module MUST support a new parameter on the nginx `listen` directive, `source=<tag>`, which attaches a short string tag to
|
||||
the listening socket. The tag is the dimension the scrape endpoint exports for every counter that came in on that listener.
|
||||
- **FR-1.3** A listening socket with neither `device=` nor `source=` MUST be tagged with the configured default source string (see
|
||||
`ipng_stats_default_source`, FR-5.3). The default default is the literal string `direct`.
|
||||
- **FR-1.4** A listening socket with `device=X` but no `source=` MUST be tagged with the interface name `X`.
|
||||
- **FR-1.5** Two `listen` directives that share `address:port` but differ in `device=` MUST coexist, and the kernel's TCP socket lookup
|
||||
rules MUST be relied on to dispatch each SYN to the most specific match. The module MUST NOT attempt to duplicate this logic in
|
||||
userspace.
|
||||
- **FR-1.6** A `listen` directive that uses a wildcard address (`80`, `[::]:80`) together with `device=<ifname>` MUST accept only
|
||||
connections whose ingress interface is `<ifname>`, for any local address served through that interface. This is the intended deployment
|
||||
shape: wildcard fallback plus per-tunnel device-bound listeners.
|
||||
|
||||
**FR-2 Counters**
|
||||
|
||||
- **FR-2.1** The module MUST maintain, for every observed `(source, vip, status_code)` tuple, the following counters: total requests,
|
||||
total bytes received (sum of request bytes including request line, headers, and body), total bytes sent (sum of response bytes
|
||||
including status line, headers, and body), and a fixed-bucket histogram of request duration in milliseconds.
|
||||
- **FR-2.2** When an upstream is used to serve the request, the module MUST additionally maintain a fixed-bucket histogram of upstream
|
||||
response time in milliseconds, keyed by the same `(source, vip)` pair.
|
||||
- **FR-2.3** The histogram bucket boundaries MUST be fixed at module initialization and MUST be the same for every `(source, vip)` key.
|
||||
The default boundaries are `{1, 5, 10, 25, 50, 100, 250, 500, 1000, 2500, 5000, 10000}` milliseconds plus an implicit `+Inf` bucket.
|
||||
Operators MAY override the boundaries via the `ipng_stats_buckets` directive at the `http` level.
|
||||
- **FR-2.4** The module MUST additionally maintain, per `(source, vip)` pair, exponentially-weighted moving averages for instantaneous
|
||||
request rate with decay windows of 1 second, 10 seconds, and 60 seconds. EWMAs are updated from the periodic flush tick (see FR-4.2),
|
||||
not from the request path.
|
||||
- **FR-2.5** The `vip` dimension of every counter MUST be the connection's `$server_addr` in its canonical textual form (dotted-quad for
|
||||
IPv4, RFC 5952 lowercase-compressed form for IPv6). IPv6 zone identifiers (scope-ids), if any, MUST be stripped during canonicalization;
|
||||
link-local VIPs (which are not expected in practice) are attributed under their scope-less textual form. Port is not part of the key;
|
||||
a VIP that listens on both 80 and 443 MUST be aggregated.
|
||||
- **FR-2.6** The `status_code` dimension MUST be the full three-digit HTTP status code as recorded by nginx at log phase. The module MUST
|
||||
NOT bucket codes into classes (2xx/3xx/4xx/5xx); bucketing is the consumer's job.
|
||||
|
||||
**FR-3 Scrape endpoint**
|
||||
|
||||
- **FR-3.1** The module MUST provide a new nginx handler directive, `ipng_stats;`, that, when placed in a `location` block, causes that
|
||||
location to serve the module's counters and MUST NOT be combinable with other content handlers in the same location.
|
||||
- **FR-3.2** The `ipng_stats` handler MUST support content negotiation via the `Accept` request header:
|
||||
- `Accept: application/json` → JSON output.
|
||||
- `Accept: text/plain` (or anything else, including absent) → Prometheus text exposition format.
|
||||
- **FR-3.3** The handler MUST support a `source=<tag>` query parameter that filters the output to only counters whose source dimension
|
||||
equals the supplied tag. The comparison is exact-match and case-sensitive.
|
||||
- **FR-3.4** The handler MUST support a `vip=<address>` query parameter that filters the output to only counters whose VIP dimension
|
||||
equals the supplied address. The comparison uses the canonicalized form of FR-2.5.
|
||||
- **FR-3.5** Both filters MAY be supplied together; their effect is the intersection.
|
||||
- **FR-3.6** The JSON schema MUST be documented in `docs/scrape-api.md` and MUST version via a top-level `schema` field so that breaking
|
||||
changes can be made additively without bricking existing consumers.
|
||||
- **FR-3.7** The Prometheus text output MUST use stable metric names prefixed with `nginx_ipng_` and MUST label every series with `source`
|
||||
and `vip`. Counter metrics additionally carry a `code` label.
|
||||
|
||||
**FR-4 Hot path and flush**
|
||||
|
||||
- **FR-4.1** Per-request counter updates MUST occur in the nginx log phase and MUST be localized to the current worker's private counter
|
||||
table. The module MUST NOT take any locks on the request path and MUST NOT issue any atomic operation on the request path.
|
||||
- **FR-4.2** Each worker MUST run a periodic timer, default one second, that flushes the worker's private counter deltas into the
|
||||
shared-memory zone using atomic adds. The flush interval is configurable via the `ipng_stats_flush_interval` directive.
|
||||
- **FR-4.3** The scrape handler MUST read only from the shared-memory zone. Workers MUST NOT read from each other's private tables.
|
||||
- **FR-4.4** Histogram updates MUST be branch-light: the module MUST precompute a small lookup that maps elapsed milliseconds to a bucket
|
||||
index using binary search over the fixed boundary array, and MUST NOT scan the array linearly.
|
||||
|
||||
**FR-5 Directives**
|
||||
|
||||
- **FR-5.1** `ipng_stats_zone name:size` at the `http` level declares the shared-memory zone the module uses. `name` is the zone name (no
|
||||
default); `size` is a size with suffix (`k`, `m`). The directive is mandatory if the module is loaded.
|
||||
- **FR-5.2** `ipng_stats_flush_interval <duration>` at the `http` level sets the worker flush cadence. Default `1s`. Minimum `100ms`.
|
||||
- **FR-5.3** `ipng_stats_default_source <tag>` at the `http` level sets the tag applied to listening sockets that have neither `device=`
|
||||
nor `source=`. Default `direct`.
|
||||
- **FR-5.4** `ipng_stats_buckets <ms ms ms ...>` at the `http` level overrides the default histogram bucket boundaries. Values MUST be
|
||||
strictly increasing positive integers.
|
||||
- **FR-5.5** `ipng_stats on|off` at the `http`, `server`, or `location` level opts a context into or out of counting. Default `on` at the
|
||||
`http` level when the module is loaded. A location serving the `ipng_stats` handler MUST NOT have itself counted (the module
|
||||
automatically sets `off` for the scrape location).
|
||||
|
||||
**FR-6 Packaging**
|
||||
|
||||
- **FR-6.1** The module MUST build as a dynamic module using nginx's `--with-compat --add-dynamic-module=...` flow, against the nginx-dev
|
||||
headers of the target Debian release, so that the resulting `.so` loads into stock upstream nginx on that release without rebuilding
|
||||
nginx itself.
|
||||
- **FR-6.2** The module MUST ship as a Debian package named `libnginx-mod-http-ipng-stats`, following the `libnginx-mod-http-*` naming
|
||||
convention used by existing third-party nginx modules packaged for Debian.
|
||||
- **FR-6.3** The package MUST install:
|
||||
- `/usr/lib/nginx/modules/ngx_http_ipng_stats_module.so`
|
||||
- `/etc/nginx/modules-available/50-mod-http-ipng-stats.conf` containing the `load_module` directive.
|
||||
- A symlink `/etc/nginx/modules-enabled/50-mod-http-ipng-stats.conf → ../modules-available/50-mod-http-ipng-stats.conf` created in the
|
||||
package's postinst.
|
||||
- **FR-6.4** The package postinst MUST run `nginx -t` after installing the module. If the test fails, postinst MUST remove the
|
||||
`modules-enabled` symlink and report a non-fatal warning so that a broken upgrade does not leave the operator's nginx unable to start.
|
||||
|
||||
### Non-Functional Requirements
|
||||
|
||||
**NFR-1 Correctness under concurrency**
|
||||
|
||||
- **NFR-1.1** Per-worker counter tables MUST be owned exclusively by their worker and MUST NOT be read or written by any other worker,
|
||||
any handler, or any timer other than the worker's own flush timer.
|
||||
- **NFR-1.2** Flushes from workers into the shared zone MUST use relaxed atomic `fetch_add` on 64-bit lanes. The module MUST NOT rely on
|
||||
`memset`, `memcpy`, or any unaligned access for shared-zone updates.
|
||||
- **NFR-1.3** A scrape that races with a flush MUST observe a monotonically non-decreasing counter value; temporary readings that see
|
||||
partial flushes across different keys are acceptable, but a single counter MUST never appear to decrease.
|
||||
- **NFR-1.4** Histogram bucket counts and sum/count fields MUST be updated in a way that a concurrent scrape never observes
|
||||
`count < sum-of-buckets`. This is achieved by updating bucket counts before the sum/count and by a scraper that reads sum/count before
|
||||
bucket counts.
|
||||
|
||||
**NFR-2 Hot-path cost**
|
||||
|
||||
- **NFR-2.1** The per-request cost of the log-phase handler MUST be bounded by: one listening-socket pointer deref, one VIP pointer deref
|
||||
(cached on the connection struct), a constant-time status-code index computation, a constant number of integer increments, and a
|
||||
`O(log B)` histogram binary search where `B` is the number of buckets. No syscalls, no allocations, no locks.
|
||||
- **NFR-2.2** The per-flush cost per worker MUST be bounded by `O(K)` atomic adds, where `K` is the number of distinct
|
||||
`(source, vip, code)` keys touched by that worker since the last flush. Keys untouched during an interval MUST NOT be visited.
|
||||
- **NFR-2.3** The scrape cost MUST be bounded by `O(K_total)` reads from the shared zone plus `O(K_total)` string format operations,
|
||||
where `K_total` is the number of distinct keys in the zone.
|
||||
|
||||
**NFR-3 Memory bounds**
|
||||
|
||||
- **NFR-3.1** The shared-memory zone MUST be sized by the operator at module-load time (FR-5.1) and MUST NOT grow beyond that size. When
|
||||
the zone is full, the module MUST drop new keys, increment a dedicated `nginx_ipng_zone_full_events_total` counter, and log at `warn`
|
||||
level no more than once per minute per worker.
|
||||
- **NFR-3.2** The per-worker private counter table MUST be bounded by the same total key count the shared zone admits. A worker MUST NOT
|
||||
accumulate private state that exceeds the shared-zone capacity.
|
||||
- **NFR-3.3** The set of distinct status codes observed is small (typically ≤ 60) and MUST NOT be allowed to explode due to non-standard
|
||||
responses; the module MUST clamp any observed code `< 100` or `>= 600` into a single bucket labeled `code="unknown"` rather than
|
||||
allocating a new key.
|
||||
|
||||
**NFR-4 Reload neutrality**
|
||||
|
||||
- **NFR-4.1** `nginx -s reload` spawns a new set of workers while the old workers drain. The shared-memory zone MUST survive this
|
||||
transition; counters MUST NOT reset on reload.
|
||||
- **NFR-4.2** New workers MUST attach to the existing shared-memory zone under the same name, reconstruct their private counter tables
|
||||
lazily from observed traffic, and resume flushing.
|
||||
- **NFR-4.3** The `source` tag for any given listening socket is recomputed at reload time from the new configuration. If the operator
|
||||
renames a tag, new traffic MUST use the new tag.
|
||||
- **NFR-4.4** When a `source` tag is no longer present in any listening socket after a configuration reload, its counters MUST be
|
||||
evicted from the shared-memory zone on the first flush tick following the reload. The module MUST NOT retain historical counters under
|
||||
defunct tags indefinitely. Rename is expected to be rare and evicting the old entries immediately is acceptable.
|
||||
|
||||
**NFR-5 Packaging robustness**
|
||||
|
||||
- **NFR-5.1** The module MUST compile cleanly against the nginx-dev headers of the currently supported Debian stable and testing
|
||||
releases. CI MUST build one `.deb` per supported release and MUST fail if any target breaks.
|
||||
- **NFR-5.2** The module MUST NOT depend on any shared library beyond `libc` and nginx's own runtime. No `libnetfilter_*`, no `libcurl`,
|
||||
no `libjson*`.
|
||||
- **NFR-5.3** A version mismatch between the `.so` and the installed nginx binary MUST be detected by nginx at load time (this is the
|
||||
purpose of `--with-compat`). The package postinst MUST NOT attempt to work around a mismatch; it reports the failure and leaves the
|
||||
operator to upgrade the nginx package.
|
||||
|
||||
**NFR-6 Security**
|
||||
|
||||
- **NFR-6.1** The module MUST NOT require any Linux capability beyond what stock nginx already needs. The `SO_BINDTODEVICE` call is made
|
||||
in the nginx master process which is already privileged during the bind step; workers never call `setsockopt(SO_BINDTODEVICE)`
|
||||
themselves.
|
||||
- **NFR-6.2** The scrape endpoint MUST be accessible only via an `allow`/`deny` ACL placed in the operator's nginx config. The module
|
||||
MUST NOT ship its own authentication or authorization; it is a plain nginx handler and inherits the enclosing `location` block's access
|
||||
controls.
|
||||
- **NFR-6.3** The module MUST NOT log client IPs, request paths, `User-Agent`, or any other per-request personally-identifying field. It
|
||||
logs only aggregate counters and its own operational events.
|
||||
|
||||
**NFR-7 Documentation**
|
||||
|
||||
- **NFR-7.1** The repository MUST ship a `docs/user-guide.md` that walks an operator through installing the Debian package, loading the
|
||||
module, configuring a minimal end-to-end deployment (GRE tunnels, VIPs, `listen` lines, scrape endpoint), verifying that counters are
|
||||
flowing, and integrating the scrape endpoint with both `maglevd-frontend` and a standalone Prometheus scraper. The user guide is the
|
||||
document an operator reads once to get from a freshly-installed package to a working, observable deployment.
|
||||
- **NFR-7.2** The repository MUST ship a `docs/config-guide.md` that enumerates every directive and `listen` parameter introduced by the
|
||||
module, together with the nginx configuration contexts (`http`, `server`, `location`, or `listen`) in which each is legal, the allowed
|
||||
values, the default, and a one-sentence summary of behavior. The config guide is the document an operator greps when they need to know
|
||||
where a given knob is allowed to appear.
|
||||
|
||||
## Architecture Overview
|
||||
|
||||
### Process Model
|
||||
|
||||
The project ships one dynamic nginx module:
|
||||
|
||||
- **`ngx_http_ipng_stats_module.so`** — the dynamic module, loaded by nginx's master at startup via `load_module`. It runs entirely inside
|
||||
the nginx process model: code executes in nginx workers during the request lifecycle and during per-worker timers. No separate process
|
||||
is launched.
|
||||
|
||||
There is no daemon, no socket the module listens on, no control plane. Everything the module does is done inline with nginx.
|
||||
|
||||
### Data Flow
|
||||
|
||||
Requests enter nginx through one of two listener classes:
|
||||
|
||||
1. **Device-bound listeners** (`listen ... device=X source=Y`) accept only connections whose ingress interface is `X`. Each is tagged
|
||||
with a source string `Y`.
|
||||
2. **Wildcard fallback listeners** (`listen 80;`, `listen [::]:80;`) accept everything that didn't match a more specific listener. They
|
||||
are tagged with the configured default source (FR-1.3).
|
||||
|
||||
During request processing nginx behaves exactly as it would without the module: no handler runs early, no header is rewritten. At log
|
||||
phase, the module's log-phase handler increments the worker-local counter table keyed by `(source, vip, status_code)`.
|
||||
|
||||
A per-worker timer, firing at the configured flush interval (FR-5.2), walks the dirty keys in the worker-local table and applies their
|
||||
deltas to the shared-memory zone via atomic adds.
|
||||
|
||||
The scrape handler, when invoked at `GET /ipng-stats` (or whatever location the operator chose), reads the shared-memory zone directly
|
||||
and formats the output per the requested content type.
|
||||
|
||||
`maglevd-frontend` fetches the scrape endpoint of each backend in its configured fleet at roughly the same cadence it already uses for
|
||||
maglevd state. It filters server-side via `?source=<its own tag>` so that it only sees the traffic it delivered. The aggregated view is
|
||||
rendered alongside the existing maglev status page.
|
||||
|
||||
No component in this project writes to anything outside nginx's own memory. In particular, the module does not touch the file system,
|
||||
does not emit log lines on the request path, and does not speak to any upstream.
|
||||
|
||||
## Components
|
||||
|
||||
### The nginx module
|
||||
|
||||
`ngx_http_ipng_stats_module` is the entire technical surface of this project. It is a single C module conforming to nginx's
|
||||
dynamic-module ABI.
|
||||
|
||||
#### Responsibilities
|
||||
|
||||
- Parse new `listen` parameters `device=` and `source=` and attach their values to each listening socket's config (FR-1.1, FR-1.2).
|
||||
- Call `setsockopt(SO_BINDTODEVICE)` in the master process at bind time for listeners that set `device=` (FR-1.1, NFR-6.1).
|
||||
- Maintain per-worker private counter tables keyed by `(source_id, vip_id, status_code)` (FR-2.1, NFR-1.1).
|
||||
- Run a per-worker flush timer that moves deltas into the shared-memory zone atomically (FR-4.2, NFR-1.2).
|
||||
- Update EWMAs at flush time (FR-2.4).
|
||||
- Serve the scrape endpoint with content negotiation and optional filters (FR-3).
|
||||
- Honor `ipng_stats on|off` at any config level (FR-5.5).
|
||||
|
||||
#### Attribution Model
|
||||
|
||||
The module's single novel idea is that per-maglev attribution is done by the Linux kernel's TCP socket lookup, not by any userspace
|
||||
inspection. Each `maglevd` instance terminates its GRE tunnel on a dedicated interface on the nginx host; the operator writes one
|
||||
`listen ... device=<ifname> source=<tag>` line per `(family, tunnel)` pair. The kernel binds that listening socket with `SO_BINDTODEVICE`,
|
||||
which causes it to match only connections whose ingress interface is that tunnel. A wildcard `listen 80;` and `listen [::]:80;` pair
|
||||
provides the fallback for traffic arriving on any other interface — typically normal web traffic, not from maglev.
|
||||
|
||||
The kernel's TCP listener lookup prefers a more-specific (device-matching) listener over a less-specific (wildcard) one, so the fallback
|
||||
and the device-bound listeners coexist without conflicts. The module does not need to duplicate this logic and does not try to.
|
||||
|
||||
Because the `device=` binding uses a wildcard address, the module does not need to know the set of VIPs served through each tunnel.
|
||||
Adding a VIP (binding an address to `lo` and writing a new `server_name` block) does not require touching the `listen` lines. Adding a
|
||||
new maglev instance (a new GRE tunnel) does. This is the correct split: VIPs are vhost-level concerns and change often; maglev instances
|
||||
are fleet-level concerns and change rarely.
|
||||
|
||||
The design assumes GRE tunnels used as `device=` sources carry **only** maglev-originated traffic. Any other traffic arriving on such an
|
||||
interface is silently misattributed to that maglev's source tag. This is a deployment invariant, not a defect.
|
||||
|
||||
#### Counter Data Model
|
||||
|
||||
Counters are stored as a flat hash table in a shared-memory zone. The key is the tuple `(source_id, vip_id, status_code)` where
|
||||
`source_id` and `vip_id` are small integers assigned at first observation and reused thereafter. The value is a fixed-size record
|
||||
containing:
|
||||
|
||||
- `requests` (u64)
|
||||
- `bytes_in` (u64)
|
||||
- `bytes_out` (u64)
|
||||
- `duration_hist` — `B+1` u64 lanes (one per bucket plus the `+Inf` bucket)
|
||||
- `duration_sum_ms` (u64)
|
||||
- `upstream_hist` — same shape, only updated when an upstream served the request
|
||||
- `upstream_sum_ms` (u64)
|
||||
|
||||
A parallel table keyed by `(source_id, vip_id)` — one row per VIP — holds the EWMAs for instantaneous rate. EWMAs are floats but updated
|
||||
only from the flush tick, so there is no float contention on the request path.
|
||||
|
||||
The module also keeps a small string interning table for source and VIP strings, keyed by the integer IDs above, so that the scrape
|
||||
endpoint can recover the original strings without re-parsing configuration.
|
||||
|
||||
String interning is capacity-bounded: the zone is sized by the operator, and once capacity is exhausted new keys are dropped with a
|
||||
counter bump and an infrequent log line (NFR-3.1). In practice, the number of distinct VIPs on a single nginx host is small (tens, maybe
|
||||
low hundreds), and the number of distinct source tags is the number of maglev instances (single digits). The dominant factor is
|
||||
`status_code`; ~60 keys per VIP is a typical steady state.
|
||||
|
||||
#### Hot Path
|
||||
|
||||
The log-phase handler is deliberately short. Pseudocode:
|
||||
|
||||
```c
|
||||
static ngx_int_t
|
||||
ipng_stats_log_handler(ngx_http_request_t *r)
|
||||
{
|
||||
ipng_listen_ctx_t *lctx;
|
||||
ipng_counter_t *counter;
|
||||
ngx_msec_int_t elapsed_ms;
|
||||
ngx_uint_t code_idx;
|
||||
|
||||
if (!ipng_stats_enabled(r)) {
|
||||
return NGX_OK;
|
||||
}
|
||||
|
||||
lctx = ngx_http_ipng_stats_listen_ctx(r->connection->listening);
|
||||
/* lctx contains source_id and the cached VIP id,
|
||||
or resolves VIP lazily on first seen address */
|
||||
|
||||
code_idx = ipng_status_to_index(r->headers_out.status);
|
||||
counter = ipng_worker_slot(lctx, r->connection->local_sockaddr, code_idx);
|
||||
|
||||
counter->requests++;
|
||||
counter->bytes_in += r->request_length;
|
||||
counter->bytes_out += r->connection->sent;
|
||||
|
||||
elapsed_ms = (ngx_msec_int_t)(ngx_current_msec - r->start_msec);
|
||||
ipng_hist_add(&counter->duration_hist, elapsed_ms);
|
||||
counter->duration_sum_ms += elapsed_ms;
|
||||
|
||||
if (r->upstream_states && r->upstream_states->nelts > 0) {
|
||||
ngx_msec_int_t up_ms = ipng_upstream_total_ms(r);
|
||||
ipng_hist_add(&counter->upstream_hist, up_ms);
|
||||
counter->upstream_sum_ms += up_ms;
|
||||
}
|
||||
|
||||
return NGX_OK;
|
||||
}
|
||||
```
|
||||
|
||||
Nothing here touches shared memory. `ipng_worker_slot` resolves a private table slot using a small per-worker hash keyed by
|
||||
`(source_id, vip_id, code_idx)`. VIP lookup is cached on the connection so that keep-alive requests reuse the resolved ID.
|
||||
|
||||
#### Flush Timer
|
||||
|
||||
At the interval configured by `ipng_stats_flush_interval` (default 1s), the worker:
|
||||
|
||||
1. Iterates its dirty-slot list (slots touched since the previous flush).
|
||||
2. For each dirty slot, computes the delta versus the last flushed snapshot stored in the same slot.
|
||||
3. Applies the delta to the shared-zone slot using 64-bit relaxed `fetch_add` on each counter lane.
|
||||
4. Updates EWMAs from the delta.
|
||||
5. Clears the dirty list (not the slot itself; slot state is preserved so the next flush can compute deltas again).
|
||||
|
||||
The worker never walks the entire table — only dirty slots — so idle VIPs cost nothing.
|
||||
|
||||
#### Scrape Handler
|
||||
|
||||
The `ipng_stats` handler is a leaf content handler. It:
|
||||
|
||||
1. Parses `?source=` and `?vip=` into exact-match filters.
|
||||
2. Parses `Accept:` to pick output format.
|
||||
3. Walks the shared-memory zone under a shared lock (readers hold the read side of a rwlock; flushes and interners hold the write side
|
||||
briefly).
|
||||
4. Emits each matching key in the chosen format directly into an nginx chain buffer.
|
||||
|
||||
Output buffering and sending are standard nginx content handler code. The handler does not allocate during the walk; it uses a
|
||||
fixed-size buffer per chain link and requests new links only when full.
|
||||
|
||||
#### Presents and Consumes
|
||||
|
||||
**Presents.**
|
||||
|
||||
- **One nginx content handler**, `ipng_stats`, usable in any `location` block. Serves Prometheus text and JSON, filtered by optional
|
||||
query parameters.
|
||||
- **Two new `listen` parameters**, `device=` and `source=`, usable anywhere a `listen` directive is used.
|
||||
- **Five new `http`-level directives**: `ipng_stats_zone`, `ipng_stats_flush_interval`, `ipng_stats_default_source`,
|
||||
`ipng_stats_buckets`, `ipng_stats` (on/off).
|
||||
- **A Prometheus metric family** prefixed `nginx_ipng_*`, labelled `source`, `vip`, and (for request counters) `code`.
|
||||
|
||||
**Consumes.**
|
||||
|
||||
- **An nginx shared-memory zone** declared by `ipng_stats_zone`. The zone is allocated from nginx's own shared-memory pool.
|
||||
- **The Linux `SO_BINDTODEVICE` socket option**, applied in the nginx master process during bind.
|
||||
- **The nginx log phase and connection structures** — standard module embedding, no private kernel calls.
|
||||
|
||||
### The Debian package
|
||||
|
||||
`libnginx-mod-http-ipng-stats` is the packaging wrapper. There is no ambition to build RPMs, Alpine packages, or a Homebrew formula;
|
||||
Debian is the target and upstream nginx on Debian is the platform.
|
||||
|
||||
#### Responsibilities
|
||||
|
||||
- Build the module against the target release's nginx-dev headers with `--with-compat` (NFR-5.1, NFR-5.3).
|
||||
- Install the compiled `.so` into `/usr/lib/nginx/modules` (FR-6.3).
|
||||
- Drop a `load_module` stanza into `/etc/nginx/modules-available/` and enable it by default via a symlink in `modules-enabled/`
|
||||
(FR-6.3).
|
||||
- Sanity-check the resulting config with `nginx -t` in the postinst and back out cleanly if it fails (FR-6.4).
|
||||
|
||||
#### Build
|
||||
|
||||
The build is a plain `debian/rules` invocation that:
|
||||
|
||||
1. Fetches the nginx source for the installed `nginx-dev` version.
|
||||
2. Runs `./configure --with-compat --add-dynamic-module=...` pointed at the module tree.
|
||||
3. Builds only the module (`make modules`).
|
||||
4. Installs the resulting `.so` into the package tree.
|
||||
|
||||
No nginx binary is produced, shipped, or touched. The package is strictly additive.
|
||||
|
||||
#### Presents and Consumes
|
||||
|
||||
**Presents.**
|
||||
|
||||
- **One Debian package** per supported release.
|
||||
- **One dynamic module** loadable into stock upstream nginx.
|
||||
|
||||
**Consumes.**
|
||||
|
||||
- **The target release's `nginx-dev` package** at build time.
|
||||
- **The running `nginx` package** at install time, for `nginx -t` validation.
|
||||
|
||||
## Operational Concerns
|
||||
|
||||
### Deployment Topology
|
||||
|
||||
A typical deployment on a single nginx host looks like:
|
||||
|
||||
- One GRE tunnel per maglev instance, terminated on the nginx host by the operator's networking layer (systemd-networkd, Netplan, or a
|
||||
hand-rolled interface config). Interface names follow a consistent pattern, typically `gre-<tag>` — e.g. `gre-mg1`, `gre-mg2`.
|
||||
- VIPs bound to a local dummy or loopback interface so the kernel accepts inner packets destined for them.
|
||||
- A hand-maintained `listen` include file with one device-bound listen per `(family, tunnel)` pair, reused across vhosts.
|
||||
- Fallback `listen 80;` and `listen [::]:80;` in whichever server blocks serve direct web traffic.
|
||||
- A single scrape location, e.g. `location = /ipng-stats`, served from a locked-down server block that only allows the maglev fleet and
|
||||
the local Prometheus scraper.
|
||||
|
||||
### Configuration
|
||||
|
||||
A minimal working configuration is about fifteen lines:
|
||||
|
||||
```nginx
|
||||
load_module modules/ngx_http_ipng_stats_module.so;
|
||||
|
||||
http {
|
||||
ipng_stats_zone ipng:4m;
|
||||
|
||||
server {
|
||||
listen 80;
|
||||
listen [::]:80;
|
||||
include /etc/nginx/ipng-maglev/listens.conf;
|
||||
|
||||
server_name _;
|
||||
# ... normal vhost content
|
||||
}
|
||||
|
||||
server {
|
||||
listen 127.0.0.1:9113;
|
||||
location = /ipng-stats {
|
||||
ipng_stats;
|
||||
allow 127.0.0.1;
|
||||
allow 2001:db8::/48; # maglev fleet
|
||||
deny all;
|
||||
}
|
||||
}
|
||||
}
|
||||
```
|
||||
|
||||
`listens.conf` is eight lines (two families × four maglevs) and stable across vhost changes.
|
||||
|
||||
### Nginx Reload Semantics
|
||||
|
||||
`nginx -s reload` forks fresh workers, has old workers finish in-flight requests, and then shuts the old workers down. The plugin's
|
||||
shared-memory zone is declared by name, which survives the reload; new workers attach to the same zone and continue accumulating
|
||||
counters against the same keys. Counters MUST NOT reset on reload (NFR-4.1).
|
||||
|
||||
Source tags are recomputed from the new configuration on reload (NFR-4.3). Renaming a tag in configuration means new traffic appears
|
||||
under the new name; the old name lingers in the zone until either operator restart or an LRU eviction policy ages it out (this is one
|
||||
of the open questions below).
|
||||
|
||||
### Observability of the Plugin Itself
|
||||
|
||||
The plugin emits a handful of meta-metrics on the same scrape endpoint:
|
||||
|
||||
- `nginx_ipng_zone_bytes_used` / `nginx_ipng_zone_bytes_total` — zone high-water and capacity.
|
||||
- `nginx_ipng_zone_full_events_total` — number of key insertions that were dropped because the zone was full.
|
||||
- `nginx_ipng_flushes_total` — number of per-worker flush ticks that have run.
|
||||
- `nginx_ipng_flush_duration_seconds` — histogram of flush durations.
|
||||
- `nginx_ipng_scrape_duration_seconds` — histogram of scrape handler durations.
|
||||
|
||||
These make it possible to alert on "the module is running hot" and "the zone is full" without having to run a second scraper against
|
||||
some other endpoint.
|
||||
|
||||
### Failure Modes
|
||||
|
||||
- **Shared zone full.** New keys are dropped, a counter is incremented, a rate-limited warning is logged, and the operator is expected
|
||||
to resize the zone. Existing keys continue updating normally (NFR-3.1).
|
||||
- **Worker crash.** The crashed worker's private counter deltas since its last flush are lost. The shared zone is unaffected. Since the
|
||||
default flush interval is one second, the worst-case data loss is one second of that worker's traffic. This is acceptable for an
|
||||
observability plane.
|
||||
- **nginx master crash / package upgrade.** The shared zone is torn down with the old master. When the new master starts, the zone is
|
||||
recreated empty. Counters start from zero. Consumers that need history SHOULD read from Prometheus, which retains history across
|
||||
restarts.
|
||||
- **Device disappears.** If an operator removes a GRE tunnel without removing its `listen` line, nginx's bind will fail on the next
|
||||
reload and the reload will error cleanly. The module does not hide this; a failing `nginx -t` is the right answer.
|
||||
- **Traffic on a wildcard listener that should have been device-bound.** The traffic is counted under `direct` (or the configured
|
||||
default). This is detectable: if the operator expects zero traffic under `direct` and the dashboard shows non-zero, a maglev instance
|
||||
is probably missing from the `listen` include.
|
||||
- **Slow scrape on a large zone.** Scrape cost is linear in the number of keys (NFR-2.3). On a host with a very large VIP count, the
|
||||
operator SHOULD increase the flush interval, lower the scrape frequency, or both. The module does not cap scrape runtime.
|
||||
- **Maglev frontend is down.** The module is unaffected; its counters continue to increment and the Prometheus scrape continues to work.
|
||||
When the frontend comes back, it resumes fetching. No state is lost.
|
||||
|
||||
### Security
|
||||
|
||||
- **Capabilities.** The module needs no capabilities beyond what nginx already has. `SO_BINDTODEVICE` is called by the master during
|
||||
bind; workers never call it (NFR-6.1).
|
||||
- **Scrape access control.** The operator MUST place the scrape `location` behind an `allow`/`deny` ACL. The module does not ship auth;
|
||||
this is deliberate, and documented (NFR-6.2).
|
||||
- **No PII.** The module records only aggregate per-VIP counters. Client IP, request path, headers, and cookies are not touched.
|
||||
Access-log-style observation belongs in nginx's own access log (NFR-6.3).
|
||||
- **Zone sizing as a soft DoS mitigation.** Because new keys are dropped when the zone is full rather than allocating unbounded memory,
|
||||
a stream of bogus traffic cannot cause the module to exhaust nginx's memory. The tradeoff is that a real new VIP added after zone
|
||||
exhaustion won't be tracked until the operator resizes — explicit and visible in the meta-metrics.
|
||||
|
||||
## Alternatives Considered
|
||||
|
||||
- **OpenResty + `lua-nginx-module` + `nginx-lua-prometheus`.** Rejected. Adds a large runtime dependency just for a narrow feature. The
|
||||
deployment target is stock upstream nginx on Debian, and shipping an entirely different nginx build would defeat half the point of
|
||||
packaging.
|
||||
- **Access log tailing sidecar.** Rejected. Decoupled but introduces a second deploy unit, a log-rotation race, and a synchronization
|
||||
gap between access log truncation and counter accuracy. Also loses live EWMAs.
|
||||
- **`nginx-module-vts`.** Considered. VTS is a perfectly good general-purpose metric module, but it has no concept of "which ingress
|
||||
interface did this request come in on", which is the entire innovation here. Adapting VTS to attribute by ingress interface would be a
|
||||
bigger diff than writing a purpose-built module.
|
||||
- **Attribution via CONNMARK on a single shared GRE tunnel.** Rejected after investigation. Netfilter loses the outer GRE source during
|
||||
decapsulation; the outer and inner conntrack entries are independent and mark does not cross. Even if tagging worked, `SO_MARK` on an
|
||||
accepted socket does not reflect incoming packet or conntrack mark without a per-packet `libnetfilter_conntrack` lookup, which is too
|
||||
heavy for a log-phase handler.
|
||||
- **Attribution via multiple GRE tunnels and CONNMARK.** Rejected as strictly worse than `SO_BINDTODEVICE`: it still requires per-maglev
|
||||
tunnels, still needs nginx to read the mark (hard), and adds a netfilter dependency. `SO_BINDTODEVICE` solves the same problem with
|
||||
kernel primitives nginx already knows about.
|
||||
- **Attribution via eBPF `SO_REUSEPORT` programs.** Rejected as dramatic overkill for a problem the kernel already solves for free via
|
||||
socket-lookup specificity.
|
||||
- **Per-VIP enumeration in `listen` directives.** Rejected in favor of wildcard `listen 80 device=gre-mg1;`. The wildcard form works
|
||||
because nginx routes by `server_name` post-accept, so the `listen` only needs to express `(port, device)` and does not need the VIP
|
||||
address. This makes the generated include file size independent of the VIP count.
|
||||
- **Pushing counters from the module into `maglevd` over gRPC.** Rejected. It inverts the wait-for graph (maglevd's design doc is
|
||||
careful to keep the daemon free of callbacks from the backends), it complicates restart neutrality, and it adds a gRPC client to a C
|
||||
module. Pull-based scrape keeps maglevd out of the traffic-metrics business, matches the doc's philosophy, and lets the frontend use
|
||||
its existing per-server goroutine model.
|
||||
- **Shipping separate JSON and Prometheus handlers.** Rejected. Content negotiation on one handler is simpler to configure and serves
|
||||
both audiences from one ACL.
|
||||
|
||||
## Decisions Deferred Post-v0.1
|
||||
|
||||
- **Histogram bucket overrides per `source` or per `vip`.** v0.1 keeps FR-2.3's module-level set. If a single nginx instance ever serves
|
||||
both latency-sensitive (API) and bulk (download) traffic on the same host such that one bucket set is too compromised, making buckets
|
||||
per-`source` or per-`vip` is possible but multiplies memory and complicates Prometheus output.
|
||||
- **TLS handshake metrics.** The module reports `request_duration` from the start of the HTTP request, not from TCP accept. For
|
||||
TLS-terminating frontends a handshake-time fraction is invisible. Adding a `tls_handshake_duration` histogram is deferred until
|
||||
operators ask for it.
|
||||
- **`maglevd-frontend` fetch cadence.** Whichever cadence the frontend adopts for traffic counters — the existing ~one-second refresh,
|
||||
or an SSE bridge layered on top — the plugin supports it. The choice is on the frontend side.
|
||||
Reference in New Issue
Block a user