Harden scrape rendering and add AddressSanitizer test suite

Move all heap allocation out of the slab-mutex critical section in
render_prom/render_json: snapshot cardinality under a brief lock,
allocate aggs/snaps/string tables outside the lock, then re-acquire
only to deep-copy strings and walk the LRU into the pre-allocated
buffers. A worker crash during output buffer allocation can no
longer leave the shared-memory zone locked, and a corrupt cardinality
count is caught by a 10k sanity cap rather than causing a runaway
ngx_pcalloc.

Add build-asan and tests/02-asan/: a full sanitizer-instrumented
nginx + module built via apt-source, and a 2-node containerlab
Robot suite that drives reload storms, concurrent scrape-during-reload,
and intern-table growth, failing if AddressSanitizer or UBSan
reports anything on stderr. The two Robot suites now check for
their required build artifacts up front so `make robot-test` no
longer rebuilds them on every invocation.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
This commit is contained in:
2026-04-18 10:58:51 +02:00
parent cdcbb07c9a
commit fdef2a552b
8 changed files with 746 additions and 232 deletions

184
tests/02-asan/02-asan.robot Normal file
View File

@@ -0,0 +1,184 @@
# SPDX-License-Identifier: Apache-2.0
*** Settings ***
Documentation AddressSanitizer + UBSan stress suite for
... ngx_http_ipng_stats_module. Deploys a 2-node containerlab
... topology running an ASan-instrumented nginx (built by
... `make build-asan`), exercises the code paths most likely
... to surface memory errors — shared-zone init and reuse,
... scrape rendering under the slab mutex, log-phase
... interning, logtail UDP flush — and fails if any
... AddressSanitizer or UBSan finding appears in the nginx
... stderr during the run.
...
... This suite is deliberately not a superset of 01-module —
... it's a landing zone for memory-correctness cases.
... Functional coverage (attribution, filters, counters)
... lives in 01-module.
Library OperatingSystem
Library String
Suite Setup Deploy Lab
Suite Teardown Cleanup Lab
Test Teardown Assert No Sanitizer Findings
*** Variables ***
${lab-name} ipng-stats-asan
${lab-file} lab/ipng-stats-asan.clab.yml
${runtime} docker
${CLAB_BIN} sudo containerlab
${SERVER} clab-${lab-name}-server
${CLIENT} clab-${lab-name}-client
${SCRAPE_URL} http://172.20.41.2:9113/stats
${DATAPLANE_URL} http://10.0.1.1:8080
${STRESS_RELOADS} 10
${STRESS_REQ_PER_LOOP} 25
*** Test Cases ***
ASan nginx starts and serves a scrape
[Documentation] The ASan-instrumented nginx boots with the module
... loaded, and a bare scrape returns the expected
... preamble. Touches init_zone, postconfig, and the
... scrape renderer with an empty LRU.
${output} = Scrape Prometheus
Should Contain ${output} nginx-ipng-stats-plugin
Should Contain ${output} nginx_ipng_requests_total
Scrape an empty JSON report
[Documentation] JSON renderer path with zero records — catches
... off-by-one errors in the bracket emission.
${rc} ${output} = Run And Return Rc And Output
... curl -sf -H 'Accept: application/json' ${SCRAPE_URL}
Should Be Equal As Integers ${rc} 0
Should Contain ${output} "schema":2
Should Contain ${output} "records":[
Reload storm without traffic
[Documentation] Back-to-back reloads with no traffic in between.
... Exercises init_zone's zone-reuse branch and the
... shctx magic check; the cardinality is zero so the
... renderer's naggs_alloc == 0 path is also covered.
FOR ${i} IN RANGE ${STRESS_RELOADS}
Docker Exec ${SERVER} ngxasan -s reload
Sleep 200ms
Scrape Prometheus
END
Reload storm with interleaved traffic
[Documentation] Generate traffic, reload, scrape, repeat. This is
... the scenario that surfaced the original crash: the
... scrape path walks the shared-zone LRU while workers
... are being cycled. Also grows the interning table
... by using a handful of distinct paths.
FOR ${i} IN RANGE ${STRESS_RELOADS}
Generate Traffic ${STRESS_REQ_PER_LOOP}
Docker Exec ${SERVER} ngxasan -s reload
Sleep 200ms
Scrape Prometheus
END
Concurrent scrape during reload
[Documentation] Scrape in a tight loop while issuing reloads from
... a parallel shell. The renderer's snapshot step
... deep-copies strings under the slab mutex; a
... concurrent intern_shared grow during that window
... would surface here as use-after-free. We run the
... whole dance in one bash -c so Robot doesn't have
... to babysit the background pid.
Generate Traffic ${STRESS_REQ_PER_LOOP}
${rc} ${output} = Run And Return Rc And Output
... bash -c '( for i in $(seq 1 200); do curl -sf ${SCRAPE_URL} > /dev/null || true; done ) & scraper=$!; for i in 1 2 3 4 5; do docker exec ${SERVER} ngxasan -s reload; sleep 0.3; done; wait $scraper'
Should Be Equal As Integers ${rc} 0
Large cardinality intern table growth
[Documentation] Drive enough distinct request paths that the
... per-VIP vip/source interning array grows past its
... initial slab_alloc — this exercises the realloc
... path (ngx_slab_free_locked of the old entries
... buffer, copy into the new one) inside the log
... handler.
FOR ${i} IN RANGE 60
Docker Exec Ignore Rc ${CLIENT} curl -s ${DATAPLANE_URL}/path${i}
END
Sleep 500ms
Scrape Prometheus
*** Keywords ***
# --- Lab lifecycle ---
Deploy Lab
Require ASan Build
Run ${CLAB_BIN} --runtime ${runtime} destroy -t ${CURDIR}/${lab-file} --cleanup 2>&1 || true
${rc} ${output} = Run And Return Rc And Output
... ${CLAB_BIN} --runtime ${runtime} deploy -t ${CURDIR}/${lab-file}
Log ${output}
Should Be Equal As Integers ${rc} 0
Wait Until Keyword Succeeds 90s 3s Server Is Ready
Wait Until Keyword Succeeds 60s 3s Client Can Reach Server
Require ASan Build
[Documentation] Fail fast with an actionable message if the user
... forgot to run `make build-asan` before invoking
... this suite.
${rc} = Run And Return Rc test -x ${EXECDIR}/build/nginx-asan/sbin/nginx
Run Keyword If ${rc} != 0
... Fail ASan nginx not found — run `make build-asan` first.
Server Is Ready
${rc} ${output} = Run And Return Rc And Output curl -sf ${SCRAPE_URL}
Should Be Equal As Integers ${rc} 0
Client Can Reach Server
${rc} ${output} = Run And Return Rc And Output
... docker exec ${CLIENT} curl -sf ${DATAPLANE_URL}/
Should Be Equal As Integers ${rc} 0
Cleanup Lab
Run docker logs ${SERVER} > ${EXECDIR}/tests/out/asan-server-docker.log 2>&1
Run docker exec ${SERVER} cat /tmp/nginx.err > ${EXECDIR}/tests/out/asan-nginx-err.log 2>&1
Run docker exec ${SERVER} cat /tmp/nginx.stderr > ${EXECDIR}/tests/out/asan-nginx-stderr.log 2>&1
Run docker exec ${SERVER} bash -c 'cat /tmp/asan.* 2>/dev/null; cat /tmp/ubsan.* 2>/dev/null' > ${EXECDIR}/tests/out/asan-reports.log 2>&1
Run ${CLAB_BIN} --runtime ${runtime} destroy -t ${CURDIR}/${lab-file} --cleanup
# --- Sanitizer assertion ---
Assert No Sanitizer Findings
[Documentation] Fail the current test if the ASan or UBSan
... runtime wrote any findings to stderr or their
... per-pid log files. Runs after every test case —
... we want the failing test to be the one that
... produced the finding, not a later one.
${rc} ${hits} = Run And Return Rc And Output
... docker exec ${SERVER} bash -c 'grep -E "AddressSanitizer|LeakSanitizer|runtime error|SUMMARY:" /tmp/nginx.stderr /tmp/asan.* /tmp/ubsan.* 2>/dev/null || true'
Run Keyword If '${hits}' != '${EMPTY}'
... Fail Sanitizer findings detected:\n${hits}
# --- Traffic generation ---
Generate Traffic
[Arguments] ${count}
FOR ${i} IN RANGE ${count}
Docker Exec Ignore Rc ${CLIENT} curl -s ${DATAPLANE_URL}/
END
# --- Scraping ---
Scrape Prometheus
${rc} ${output} = Run And Return Rc And Output curl -sf ${SCRAPE_URL}
Should Be Equal As Integers ${rc} 0
RETURN ${output}
# --- Container helpers ---
Docker Exec
[Arguments] ${container} ${cmd}
${rc} ${output} = Run And Return Rc And Output
... docker exec ${container} ${cmd}
Should Be Equal As Integers ${rc} 0
RETURN ${output}
Docker Exec Ignore Rc
[Arguments] ${container} ${cmd}
${rc} ${output} = Run And Return Rc And Output
... docker exec ${container} ${cmd}
RETURN ${output}

View File

@@ -0,0 +1,23 @@
#!/bin/bash
# SPDX-License-Identifier: Apache-2.0
# Client container entrypoint for the ASan test suite. Identical in
# spirit to tests/01-module/lab/client/start.sh — kept as a separate
# file so this suite's lab can be torn down and redeployed without
# affecting 01-module state.
set -e
apt-get update -qq
apt-get install -y -qq curl iproute2 > /dev/null 2>&1
echo "Waiting for eth1 ..."
while ! ip link show eth1 > /dev/null 2>&1; do
sleep 0.2
done
ip link set eth1 up
ip addr add ${MY_IP} dev eth1
# Drop the mgmt default route so data-plane traffic goes out eth1.
ip route del default 2>/dev/null || true
exec sleep infinity

View File

@@ -0,0 +1,46 @@
# SPDX-License-Identifier: Apache-2.0
# Containerlab topology for the AddressSanitizer/UBSan test suite.
#
# The server container bind-mounts build/nginx-asan/ — the
# sanitizer-instrumented nginx built by `make build-asan`. The binary
# was compiled against host glibc, so the container image must match
# the host's Debian release (trixie/13) for the .so and libasan to be
# ABI-compatible. The binary is run directly (no .deb install): the
# `make pkg-deb` path is exercised by tests/01-module/.
#
# Topology: one server + one client with a single data-plane link.
# Unlike 01-module we don't need multi-interface attribution here —
# this suite is focused on memory correctness, not traffic tagging.
name: ipng-stats-asan
mgmt:
network: ipng-stats-asan-net
ipv4-subnet: 172.20.41.0/24
topology:
nodes:
server:
kind: linux
image: debian:trixie-slim
mgmt-ipv4: 172.20.41.2
binds:
# RW because nginx chowns client_body_temp/ and writes to logs/
# on master startup; it's a build artifact so we don't mind.
- ../../../build/nginx-asan:/opt/nginx-asan
- ./server/nginx.conf:/opt/nginx-asan/conf/nginx.conf:ro
- ./server/start.sh:/start.sh:ro
cmd: bash /start.sh
client:
kind: linux
image: debian:trixie-slim
mgmt-ipv4: 172.20.41.11
binds:
- ./client/start.sh:/start.sh:ro
cmd: bash /start.sh
env:
MY_IP: 10.0.1.2/24
links:
- endpoints: ["server:eth1", "client:eth1"]

View File

@@ -0,0 +1,50 @@
# SPDX-License-Identifier: Apache-2.0
# Minimal nginx config for the ASan test suite. Exercises the code paths
# most likely to surface memory errors: shared-zone init/reload, the
# scrape renderer (under slab mutex), the log-phase handler's interning,
# and logtail UDP buffering.
load_module /opt/nginx-asan/modules/ngx_http_ipng_stats_module.so;
daemon off;
master_process on;
worker_processes 2;
pid /tmp/nginx.pid;
error_log /tmp/nginx.err info;
events {
worker_connections 128;
}
http {
access_log off;
ipng_stats_zone ipng:1m;
ipng_stats_flush_interval 300ms;
ipng_stats_default_source direct;
log_format logtail '$remote_addr\t$request_method\t$request_uri\t$status';
ipng_stats_logtail logtail udp://127.0.0.1:9514 buffer=4k flush=300ms;
server {
# Mgmt scrape endpoint.
listen 172.20.41.2:9113;
location = /stats {
ipng_stats;
allow all;
}
}
server {
# Data plane — client traffic lands here.
listen 10.0.1.1:8080 device=eth1 ipng_source_tag=cl1;
listen 172.20.41.2:8080;
location / {
return 200 "ok $server_addr\n";
}
location /notfound {
return 404 "nope\n";
}
}
}

View File

@@ -0,0 +1,61 @@
#!/bin/bash
# SPDX-License-Identifier: Apache-2.0
# Server container entrypoint for the ASan test suite. Installs libasan
# runtime (the sanitizer-instrumented binary was linked against host
# gcc's libasan.so.8), wires up the data-plane interface, and execs the
# ASan nginx in the foreground with stderr captured so the Robot suite
# can grep for AddressSanitizer/UBSan findings at teardown.
set -e
apt-get update -qq
apt-get install -y -qq libasan8 libubsan1 ncat iproute2 curl > /dev/null 2>&1
# Wait for containerlab to attach the data-plane veth, configure the IP.
echo "Waiting for eth1 ..."
while ! ip link show eth1 > /dev/null 2>&1; do
sleep 0.2
done
ip link set eth1 up
ip addr add 10.0.1.1/24 dev eth1
# UDP logtail listener — drains the module's datagrams so sendto() has
# a real destination. The test doesn't assert on this file's contents
# (01-module already covers logtail semantics); we just need the socket
# to exist so ASan sees a complete write/flush cycle in the module.
mkdir -p /var/log/nginx
ncat -u -l -k 127.0.0.1 9514 --recv-only >> /var/log/nginx/logtail-udp.log &
# ASan options:
# detect_odr_violation=0 — nginx intentionally duplicates symbols like
# ngx_module_names between the main binary and each dynamic module.
# abort_on_error=1, halt_on_error=1 — fail fast so the Robot suite
# sees the exit status and the ASan report is preserved at the tail
# of /tmp/nginx.stderr.
# detect_leaks=0 — nginx exits without running its pool destructors in
# many paths; leak detection is not the goal here.
# log_path — ASan writes each finding to this prefix + pid, so even
# when nginx wipes its own error log on reload the ASan traces
# survive for post-run inspection.
ASAN_OPTS="detect_odr_violation=0:abort_on_error=1:halt_on_error=1:detect_leaks=0:log_path=/tmp/asan"
UBSAN_OPTS="print_stacktrace=1:halt_on_error=0:log_path=/tmp/ubsan"
# Wrapper so every subsequent `docker exec ... ngxasan ...` (e.g. the
# reload signal from the Robot suite) inherits the same sanitizer
# settings. `docker exec` does not carry the master's env.
cat > /usr/local/bin/ngxasan <<EOF
#!/bin/bash
export ASAN_OPTIONS="${ASAN_OPTS}"
export UBSAN_OPTIONS="${UBSAN_OPTS}"
exec /opt/nginx-asan/sbin/nginx -p /opt/nginx-asan -c /opt/nginx-asan/conf/nginx.conf "\$@"
EOF
chmod +x /usr/local/bin/ngxasan
export ASAN_OPTIONS="${ASAN_OPTS}"
export UBSAN_OPTIONS="${UBSAN_OPTS}"
# Tee stderr so both docker logs and /tmp/nginx.stderr see it. The
# Robot suite inspects the file; ASan writes its report to stderr
# before abort_on_error kicks the process.
exec /opt/nginx-asan/sbin/nginx -p /opt/nginx-asan -c /opt/nginx-asan/conf/nginx.conf \
2> >(tee /tmp/nginx.stderr >&2)