Harden scrape rendering and add AddressSanitizer test suite
Move all heap allocation out of the slab-mutex critical section in render_prom/render_json: snapshot cardinality under a brief lock, allocate aggs/snaps/string tables outside the lock, then re-acquire only to deep-copy strings and walk the LRU into the pre-allocated buffers. A worker crash during output buffer allocation can no longer leave the shared-memory zone locked, and a corrupt cardinality count is caught by a 10k sanity cap rather than causing a runaway ngx_pcalloc. Add build-asan and tests/02-asan/: a full sanitizer-instrumented nginx + module built via apt-source, and a 2-node containerlab Robot suite that drives reload storms, concurrent scrape-during-reload, and intern-table growth, failing if AddressSanitizer or UBSan reports anything on stderr. The two Robot suites now check for their required build artifacts up front so `make robot-test` no longer rebuilds them on every invocation. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
This commit is contained in:
@@ -201,6 +201,7 @@ Request count accuracy
|
||||
# --- Lab lifecycle ---
|
||||
|
||||
Deploy Lab
|
||||
Require Deb Build
|
||||
Run ${CLAB_BIN} --runtime ${runtime} destroy -t ${CURDIR}/${lab-file} --cleanup 2>&1 || true
|
||||
${rc} ${output} = Run And Return Rc And Output
|
||||
... ${CLAB_BIN} --runtime ${runtime} deploy -t ${CURDIR}/${lab-file}
|
||||
@@ -210,6 +211,16 @@ Deploy Lab
|
||||
Wait Until Keyword Succeeds 60s 3s Client Can Reach Server ${CLIENT1} 10.0.1.1
|
||||
Wait Until Keyword Succeeds 60s 3s Client Can Reach Server ${CLIENT2} 10.0.2.1
|
||||
|
||||
Require Deb Build
|
||||
[Documentation] Fail fast with an actionable message if the user
|
||||
... forgot to run `make pkg-deb` before invoking this
|
||||
... suite. The server container dpkg-installs the
|
||||
... built .deb via its bind-mount of build/.
|
||||
${rc} ${output} = Run And Return Rc And Output
|
||||
... bash -c 'ls ${EXECDIR}/build/libnginx-mod-http-ipng-stats_*.deb 2>/dev/null'
|
||||
Run Keyword If ${rc} != 0
|
||||
... Fail Module .deb not found — run `make pkg-deb` first.
|
||||
|
||||
Server Is Ready
|
||||
${rc} ${output} = Run And Return Rc And Output
|
||||
... curl -sf ${SCRAPE_URL}
|
||||
|
||||
184
tests/02-asan/02-asan.robot
Normal file
184
tests/02-asan/02-asan.robot
Normal file
@@ -0,0 +1,184 @@
|
||||
# SPDX-License-Identifier: Apache-2.0
|
||||
*** Settings ***
|
||||
Documentation AddressSanitizer + UBSan stress suite for
|
||||
... ngx_http_ipng_stats_module. Deploys a 2-node containerlab
|
||||
... topology running an ASan-instrumented nginx (built by
|
||||
... `make build-asan`), exercises the code paths most likely
|
||||
... to surface memory errors — shared-zone init and reuse,
|
||||
... scrape rendering under the slab mutex, log-phase
|
||||
... interning, logtail UDP flush — and fails if any
|
||||
... AddressSanitizer or UBSan finding appears in the nginx
|
||||
... stderr during the run.
|
||||
...
|
||||
... This suite is deliberately not a superset of 01-module —
|
||||
... it's a landing zone for memory-correctness cases.
|
||||
... Functional coverage (attribution, filters, counters)
|
||||
... lives in 01-module.
|
||||
Library OperatingSystem
|
||||
Library String
|
||||
Suite Setup Deploy Lab
|
||||
Suite Teardown Cleanup Lab
|
||||
Test Teardown Assert No Sanitizer Findings
|
||||
|
||||
*** Variables ***
|
||||
${lab-name} ipng-stats-asan
|
||||
${lab-file} lab/ipng-stats-asan.clab.yml
|
||||
${runtime} docker
|
||||
${CLAB_BIN} sudo containerlab
|
||||
${SERVER} clab-${lab-name}-server
|
||||
${CLIENT} clab-${lab-name}-client
|
||||
${SCRAPE_URL} http://172.20.41.2:9113/stats
|
||||
${DATAPLANE_URL} http://10.0.1.1:8080
|
||||
${STRESS_RELOADS} 10
|
||||
${STRESS_REQ_PER_LOOP} 25
|
||||
|
||||
*** Test Cases ***
|
||||
|
||||
ASan nginx starts and serves a scrape
|
||||
[Documentation] The ASan-instrumented nginx boots with the module
|
||||
... loaded, and a bare scrape returns the expected
|
||||
... preamble. Touches init_zone, postconfig, and the
|
||||
... scrape renderer with an empty LRU.
|
||||
${output} = Scrape Prometheus
|
||||
Should Contain ${output} nginx-ipng-stats-plugin
|
||||
Should Contain ${output} nginx_ipng_requests_total
|
||||
|
||||
Scrape an empty JSON report
|
||||
[Documentation] JSON renderer path with zero records — catches
|
||||
... off-by-one errors in the bracket emission.
|
||||
${rc} ${output} = Run And Return Rc And Output
|
||||
... curl -sf -H 'Accept: application/json' ${SCRAPE_URL}
|
||||
Should Be Equal As Integers ${rc} 0
|
||||
Should Contain ${output} "schema":2
|
||||
Should Contain ${output} "records":[
|
||||
|
||||
Reload storm without traffic
|
||||
[Documentation] Back-to-back reloads with no traffic in between.
|
||||
... Exercises init_zone's zone-reuse branch and the
|
||||
... shctx magic check; the cardinality is zero so the
|
||||
... renderer's naggs_alloc == 0 path is also covered.
|
||||
FOR ${i} IN RANGE ${STRESS_RELOADS}
|
||||
Docker Exec ${SERVER} ngxasan -s reload
|
||||
Sleep 200ms
|
||||
Scrape Prometheus
|
||||
END
|
||||
|
||||
Reload storm with interleaved traffic
|
||||
[Documentation] Generate traffic, reload, scrape, repeat. This is
|
||||
... the scenario that surfaced the original crash: the
|
||||
... scrape path walks the shared-zone LRU while workers
|
||||
... are being cycled. Also grows the interning table
|
||||
... by using a handful of distinct paths.
|
||||
FOR ${i} IN RANGE ${STRESS_RELOADS}
|
||||
Generate Traffic ${STRESS_REQ_PER_LOOP}
|
||||
Docker Exec ${SERVER} ngxasan -s reload
|
||||
Sleep 200ms
|
||||
Scrape Prometheus
|
||||
END
|
||||
|
||||
Concurrent scrape during reload
|
||||
[Documentation] Scrape in a tight loop while issuing reloads from
|
||||
... a parallel shell. The renderer's snapshot step
|
||||
... deep-copies strings under the slab mutex; a
|
||||
... concurrent intern_shared grow during that window
|
||||
... would surface here as use-after-free. We run the
|
||||
... whole dance in one bash -c so Robot doesn't have
|
||||
... to babysit the background pid.
|
||||
Generate Traffic ${STRESS_REQ_PER_LOOP}
|
||||
${rc} ${output} = Run And Return Rc And Output
|
||||
... bash -c '( for i in $(seq 1 200); do curl -sf ${SCRAPE_URL} > /dev/null || true; done ) & scraper=$!; for i in 1 2 3 4 5; do docker exec ${SERVER} ngxasan -s reload; sleep 0.3; done; wait $scraper'
|
||||
Should Be Equal As Integers ${rc} 0
|
||||
|
||||
Large cardinality intern table growth
|
||||
[Documentation] Drive enough distinct request paths that the
|
||||
... per-VIP vip/source interning array grows past its
|
||||
... initial slab_alloc — this exercises the realloc
|
||||
... path (ngx_slab_free_locked of the old entries
|
||||
... buffer, copy into the new one) inside the log
|
||||
... handler.
|
||||
FOR ${i} IN RANGE 60
|
||||
Docker Exec Ignore Rc ${CLIENT} curl -s ${DATAPLANE_URL}/path${i}
|
||||
END
|
||||
Sleep 500ms
|
||||
Scrape Prometheus
|
||||
|
||||
*** Keywords ***
|
||||
|
||||
# --- Lab lifecycle ---
|
||||
|
||||
Deploy Lab
|
||||
Require ASan Build
|
||||
Run ${CLAB_BIN} --runtime ${runtime} destroy -t ${CURDIR}/${lab-file} --cleanup 2>&1 || true
|
||||
${rc} ${output} = Run And Return Rc And Output
|
||||
... ${CLAB_BIN} --runtime ${runtime} deploy -t ${CURDIR}/${lab-file}
|
||||
Log ${output}
|
||||
Should Be Equal As Integers ${rc} 0
|
||||
Wait Until Keyword Succeeds 90s 3s Server Is Ready
|
||||
Wait Until Keyword Succeeds 60s 3s Client Can Reach Server
|
||||
|
||||
Require ASan Build
|
||||
[Documentation] Fail fast with an actionable message if the user
|
||||
... forgot to run `make build-asan` before invoking
|
||||
... this suite.
|
||||
${rc} = Run And Return Rc test -x ${EXECDIR}/build/nginx-asan/sbin/nginx
|
||||
Run Keyword If ${rc} != 0
|
||||
... Fail ASan nginx not found — run `make build-asan` first.
|
||||
|
||||
Server Is Ready
|
||||
${rc} ${output} = Run And Return Rc And Output curl -sf ${SCRAPE_URL}
|
||||
Should Be Equal As Integers ${rc} 0
|
||||
|
||||
Client Can Reach Server
|
||||
${rc} ${output} = Run And Return Rc And Output
|
||||
... docker exec ${CLIENT} curl -sf ${DATAPLANE_URL}/
|
||||
Should Be Equal As Integers ${rc} 0
|
||||
|
||||
Cleanup Lab
|
||||
Run docker logs ${SERVER} > ${EXECDIR}/tests/out/asan-server-docker.log 2>&1
|
||||
Run docker exec ${SERVER} cat /tmp/nginx.err > ${EXECDIR}/tests/out/asan-nginx-err.log 2>&1
|
||||
Run docker exec ${SERVER} cat /tmp/nginx.stderr > ${EXECDIR}/tests/out/asan-nginx-stderr.log 2>&1
|
||||
Run docker exec ${SERVER} bash -c 'cat /tmp/asan.* 2>/dev/null; cat /tmp/ubsan.* 2>/dev/null' > ${EXECDIR}/tests/out/asan-reports.log 2>&1
|
||||
Run ${CLAB_BIN} --runtime ${runtime} destroy -t ${CURDIR}/${lab-file} --cleanup
|
||||
|
||||
# --- Sanitizer assertion ---
|
||||
|
||||
Assert No Sanitizer Findings
|
||||
[Documentation] Fail the current test if the ASan or UBSan
|
||||
... runtime wrote any findings to stderr or their
|
||||
... per-pid log files. Runs after every test case —
|
||||
... we want the failing test to be the one that
|
||||
... produced the finding, not a later one.
|
||||
${rc} ${hits} = Run And Return Rc And Output
|
||||
... docker exec ${SERVER} bash -c 'grep -E "AddressSanitizer|LeakSanitizer|runtime error|SUMMARY:" /tmp/nginx.stderr /tmp/asan.* /tmp/ubsan.* 2>/dev/null || true'
|
||||
Run Keyword If '${hits}' != '${EMPTY}'
|
||||
... Fail Sanitizer findings detected:\n${hits}
|
||||
|
||||
# --- Traffic generation ---
|
||||
|
||||
Generate Traffic
|
||||
[Arguments] ${count}
|
||||
FOR ${i} IN RANGE ${count}
|
||||
Docker Exec Ignore Rc ${CLIENT} curl -s ${DATAPLANE_URL}/
|
||||
END
|
||||
|
||||
# --- Scraping ---
|
||||
|
||||
Scrape Prometheus
|
||||
${rc} ${output} = Run And Return Rc And Output curl -sf ${SCRAPE_URL}
|
||||
Should Be Equal As Integers ${rc} 0
|
||||
RETURN ${output}
|
||||
|
||||
# --- Container helpers ---
|
||||
|
||||
Docker Exec
|
||||
[Arguments] ${container} ${cmd}
|
||||
${rc} ${output} = Run And Return Rc And Output
|
||||
... docker exec ${container} ${cmd}
|
||||
Should Be Equal As Integers ${rc} 0
|
||||
RETURN ${output}
|
||||
|
||||
Docker Exec Ignore Rc
|
||||
[Arguments] ${container} ${cmd}
|
||||
${rc} ${output} = Run And Return Rc And Output
|
||||
... docker exec ${container} ${cmd}
|
||||
RETURN ${output}
|
||||
23
tests/02-asan/lab/client/start.sh
Executable file
23
tests/02-asan/lab/client/start.sh
Executable file
@@ -0,0 +1,23 @@
|
||||
#!/bin/bash
|
||||
# SPDX-License-Identifier: Apache-2.0
|
||||
# Client container entrypoint for the ASan test suite. Identical in
|
||||
# spirit to tests/01-module/lab/client/start.sh — kept as a separate
|
||||
# file so this suite's lab can be torn down and redeployed without
|
||||
# affecting 01-module state.
|
||||
|
||||
set -e
|
||||
|
||||
apt-get update -qq
|
||||
apt-get install -y -qq curl iproute2 > /dev/null 2>&1
|
||||
|
||||
echo "Waiting for eth1 ..."
|
||||
while ! ip link show eth1 > /dev/null 2>&1; do
|
||||
sleep 0.2
|
||||
done
|
||||
ip link set eth1 up
|
||||
ip addr add ${MY_IP} dev eth1
|
||||
|
||||
# Drop the mgmt default route so data-plane traffic goes out eth1.
|
||||
ip route del default 2>/dev/null || true
|
||||
|
||||
exec sleep infinity
|
||||
46
tests/02-asan/lab/ipng-stats-asan.clab.yml
Normal file
46
tests/02-asan/lab/ipng-stats-asan.clab.yml
Normal file
@@ -0,0 +1,46 @@
|
||||
# SPDX-License-Identifier: Apache-2.0
|
||||
# Containerlab topology for the AddressSanitizer/UBSan test suite.
|
||||
#
|
||||
# The server container bind-mounts build/nginx-asan/ — the
|
||||
# sanitizer-instrumented nginx built by `make build-asan`. The binary
|
||||
# was compiled against host glibc, so the container image must match
|
||||
# the host's Debian release (trixie/13) for the .so and libasan to be
|
||||
# ABI-compatible. The binary is run directly (no .deb install): the
|
||||
# `make pkg-deb` path is exercised by tests/01-module/.
|
||||
#
|
||||
# Topology: one server + one client with a single data-plane link.
|
||||
# Unlike 01-module we don't need multi-interface attribution here —
|
||||
# this suite is focused on memory correctness, not traffic tagging.
|
||||
|
||||
name: ipng-stats-asan
|
||||
|
||||
mgmt:
|
||||
network: ipng-stats-asan-net
|
||||
ipv4-subnet: 172.20.41.0/24
|
||||
|
||||
topology:
|
||||
nodes:
|
||||
server:
|
||||
kind: linux
|
||||
image: debian:trixie-slim
|
||||
mgmt-ipv4: 172.20.41.2
|
||||
binds:
|
||||
# RW because nginx chowns client_body_temp/ and writes to logs/
|
||||
# on master startup; it's a build artifact so we don't mind.
|
||||
- ../../../build/nginx-asan:/opt/nginx-asan
|
||||
- ./server/nginx.conf:/opt/nginx-asan/conf/nginx.conf:ro
|
||||
- ./server/start.sh:/start.sh:ro
|
||||
cmd: bash /start.sh
|
||||
|
||||
client:
|
||||
kind: linux
|
||||
image: debian:trixie-slim
|
||||
mgmt-ipv4: 172.20.41.11
|
||||
binds:
|
||||
- ./client/start.sh:/start.sh:ro
|
||||
cmd: bash /start.sh
|
||||
env:
|
||||
MY_IP: 10.0.1.2/24
|
||||
|
||||
links:
|
||||
- endpoints: ["server:eth1", "client:eth1"]
|
||||
50
tests/02-asan/lab/server/nginx.conf
Normal file
50
tests/02-asan/lab/server/nginx.conf
Normal file
@@ -0,0 +1,50 @@
|
||||
# SPDX-License-Identifier: Apache-2.0
|
||||
# Minimal nginx config for the ASan test suite. Exercises the code paths
|
||||
# most likely to surface memory errors: shared-zone init/reload, the
|
||||
# scrape renderer (under slab mutex), the log-phase handler's interning,
|
||||
# and logtail UDP buffering.
|
||||
|
||||
load_module /opt/nginx-asan/modules/ngx_http_ipng_stats_module.so;
|
||||
|
||||
daemon off;
|
||||
master_process on;
|
||||
worker_processes 2;
|
||||
pid /tmp/nginx.pid;
|
||||
error_log /tmp/nginx.err info;
|
||||
|
||||
events {
|
||||
worker_connections 128;
|
||||
}
|
||||
|
||||
http {
|
||||
access_log off;
|
||||
ipng_stats_zone ipng:1m;
|
||||
ipng_stats_flush_interval 300ms;
|
||||
ipng_stats_default_source direct;
|
||||
|
||||
log_format logtail '$remote_addr\t$request_method\t$request_uri\t$status';
|
||||
ipng_stats_logtail logtail udp://127.0.0.1:9514 buffer=4k flush=300ms;
|
||||
|
||||
server {
|
||||
# Mgmt scrape endpoint.
|
||||
listen 172.20.41.2:9113;
|
||||
|
||||
location = /stats {
|
||||
ipng_stats;
|
||||
allow all;
|
||||
}
|
||||
}
|
||||
|
||||
server {
|
||||
# Data plane — client traffic lands here.
|
||||
listen 10.0.1.1:8080 device=eth1 ipng_source_tag=cl1;
|
||||
listen 172.20.41.2:8080;
|
||||
|
||||
location / {
|
||||
return 200 "ok $server_addr\n";
|
||||
}
|
||||
location /notfound {
|
||||
return 404 "nope\n";
|
||||
}
|
||||
}
|
||||
}
|
||||
61
tests/02-asan/lab/server/start.sh
Executable file
61
tests/02-asan/lab/server/start.sh
Executable file
@@ -0,0 +1,61 @@
|
||||
#!/bin/bash
|
||||
# SPDX-License-Identifier: Apache-2.0
|
||||
# Server container entrypoint for the ASan test suite. Installs libasan
|
||||
# runtime (the sanitizer-instrumented binary was linked against host
|
||||
# gcc's libasan.so.8), wires up the data-plane interface, and execs the
|
||||
# ASan nginx in the foreground with stderr captured so the Robot suite
|
||||
# can grep for AddressSanitizer/UBSan findings at teardown.
|
||||
|
||||
set -e
|
||||
|
||||
apt-get update -qq
|
||||
apt-get install -y -qq libasan8 libubsan1 ncat iproute2 curl > /dev/null 2>&1
|
||||
|
||||
# Wait for containerlab to attach the data-plane veth, configure the IP.
|
||||
echo "Waiting for eth1 ..."
|
||||
while ! ip link show eth1 > /dev/null 2>&1; do
|
||||
sleep 0.2
|
||||
done
|
||||
ip link set eth1 up
|
||||
ip addr add 10.0.1.1/24 dev eth1
|
||||
|
||||
# UDP logtail listener — drains the module's datagrams so sendto() has
|
||||
# a real destination. The test doesn't assert on this file's contents
|
||||
# (01-module already covers logtail semantics); we just need the socket
|
||||
# to exist so ASan sees a complete write/flush cycle in the module.
|
||||
mkdir -p /var/log/nginx
|
||||
ncat -u -l -k 127.0.0.1 9514 --recv-only >> /var/log/nginx/logtail-udp.log &
|
||||
|
||||
# ASan options:
|
||||
# detect_odr_violation=0 — nginx intentionally duplicates symbols like
|
||||
# ngx_module_names between the main binary and each dynamic module.
|
||||
# abort_on_error=1, halt_on_error=1 — fail fast so the Robot suite
|
||||
# sees the exit status and the ASan report is preserved at the tail
|
||||
# of /tmp/nginx.stderr.
|
||||
# detect_leaks=0 — nginx exits without running its pool destructors in
|
||||
# many paths; leak detection is not the goal here.
|
||||
# log_path — ASan writes each finding to this prefix + pid, so even
|
||||
# when nginx wipes its own error log on reload the ASan traces
|
||||
# survive for post-run inspection.
|
||||
ASAN_OPTS="detect_odr_violation=0:abort_on_error=1:halt_on_error=1:detect_leaks=0:log_path=/tmp/asan"
|
||||
UBSAN_OPTS="print_stacktrace=1:halt_on_error=0:log_path=/tmp/ubsan"
|
||||
|
||||
# Wrapper so every subsequent `docker exec ... ngxasan ...` (e.g. the
|
||||
# reload signal from the Robot suite) inherits the same sanitizer
|
||||
# settings. `docker exec` does not carry the master's env.
|
||||
cat > /usr/local/bin/ngxasan <<EOF
|
||||
#!/bin/bash
|
||||
export ASAN_OPTIONS="${ASAN_OPTS}"
|
||||
export UBSAN_OPTIONS="${UBSAN_OPTS}"
|
||||
exec /opt/nginx-asan/sbin/nginx -p /opt/nginx-asan -c /opt/nginx-asan/conf/nginx.conf "\$@"
|
||||
EOF
|
||||
chmod +x /usr/local/bin/ngxasan
|
||||
|
||||
export ASAN_OPTIONS="${ASAN_OPTS}"
|
||||
export UBSAN_OPTIONS="${UBSAN_OPTS}"
|
||||
|
||||
# Tee stderr so both docker logs and /tmp/nginx.stderr see it. The
|
||||
# Robot suite inspects the file; ASan writes its report to stderr
|
||||
# before abort_on_error kicks the process.
|
||||
exec /opt/nginx-asan/sbin/nginx -p /opt/nginx-asan -c /opt/nginx-asan/conf/nginx.conf \
|
||||
2> >(tee /tmp/nginx.stderr >&2)
|
||||
Reference in New Issue
Block a user