Files
Pim van Pelt fdef2a552b Harden scrape rendering and add AddressSanitizer test suite
Move all heap allocation out of the slab-mutex critical section in
render_prom/render_json: snapshot cardinality under a brief lock,
allocate aggs/snaps/string tables outside the lock, then re-acquire
only to deep-copy strings and walk the LRU into the pre-allocated
buffers. A worker crash during output buffer allocation can no
longer leave the shared-memory zone locked, and a corrupt cardinality
count is caught by a 10k sanity cap rather than causing a runaway
ngx_pcalloc.

Add build-asan and tests/02-asan/: a full sanitizer-instrumented
nginx + module built via apt-source, and a 2-node containerlab
Robot suite that drives reload storms, concurrent scrape-during-reload,
and intern-table growth, failing if AddressSanitizer or UBSan
reports anything on stderr. The two Robot suites now check for
their required build artifacts up front so `make robot-test` no
longer rebuilds them on every invocation.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-04-18 10:58:51 +02:00

185 lines
8.1 KiB
Plaintext

# SPDX-License-Identifier: Apache-2.0
*** Settings ***
Documentation AddressSanitizer + UBSan stress suite for
... ngx_http_ipng_stats_module. Deploys a 2-node containerlab
... topology running an ASan-instrumented nginx (built by
... `make build-asan`), exercises the code paths most likely
... to surface memory errors — shared-zone init and reuse,
... scrape rendering under the slab mutex, log-phase
... interning, logtail UDP flush — and fails if any
... AddressSanitizer or UBSan finding appears in the nginx
... stderr during the run.
...
... This suite is deliberately not a superset of 01-module —
... it's a landing zone for memory-correctness cases.
... Functional coverage (attribution, filters, counters)
... lives in 01-module.
Library OperatingSystem
Library String
Suite Setup Deploy Lab
Suite Teardown Cleanup Lab
Test Teardown Assert No Sanitizer Findings
*** Variables ***
${lab-name} ipng-stats-asan
${lab-file} lab/ipng-stats-asan.clab.yml
${runtime} docker
${CLAB_BIN} sudo containerlab
${SERVER} clab-${lab-name}-server
${CLIENT} clab-${lab-name}-client
${SCRAPE_URL} http://172.20.41.2:9113/stats
${DATAPLANE_URL} http://10.0.1.1:8080
${STRESS_RELOADS} 10
${STRESS_REQ_PER_LOOP} 25
*** Test Cases ***
ASan nginx starts and serves a scrape
[Documentation] The ASan-instrumented nginx boots with the module
... loaded, and a bare scrape returns the expected
... preamble. Touches init_zone, postconfig, and the
... scrape renderer with an empty LRU.
${output} = Scrape Prometheus
Should Contain ${output} nginx-ipng-stats-plugin
Should Contain ${output} nginx_ipng_requests_total
Scrape an empty JSON report
[Documentation] JSON renderer path with zero records — catches
... off-by-one errors in the bracket emission.
${rc} ${output} = Run And Return Rc And Output
... curl -sf -H 'Accept: application/json' ${SCRAPE_URL}
Should Be Equal As Integers ${rc} 0
Should Contain ${output} "schema":2
Should Contain ${output} "records":[
Reload storm without traffic
[Documentation] Back-to-back reloads with no traffic in between.
... Exercises init_zone's zone-reuse branch and the
... shctx magic check; the cardinality is zero so the
... renderer's naggs_alloc == 0 path is also covered.
FOR ${i} IN RANGE ${STRESS_RELOADS}
Docker Exec ${SERVER} ngxasan -s reload
Sleep 200ms
Scrape Prometheus
END
Reload storm with interleaved traffic
[Documentation] Generate traffic, reload, scrape, repeat. This is
... the scenario that surfaced the original crash: the
... scrape path walks the shared-zone LRU while workers
... are being cycled. Also grows the interning table
... by using a handful of distinct paths.
FOR ${i} IN RANGE ${STRESS_RELOADS}
Generate Traffic ${STRESS_REQ_PER_LOOP}
Docker Exec ${SERVER} ngxasan -s reload
Sleep 200ms
Scrape Prometheus
END
Concurrent scrape during reload
[Documentation] Scrape in a tight loop while issuing reloads from
... a parallel shell. The renderer's snapshot step
... deep-copies strings under the slab mutex; a
... concurrent intern_shared grow during that window
... would surface here as use-after-free. We run the
... whole dance in one bash -c so Robot doesn't have
... to babysit the background pid.
Generate Traffic ${STRESS_REQ_PER_LOOP}
${rc} ${output} = Run And Return Rc And Output
... bash -c '( for i in $(seq 1 200); do curl -sf ${SCRAPE_URL} > /dev/null || true; done ) & scraper=$!; for i in 1 2 3 4 5; do docker exec ${SERVER} ngxasan -s reload; sleep 0.3; done; wait $scraper'
Should Be Equal As Integers ${rc} 0
Large cardinality intern table growth
[Documentation] Drive enough distinct request paths that the
... per-VIP vip/source interning array grows past its
... initial slab_alloc — this exercises the realloc
... path (ngx_slab_free_locked of the old entries
... buffer, copy into the new one) inside the log
... handler.
FOR ${i} IN RANGE 60
Docker Exec Ignore Rc ${CLIENT} curl -s ${DATAPLANE_URL}/path${i}
END
Sleep 500ms
Scrape Prometheus
*** Keywords ***
# --- Lab lifecycle ---
Deploy Lab
Require ASan Build
Run ${CLAB_BIN} --runtime ${runtime} destroy -t ${CURDIR}/${lab-file} --cleanup 2>&1 || true
${rc} ${output} = Run And Return Rc And Output
... ${CLAB_BIN} --runtime ${runtime} deploy -t ${CURDIR}/${lab-file}
Log ${output}
Should Be Equal As Integers ${rc} 0
Wait Until Keyword Succeeds 90s 3s Server Is Ready
Wait Until Keyword Succeeds 60s 3s Client Can Reach Server
Require ASan Build
[Documentation] Fail fast with an actionable message if the user
... forgot to run `make build-asan` before invoking
... this suite.
${rc} = Run And Return Rc test -x ${EXECDIR}/build/nginx-asan/sbin/nginx
Run Keyword If ${rc} != 0
... Fail ASan nginx not found — run `make build-asan` first.
Server Is Ready
${rc} ${output} = Run And Return Rc And Output curl -sf ${SCRAPE_URL}
Should Be Equal As Integers ${rc} 0
Client Can Reach Server
${rc} ${output} = Run And Return Rc And Output
... docker exec ${CLIENT} curl -sf ${DATAPLANE_URL}/
Should Be Equal As Integers ${rc} 0
Cleanup Lab
Run docker logs ${SERVER} > ${EXECDIR}/tests/out/asan-server-docker.log 2>&1
Run docker exec ${SERVER} cat /tmp/nginx.err > ${EXECDIR}/tests/out/asan-nginx-err.log 2>&1
Run docker exec ${SERVER} cat /tmp/nginx.stderr > ${EXECDIR}/tests/out/asan-nginx-stderr.log 2>&1
Run docker exec ${SERVER} bash -c 'cat /tmp/asan.* 2>/dev/null; cat /tmp/ubsan.* 2>/dev/null' > ${EXECDIR}/tests/out/asan-reports.log 2>&1
Run ${CLAB_BIN} --runtime ${runtime} destroy -t ${CURDIR}/${lab-file} --cleanup
# --- Sanitizer assertion ---
Assert No Sanitizer Findings
[Documentation] Fail the current test if the ASan or UBSan
... runtime wrote any findings to stderr or their
... per-pid log files. Runs after every test case —
... we want the failing test to be the one that
... produced the finding, not a later one.
${rc} ${hits} = Run And Return Rc And Output
... docker exec ${SERVER} bash -c 'grep -E "AddressSanitizer|LeakSanitizer|runtime error|SUMMARY:" /tmp/nginx.stderr /tmp/asan.* /tmp/ubsan.* 2>/dev/null || true'
Run Keyword If '${hits}' != '${EMPTY}'
... Fail Sanitizer findings detected:\n${hits}
# --- Traffic generation ---
Generate Traffic
[Arguments] ${count}
FOR ${i} IN RANGE ${count}
Docker Exec Ignore Rc ${CLIENT} curl -s ${DATAPLANE_URL}/
END
# --- Scraping ---
Scrape Prometheus
${rc} ${output} = Run And Return Rc And Output curl -sf ${SCRAPE_URL}
Should Be Equal As Integers ${rc} 0
RETURN ${output}
# --- Container helpers ---
Docker Exec
[Arguments] ${container} ${cmd}
${rc} ${output} = Run And Return Rc And Output
... docker exec ${container} ${cmd}
Should Be Equal As Integers ${rc} 0
RETURN ${output}
Docker Exec Ignore Rc
[Arguments] ${container} ${cmd}
${rc} ${output} = Run And Return Rc And Output
... docker exec ${container} ${cmd}
RETURN ${output}