Move all heap allocation out of the slab-mutex critical section in render_prom/render_json: snapshot cardinality under a brief lock, allocate aggs/snaps/string tables outside the lock, then re-acquire only to deep-copy strings and walk the LRU into the pre-allocated buffers. A worker crash during output buffer allocation can no longer leave the shared-memory zone locked, and a corrupt cardinality count is caught by a 10k sanity cap rather than causing a runaway ngx_pcalloc. Add build-asan and tests/02-asan/: a full sanitizer-instrumented nginx + module built via apt-source, and a 2-node containerlab Robot suite that drives reload storms, concurrent scrape-during-reload, and intern-table growth, failing if AddressSanitizer or UBSan reports anything on stderr. The two Robot suites now check for their required build artifacts up front so `make robot-test` no longer rebuilds them on every invocation. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
185 lines
8.1 KiB
Plaintext
185 lines
8.1 KiB
Plaintext
# SPDX-License-Identifier: Apache-2.0
|
|
*** Settings ***
|
|
Documentation AddressSanitizer + UBSan stress suite for
|
|
... ngx_http_ipng_stats_module. Deploys a 2-node containerlab
|
|
... topology running an ASan-instrumented nginx (built by
|
|
... `make build-asan`), exercises the code paths most likely
|
|
... to surface memory errors — shared-zone init and reuse,
|
|
... scrape rendering under the slab mutex, log-phase
|
|
... interning, logtail UDP flush — and fails if any
|
|
... AddressSanitizer or UBSan finding appears in the nginx
|
|
... stderr during the run.
|
|
...
|
|
... This suite is deliberately not a superset of 01-module —
|
|
... it's a landing zone for memory-correctness cases.
|
|
... Functional coverage (attribution, filters, counters)
|
|
... lives in 01-module.
|
|
Library OperatingSystem
|
|
Library String
|
|
Suite Setup Deploy Lab
|
|
Suite Teardown Cleanup Lab
|
|
Test Teardown Assert No Sanitizer Findings
|
|
|
|
*** Variables ***
|
|
${lab-name} ipng-stats-asan
|
|
${lab-file} lab/ipng-stats-asan.clab.yml
|
|
${runtime} docker
|
|
${CLAB_BIN} sudo containerlab
|
|
${SERVER} clab-${lab-name}-server
|
|
${CLIENT} clab-${lab-name}-client
|
|
${SCRAPE_URL} http://172.20.41.2:9113/stats
|
|
${DATAPLANE_URL} http://10.0.1.1:8080
|
|
${STRESS_RELOADS} 10
|
|
${STRESS_REQ_PER_LOOP} 25
|
|
|
|
*** Test Cases ***
|
|
|
|
ASan nginx starts and serves a scrape
|
|
[Documentation] The ASan-instrumented nginx boots with the module
|
|
... loaded, and a bare scrape returns the expected
|
|
... preamble. Touches init_zone, postconfig, and the
|
|
... scrape renderer with an empty LRU.
|
|
${output} = Scrape Prometheus
|
|
Should Contain ${output} nginx-ipng-stats-plugin
|
|
Should Contain ${output} nginx_ipng_requests_total
|
|
|
|
Scrape an empty JSON report
|
|
[Documentation] JSON renderer path with zero records — catches
|
|
... off-by-one errors in the bracket emission.
|
|
${rc} ${output} = Run And Return Rc And Output
|
|
... curl -sf -H 'Accept: application/json' ${SCRAPE_URL}
|
|
Should Be Equal As Integers ${rc} 0
|
|
Should Contain ${output} "schema":2
|
|
Should Contain ${output} "records":[
|
|
|
|
Reload storm without traffic
|
|
[Documentation] Back-to-back reloads with no traffic in between.
|
|
... Exercises init_zone's zone-reuse branch and the
|
|
... shctx magic check; the cardinality is zero so the
|
|
... renderer's naggs_alloc == 0 path is also covered.
|
|
FOR ${i} IN RANGE ${STRESS_RELOADS}
|
|
Docker Exec ${SERVER} ngxasan -s reload
|
|
Sleep 200ms
|
|
Scrape Prometheus
|
|
END
|
|
|
|
Reload storm with interleaved traffic
|
|
[Documentation] Generate traffic, reload, scrape, repeat. This is
|
|
... the scenario that surfaced the original crash: the
|
|
... scrape path walks the shared-zone LRU while workers
|
|
... are being cycled. Also grows the interning table
|
|
... by using a handful of distinct paths.
|
|
FOR ${i} IN RANGE ${STRESS_RELOADS}
|
|
Generate Traffic ${STRESS_REQ_PER_LOOP}
|
|
Docker Exec ${SERVER} ngxasan -s reload
|
|
Sleep 200ms
|
|
Scrape Prometheus
|
|
END
|
|
|
|
Concurrent scrape during reload
|
|
[Documentation] Scrape in a tight loop while issuing reloads from
|
|
... a parallel shell. The renderer's snapshot step
|
|
... deep-copies strings under the slab mutex; a
|
|
... concurrent intern_shared grow during that window
|
|
... would surface here as use-after-free. We run the
|
|
... whole dance in one bash -c so Robot doesn't have
|
|
... to babysit the background pid.
|
|
Generate Traffic ${STRESS_REQ_PER_LOOP}
|
|
${rc} ${output} = Run And Return Rc And Output
|
|
... bash -c '( for i in $(seq 1 200); do curl -sf ${SCRAPE_URL} > /dev/null || true; done ) & scraper=$!; for i in 1 2 3 4 5; do docker exec ${SERVER} ngxasan -s reload; sleep 0.3; done; wait $scraper'
|
|
Should Be Equal As Integers ${rc} 0
|
|
|
|
Large cardinality intern table growth
|
|
[Documentation] Drive enough distinct request paths that the
|
|
... per-VIP vip/source interning array grows past its
|
|
... initial slab_alloc — this exercises the realloc
|
|
... path (ngx_slab_free_locked of the old entries
|
|
... buffer, copy into the new one) inside the log
|
|
... handler.
|
|
FOR ${i} IN RANGE 60
|
|
Docker Exec Ignore Rc ${CLIENT} curl -s ${DATAPLANE_URL}/path${i}
|
|
END
|
|
Sleep 500ms
|
|
Scrape Prometheus
|
|
|
|
*** Keywords ***
|
|
|
|
# --- Lab lifecycle ---
|
|
|
|
Deploy Lab
|
|
Require ASan Build
|
|
Run ${CLAB_BIN} --runtime ${runtime} destroy -t ${CURDIR}/${lab-file} --cleanup 2>&1 || true
|
|
${rc} ${output} = Run And Return Rc And Output
|
|
... ${CLAB_BIN} --runtime ${runtime} deploy -t ${CURDIR}/${lab-file}
|
|
Log ${output}
|
|
Should Be Equal As Integers ${rc} 0
|
|
Wait Until Keyword Succeeds 90s 3s Server Is Ready
|
|
Wait Until Keyword Succeeds 60s 3s Client Can Reach Server
|
|
|
|
Require ASan Build
|
|
[Documentation] Fail fast with an actionable message if the user
|
|
... forgot to run `make build-asan` before invoking
|
|
... this suite.
|
|
${rc} = Run And Return Rc test -x ${EXECDIR}/build/nginx-asan/sbin/nginx
|
|
Run Keyword If ${rc} != 0
|
|
... Fail ASan nginx not found — run `make build-asan` first.
|
|
|
|
Server Is Ready
|
|
${rc} ${output} = Run And Return Rc And Output curl -sf ${SCRAPE_URL}
|
|
Should Be Equal As Integers ${rc} 0
|
|
|
|
Client Can Reach Server
|
|
${rc} ${output} = Run And Return Rc And Output
|
|
... docker exec ${CLIENT} curl -sf ${DATAPLANE_URL}/
|
|
Should Be Equal As Integers ${rc} 0
|
|
|
|
Cleanup Lab
|
|
Run docker logs ${SERVER} > ${EXECDIR}/tests/out/asan-server-docker.log 2>&1
|
|
Run docker exec ${SERVER} cat /tmp/nginx.err > ${EXECDIR}/tests/out/asan-nginx-err.log 2>&1
|
|
Run docker exec ${SERVER} cat /tmp/nginx.stderr > ${EXECDIR}/tests/out/asan-nginx-stderr.log 2>&1
|
|
Run docker exec ${SERVER} bash -c 'cat /tmp/asan.* 2>/dev/null; cat /tmp/ubsan.* 2>/dev/null' > ${EXECDIR}/tests/out/asan-reports.log 2>&1
|
|
Run ${CLAB_BIN} --runtime ${runtime} destroy -t ${CURDIR}/${lab-file} --cleanup
|
|
|
|
# --- Sanitizer assertion ---
|
|
|
|
Assert No Sanitizer Findings
|
|
[Documentation] Fail the current test if the ASan or UBSan
|
|
... runtime wrote any findings to stderr or their
|
|
... per-pid log files. Runs after every test case —
|
|
... we want the failing test to be the one that
|
|
... produced the finding, not a later one.
|
|
${rc} ${hits} = Run And Return Rc And Output
|
|
... docker exec ${SERVER} bash -c 'grep -E "AddressSanitizer|LeakSanitizer|runtime error|SUMMARY:" /tmp/nginx.stderr /tmp/asan.* /tmp/ubsan.* 2>/dev/null || true'
|
|
Run Keyword If '${hits}' != '${EMPTY}'
|
|
... Fail Sanitizer findings detected:\n${hits}
|
|
|
|
# --- Traffic generation ---
|
|
|
|
Generate Traffic
|
|
[Arguments] ${count}
|
|
FOR ${i} IN RANGE ${count}
|
|
Docker Exec Ignore Rc ${CLIENT} curl -s ${DATAPLANE_URL}/
|
|
END
|
|
|
|
# --- Scraping ---
|
|
|
|
Scrape Prometheus
|
|
${rc} ${output} = Run And Return Rc And Output curl -sf ${SCRAPE_URL}
|
|
Should Be Equal As Integers ${rc} 0
|
|
RETURN ${output}
|
|
|
|
# --- Container helpers ---
|
|
|
|
Docker Exec
|
|
[Arguments] ${container} ${cmd}
|
|
${rc} ${output} = Run And Return Rc And Output
|
|
... docker exec ${container} ${cmd}
|
|
Should Be Equal As Integers ${rc} 0
|
|
RETURN ${output}
|
|
|
|
Docker Exec Ignore Rc
|
|
[Arguments] ${container} ${cmd}
|
|
${rc} ${output} = Run And Return Rc And Output
|
|
... docker exec ${container} ${cmd}
|
|
RETURN ${output}
|