ADR 020: Parallel Audit Completeness vs. Fail-Fast
Status: Active (v0.7.0 "Quartz Maturity") Decider: Architecture Lead Date: 2026-05-02
Context
Zenzic uses a ProcessPoolExecutor to scan documentation files in parallel
when a repository contains 50 or more Markdown files (ADAPTIVE_PARALLEL_THRESHOLD
in core/scanner.py). Each worker executes _scan_single_file() independently
and returns an IntegrityReport containing any findings, including SecurityFinding
objects emitted by the Shield (Z201/Z202/Z203).
In the implementation prior to v0.7.0, the coordinator collected results by
iterating over futures_map.items() in submission order, calling
fut.result(timeout=30) on each future in turn. This design had two consequences:
-
No early termination. If file 1 of 500 contained a credential (Z201, Exit Code 2), all 499 remaining workers continued to completion before the CLI could report the breach. On large repositories, this wasted significant CI compute time.
-
Sequential result collection. A slow worker at position 2 would block collection of all subsequent results until it completed or timed out, even if workers 3–500 had already finished.
Two abort mechanisms were evaluated before the adopted solution:
multiprocessing.Manager().Event() — a shared boolean flag visible to both
coordinator and workers. Rejected. Passing a manager event to _worker()
makes it stateful: its output would depend on external shared state rather than
solely on its inputs (md_file, config, rule_engine). This violates
Pillar 3: Pure Functions First — a founding invariant of the Zenzic
architecture. _worker() must remain a pure function.
concurrent.futures.as_completed() — an iterator that yields futures in
completion order. Evaluated and replaced. as_completed() provides no
per-batch timeout guarantee. A deadlocked final worker would block the generator
indefinitely. The ZRT-002 protection (Z009 for deadlocked workers) cannot be
preserved without introducing a separate per-future timeout mechanism that
negates the simplicity advantage of as_completed().
Decision
From v0.7.0, the parallel coordinator uses
concurrent.futures.wait()withreturn_when=FIRST_COMPLETEDand a_abortlocal flag. On the firstSecurityFindingin a completed worker result, all still-queued (PENDING) futures are cancelled immediately. The ZRT-002 deadlock guard is preserved.
The implementation replaces the for fut, md_file in futures_map.items() loop
with a while _pending loop. Each iteration calls:
done, _pending = concurrent.futures.wait(
_pending,
timeout=_WORKER_TIMEOUT_S,
return_when=concurrent.futures.FIRST_COMPLETED,
)
When a completed report contains security_findings, the coordinator sets
_abort = True and calls pending_fut.cancel() on every future still in
_pending. Subsequent iterations discard results silently.
Behavioural changes in v0.7.0:
| Scenario | Pre-v0.7.0 | v0.7.0 |
|---|---|---|
| No security breach | All files scanned | All files scanned (unchanged) |
| Security breach in file 1/500 | All 500 files scanned | Breach detected; pending tasks cancelled |
| Deadlocked worker | Z009 after 30 s per-worker | Z009 if no worker completes in 30 s |
| Result order | Submission order → sorted | Completion order → sorted |
Cancellation semantics: future.cancel() operates only on tasks that have
not yet been dispatched to a worker process (PENDING state). Tasks already
RUNNING cannot be interrupted — they complete and their results are silently
discarded (not added to the report). The fail-fast is therefore a
best-effort CI optimisation, not a hard execution guarantee.
ZRT-002 preservation: If concurrent.futures.wait() returns an empty done
set (no worker completed within _WORKER_TIMEOUT_S seconds), all remaining
pending futures are cancelled and a Z009 finding is emitted for each stalled
file. This protects against ReDoS patterns in [[custom_rules]] that somehow
bypass the startup canary (_assert_regex_canary()).
Rationale
1. Pillar 3 Preserved
The fail-fast is implemented entirely in the coordinator, which is orchestration logic — not analysis logic. The coordinator is the only scope where multiple futures are visible simultaneously. No analysis function is aware of the abort state.
_worker() and _scan_single_file() are unchanged in v0.7.0. Given the
same inputs, they produce the same output. They have no dependency on shared
state. This functional purity is what makes them deterministic in isolation and
trivially testable.
2. Audit-Complete Semantics for Running Workers
Workers already executing when a breach is detected are allowed to complete
naturally. Their results are discarded by the coordinator. This prevents the
scenario where a partially-written IntegrityReport (from a worker interrupted
mid-execution) corrupts the findings list or leaves file handles open.
3. Deterministic Output
The final reports list is always sorted by file_path after collection.
CLI output is reproducible regardless of worker completion order, pool size,
or how many files were scanned before the abort.
4. wait(FIRST_COMPLETED) vs as_completed()
as_completed() was the initially-proposed mechanism. It was replaced by
wait(return_when=FIRST_COMPLETED) for one specific reason: the ZRT-002
deadlock guard. With as_completed(), a deadlocked last worker causes the
generator to block indefinitely with no way to enforce a timeout per pending
batch. With wait(timeout=_WORKER_TIMEOUT_S), an empty done set after 30
seconds unconditionally triggers the Z009 guard — no additional mechanism needed.
Invariants
_worker()must remain a pure, stateless function. No shared state, queue, or event may be passed to it.- The
_abortflag is a local variable in the coordinator loop. It is not exported, not shared with workers, and not visible outside thewith executorblock. - Results are always sorted by
file_pathbefore being returned. The completion order fromwait()is never the final output order. - ZRT-002 deadlock guard: if no future completes within
_WORKER_TIMEOUT_Sseconds, all remaining futures are cancelled and a Z009 finding is emitted for each stalled file.
Consequences
- On repositories with a security breach in the first few files, CI runtime is reduced proportionally to the number of cancelled workers.
- On repositories with no breach, performance is identical to the previous implementation (all workers complete, all results collected).
- The
ADAPTIVE_PARALLEL_THRESHOLDconstant retains its role: below 50 files, sequential mode is used and this ADR does not apply. The sequential path is unchanged. - The fail-fast applies to parallel mode only. A scan that produces zero security findings is unaffected by this change.