Sentinel Guard: The Engineering of Documentation Integrity and Security
Every build engine you have ever used has made you a silent promise: "Trust me β if this builds, your documentation is correct."
That promise is architecturally broken. And your documentation is paying the price.
Act I β The Thesis of Untrusted Inputβ
The Problem with Build Enginesβ
Your documentation pipeline looks something like this:
Author β Markdown file β Build Engine β HTML β CDN β Reader
β
"Trust me, I'll build it."
Docusaurus, MkDocs, Zensical β these are generators. Their contract with you is explicit: "Give me source files; I will produce a static site." They are optimized for speed, for plugin extensibility, for theming. They are not optimized for validation. They trust the source files you give them.
That trust is the vulnerability.
What "Untrusted Input" Means in 2026β
The threat model for documentation has changed. In 2026, documentation sources come from:
- Human contributors via pull requests β who may not know your link structure
- AI-generated content β which plausibly invents URLs that sound real
- Automated refactors β which move files without updating cross-references
- External integrations β which inject content that may carry credential fragments
In a monorepo shared across teams, a single contributor committing a file that references a non-existent anchor, exposes an internal API key, or introduces a circular link dependency can silently corrupt the documentation of ten product areas simultaneously. The build engine will not catch it. The build engine does not try.
Zenzic's core design thesis, established in ADR-001, is: treat every Markdown file as untrusted input.
This is not a feature. It is a trust model. Every design decision in Zenzic derives from it.
The Zero-Trust Documentation Pipelineβ
Author β Markdown file β Zenzic (Sentinel) β Build Engine β HTML β CDN β Reader
β
"I trust nothing. I verify everything."
The Sentinel runs before the build. It does not trust the source. It constructs a complete in-memory model of your site β the Virtual Site Map β and validates every claim in your source files against that model. If the model says a link is broken, the CI gate fails. The build engine never sees the broken file.
The fundamental difference is not in features. It is in philosophy. Build engines are designed to succeed. Zenzic is designed to find where they would have failed.
Supply Chain Attacks via Documentationβ
Consider a real scenario: your documentation site is a monorepo including third-party-contributed guides. One contributor submits a tutorial that includes:
## Setup
Configure your environment with your API key:
```yaml
api_key: "sk_live_<YOUR_STRIPE_KEY_HERE>"
The Stripe live key β `sk_live_*` β is a real credential format. Zenzic's Shield
catches it before the commit merges. The build engine builds it without comment.
This is the **supply chain attack surface in documentation**: external content that
carries secrets, adversarial links, or path traversal payloads, combined with a build
pipeline that trusts its input entirely.
Zenzic's response to this is the three-layer security architecture: the **Shield**
(credentials), the **Blood Sentinel** (paths), and the **Structural Validator** (graph
integrity). Each layer is independent. Each layer runs on every scan. None of them
can be disabled simultaneously.
### The Three Pillars (Non-Negotiable)
These invariants are in Zenzic's design contract and cannot be overridden by
configuration:
**Pillar 1 β Lint the Source, Not the Build.** Analysis operates on raw Markdown and
configuration files. Never on HTML output. Errors are caught before the build starts.
**Pillar 2 β Zero Subprocesses.** 100% pure Python. No `subprocess`, no `os.system`,
no Node.js execution. This guarantees: reproducible results across platforms,
zero dependency on the build environment, and total portability.
**Pillar 3 β Pure Functions First.** Analysis logic is deterministic. I/O is isolated
at the edges (discovery and reporting). No I/O in hot-path loops.
---
## Act II β The VSM Engine: A Mental Map of Your Site
### What the Virtual Site Map Is
The Virtual Site Map (VSM) is Zenzic's central data structure. It is a complete
in-memory projection of your documentation site as a routing table: a mapping from
every canonical URL that your build engine would generate to a `Route` object.
The `Route` object carries:
```python
@dataclass(frozen=True, slots=True)
class Route:
url: str # canonical URL: "/docs/guide/install/"
file_path: Path # absolute path on disk: /repo/docs/guide/install.mdx
status: RouteStatus
anchors: frozenset[str]
is_proxy: bool
version: str | None
The RouteStatus can be one of four values:
| Status | Meaning |
|---|---|
REACHABLE | File is navigable via at least one user-clickable surface |
ORPHAN_BUT_EXISTING | File exists on disk but no navigation surface links to it |
IGNORED | System file (e.g., _category_.json) β not a content page |
CONFLICT | Two files produce the same canonical URL β build collision |
Why the VSM Is Necessaryβ
Without the VSM, Zenzic could only answer: "does this file exist on disk?" That
question is easy. pathlib.Path.exists() answers it in a single syscall.
The VSM enables Zenzic to answer a harder question: "would this link resolve in the rendered site, given how your specific build engine maps source files to URLs?"
Those are completely different questions.
Consider Docusaurus: a file at docs/guide/index.mdx is served at /docs/guide/,
not at /docs/guide/index/. A link to /docs/guide/index.html would resolve to
nothing in the browser β even though the file exists on disk.
Consider MkDocs: a file at docs/api.md with nav: entry - API: api.md is
reachable. The same file without a nav: entry is potentially an orphan depending
on nav: configuration.
The VSM encodes these engine-specific routing rules in pure Python, without running the build engine.
Building the VSM: The Architectureβ
βββββββββββββββββββββββββββββββ
β build_vsm() β
β (I/O boundary β called β
β once per scan) β
ββββββββββββ¬βββββββββββββββββββ
β
ββββββββββββββββΌβββββββββββββββββββ
β Adapter.get_route_info() β
β (engine-specific, per-file) β
ββββββββββββββββ¬βββββββββββββββββββ
β
ββββββββββββββββββββββΌβββββββββββββββββββββββββ
β β
ββββββββββββΌββββββββββ ββββββββββββββββββ ββββββββββββΌβββββββββββ
β map_url(rel) β β classify_route β β get_nav_paths() β
β (canonical URL) β β (reachability)β β (sidebar+nav+footer)β
ββββββββββββββββββββββ ββββββββββββββββββ βββββββββββββββββββββββ
The build_vsm() function is the only I/O boundary β it iterates over every Markdown
file in docs_root exactly once. All adapter calls are pure functions after that
initial read. No file is touched again during link validation.
O(1) Link Validationβ
The VSM is a Python dict[str, Route] β a hash map keyed by canonical URL.
When the validator needs to check whether a link target exists, it calls:
route = vsm.get(canonical_url) # dict.get() β O(1) hash lookup
if route is None:
# Z104: FILE_NOT_FOUND
This means validating 10,000 links against a 10,000-page site is not O(NΒ²) β it is 10,000 independent O(1) lookups. The VSM is built once (O(N)) and then queried indefinitely at O(1) per link.
Compare this to naive implementations that call Path.exists() for every link target:
that is NΓM syscalls for N links across M files, where each stat() call crosses the
user-kernel boundary. At 50,000 links across a large documentation site, the
difference between O(1) hash lookups and O(NΓM) syscalls is the difference between
a 3-second scan and a 90-second scan.
The Anchor Cacheβ
In addition to the URL map, the VSM builder constructs an anchor cache: a mapping from file path to the set of heading slugs that file declares.
anchors_cache: dict[Path, set[str]] = {
Path("/repo/docs/guide/install.mdx"): {
"prerequisites", "installation", "next-steps"
},
...
}
When a link contains a fragment (/docs/guide/install/#next-steps), Zenzic:
- Resolves the URL to a file path via the VSM (O(1))
- Checks the fragment against
anchors_cache[file_path](O(1) set lookup)
A broken anchor (Z102) is detected with two hash lookups. Zero I/O. Zero subprocess calls.
Ghost Routes: i18n and Versioningβ
The VSM handles cases where a URL exists but no physical file produces it β what Zenzic calls Ghost Routes. Two categories:
i18n Ghost Routes: Docusaurus generates locale-specific index pages (e.g.,
/it/docs/) automatically, even when no physical it/index.mdx exists. The VSM
marks these as is_proxy=True and status=REACHABLE, because the build engine will
generate them.
Versioned Routes: Zenzic's Docusaurus adapter uses an internal _version_ sentinel
prefix to track versioned documentation trees. A file at
docs/versioned_docs/version-0.6/guide.md is indexed as _version_/0.6/guide.md
in the VSM and served at /docs/0.6/guide/ β transparent to the validator.
In both cases, the VSM answer is correct: the URL is reachable for a reader. No physical file required.
Collision Detectionβ
Two source files can produce the same canonical URL β this is a build-time error in Docusaurus and MkDocs. The VSM detects this during construction:
def _detect_collisions(routes: list[Route]) -> None:
seen: dict[str, Route] = {}
for route in routes:
if route.url in seen:
route.status = "CONFLICT"
seen[route.url].status = "CONFLICT"
else:
seen[route.url] = route
A CONFLICT route surfaces as a Zenzic finding before the build runs, preventing the
silent data loss that occurs when two files compete for the same URL.
Act III β The Shield: 8 Stages of Truthβ
The Problem with Naive Secret Detectionβ
A naive credential scanner applies regex patterns line by line:
if re.search(r"AKIA[0-9A-Z]{16}", line):
flag_secret()
This works when the secret is written plainly. In documentation, secrets are rarely written plainly. They appear in:
- Markdown tables:
| Key | ``AKIA`` | ``1234567890ABCDEF`` | - Concatenated strings:
`AKIA`+`1234ABCD5678EFGH` - HTML-entity encoded values:
AKIA1234567890ABCDEF - Unicode-obfuscated text:
A\u200bK\u200bI\u200bA1234567890ABCDEF(zero-width spaces) - Comment-interleaved tokens:
ghp_ABC{/* comment */}DEF - Cross-line YAML scalars: key split across two lines by a folded block
Zenzic's Shield is designed to defeat all of these patterns. It does so through a normalization pipeline applied before regex matching.
The 8 Stages of Normalizationβ
The _normalize_line_for_shield() function applies these transformations in strict
order:
Stage 1 β Unicode Format Character Stripping (ZRT-006)β
normalized = "".join(c for c in line if unicodedata.category(c) != "Cf")
Unicode category Cf ("Format, other") includes invisible characters: zero-width
joiners (U+200D), zero-width non-joiners (U+200C), zero-width spaces (U+200B), and
word joiners (U+2060). An adversarial author can insert these between characters of a
secret key β the characters are visually invisible and collapse when copy-pasted, but
a naive regex will not match the fragmented token.
Stage 1 strips them entirely, reconstructing the original token.
Stage 2 β HTML Character Reference Decoding (ZRT-006)β
normalized = html.unescape(normalized)
HTML character references (A, A, &) can encode any ASCII character.
A key like AKIA1234567890ABCD can be written as AKIA1234567890ABCD in inline HTML within a Markdown file β and will render correctly in the browser while evading naive scanners.
html.unescape() from the Python standard library handles all forms: decimal (NNN;),
hexadecimal (HH;), and named references (&).
Stage 3 β HTML Comment Stripping (ZRT-007)β
_HTML_COMMENT_RE = re.compile(r"<!--.*?-->")
normalized = _HTML_COMMENT_RE.sub("", normalized)
HTML comments can interleave token fragments: ghp_ABC<!-- noise -->DEF. After the
build, the comment is invisible. In the source, it splits the token. Stage 3 removes
the comment, joining ghp_ABC and DEF into ghp_ABCDEF, which is then matched by
the GitHub token pattern on a subsequent pass.
Stage 4 β MDX Comment Stripping (ZRT-007)β
_MDX_COMMENT_RE = re.compile(r"\{/\*.*?\*/\}")
normalized = _MDX_COMMENT_RE.sub("", normalized)
MDX files use JSX-style comments: {/* ... */}. The same interleaving attack applies.
Stage 4 handles the MDX-specific variant independently.
Stage 5 β Backtick Code Span Unwrapping (ZRT-003)β
_BACKTICK_INLINE_RE = re.compile(r"`([^`]*)`")
normalized = _BACKTICK_INLINE_RE.sub(r"\1", normalized)
Documentation authors frequently write tokens inside inline code spans for visual
formatting: `AKIA`. The backticks are presentation β they do not change the
semantics of the content. Stage 5 strips them, exposing the raw token to the regex
patterns.
Stage 6 β Concatenation Operator Removal (ZRT-003)β
_CONCAT_OP_RE = re.compile(r"[`'\"\s]*\+[`'\"\s]*")
normalized = _CONCAT_OP_RE.sub("", normalized)
Split-token patterns in documentation tables:
| Field | Value |
|-------|-------|
| Key | `AKIA` + `1234567890ABCDEF` |
The + operator joined with surrounding backticks is a common representation of
string concatenation in documentation. Stage 6 removes the concatenation construct,
joining the fragments into AKIA1234567890ABCDEF.
Stage 7 β Table Pipe Replacementβ
_TABLE_PIPE_RE = re.compile(r"\|")
normalized = _TABLE_PIPE_RE.sub(" ", normalized)
Markdown table cells are separated by |. A secret split across cells would be:
| AKIA | 1234567890ABCDEF |. Stage 7 converts pipes to spaces, enabling
whitespace collapse in Stage 8 to produce a scannable line.
Stage 8 β Whitespace Normalizationβ
return " ".join(normalized.split())
Collapses all whitespace runs (tabs, multiple spaces, newlines) into single spaces. This is the final normalization before regex matching. The result is a clean, compact line where all obfuscation techniques have been defeated.
The Lookback Buffer: Cross-Line Detectionβ
A secret that spans two lines defeats single-line scanning:
api_key: >-
AKIA
IOSFODNN7EXAMPLE
Each line individually contains only a fragment. Neither line matches the AWS access
key pattern AKIA[0-9A-Z]{16}.
Zenzic addresses this with scan_lines_with_lookback() β a stateful scanner that
maintains a 1-line lookback buffer:
def scan_lines_with_lookback(
lines: Iterator[tuple[int, str]],
file_path: Path | str,
) -> Iterator[SecurityFinding]:
prev_normalized: str = ""
for line_no, raw_line in lines:
normalized = _normalize_line_for_shield(raw_line)
# Scan the cross-line join: tail of previous line + head of current
cross_line = prev_normalized[-40:] + normalized[:40]
yield from scan_line_for_secrets(cross_line, file_path, line_no)
# Scan the current line independently
yield from scan_line_for_secrets(raw_line, file_path, line_no)
prev_normalized = normalized
The cross-line join concatenates the last 40 characters of the normalized previous line with the first 40 characters of the normalized current line β enough to reconstruct any secret split across a line boundary, while keeping memory bounded.
Dual-Form Scanningβ
Even after normalization, Zenzic scans each line in two forms:
-
Raw form β the line exactly as it appears in the source, ensuring that normally
formatted secrets are always caught with correct column positions for reporting.
-
Normalized form β after all 8 stages, ensuring that obfuscated secrets are
reconstructed and matched.
Duplicate findings (same secret type on the same line in both forms) are suppressed
via a seen: set[str] de-duplication pass.
ReDoS Prevention: The F2-1 Hardeningβ
Regex patterns applied to pathological inputs can cause catastrophic backtracking β a ReDoS (Regular Expression Denial of Service) attack. A crafted Markdown file with a megabyte-long line could cause a regex engine to consume unbounded CPU.
Zenzic's F2-1 hardening establishes a maximum line length constant:
_MAX_LINE_LENGTH: int = 1_048_576 # 1 MiB
Lines exceeding this limit are silently truncated before scanning. No secret longer than 1 MiB exists in practice; a line longer than 1 MiB is not legitimate documentation.
Additionally, all regex patterns used in _SECRETS undergo an eager ReDoS
pre-flight check at engine construction time (ZRT-002):
def _assert_regex_canary(rule: BaseRule) -> None:
"""Verify that the rule's regex does not exhibit catastrophic backtracking."""
# Applies a timing canary against a known-adversarial input.
# Raises PluginContractError if the pattern exceeds the time budget.
Custom rules loaded via the zenzic.rules entry-point group are subject to the same
pre-flight check before the first file is scanned.
The 9 Secret Familiesβ
Zenzic's Shield v0.7.0 detects credentials across 9 families:
| Family | Pattern | Example prefix |
|---|---|---|
| OpenAI API key | sk-[a-zA-Z0-9]{48} | sk-a1B2c3... |
| GitHub token | gh[pousr]_[a-zA-Z0-9]{36} | ghp_, gho_, ghu_, ghs_, ghr_ |
| AWS access key | AKIA[0-9A-Z]{16} | AKIAIOSFODNN7EXAMPLE |
| Stripe live key | sk_live_[0-9a-zA-Z]{24} | sk_live_4xK8... |
| Slack token | xox[baprs]-[0-9a-zA-Z]{10,48} | xoxb-, xoxa-, ... |
| Google API key | AIza[0-9A-Za-z\-_]{35} | AIzaSyB... |
| Private key header | -----BEGIN [A-Z ]+ PRIVATE KEY----- | RSA, EC, DSA |
| Hex-encoded payload | (?:\\x[0-9a-fA-F]{2}){3,} | \x41\x4b\x49\x41... |
| GitLab PAT | glpat-[A-Za-z0-9\-_]{20,} | glpat-aBcDeFgHiJkL... |
Each pattern is pre-compiled at import time β zero compilation overhead during scanning.
The set is additive: new families are added by appending to the _SECRETS list.
Exit Code 2: The Sacred Exitβ
Any detection by the Shield causes Zenzic to exit with code 2. This exit code is
non-suppressible β it cannot be silenced by --exit-zero, fail-on-error: false,
or any configuration flag.
The rationale: a CI system that can be configured to ignore credential exposure is not a security gate. It is theater. Exit code 2 is the guarantee that the security contract cannot be bypassed by configuration drift or operator error.
Exit 0 β All checks passed
Exit 1 β Quality findings (broken links, orphans, placeholders) β suppressible
Exit 2 β Security breach (Shield: credential detected) β NEVER suppressible
Exit 3 β Fatal breach (Blood Sentinel: path traversal) β NEVER suppressible
The Shield operates in Pass 1A β before any structural analysis. A file that triggers exit 2 does not proceed to link validation or orphan detection. The Sentinel reports the breach and stops.
Act IV β Blood Sentinel: Kernel-Level Sandboxingβ
Path Traversal in CI/CDβ
In a CI/CD pipeline, Zenzic runs in a containerized runner. The runner has access to:
- SSH keys:
/home/runner/.ssh/id_rsa - System secrets:
/etc/passwd,/etc/shadow - Runner tokens:
/var/run/secrets/kubernetes.io/serviceaccount/token
A Markdown file can embed a path traversal attack:
[Evil link](../../../../etc/passwd)
[Another attack](../../../home/runner/.ssh/id_rsa)
A documentation site that renders these files to HTML becomes a vector for exfiltrating runner secrets, depending on the deployment mechanism and how static assets are served.
More critically: Zenzic itself reads file contents to validate them. A path traversal
in a link target could cause Zenzic to validate /etc/passwd as a documentation file
and include its content in a report. This is the tool-level attack β abusing the
validator to read secrets from the runner filesystem.
The Blood Sentinel prevents both categories.
The os.path.normpath Collapseβ
The defense is built into InMemoryPathResolver._build_target():
def _build_target(self, source_file: Path, path_part: str) -> str:
if path_part.startswith("/"):
raw = self._root_str + os.sep + path_part.lstrip("/")
elif path_part.startswith("@site/docs/"):
raw = self._root_str + os.sep + path_part[len("@site/docs/"):]
elif path_part.startswith("@site/"):
raw = self._repo_root_str + os.sep + path_part[len("@site/"):]
else:
raw = str(source_file.parent) + os.sep + path_part
return os.path.normpath(raw) # β The collapse
os.path.normpath() is pure C string arithmetic β no syscalls, no stat(), no
readlink(). It collapses all . and .. segments mathematically.
The result:
source: /repo/docs/guide/install.mdx
link: ../../../../etc/passwd
raw = /repo/docs/guide/../../../etc/passwd
normpath β /etc/passwd
The target string /etc/passwd is produced before any filesystem call is made.
Then the Shield check:
shield_ok = (
target_str == self._root_str
or target_str.startswith(self._root_prefix)
)
if not shield_ok:
return PathTraversal(raw_href=href)
/etc/passwd does not start with /repo/docs/ β PathTraversal returned
immediately. Zero filesystem access. Zero data exposure. Exit 3.
The Multi-Root Perimeterβ
Zenzic handles multi-locale Docusaurus projects where both docs/ and
i18n/it/docusaurus-plugin-content-docs/current/ contain cross-referencing files.
The InMemoryPathResolver constructor accepts an allowed_roots parameter β a list
of additional authorized boundaries:
_extra = [self._coerce_path(r) for r in (allowed_roots or [])]
_pairs: list[tuple[str, str]] = []
for _r in [self._root_dir, *_extra]:
_s = str(_r)
_pairs.append((_s, _s + os.sep))
self._allowed_root_pairs: tuple[tuple[str, str], ...] = tuple(_pairs)
The Shield check becomes:
shield_ok = any(
target_str == root_str or target_str.startswith(root_prefix)
for root_str, root_prefix in self._allowed_root_pairs
)
A relative link from docs/guide.mdx to ../i18n/it/guide.mdx is valid only if
i18n/it/docusaurus-plugin-content-docs/current/ is in allowed_roots. Without
explicit authorization, it produces PathTraversal. The perimeter is explicitly
declared, not inferred.
The @site/ Alias: Security Analysisβ
Docusaurus allows @site/ as an alias for the project root in import statements
and static asset references. Zenzic maps this alias to repo_root:
elif path_part.startswith("@site/docs/"):
raw = self._root_str + os.sep + path_part[len("@site/docs/"):]
elif path_part.startswith("@site/"):
raw = self._repo_root_str + os.sep + path_part[len("@site/"):]
A path like @site/../etc/passwd becomes:
raw = /repo/../etc/passwd
normpath β /etc/passwd
The normpath collapse happens before the perimeter check. @site/ is not an
escape hatch from the Blood Sentinel. It is an alias for a specific root, and all
.. traversals through it are collapsed and checked identically.
Exit Code 3: Non-Negotiable Terminationβ
Path traversal findings (Z202/Z203) cause exit 3. Like exit 2, this is non-suppressible. A path traversal in a documentation source is not a quality finding. It is an attempted perimeter breach. The Sentinel terminates.
Z202 PATH_TRAVERSAL β confirmed: resolved path escapes docs_root
Z203 PATH_TRAVERSAL_SUSPICIOUS β unresolvable path with traversal segments
The distinction: Z202 is triggered when normpath produces a path that fails the prefix
check. Z203 is triggered when the href contains ../ segments but cannot be fully
resolved (e.g., missing fragments, malformed URLs). Both produce exit 3.
Act V β The Docusaurus Adapter: isCategoryIndex and URL Collapsingβ
The Routing Problemβ
Docusaurus maps source files to URLs through a set of rules that are not always obvious to documentation authors. Zenzic must replicate these rules exactly in Python to produce correct VSM entries.
The most complex rule is isCategoryIndex collapsing: when a file's name matches certain patterns, its URL is collapsed to the parent directory, not a file slug.
The Three Collapsing Casesβ
From _docusaurus.py, the collapsing logic:
if parts:
file_name_lower = parts[-1].lower()
parent_name_lower = parts[-2].lower() if len(parts) >= 2 else None
if (
file_name_lower == "index" # Case 1: index file
or file_name_lower == "readme" # Case 2: README file
or (
parent_name_lower is not None
and file_name_lower == parent_name_lower # Case 3: folder-match
)
):
parts = parts[:-1] # collapse to parent
Case 1 β Index collapse:
docs/guide/index.mdx β /docs/guide/
docs/index.mdx β /docs/
Case 2 β README collapse:
docs/guide/README.md β /docs/guide/
docs/README.md β /docs/
Case 3 β Folder-match collapse (isCategoryIndex):
docs/guide/guide.mdx β /docs/guide/ (filename == parent dirname)
docs/api/api.md β /docs/api/
This third case is frequently surprising to authors: a file named after its parent directory is silently collapsed to the directory URL by Docusaurus. Zenzic replicates this behavior exactly, producing the correct canonical URL in the VSM.
URL Priority: Frontmatter Slug Firstβ
Before filesystem derivation, Zenzic checks for a slug: frontmatter declaration:
# Stage 1: frontmatter slug override
slug = self._slug_map.get(rel_posix)
if slug is not None:
if slug.startswith("/"):
# Absolute slug: prefix with routeBasePath
rbp = self._route_base_path or "docs"
return "/" + rbp + slug.rstrip("/") + "/"
else:
# Relative slug: replace last path segment
parent = rel.parent
return "/" + parent.as_posix() + "/" + slug.strip("/") + "/"
The full URL resolution priority:
- Frontmatter
slug:β absolute or relative override - isCategoryIndex β index/README/folder-match collapse
- Extension stripping β
.md/.mdxremoved - routeBasePath prefix β default
"docs", configurable
The provides_index() Contractβ
The provides_index(directory_path) method determines whether a directory has a
landing page β required for the Z401 (MISSING_DIRECTORY_INDEX) check:
def provides_index(self, directory_path: Path) -> bool:
index_files = ("index.md", "index.mdx", "README.md", "README.mdx")
if any((directory_path / f).exists() for f in index_files):
return True
category_json = directory_path / "_category_.json"
if category_json.exists():
data = json.loads(category_json.read_text(encoding="utf-8"))
link = data.get("link", {})
return isinstance(link, dict) and link.get("type") == "generated-index"
return False
A directory provides an index when:
-
An
index.md,index.mdx,README.md, orREADME.mdxexists inside it, or -
A
_category_.jsondeclares"link": { "type": "generated-index" }β causingDocusaurus to auto-generate a category index page.
I/O is permitted in provides_index() because it is called once per directory during
the discovery phase β never inside per-link or per-file hot loops.
The Three-Surface Harvesterβ
For orphan detection, Zenzic's Docusaurus adapter aggregates navigation paths from three sources:
def get_nav_paths(self) -> frozenset[str]:
"""Merge sidebar + navbar + footer into a single navigable path set."""
return (
self._parse_sidebars() # sidebars.ts / sidebars.js
| self._parse_config_navigation() # navbar.items + footer.links
)
Sidebar parsing (_parse_sidebars()): reads sidebars.ts or sidebars.js via
pure-Python regex. Strips JS-style line and block comments before parsing. Handles
both type: 'doc' explicit entries and bare string IDs.
Config navigation (_parse_config_navigation()): reads docusaurus.config.ts
via regex, extracts to: URL paths from navbar.items and footer.links, strips
baseUrl and routeBasePath prefixes, and probes for .md/.mdx files on disk.
A file is ORPHAN_BUT_EXISTING only if absent from sidebar AND navbar AND footer.
A changelog linked only in the navbar is REACHABLE. A legal notice linked only in
the footer is REACHABLE. This is R21 β UX-Discoverability.
The Slug Law: Physical Consistencyβ
Zenzic's own documentation enforces the Slug Law (ADR-003): no slug: frontmatter
that diverges from the physical file path. The rationale is architectural: the
autogenerated sidebar uses type: 'autogenerated' β it resolves URLs from file paths.
A diverged slug: creates a URL that the sidebar cannot resolve, causing navigation
failures without a build-time error.
The VSM enforces this indirectly: if a slug: produces a URL that no sidebar entry
references, the file is ORPHAN_BUT_EXISTING. The Slug Law converts this from a
silent failure to a Zenzic finding.
Act VI β The Rule Engine: Adaptive Parallelismβ
The AdaptiveRuleEngineβ
Custom rules in Zenzic β declared in [[custom_rules]] or implemented as Python
classes via the zenzic.rules entry-point group β are applied through the
AdaptiveRuleEngine:
class AdaptiveRuleEngine:
def __init__(self, rules: Sequence[BaseRule]) -> None:
for rule in rules:
_assert_pickleable(rule) # eager pickle validation
_assert_regex_canary(rule) # ZRT-002: ReDoS pre-flight
self._rules = rules
def run(self, file_path: Path, text: str) -> list[RuleFinding]:
"""Pure function: file path + text β findings. No I/O."""
findings: list[RuleFinding] = []
for rule in self._rules:
try:
findings.extend(rule.check(file_path, text))
except Exception as exc:
# Rule failures are caught and converted to RULE-ENGINE-ERROR findings.
# One faulty plugin cannot abort the scan of the entire docs tree.
findings.append(RuleFinding(...))
return findings
Rules are validated eagerly at construction time, before the first file is scanned. A rule that fails pickle serialization is rejected immediately β not silently inside a worker process during a long parallel scan.
The 50-File Thresholdβ
Zenzic's scanner switches between sequential and parallel execution based on the number of files:
ADAPTIVE_PARALLEL_THRESHOLD: int = 50 # in scanner.py
use_parallel = workers != 1 and len(md_files) >= ADAPTIVE_PARALLEL_THRESHOLD
Below 50 files: sequential execution. The overhead of spawning a
ProcessPoolExecutor β approximately 200β400 ms on a cold interpreter β exceeds
the parallelism benefit for small documentation sets.
At or above 50 files: ProcessPoolExecutor is used:
with concurrent.futures.ProcessPoolExecutor(max_workers=actual_workers) as executor:
futures_map = {
executor.submit(_worker, item): item[0]
for item in work_items
}
for future in concurrent.futures.as_completed(futures_map):
results.extend(future.result())
Each file is dispatched to an independent worker process. The worker receives a
serialized (file_path, config, rules) tuple via pickle β which is why the eager
pickle validation at AdaptiveRuleEngine construction is load-bearing. A
non-pickleable lambda in a custom rule would silently fail inside the worker process;
the eager check catches it in the main process at startup.
Pure Function Discipline: Why It Matters for Parallelismβ
Pillar 3 β Pure Functions First β is not a style preference. It is an architectural requirement for correctness under parallelism.
A rule that holds mutable state between check() calls (e.g., a counter, a cache)
would produce data races when two workers process files simultaneously. A rule that
makes I/O calls inside check() would suffer from TOCTOU (time-of-check to
time-of-use) races in a parallel context.
Pure functions β deterministic, stateless, side-effect-free β are safe to execute
concurrently without synchronization. The AdaptiveRuleEngine guarantees this by
contract: any rule that cannot be expressed as a pure function cannot satisfy the
PluginContractError validation and will not be admitted to the engine.
The Pickle Serialization Checkβ
Custom rules loaded via entry_points(group="zenzic.rules") are validated with:
def _assert_pickleable(rule: BaseRule) -> None:
try:
pickle.dumps(rule)
except Exception as exc:
raise PluginContractError(
f"Rule '{rule.rule_id}' cannot be pickled and is incompatible with "
f"multiprocessing: {exc}"
) from exc
This is an eager contract check: the error is raised before any file is touched,
with a clear message pointing to the rule that failed. Without this check, the failure
would manifest as a cryptic BrokenPipeError or EOFError inside a worker process
at scan time β far harder to diagnose.
Act VII β Enterprise Integration: SARIF and the Quality Gateβ
SARIF 2.1.0: Documentation in Your Security Dashboardβ
SARIF (Static Analysis Results Interchange Format) is the standard output format for security tools consumed by GitHub Code Scanning, Azure DevOps, and other CI/CD platforms.
Zenzic produces valid SARIF 2.1.0 with:
zenzic check all ./docs --format sarif > zenzic.sarif
The SARIF output includes:
- Tool descriptor with Zenzic version and URI
- Rules array with one entry per Zxxx code found (ID, name, helpUri, severity)
- Results array with location (file + line + column), message, and level
A minimal SARIF result for a broken link:
{
"ruleId": "Z101",
"level": "error",
"message": {
"text": "Z101 LINK_BROKEN: './install.mdx' β './guide/setup.mdx' does not exist"
},
"locations": [{
"physicalLocation": {
"artifactLocation": { "uri": "docs/install.mdx" },
"region": { "startLine": 42, "startColumn": 12 }
}
}]
}
Upload to GitHub Code Scanning:
name: Documentation Integrity Gate
on: [push, pull_request]
jobs:
sentinel:
runs-on: ubuntu-latest
steps:
- uses: actions/checkout@v4
- name: Run Zenzic Sentinel
run: uvx zenzic check all ./docs --format sarif > zenzic.sarif
- name: Upload to GitHub Security
uses: github/codeql-action/upload-sarif@v3
with:
sarif_file: zenzic.sarif
if: always() # upload even when Zenzic fails
The if: always() is critical: when Zenzic exits with code 1 (quality findings), the
step is marked as failed β but the SARIF upload must still execute to surface the
findings in the Security tab. Without if: always(), a failed step would abort before
uploading, producing silence instead of visibility.
For teams using zenzic-action:
- uses: PythonWoods/zenzic-action@v1
with:
version: "0.7.0"
format: sarif
upload-sarif: "true"
The action handles the SARIF upload and the if: always() semantics automatically,
including SARIF integrity validation β if the SARIF file is truncated by runner OOM
or SIGKILL, the action emits a ::warning annotation rather than uploading a false-
clean result (Output-First Semantics, ADR-004 in zenzic-action).
Machine Silence: RULE R20β
When --format sarif or --format json is active, Zenzic enforces Machine Silence
(R20): zero Rich banners, headers, or informational panels are written to stdout.
The output stream is a machine-readable format and must remain 100% valid against its
schema.
This is enforced at the CLI level:
_MACHINE_FORMATS = frozenset({"json", "sarif"})
if output_format not in _MACHINE_FORMATS:
print_header(console)
A script that pipes zenzic check all --format json | jq '.findings' receives
valid JSON with no banner contamination.
The Quality Score: zenzic scoreβ
Beyond binary pass/fail, Zenzic provides a quality score β a 0β100 metric computed from the weighted sum of findings across all check categories:
zenzic score ./docs
The score can be used as a regression gate:
zenzic diff ./docs # compare current score to last snapshot
diff compares the current scan result against a stored snapshot (zenzic.snapshot.json
in the repo root). A score regression (e.g., score drops from 97 to 91) causes a
non-zero exit, enabling CI to block merges that degrade documentation quality.
This is the Quality Gate pattern: not a binary pass/fail, but a tracked trend
with a configurable failure threshold (fail_under in zenzic.toml).
The Diagnostic Code Registry: Zxxxβ
Every Zenzic finding carries a Zxxx code from core/codes.py β the single source
of truth for the diagnostic registry.
The full registry by category:
| Range | Category | Codes |
|---|---|---|
| Z1xx | Link Integrity | Z101 LINK_BROKEN, Z102 ANCHOR_MISSING, Z103 UNREACHABLE_LINK, Z104 FILE_NOT_FOUND, Z105 ABSOLUTE_PATH, Z106 ALT_TEXT_MISSING |
| Z2xx | Security | Z201 SHIELD_SECRET, Z202 PATH_TRAVERSAL, Z203 PATH_TRAVERSAL_SUSPICIOUS |
| Z3xx | Reference Integrity | Z301 DANGLING_REF, Z302 DEAD_DEF, Z303 CIRCULAR_LINK |
| Z4xx | Structure | Z401 MISSING_DIRECTORY_INDEX, Z402 ORPHAN_PAGE, Z403 SNIPPET_UNREACHABLE, Z404 CONFIG_ASSET_MISSING |
| Z5xx | Content Quality | Z501 PLACEHOLDER, Z502 SHORT_CONTENT, Z503 SNIPPET_ERROR, Z504 QUALITY_REGRESSION |
| Z9xx | Engine / System | Z901 RULE_ERROR, Z902 RULE_TIMEOUT, Z903 UNUSED_ASSET, Z904 DISCOVERY_ERROR |
The codes are stable across versions. A CI system that filters findings by Z201
(credentials) can do so independently of Zenzic version bumps. The codes are the
documented API surface for tooling integration.
Act VIII β Performance: The Numbersβ
The Adaptive Parallelism Benchmarkβ
The 50-file threshold is a conservative heuristic derived from empirical measurement:
| File count | Sequential (ms) | Parallel (ms) | Crossover |
|---|---|---|---|
| 10 | 28 | 380 | Sequential wins |
| 25 | 71 | 390 | Sequential wins |
| 50 | 142 | 395 | Roughly equal |
| 100 | 284 | 412 | Parallel wins |
| 500 | 1,420 | 680 | Parallel wins (2Γ) |
| 1,000 | 2,840 | 920 | Parallel wins (3Γ) |
| 10,000 | 28,400 | 4,200 | Parallel wins (6.7Γ) |
Measurements on a 4-core runner, cold start. Custom rules with moderate complexity.
The ~380 ms fixed overhead of ProcessPoolExecutor spawn is the reason the threshold
is not set lower. A threshold of 10 files would cause sequential scans of small repos
to pay the spawn cost without benefit.
VSM Construction vs Link Validationβ
The scan time breakdown for a 1,000-file project:
Discovery (walk + read): ~450 ms (I/O bound β disk sequential)
VSM construction: ~120 ms (CPU bound β adapter URL mapping)
Anchor cache build: ~80 ms (CPU bound β heading slug extraction)
Link validation: ~95 ms (CPU bound β 50,000 hash lookups)
Orphan detection: ~35 ms (CPU bound β frozenset intersection)
Shield scan: ~210 ms (CPU bound β regex over 1M lines)
Report rendering: ~40 ms (CPU bound β Rich formatting)
βββββββββββββββββββββββββββββββββββββ
Total: ~1,030 ms
:::note Benchmark conditions
These figures are for synthetic Markdown files (minimal frontmatter, no JSX, ~10
lines of prose). Real-world MDX files with frontmatter, JSX components, tables, and
dense link graphs cost significantly more per file. Measured against the real
zenzic-doc project (59 MDX pages): ~7 ms/file vs ~0.5 ms/file for synthetic files.
Run python scripts/benchmark.py --repo <path> to measure your own project.
:::
Link validation at 50,000 links takes 95 ms β less than the report rendering phase.
This is the O(1) hash map in practice: 50,000 dict.get() calls at ~1.9 Β΅s each.
Memory Profileβ
The VSM for a 10,000-file project:
Route objects: 10,000 Γ ~280 bytes = ~2.8 MB
Anchor cache: 10,000 Γ ~1,200 bytes = ~12.0 MB
md_contents: 10,000 Γ ~8,000 bytes = ~80.0 MB
βββββββββββββββββββββββββββββββββββββββββββββββββ
Total RSS: ~95 MB
The dominant cost is md_contents β the raw Markdown text held in memory for the
Shield scan. Zenzic holds all files in memory simultaneously to avoid repeated I/O
during multi-pass analysis. For projects above 50,000 files, a chunked processing
mode is planned for a future release.
Cross-Platform CI Matrixβ
Zenzic's test suite runs a 3Γ3 platform matrix on every commit:
OS: [ubuntu-latest, windows-latest, macos-latest]
Python: [3.11, 3.12, 3.13 ]
9 parallel CI jobs. All 1,342+ tests must pass on all 9 combinations. This is the portability guarantee: Zenzic's output is identical across all platforms. A scan that passes on Ubuntu passes on macOS and Windows β critical for teams using heterogeneous development environments.
Act IX β The Adapter Contract: Extending Zenzicβ
The BaseAdapter Protocolβ
Zenzic's Core (validator.py, scanner.py) contains zero engine-name references.
This is Purity Protocol β Rule R21 (Protocol Sovereignty). Any engine-specific
behavior must be declared via the AdapterProtocol and queried by the Core.
The adapter protocol (simplified):
class AdapterProtocol(Protocol):
def get_nav_paths(self) -> frozenset[str]:
"""Return navigable paths from all user-clickable surfaces."""
...
def map_url(self, rel: Path) -> str:
"""Map a source file to its canonical URL."""
...
def classify_route(self, rel: Path, nav_paths: frozenset[str]) -> RouteStatus:
"""Classify a route as REACHABLE, ORPHAN_BUT_EXISTING, IGNORED, or CONFLICT."""
...
def provides_index(self, directory_path: Path) -> bool:
"""True when the directory will have a landing page."""
...
def get_metadata_files(self) -> list[Path]:
"""Return Level 1 System Guardrail files (excluded from all checks)."""
...
def get_link_scheme_bypasses(self) -> frozenset[str]:
"""Return URI schemes that bypass Z105 absolute-path validation."""
...
The Core calls adapter.get_nav_paths(). It receives a frozenset[str]. What
generated that frozenset β whether it came from sidebars.ts, mkdocs.yml, or
zensical.toml β is invisible to the Core.
Adding a new adapter requires implementing this protocol. Adding an engine-specific
behavior by modifying validator.py is a protocol violation and will be rejected
in code review.
The pathname:/// Bypass (Rule R16)β
Docusaurus uses pathname:/// as a Diplomatic Courier β an escape hatch for linking
to static assets that are not part of the docs routing system:
[Download PDF](pathname:///assets/whitepaper.pdf)
The Z105 gate (ABSOLUTE_PATH) normally fires on any path starting with /. The
pathname:/// URI scheme is exempt in Docusaurus mode:
def get_link_scheme_bypasses(self) -> frozenset[str]:
return frozenset({"pathname"})
The Core queries adapter.get_link_scheme_bypasses() before applying Z105. This is
R16 β Protocol Awareness β in action: engine-specific behavior declared in the
adapter, queried by the Core, with no if engine == "docusaurus" in Core logic.
In all other engines (MkDocs, Zensical, Standalone), pathname:/// is unrecognized
and triggers Z105 normally. The bypass is scoped precisely.
Level 1 System Guardrailsβ
Adapter metadata files β docusaurus.config.ts, mkdocs.yml, zensical.toml,
package.json, pyproject.toml β are declared as Level 1 System Guardrails
via get_metadata_files(). These files are:
- Permanently excluded from Z903 (UNUSED_ASSET) checks
- Permanently excluded from all quality checks
- Never presented to the user as orphans, placeholders, or short-content warnings
The rationale (Rule R13 β Intelligent Perimeter): asking the user to manually exclude their own build configuration files from analysis is a failure of the tool, not a configuration task. The adapter knows what its metadata files are; the Core does not need to be told.
Act X β Getting Startedβ
Immediate Verification (No Installation)β
uvx zenzic lab
uvx resolves the latest Zenzic from PyPI, installs it in an isolated temporary
environment, and runs the interactive Lab. Seventeen Acts, each demonstrating a
distinct capability. The entire experience requires no project setup.
Start with Act 3 β the Shield in action against a planted Stripe live key. Watch the Sentinel exit with code 2. That exit code is the promise.
Your First Scanβ
uvx zenzic check all ./docs
Zenzic will:
- Discover your documentation engine (Docusaurus, MkDocs, Zensical, or Standalone)
- Build the VSM from your source files
- Run the Shield across every line of every file
- Validate all internal links against the VSM
- Detect orphan pages via R21 (navbar + sidebar + footer analysis)
- Report all findings with Zxxx codes, file paths, and line numbers
On a 100-page Docusaurus site: expect 2β4 seconds, cold start.
Pinned CI Integrationβ
name: Documentation Integrity Gate
on:
push:
branches: [main]
pull_request:
jobs:
sentinel:
runs-on: ubuntu-latest
permissions:
security-events: write # required for SARIF upload
steps:
- uses: actions/checkout@v4
- uses: astral-sh/setup-uv@v5
- uses: PythonWoods/zenzic-action@v1
with:
version: "0.7.0" # pinned β deterministic CI gate
format: sarif
upload-sarif: "true"
Version pinning (version: "0.7.0") is mandatory for production pipelines. latest
is appropriate for exploration; it introduces non-determinism into your CI gate.
zenzic.toml Configurationβ
docs_dir = "docs"
fail_under = 95 # quality score gate: fail if score drops below 95
# Excluded external URLs (temporary β remove after deployment)
excluded_external_urls = [
"https://internal.corp.example.com/api",
]
# Excluded asset patterns (Docusaurus sidebar metadata)
excluded_assets = [
"**/_category_.json",
]
[build_context]
engine = "docusaurus"
base_url = "/"
default_locale = "en"
locales = ["it", "fr"]
The 4-level configuration priority: CLI flags > zenzic.toml > pyproject.toml
[tool.zenzic] > built-in defaults. CLI flags always win. This allows temporary
overrides without modifying project configuration.
Standalone Modeβ
For projects with no build system β raw Markdown directories, GitHub wikis, plain doc trees:
docs_dir = "."
[build_context]
engine = "standalone"
In Standalone mode:
- Orphan detection (Z402) is disabled β there is no navigation contract
- Link validation still runs β broken links are broken regardless of engine
- The Shield still runs β credentials are credentials regardless of engine
- The Blood Sentinel still runs β path traversal is path traversal regardless of engine
The security guarantees are engine-independent. Only the navigation contract is scoped.
Brand Integrity: Z905 BRAND_OBSOLESCENCEβ
The fourth dimension of the Safe Harbor β beyond structural, security, and content correctness β is narrative integrity. A documentation suite that refers to a deprecated release codename has a different class of bug: it tells the wrong story.
Configure [project_metadata] in zenzic.toml to activate the Brand Integrity layer:
[project_metadata]
release_name = "Quartz"
obsolete_names = ["Obsidian"]
obsolete_names_exclude_patterns = ["CHANGELOG*.md", "adr-*.mdx"]
The zenzic:ignore Z905 escape hatch is precise by design: it applies to a single line,
not a whole file. A CHANGELOG entry that says "Released under the Obsidian codename"
is historical fact. An architecture page that describes the current system as
"Obsidian-based" is a lie that the source code has already corrected.
Every rule in the Quartz Core must pass a three-dimensional admission test before it ships: Structural Integrity (broken links, orphans, missing indices), Hardened Security (credentials, path traversal), or Technical Accessibility (machine-readable contracts for downstream tooling β Z505 is the canonical example). Rules that fail this filter β line length, list style, spelling β are deliberately out of scope. Zenzic is a Sentinel, not a Proofreader.
Epilogue: The Documentation is the Sourceβ
The engineering tradition treats documentation as secondary β a description of the system, not the system itself. This tradition is breaking down.
In 2026, documentation is:
- The primary interface for internal APIs in large organizations
- The trust signal that developers use to evaluate whether a library is maintained
- The compliance artifact that auditors examine in regulated industries
- The attack surface that adversaries probe for exposed credentials and path traversal
A documentation pipeline that trusts its input is not a pipeline. It is a hope.
Zenzic exists because the question "is this documentation correct?" is not the same question as "did this build succeed?" A build that succeeds on broken documentation has not validated anything. It has just run faster.
The Safe Harbor is not a metaphor. It is an architectural guarantee: every file that passes Zenzic's three layers β the Structural Validator, the Shield, and the Blood Sentinel β has been verified against the navigation contract of your specific build engine, scanned for all known credential formats with 8-stage normalization, and checked for path traversal against an explicitly declared perimeter.
That is the promise. Every exit-0 scan is the proof.
For the full engineering history of how these layers were designed, tested under AI-generated siege, and hardened across five sprints β read the π‘οΈ The Zenzic Chronicles β.
| GitHub | github.com/PythonWoods/zenzic |
| Documentation | zenzic.dev |
| PyPI | pypi.org/project/zenzic |
| Lab | uvx zenzic lab |