Passa al contenuto principale

Sentinel Guard: The Engineering of Documentation Integrity and Security

Β· 35 minuti di lettura
PythonWoods
Creator of Zenzic

Every build engine you have ever used has made you a silent promise: "Trust me β€” if this builds, your documentation is correct."

That promise is architecturally broken. And your documentation is paying the price.


Act I β€” The Thesis of Untrusted Input​

The Problem with Build Engines​

Your documentation pipeline looks something like this:

Author β†’ Markdown file β†’ Build Engine β†’ HTML β†’ CDN β†’ Reader
↑
"Trust me, I'll build it."

Docusaurus, MkDocs, Zensical β€” these are generators. Their contract with you is explicit: "Give me source files; I will produce a static site." They are optimized for speed, for plugin extensibility, for theming. They are not optimized for validation. They trust the source files you give them.

That trust is the vulnerability.

What "Untrusted Input" Means in 2026​

The threat model for documentation has changed. In 2026, documentation sources come from:

  • Human contributors via pull requests β€” who may not know your link structure
  • AI-generated content β€” which plausibly invents URLs that sound real
  • Automated refactors β€” which move files without updating cross-references
  • External integrations β€” which inject content that may carry credential fragments

In a monorepo shared across teams, a single contributor committing a file that references a non-existent anchor, exposes an internal API key, or introduces a circular link dependency can silently corrupt the documentation of ten product areas simultaneously. The build engine will not catch it. The build engine does not try.

Zenzic's core design thesis, established in ADR-001, is: treat every Markdown file as untrusted input.

This is not a feature. It is a trust model. Every design decision in Zenzic derives from it.

The Zero-Trust Documentation Pipeline​

Author β†’ Markdown file β†’ Zenzic (Sentinel) β†’ Build Engine β†’ HTML β†’ CDN β†’ Reader
↑
"I trust nothing. I verify everything."

The Sentinel runs before the build. It does not trust the source. It constructs a complete in-memory model of your site β€” the Virtual Site Map β€” and validates every claim in your source files against that model. If the model says a link is broken, the CI gate fails. The build engine never sees the broken file.

The fundamental difference is not in features. It is in philosophy. Build engines are designed to succeed. Zenzic is designed to find where they would have failed.

Supply Chain Attacks via Documentation​

Consider a real scenario: your documentation site is a monorepo including third-party-contributed guides. One contributor submits a tutorial that includes:

## Setup

Configure your environment with your API key:

```yaml

api_key: "sk_live_<YOUR_STRIPE_KEY_HERE>"


The Stripe live key β€” `sk_live_*` β€” is a real credential format. Zenzic's Shield
catches it before the commit merges. The build engine builds it without comment.

This is the **supply chain attack surface in documentation**: external content that
carries secrets, adversarial links, or path traversal payloads, combined with a build
pipeline that trusts its input entirely.

Zenzic's response to this is the three-layer security architecture: the **Shield**
(credentials), the **Blood Sentinel** (paths), and the **Structural Validator** (graph
integrity). Each layer is independent. Each layer runs on every scan. None of them
can be disabled simultaneously.

### The Three Pillars (Non-Negotiable)

These invariants are in Zenzic's design contract and cannot be overridden by
configuration:

**Pillar 1 β€” Lint the Source, Not the Build.** Analysis operates on raw Markdown and
configuration files. Never on HTML output. Errors are caught before the build starts.

**Pillar 2 β€” Zero Subprocesses.** 100% pure Python. No `subprocess`, no `os.system`,
no Node.js execution. This guarantees: reproducible results across platforms,
zero dependency on the build environment, and total portability.

**Pillar 3 β€” Pure Functions First.** Analysis logic is deterministic. I/O is isolated
at the edges (discovery and reporting). No I/O in hot-path loops.

---

## Act II β€” The VSM Engine: A Mental Map of Your Site

### What the Virtual Site Map Is

The Virtual Site Map (VSM) is Zenzic's central data structure. It is a complete
in-memory projection of your documentation site as a routing table: a mapping from
every canonical URL that your build engine would generate to a `Route` object.

The `Route` object carries:

```python
@dataclass(frozen=True, slots=True)
class Route:
url: str # canonical URL: "/docs/guide/install/"
file_path: Path # absolute path on disk: /repo/docs/guide/install.mdx
status: RouteStatus
anchors: frozenset[str]
is_proxy: bool
version: str | None

The RouteStatus can be one of four values:

StatusMeaning
REACHABLEFile is navigable via at least one user-clickable surface
ORPHAN_BUT_EXISTINGFile exists on disk but no navigation surface links to it
IGNOREDSystem file (e.g., _category_.json) β€” not a content page
CONFLICTTwo files produce the same canonical URL β€” build collision

Why the VSM Is Necessary​

Without the VSM, Zenzic could only answer: "does this file exist on disk?" That question is easy. pathlib.Path.exists() answers it in a single syscall.

The VSM enables Zenzic to answer a harder question: "would this link resolve in the rendered site, given how your specific build engine maps source files to URLs?"

Those are completely different questions.

Consider Docusaurus: a file at docs/guide/index.mdx is served at /docs/guide/, not at /docs/guide/index/. A link to /docs/guide/index.html would resolve to nothing in the browser β€” even though the file exists on disk.

Consider MkDocs: a file at docs/api.md with nav: entry - API: api.md is reachable. The same file without a nav: entry is potentially an orphan depending on nav: configuration.

The VSM encodes these engine-specific routing rules in pure Python, without running the build engine.

Building the VSM: The Architecture​

β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
β”‚ build_vsm() β”‚
β”‚ (I/O boundary β€” called β”‚
β”‚ once per scan) β”‚
β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜
β”‚
β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β–Όβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
β”‚ Adapter.get_route_info() β”‚
β”‚ (engine-specific, per-file) β”‚
β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜
β”‚
β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β–Όβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
β”‚ β”‚
β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β–Όβ”€β”€β”€β”€β”€β”€β”€β”€β”€β” β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β” β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β–Όβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
β”‚ map_url(rel) β”‚ β”‚ classify_route β”‚ β”‚ get_nav_paths() β”‚
β”‚ (canonical URL) β”‚ β”‚ (reachability)β”‚ β”‚ (sidebar+nav+footer)β”‚
β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜ β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜ β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜

The build_vsm() function is the only I/O boundary β€” it iterates over every Markdown file in docs_root exactly once. All adapter calls are pure functions after that initial read. No file is touched again during link validation.

The VSM is a Python dict[str, Route] β€” a hash map keyed by canonical URL.

When the validator needs to check whether a link target exists, it calls:

route = vsm.get(canonical_url) # dict.get() β€” O(1) hash lookup
if route is None:
# Z104: FILE_NOT_FOUND

This means validating 10,000 links against a 10,000-page site is not O(NΒ²) β€” it is 10,000 independent O(1) lookups. The VSM is built once (O(N)) and then queried indefinitely at O(1) per link.

Compare this to naive implementations that call Path.exists() for every link target: that is NΓ—M syscalls for N links across M files, where each stat() call crosses the user-kernel boundary. At 50,000 links across a large documentation site, the difference between O(1) hash lookups and O(NΓ—M) syscalls is the difference between a 3-second scan and a 90-second scan.

The Anchor Cache​

In addition to the URL map, the VSM builder constructs an anchor cache: a mapping from file path to the set of heading slugs that file declares.

anchors_cache: dict[Path, set[str]] = {
Path("/repo/docs/guide/install.mdx"): {
"prerequisites", "installation", "next-steps"
},
...
}

When a link contains a fragment (/docs/guide/install/#next-steps), Zenzic:

  1. Resolves the URL to a file path via the VSM (O(1))
  2. Checks the fragment against anchors_cache[file_path] (O(1) set lookup)

A broken anchor (Z102) is detected with two hash lookups. Zero I/O. Zero subprocess calls.

Ghost Routes: i18n and Versioning​

The VSM handles cases where a URL exists but no physical file produces it β€” what Zenzic calls Ghost Routes. Two categories:

i18n Ghost Routes: Docusaurus generates locale-specific index pages (e.g., /it/docs/) automatically, even when no physical it/index.mdx exists. The VSM marks these as is_proxy=True and status=REACHABLE, because the build engine will generate them.

Versioned Routes: Zenzic's Docusaurus adapter uses an internal _version_ sentinel prefix to track versioned documentation trees. A file at docs/versioned_docs/version-0.6/guide.md is indexed as _version_/0.6/guide.md in the VSM and served at /docs/0.6/guide/ β€” transparent to the validator.

In both cases, the VSM answer is correct: the URL is reachable for a reader. No physical file required.

Collision Detection​

Two source files can produce the same canonical URL β€” this is a build-time error in Docusaurus and MkDocs. The VSM detects this during construction:

def _detect_collisions(routes: list[Route]) -> None:
seen: dict[str, Route] = {}
for route in routes:
if route.url in seen:
route.status = "CONFLICT"
seen[route.url].status = "CONFLICT"
else:
seen[route.url] = route

A CONFLICT route surfaces as a Zenzic finding before the build runs, preventing the silent data loss that occurs when two files compete for the same URL.


Act III β€” The Shield: 8 Stages of Truth​

The Problem with Naive Secret Detection​

A naive credential scanner applies regex patterns line by line:

if re.search(r"AKIA[0-9A-Z]{16}", line):
flag_secret()

This works when the secret is written plainly. In documentation, secrets are rarely written plainly. They appear in:

  • Markdown tables: | Key | `` AKIA`` | ``1234567890ABCDEF `` |
  • Concatenated strings: `AKIA` + `1234ABCD5678EFGH`
  • HTML-entity encoded values: &#65;&#75;&#73;&#65;1234567890ABCDEF
  • Unicode-obfuscated text: A\u200bK\u200bI\u200bA1234567890ABCDEF (zero-width spaces)
  • Comment-interleaved tokens: ghp_ABC{/* comment */}DEF
  • Cross-line YAML scalars: key split across two lines by a folded block

Zenzic's Shield is designed to defeat all of these patterns. It does so through a normalization pipeline applied before regex matching.

The 8 Stages of Normalization​

The _normalize_line_for_shield() function applies these transformations in strict order:

Stage 1 β€” Unicode Format Character Stripping (ZRT-006)​

normalized = "".join(c for c in line if unicodedata.category(c) != "Cf")

Unicode category Cf ("Format, other") includes invisible characters: zero-width joiners (U+200D), zero-width non-joiners (U+200C), zero-width spaces (U+200B), and word joiners (U+2060). An adversarial author can insert these between characters of a secret key β€” the characters are visually invisible and collapse when copy-pasted, but a naive regex will not match the fragmented token.

Stage 1 strips them entirely, reconstructing the original token.

Stage 2 β€” HTML Character Reference Decoding (ZRT-006)​

normalized = html.unescape(normalized)

HTML character references (&#65;, &#x41;, &amp;) can encode any ASCII character. A key like AKIA1234567890ABCD can be written as &#65;&#75;&#73;&#65;1234567890&#65;&#66;&#67;&#68; in inline HTML within a Markdown file β€” and will render correctly in the browser while evading naive scanners.

html.unescape() from the Python standard library handles all forms: decimal (&#NNN;), hexadecimal (&#xHH;), and named references (&amp;).

Stage 3 β€” HTML Comment Stripping (ZRT-007)​

_HTML_COMMENT_RE = re.compile(r"<!--.*?-->")
normalized = _HTML_COMMENT_RE.sub("", normalized)

HTML comments can interleave token fragments: ghp_ABC<!-- noise -->DEF. After the build, the comment is invisible. In the source, it splits the token. Stage 3 removes the comment, joining ghp_ABC and DEF into ghp_ABCDEF, which is then matched by the GitHub token pattern on a subsequent pass.

Stage 4 β€” MDX Comment Stripping (ZRT-007)​

_MDX_COMMENT_RE = re.compile(r"\{/\*.*?\*/\}")
normalized = _MDX_COMMENT_RE.sub("", normalized)

MDX files use JSX-style comments: {/* ... */}. The same interleaving attack applies. Stage 4 handles the MDX-specific variant independently.

Stage 5 β€” Backtick Code Span Unwrapping (ZRT-003)​

_BACKTICK_INLINE_RE = re.compile(r"`([^`]*)`")
normalized = _BACKTICK_INLINE_RE.sub(r"\1", normalized)

Documentation authors frequently write tokens inside inline code spans for visual formatting: `AKIA`. The backticks are presentation β€” they do not change the semantics of the content. Stage 5 strips them, exposing the raw token to the regex patterns.

Stage 6 β€” Concatenation Operator Removal (ZRT-003)​

_CONCAT_OP_RE = re.compile(r"[`'\"\s]*\+[`'\"\s]*")
normalized = _CONCAT_OP_RE.sub("", normalized)

Split-token patterns in documentation tables:

| Field | Value |
|-------|-------|
| Key | `AKIA` + `1234567890ABCDEF` |

The + operator joined with surrounding backticks is a common representation of string concatenation in documentation. Stage 6 removes the concatenation construct, joining the fragments into AKIA1234567890ABCDEF.

Stage 7 β€” Table Pipe Replacement​

_TABLE_PIPE_RE = re.compile(r"\|")
normalized = _TABLE_PIPE_RE.sub(" ", normalized)

Markdown table cells are separated by |. A secret split across cells would be: | AKIA | 1234567890ABCDEF |. Stage 7 converts pipes to spaces, enabling whitespace collapse in Stage 8 to produce a scannable line.

Stage 8 β€” Whitespace Normalization​

return " ".join(normalized.split())

Collapses all whitespace runs (tabs, multiple spaces, newlines) into single spaces. This is the final normalization before regex matching. The result is a clean, compact line where all obfuscation techniques have been defeated.

The Lookback Buffer: Cross-Line Detection​

A secret that spans two lines defeats single-line scanning:

api_key: >-
AKIA
IOSFODNN7EXAMPLE

Each line individually contains only a fragment. Neither line matches the AWS access key pattern AKIA[0-9A-Z]{16}.

Zenzic addresses this with scan_lines_with_lookback() β€” a stateful scanner that maintains a 1-line lookback buffer:

def scan_lines_with_lookback(
lines: Iterator[tuple[int, str]],
file_path: Path | str,
) -> Iterator[SecurityFinding]:
prev_normalized: str = ""
for line_no, raw_line in lines:
normalized = _normalize_line_for_shield(raw_line)
# Scan the cross-line join: tail of previous line + head of current
cross_line = prev_normalized[-40:] + normalized[:40]
yield from scan_line_for_secrets(cross_line, file_path, line_no)
# Scan the current line independently
yield from scan_line_for_secrets(raw_line, file_path, line_no)
prev_normalized = normalized

The cross-line join concatenates the last 40 characters of the normalized previous line with the first 40 characters of the normalized current line β€” enough to reconstruct any secret split across a line boundary, while keeping memory bounded.

Dual-Form Scanning​

Even after normalization, Zenzic scans each line in two forms:

  1. Raw form β€” the line exactly as it appears in the source, ensuring that normally

    formatted secrets are always caught with correct column positions for reporting.

  2. Normalized form β€” after all 8 stages, ensuring that obfuscated secrets are

    reconstructed and matched.

Duplicate findings (same secret type on the same line in both forms) are suppressed via a seen: set[str] de-duplication pass.

ReDoS Prevention: The F2-1 Hardening​

Regex patterns applied to pathological inputs can cause catastrophic backtracking β€” a ReDoS (Regular Expression Denial of Service) attack. A crafted Markdown file with a megabyte-long line could cause a regex engine to consume unbounded CPU.

Zenzic's F2-1 hardening establishes a maximum line length constant:

_MAX_LINE_LENGTH: int = 1_048_576 # 1 MiB

Lines exceeding this limit are silently truncated before scanning. No secret longer than 1 MiB exists in practice; a line longer than 1 MiB is not legitimate documentation.

Additionally, all regex patterns used in _SECRETS undergo an eager ReDoS pre-flight check at engine construction time (ZRT-002):

def _assert_regex_canary(rule: BaseRule) -> None:
"""Verify that the rule's regex does not exhibit catastrophic backtracking."""
# Applies a timing canary against a known-adversarial input.
# Raises PluginContractError if the pattern exceeds the time budget.

Custom rules loaded via the zenzic.rules entry-point group are subject to the same pre-flight check before the first file is scanned.

The 9 Secret Families​

Zenzic's Shield v0.7.0 detects credentials across 9 families:

FamilyPatternExample prefix
OpenAI API keysk-[a-zA-Z0-9]{48}sk-a1B2c3...
GitHub tokengh[pousr]_[a-zA-Z0-9]{36}ghp_, gho_, ghu_, ghs_, ghr_
AWS access keyAKIA[0-9A-Z]{16}AKIAIOSFODNN7EXAMPLE
Stripe live keysk_live_[0-9a-zA-Z]{24}sk_live_4xK8...
Slack tokenxox[baprs]-[0-9a-zA-Z]{10,48}xoxb-, xoxa-, ...
Google API keyAIza[0-9A-Za-z\-_]{35}AIzaSyB...
Private key header-----BEGIN [A-Z ]+ PRIVATE KEY-----RSA, EC, DSA
Hex-encoded payload(?:\\x[0-9a-fA-F]{2}){3,}\x41\x4b\x49\x41...
GitLab PATglpat-[A-Za-z0-9\-_]{20,}glpat-aBcDeFgHiJkL...

Each pattern is pre-compiled at import time β€” zero compilation overhead during scanning. The set is additive: new families are added by appending to the _SECRETS list.

Exit Code 2: The Sacred Exit​

Any detection by the Shield causes Zenzic to exit with code 2. This exit code is non-suppressible β€” it cannot be silenced by --exit-zero, fail-on-error: false, or any configuration flag.

The rationale: a CI system that can be configured to ignore credential exposure is not a security gate. It is theater. Exit code 2 is the guarantee that the security contract cannot be bypassed by configuration drift or operator error.

Exit 0 β€” All checks passed
Exit 1 β€” Quality findings (broken links, orphans, placeholders) β€” suppressible
Exit 2 β€” Security breach (Shield: credential detected) β€” NEVER suppressible
Exit 3 β€” Fatal breach (Blood Sentinel: path traversal) β€” NEVER suppressible

The Shield operates in Pass 1A β€” before any structural analysis. A file that triggers exit 2 does not proceed to link validation or orphan detection. The Sentinel reports the breach and stops.


Act IV β€” Blood Sentinel: Kernel-Level Sandboxing​

Path Traversal in CI/CD​

In a CI/CD pipeline, Zenzic runs in a containerized runner. The runner has access to:

  • SSH keys: /home/runner/.ssh/id_rsa
  • System secrets: /etc/passwd, /etc/shadow
  • Runner tokens: /var/run/secrets/kubernetes.io/serviceaccount/token

A Markdown file can embed a path traversal attack:

[Evil link](../../../../etc/passwd)
[Another attack](../../../home/runner/.ssh/id_rsa)

A documentation site that renders these files to HTML becomes a vector for exfiltrating runner secrets, depending on the deployment mechanism and how static assets are served.

More critically: Zenzic itself reads file contents to validate them. A path traversal in a link target could cause Zenzic to validate /etc/passwd as a documentation file and include its content in a report. This is the tool-level attack β€” abusing the validator to read secrets from the runner filesystem.

The Blood Sentinel prevents both categories.

The os.path.normpath Collapse​

The defense is built into InMemoryPathResolver._build_target():

def _build_target(self, source_file: Path, path_part: str) -> str:
if path_part.startswith("/"):
raw = self._root_str + os.sep + path_part.lstrip("/")
elif path_part.startswith("@site/docs/"):
raw = self._root_str + os.sep + path_part[len("@site/docs/"):]
elif path_part.startswith("@site/"):
raw = self._repo_root_str + os.sep + path_part[len("@site/"):]
else:
raw = str(source_file.parent) + os.sep + path_part
return os.path.normpath(raw) # ← The collapse

os.path.normpath() is pure C string arithmetic β€” no syscalls, no stat(), no readlink(). It collapses all . and .. segments mathematically.

The result:

source: /repo/docs/guide/install.mdx
link: ../../../../etc/passwd

raw = /repo/docs/guide/../../../etc/passwd
normpath β†’ /etc/passwd

The target string /etc/passwd is produced before any filesystem call is made. Then the Shield check:

shield_ok = (
target_str == self._root_str
or target_str.startswith(self._root_prefix)
)
if not shield_ok:
return PathTraversal(raw_href=href)

/etc/passwd does not start with /repo/docs/ β†’ PathTraversal returned immediately. Zero filesystem access. Zero data exposure. Exit 3.

The Multi-Root Perimeter​

Zenzic handles multi-locale Docusaurus projects where both docs/ and i18n/it/docusaurus-plugin-content-docs/current/ contain cross-referencing files.

The InMemoryPathResolver constructor accepts an allowed_roots parameter β€” a list of additional authorized boundaries:

_extra = [self._coerce_path(r) for r in (allowed_roots or [])]
_pairs: list[tuple[str, str]] = []
for _r in [self._root_dir, *_extra]:
_s = str(_r)
_pairs.append((_s, _s + os.sep))
self._allowed_root_pairs: tuple[tuple[str, str], ...] = tuple(_pairs)

The Shield check becomes:

shield_ok = any(
target_str == root_str or target_str.startswith(root_prefix)
for root_str, root_prefix in self._allowed_root_pairs
)

A relative link from docs/guide.mdx to ../i18n/it/guide.mdx is valid only if i18n/it/docusaurus-plugin-content-docs/current/ is in allowed_roots. Without explicit authorization, it produces PathTraversal. The perimeter is explicitly declared, not inferred.

The @site/ Alias: Security Analysis​

Docusaurus allows @site/ as an alias for the project root in import statements and static asset references. Zenzic maps this alias to repo_root:

elif path_part.startswith("@site/docs/"):
raw = self._root_str + os.sep + path_part[len("@site/docs/"):]
elif path_part.startswith("@site/"):
raw = self._repo_root_str + os.sep + path_part[len("@site/"):]

A path like @site/../etc/passwd becomes:

raw = /repo/../etc/passwd
normpath β†’ /etc/passwd

The normpath collapse happens before the perimeter check. @site/ is not an escape hatch from the Blood Sentinel. It is an alias for a specific root, and all .. traversals through it are collapsed and checked identically.

Exit Code 3: Non-Negotiable Termination​

Path traversal findings (Z202/Z203) cause exit 3. Like exit 2, this is non-suppressible. A path traversal in a documentation source is not a quality finding. It is an attempted perimeter breach. The Sentinel terminates.

Z202 PATH_TRAVERSAL β€” confirmed: resolved path escapes docs_root
Z203 PATH_TRAVERSAL_SUSPICIOUS β€” unresolvable path with traversal segments

The distinction: Z202 is triggered when normpath produces a path that fails the prefix check. Z203 is triggered when the href contains ../ segments but cannot be fully resolved (e.g., missing fragments, malformed URLs). Both produce exit 3.


Act V β€” The Docusaurus Adapter: isCategoryIndex and URL Collapsing​

The Routing Problem​

Docusaurus maps source files to URLs through a set of rules that are not always obvious to documentation authors. Zenzic must replicate these rules exactly in Python to produce correct VSM entries.

The most complex rule is isCategoryIndex collapsing: when a file's name matches certain patterns, its URL is collapsed to the parent directory, not a file slug.

The Three Collapsing Cases​

From _docusaurus.py, the collapsing logic:

if parts:
file_name_lower = parts[-1].lower()
parent_name_lower = parts[-2].lower() if len(parts) >= 2 else None
if (
file_name_lower == "index" # Case 1: index file
or file_name_lower == "readme" # Case 2: README file
or (
parent_name_lower is not None
and file_name_lower == parent_name_lower # Case 3: folder-match
)
):
parts = parts[:-1] # collapse to parent

Case 1 β€” Index collapse:

docs/guide/index.mdx β†’ /docs/guide/
docs/index.mdx β†’ /docs/

Case 2 β€” README collapse:

docs/guide/README.md β†’ /docs/guide/
docs/README.md β†’ /docs/

Case 3 β€” Folder-match collapse (isCategoryIndex):

docs/guide/guide.mdx β†’ /docs/guide/ (filename == parent dirname)
docs/api/api.md β†’ /docs/api/

This third case is frequently surprising to authors: a file named after its parent directory is silently collapsed to the directory URL by Docusaurus. Zenzic replicates this behavior exactly, producing the correct canonical URL in the VSM.

URL Priority: Frontmatter Slug First​

Before filesystem derivation, Zenzic checks for a slug: frontmatter declaration:

# Stage 1: frontmatter slug override
slug = self._slug_map.get(rel_posix)
if slug is not None:
if slug.startswith("/"):
# Absolute slug: prefix with routeBasePath
rbp = self._route_base_path or "docs"
return "/" + rbp + slug.rstrip("/") + "/"
else:
# Relative slug: replace last path segment
parent = rel.parent
return "/" + parent.as_posix() + "/" + slug.strip("/") + "/"

The full URL resolution priority:

  1. Frontmatter slug: β€” absolute or relative override
  2. isCategoryIndex β€” index/README/folder-match collapse
  3. Extension stripping β€” .md / .mdx removed
  4. routeBasePath prefix β€” default "docs", configurable

The provides_index() Contract​

The provides_index(directory_path) method determines whether a directory has a landing page β€” required for the Z401 (MISSING_DIRECTORY_INDEX) check:

def provides_index(self, directory_path: Path) -> bool:
index_files = ("index.md", "index.mdx", "README.md", "README.mdx")
if any((directory_path / f).exists() for f in index_files):
return True
category_json = directory_path / "_category_.json"
if category_json.exists():
data = json.loads(category_json.read_text(encoding="utf-8"))
link = data.get("link", {})
return isinstance(link, dict) and link.get("type") == "generated-index"
return False

A directory provides an index when:

  1. An index.md, index.mdx, README.md, or README.mdx exists inside it, or

  2. A _category_.json declares "link": { "type": "generated-index" } β€” causing

    Docusaurus to auto-generate a category index page.

I/O is permitted in provides_index() because it is called once per directory during the discovery phase β€” never inside per-link or per-file hot loops.

The Three-Surface Harvester​

For orphan detection, Zenzic's Docusaurus adapter aggregates navigation paths from three sources:

def get_nav_paths(self) -> frozenset[str]:
"""Merge sidebar + navbar + footer into a single navigable path set."""
return (
self._parse_sidebars() # sidebars.ts / sidebars.js
| self._parse_config_navigation() # navbar.items + footer.links
)

Sidebar parsing (_parse_sidebars()): reads sidebars.ts or sidebars.js via pure-Python regex. Strips JS-style line and block comments before parsing. Handles both type: 'doc' explicit entries and bare string IDs.

Config navigation (_parse_config_navigation()): reads docusaurus.config.ts via regex, extracts to: URL paths from navbar.items and footer.links, strips baseUrl and routeBasePath prefixes, and probes for .md/.mdx files on disk.

A file is ORPHAN_BUT_EXISTING only if absent from sidebar AND navbar AND footer. A changelog linked only in the navbar is REACHABLE. A legal notice linked only in the footer is REACHABLE. This is R21 β€” UX-Discoverability.

The Slug Law: Physical Consistency​

Zenzic's own documentation enforces the Slug Law (ADR-003): no slug: frontmatter that diverges from the physical file path. The rationale is architectural: the autogenerated sidebar uses type: 'autogenerated' β€” it resolves URLs from file paths. A diverged slug: creates a URL that the sidebar cannot resolve, causing navigation failures without a build-time error.

The VSM enforces this indirectly: if a slug: produces a URL that no sidebar entry references, the file is ORPHAN_BUT_EXISTING. The Slug Law converts this from a silent failure to a Zenzic finding.


Act VI β€” The Rule Engine: Adaptive Parallelism​

The AdaptiveRuleEngine​

Custom rules in Zenzic β€” declared in [[custom_rules]] or implemented as Python classes via the zenzic.rules entry-point group β€” are applied through the AdaptiveRuleEngine:

class AdaptiveRuleEngine:
def __init__(self, rules: Sequence[BaseRule]) -> None:
for rule in rules:
_assert_pickleable(rule) # eager pickle validation
_assert_regex_canary(rule) # ZRT-002: ReDoS pre-flight
self._rules = rules

def run(self, file_path: Path, text: str) -> list[RuleFinding]:
"""Pure function: file path + text β†’ findings. No I/O."""
findings: list[RuleFinding] = []
for rule in self._rules:
try:
findings.extend(rule.check(file_path, text))
except Exception as exc:
# Rule failures are caught and converted to RULE-ENGINE-ERROR findings.
# One faulty plugin cannot abort the scan of the entire docs tree.
findings.append(RuleFinding(...))
return findings

Rules are validated eagerly at construction time, before the first file is scanned. A rule that fails pickle serialization is rejected immediately β€” not silently inside a worker process during a long parallel scan.

The 50-File Threshold​

Zenzic's scanner switches between sequential and parallel execution based on the number of files:

ADAPTIVE_PARALLEL_THRESHOLD: int = 50 # in scanner.py

use_parallel = workers != 1 and len(md_files) >= ADAPTIVE_PARALLEL_THRESHOLD

Below 50 files: sequential execution. The overhead of spawning a ProcessPoolExecutor β€” approximately 200–400 ms on a cold interpreter β€” exceeds the parallelism benefit for small documentation sets.

At or above 50 files: ProcessPoolExecutor is used:

with concurrent.futures.ProcessPoolExecutor(max_workers=actual_workers) as executor:
futures_map = {
executor.submit(_worker, item): item[0]
for item in work_items
}
for future in concurrent.futures.as_completed(futures_map):
results.extend(future.result())

Each file is dispatched to an independent worker process. The worker receives a serialized (file_path, config, rules) tuple via pickle β€” which is why the eager pickle validation at AdaptiveRuleEngine construction is load-bearing. A non-pickleable lambda in a custom rule would silently fail inside the worker process; the eager check catches it in the main process at startup.

Pure Function Discipline: Why It Matters for Parallelism​

Pillar 3 β€” Pure Functions First β€” is not a style preference. It is an architectural requirement for correctness under parallelism.

A rule that holds mutable state between check() calls (e.g., a counter, a cache) would produce data races when two workers process files simultaneously. A rule that makes I/O calls inside check() would suffer from TOCTOU (time-of-check to time-of-use) races in a parallel context.

Pure functions β€” deterministic, stateless, side-effect-free β€” are safe to execute concurrently without synchronization. The AdaptiveRuleEngine guarantees this by contract: any rule that cannot be expressed as a pure function cannot satisfy the PluginContractError validation and will not be admitted to the engine.

The Pickle Serialization Check​

Custom rules loaded via entry_points(group="zenzic.rules") are validated with:

def _assert_pickleable(rule: BaseRule) -> None:
try:
pickle.dumps(rule)
except Exception as exc:
raise PluginContractError(
f"Rule '{rule.rule_id}' cannot be pickled and is incompatible with "
f"multiprocessing: {exc}"
) from exc

This is an eager contract check: the error is raised before any file is touched, with a clear message pointing to the rule that failed. Without this check, the failure would manifest as a cryptic BrokenPipeError or EOFError inside a worker process at scan time β€” far harder to diagnose.


Act VII β€” Enterprise Integration: SARIF and the Quality Gate​

SARIF 2.1.0: Documentation in Your Security Dashboard​

SARIF (Static Analysis Results Interchange Format) is the standard output format for security tools consumed by GitHub Code Scanning, Azure DevOps, and other CI/CD platforms.

Zenzic produces valid SARIF 2.1.0 with:

zenzic check all ./docs --format sarif > zenzic.sarif

The SARIF output includes:

  • Tool descriptor with Zenzic version and URI
  • Rules array with one entry per Zxxx code found (ID, name, helpUri, severity)
  • Results array with location (file + line + column), message, and level

A minimal SARIF result for a broken link:

{
"ruleId": "Z101",
"level": "error",
"message": {
"text": "Z101 LINK_BROKEN: './install.mdx' β†’ './guide/setup.mdx' does not exist"
},
"locations": [{
"physicalLocation": {
"artifactLocation": { "uri": "docs/install.mdx" },
"region": { "startLine": 42, "startColumn": 12 }
}
}]
}

Upload to GitHub Code Scanning:

.github/workflows/zenzic.yml
name: Documentation Integrity Gate

on: [push, pull_request]

jobs:
sentinel:
runs-on: ubuntu-latest
steps:

- uses: actions/checkout@v4

- name: Run Zenzic Sentinel

run: uvx zenzic check all ./docs --format sarif > zenzic.sarif

- name: Upload to GitHub Security

uses: github/codeql-action/upload-sarif@v3
with:
sarif_file: zenzic.sarif
if: always() # upload even when Zenzic fails

The if: always() is critical: when Zenzic exits with code 1 (quality findings), the step is marked as failed β€” but the SARIF upload must still execute to surface the findings in the Security tab. Without if: always(), a failed step would abort before uploading, producing silence instead of visibility.

For teams using zenzic-action:

.github/workflows/zenzic.yml

- uses: PythonWoods/zenzic-action@v1

with:
version: "0.7.0"
format: sarif
upload-sarif: "true"

The action handles the SARIF upload and the if: always() semantics automatically, including SARIF integrity validation β€” if the SARIF file is truncated by runner OOM or SIGKILL, the action emits a ::warning annotation rather than uploading a false- clean result (Output-First Semantics, ADR-004 in zenzic-action).

Machine Silence: RULE R20​

When --format sarif or --format json is active, Zenzic enforces Machine Silence (R20): zero Rich banners, headers, or informational panels are written to stdout. The output stream is a machine-readable format and must remain 100% valid against its schema.

This is enforced at the CLI level:

_MACHINE_FORMATS = frozenset({"json", "sarif"})

if output_format not in _MACHINE_FORMATS:
print_header(console)

A script that pipes zenzic check all --format json | jq '.findings' receives valid JSON with no banner contamination.

The Quality Score: zenzic score​

Beyond binary pass/fail, Zenzic provides a quality score β€” a 0–100 metric computed from the weighted sum of findings across all check categories:

zenzic score ./docs

The score can be used as a regression gate:

zenzic diff ./docs # compare current score to last snapshot

diff compares the current scan result against a stored snapshot (zenzic.snapshot.json in the repo root). A score regression (e.g., score drops from 97 to 91) causes a non-zero exit, enabling CI to block merges that degrade documentation quality.

This is the Quality Gate pattern: not a binary pass/fail, but a tracked trend with a configurable failure threshold (fail_under in zenzic.toml).

The Diagnostic Code Registry: Zxxx​

Every Zenzic finding carries a Zxxx code from core/codes.py β€” the single source of truth for the diagnostic registry.

The full registry by category:

RangeCategoryCodes
Z1xxLink IntegrityZ101 LINK_BROKEN, Z102 ANCHOR_MISSING, Z103 UNREACHABLE_LINK, Z104 FILE_NOT_FOUND, Z105 ABSOLUTE_PATH, Z106 ALT_TEXT_MISSING
Z2xxSecurityZ201 SHIELD_SECRET, Z202 PATH_TRAVERSAL, Z203 PATH_TRAVERSAL_SUSPICIOUS
Z3xxReference IntegrityZ301 DANGLING_REF, Z302 DEAD_DEF, Z303 CIRCULAR_LINK
Z4xxStructureZ401 MISSING_DIRECTORY_INDEX, Z402 ORPHAN_PAGE, Z403 SNIPPET_UNREACHABLE, Z404 CONFIG_ASSET_MISSING
Z5xxContent QualityZ501 PLACEHOLDER, Z502 SHORT_CONTENT, Z503 SNIPPET_ERROR, Z504 QUALITY_REGRESSION
Z9xxEngine / SystemZ901 RULE_ERROR, Z902 RULE_TIMEOUT, Z903 UNUSED_ASSET, Z904 DISCOVERY_ERROR

The codes are stable across versions. A CI system that filters findings by Z201 (credentials) can do so independently of Zenzic version bumps. The codes are the documented API surface for tooling integration.


Act VIII β€” Performance: The Numbers​

The Adaptive Parallelism Benchmark​

The 50-file threshold is a conservative heuristic derived from empirical measurement:

File countSequential (ms)Parallel (ms)Crossover
1028380Sequential wins
2571390Sequential wins
50142395Roughly equal
100284412Parallel wins
5001,420680Parallel wins (2Γ—)
1,0002,840920Parallel wins (3Γ—)
10,00028,4004,200Parallel wins (6.7Γ—)

Measurements on a 4-core runner, cold start. Custom rules with moderate complexity.

The ~380 ms fixed overhead of ProcessPoolExecutor spawn is the reason the threshold is not set lower. A threshold of 10 files would cause sequential scans of small repos to pay the spawn cost without benefit.

The scan time breakdown for a 1,000-file project:

Discovery (walk + read): ~450 ms (I/O bound β€” disk sequential)
VSM construction: ~120 ms (CPU bound β€” adapter URL mapping)
Anchor cache build: ~80 ms (CPU bound β€” heading slug extraction)
Link validation: ~95 ms (CPU bound β€” 50,000 hash lookups)
Orphan detection: ~35 ms (CPU bound β€” frozenset intersection)
Shield scan: ~210 ms (CPU bound β€” regex over 1M lines)
Report rendering: ~40 ms (CPU bound β€” Rich formatting)
─────────────────────────────────────
Total: ~1,030 ms

:::note Benchmark conditions These figures are for synthetic Markdown files (minimal frontmatter, no JSX, ~10 lines of prose). Real-world MDX files with frontmatter, JSX components, tables, and dense link graphs cost significantly more per file. Measured against the real zenzic-doc project (59 MDX pages): ~7 ms/file vs ~0.5 ms/file for synthetic files. Run python scripts/benchmark.py --repo <path> to measure your own project. :::

Link validation at 50,000 links takes 95 ms β€” less than the report rendering phase. This is the O(1) hash map in practice: 50,000 dict.get() calls at ~1.9 Β΅s each.

Memory Profile​

The VSM for a 10,000-file project:

Route objects: 10,000 Γ— ~280 bytes = ~2.8 MB
Anchor cache: 10,000 Γ— ~1,200 bytes = ~12.0 MB
md_contents: 10,000 Γ— ~8,000 bytes = ~80.0 MB
─────────────────────────────────────────────────
Total RSS: ~95 MB

The dominant cost is md_contents β€” the raw Markdown text held in memory for the Shield scan. Zenzic holds all files in memory simultaneously to avoid repeated I/O during multi-pass analysis. For projects above 50,000 files, a chunked processing mode is planned for a future release.

Cross-Platform CI Matrix​

Zenzic's test suite runs a 3Γ—3 platform matrix on every commit:

OS: [ubuntu-latest, windows-latest, macos-latest]
Python: [3.11, 3.12, 3.13 ]

9 parallel CI jobs. All 1,342+ tests must pass on all 9 combinations. This is the portability guarantee: Zenzic's output is identical across all platforms. A scan that passes on Ubuntu passes on macOS and Windows β€” critical for teams using heterogeneous development environments.


Act IX β€” The Adapter Contract: Extending Zenzic​

The BaseAdapter Protocol​

Zenzic's Core (validator.py, scanner.py) contains zero engine-name references. This is Purity Protocol β€” Rule R21 (Protocol Sovereignty). Any engine-specific behavior must be declared via the AdapterProtocol and queried by the Core.

The adapter protocol (simplified):

class AdapterProtocol(Protocol):
def get_nav_paths(self) -> frozenset[str]:
"""Return navigable paths from all user-clickable surfaces."""
...

def map_url(self, rel: Path) -> str:
"""Map a source file to its canonical URL."""
...

def classify_route(self, rel: Path, nav_paths: frozenset[str]) -> RouteStatus:
"""Classify a route as REACHABLE, ORPHAN_BUT_EXISTING, IGNORED, or CONFLICT."""
...

def provides_index(self, directory_path: Path) -> bool:
"""True when the directory will have a landing page."""
...

def get_metadata_files(self) -> list[Path]:
"""Return Level 1 System Guardrail files (excluded from all checks)."""
...

def get_link_scheme_bypasses(self) -> frozenset[str]:
"""Return URI schemes that bypass Z105 absolute-path validation."""
...

The Core calls adapter.get_nav_paths(). It receives a frozenset[str]. What generated that frozenset β€” whether it came from sidebars.ts, mkdocs.yml, or zensical.toml β€” is invisible to the Core.

Adding a new adapter requires implementing this protocol. Adding an engine-specific behavior by modifying validator.py is a protocol violation and will be rejected in code review.

The pathname:/// Bypass (Rule R16)​

Docusaurus uses pathname:/// as a Diplomatic Courier β€” an escape hatch for linking to static assets that are not part of the docs routing system:

[Download PDF](pathname:///assets/whitepaper.pdf)

The Z105 gate (ABSOLUTE_PATH) normally fires on any path starting with /. The pathname:/// URI scheme is exempt in Docusaurus mode:

def get_link_scheme_bypasses(self) -> frozenset[str]:
return frozenset({"pathname"})

The Core queries adapter.get_link_scheme_bypasses() before applying Z105. This is R16 β€” Protocol Awareness β€” in action: engine-specific behavior declared in the adapter, queried by the Core, with no if engine == "docusaurus" in Core logic.

In all other engines (MkDocs, Zensical, Standalone), pathname:/// is unrecognized and triggers Z105 normally. The bypass is scoped precisely.

Level 1 System Guardrails​

Adapter metadata files β€” docusaurus.config.ts, mkdocs.yml, zensical.toml, package.json, pyproject.toml β€” are declared as Level 1 System Guardrails via get_metadata_files(). These files are:

  • Permanently excluded from Z903 (UNUSED_ASSET) checks
  • Permanently excluded from all quality checks
  • Never presented to the user as orphans, placeholders, or short-content warnings

The rationale (Rule R13 β€” Intelligent Perimeter): asking the user to manually exclude their own build configuration files from analysis is a failure of the tool, not a configuration task. The adapter knows what its metadata files are; the Core does not need to be told.


Act X β€” Getting Started​

Immediate Verification (No Installation)​

uvx zenzic lab

uvx resolves the latest Zenzic from PyPI, installs it in an isolated temporary environment, and runs the interactive Lab. Seventeen Acts, each demonstrating a distinct capability. The entire experience requires no project setup.

Start with Act 3 β€” the Shield in action against a planted Stripe live key. Watch the Sentinel exit with code 2. That exit code is the promise.

Your First Scan​

uvx zenzic check all ./docs

Zenzic will:

  1. Discover your documentation engine (Docusaurus, MkDocs, Zensical, or Standalone)
  2. Build the VSM from your source files
  3. Run the Shield across every line of every file
  4. Validate all internal links against the VSM
  5. Detect orphan pages via R21 (navbar + sidebar + footer analysis)
  6. Report all findings with Zxxx codes, file paths, and line numbers

On a 100-page Docusaurus site: expect 2–4 seconds, cold start.

Pinned CI Integration​

.github/workflows/zenzic.yml
name: Documentation Integrity Gate

on:
push:
branches: [main]
pull_request:

jobs:
sentinel:
runs-on: ubuntu-latest
permissions:
security-events: write # required for SARIF upload
steps:

- uses: actions/checkout@v4

- uses: astral-sh/setup-uv@v5

- uses: PythonWoods/zenzic-action@v1

with:
version: "0.7.0" # pinned β€” deterministic CI gate
format: sarif
upload-sarif: "true"

Version pinning (version: "0.7.0") is mandatory for production pipelines. latest is appropriate for exploration; it introduces non-determinism into your CI gate.

zenzic.toml Configuration​

zenzic.toml
docs_dir = "docs"
fail_under = 95 # quality score gate: fail if score drops below 95

# Excluded external URLs (temporary β€” remove after deployment)
excluded_external_urls = [
"https://internal.corp.example.com/api",
]

# Excluded asset patterns (Docusaurus sidebar metadata)
excluded_assets = [
"**/_category_.json",
]

[build_context]
engine = "docusaurus"
base_url = "/"
default_locale = "en"
locales = ["it", "fr"]

The 4-level configuration priority: CLI flags > zenzic.toml > pyproject.toml [tool.zenzic] > built-in defaults. CLI flags always win. This allows temporary overrides without modifying project configuration.

Standalone Mode​

For projects with no build system β€” raw Markdown directories, GitHub wikis, plain doc trees:

zenzic.toml
docs_dir = "."

[build_context]
engine = "standalone"

In Standalone mode:

  • Orphan detection (Z402) is disabled β€” there is no navigation contract
  • Link validation still runs β€” broken links are broken regardless of engine
  • The Shield still runs β€” credentials are credentials regardless of engine
  • The Blood Sentinel still runs β€” path traversal is path traversal regardless of engine

The security guarantees are engine-independent. Only the navigation contract is scoped.


Brand Integrity: Z905 BRAND_OBSOLESCENCE​

The fourth dimension of the Safe Harbor β€” beyond structural, security, and content correctness β€” is narrative integrity. A documentation suite that refers to a deprecated release codename has a different class of bug: it tells the wrong story.

Configure [project_metadata] in zenzic.toml to activate the Brand Integrity layer:

zenzic.toml
[project_metadata]
release_name = "Quartz"
obsolete_names = ["Obsidian"]
obsolete_names_exclude_patterns = ["CHANGELOG*.md", "adr-*.mdx"]
✘docs/explanation/architecture.mdxZ905Obsolete brand term 'Obsidian': use 'Quartz' instead. Add <!-- zenzic:ignore Z905 --> (Markdown) or {/* zenzic:ignore Z905 */} (MDX) to the line to suppress intentional references.
✘ 1 error

The zenzic:ignore Z905 escape hatch is precise by design: it applies to a single line, not a whole file. A CHANGELOG entry that says "Released under the Obsidian codename" is historical fact. An architecture page that describes the current system as "Obsidian-based" is a lie that the source code has already corrected.


The Sentinel's Filter β€” Why Every Quartz Rule Exists

Every rule in the Quartz Core must pass a three-dimensional admission test before it ships: Structural Integrity (broken links, orphans, missing indices), Hardened Security (credentials, path traversal), or Technical Accessibility (machine-readable contracts for downstream tooling β€” Z505 is the canonical example). Rules that fail this filter β€” line length, list style, spelling β€” are deliberately out of scope. Zenzic is a Sentinel, not a Proofreader.

Read the full rationale β†’

Epilogue: The Documentation is the Source​

The engineering tradition treats documentation as secondary β€” a description of the system, not the system itself. This tradition is breaking down.

In 2026, documentation is:

  • The primary interface for internal APIs in large organizations
  • The trust signal that developers use to evaluate whether a library is maintained
  • The compliance artifact that auditors examine in regulated industries
  • The attack surface that adversaries probe for exposed credentials and path traversal

A documentation pipeline that trusts its input is not a pipeline. It is a hope.

Zenzic exists because the question "is this documentation correct?" is not the same question as "did this build succeed?" A build that succeeds on broken documentation has not validated anything. It has just run faster.

The Safe Harbor is not a metaphor. It is an architectural guarantee: every file that passes Zenzic's three layers β€” the Structural Validator, the Shield, and the Blood Sentinel β€” has been verified against the navigation contract of your specific build engine, scanned for all known credential formats with 8-stage normalization, and checked for path traversal against an explicitly declared perimeter.

That is the promise. Every exit-0 scan is the proof.


For the full engineering history of how these layers were designed, tested under AI-generated siege, and hardened across five sprints β€” read the πŸ›‘οΈ The Zenzic Chronicles β†’.


GitHubgithub.com/PythonWoods/zenzic
Documentationzenzic.dev
PyPIpypi.org/project/zenzic
Labuvx zenzic lab